Apache Hive Replication : Encryption Zones
We have been working on Apache hive replication for almost 3 years now solving various use cases some of which i have talked about in my previous posts here, here and here.
This post is about how we enhanced Hive replication, to add support for hdfs encryption zones. The customers could have deployments where in the source (primary) cluster is on an older version than the target (secondary) cluster.
Intro To Encryption Zones in HDFS
The customer has various databases in his hive warehouse. Each of these databases is under a different encryption zone. Encryption zones in HDFS are locations where all data within that location and its sub-directories is encrypted. There can be multiple encryption zones in any given deployment of HDFS. The way this happens, at a high level is, a key is provided for each encryption zone root directory (ez-root) — called Key Encrypting Key (KEK). Anytime data is added to the ez-root directory for each such client instance doing the operation there is a encryption key generated for the data called the data encryption key (DEK). The data encryption key is stored along with the metadata of the data in data nodes. This data encryption key is encrypted with the KEK for a given encryption zone.
Customer Deployments using Encryption Zones with Hive
Setup One
Customers just have one encryption zone in HDFS. They have all their hive databases which under this encryption zone. This can be achieved in hive by specifying the database location when creating a database in hive. Tables (manged/external) within the respective databases followed the same storage philosophy.
Setup Two
Customers had multiple encryption zones with different databases in hive confined to different encryption zones. Tables (external/manged) within these database were following the same storage philosophy as the database.
Setup Three ( we designed for this however we were not aware if a customer had this deployment)
Hive databases can be either in an encryption or non encryption zone. However tables (specifically managed) within these databases are also created with a user provided location which can be in a different zone than the database they are created under. I think this is not possible in the latest version of hive however we had customers who had older versions where this was possible and we had to provide replication capability to such source cluster deployments as well.
The primary problem to solve for encryption zones was the ability to provide point-in-time replication for managed tables. This is problematic since we had a global configuration for the change management (CM) directory using
hive.repl.cmrootdir
To allow support for the various use cases mentioned above, we introduced logic in hive where it would automatically create a CM directory under each encryption zone root directory using a relative path using the following hive config
hive.repl.cm.encryptionzone.rootdir
The CM root directory will be identified based on the table data location. Any destructive operations on partition / tables etc would lead hive to use the corresponding table’s/partion’s CM directory to enable point in time replication.
However, for customers currently using Setup One and want to have the ability to move to either Setup Two OR Setup Three, would have problems with upgrades, since we encode the CM paths in replication dump output. This is present in _files file, we create which contains the actual location of data on source and possible location of data under CM directory. If a destructive operation happens on source ( like drop / delete / truncate ) before repl load command (more details here) is run then hive would move it to the possible location of data written previously in _files. To get around this problem we introduced an additional configuration
hive.repl.cm.nonencryptionzone.rootdir
which would provide ability to create a CM directory for database’s which are not under any encryption zone.
If you, are using encryption zones in any other configuration which is not supported above we would love to hear from you.