Apache Hive Replication : Doing right by External Tables
Over the past couple of years we have been focusing on building a ground up replication solution for Apache hive, the most popular Data warehousing tool out there. We have made significant progress and we see a lot of customers really excited about this capability available natively in Apache hive.
I had written the first post about what we were doing and some of the decisions we took here. Among the various decisions, I think there was one use case which received a significant push back from users of hive. This was around migrating External tables to Managed tables. We had to quickly disable this replication mode since none of the users were remotely excited about this. First let me explain our reasoning behind doing this, following which i will outline what we have done for making sure external tables are replicated as external tables on target warehouse. This will be followed by some of the challenges we faced with the implementation and concluded with some performance considerations to take care of.
To start of, why did we decide initially to move external to managed tables. The answer to this revolves a bit around how we saw Apache hive progressing with the newer versions. In our mind there were fundamentally two modes of using hive:
- one where users want to use it as a SQL engine on top of Storage ( HDFS / S3 ) etc
- the other where users want traditional database capabilities / performance from hive for their use cases.
Among them, we anticipated that most of the users would want to use hive as a database engine, well who doesn’t want better performance :) and this was possible with the second mode since hive would very tightly control the flow of data to and from Storage layer. This would allow hive to write data in optimized format, update relevant cache structures ( some built like query results cache, some in process of being build like metastore cache), create additional index / filter structures like bloom filters, automatically update stats etc, thus allowing hive to serve data faster for user Queries . In the SQL Engine mode since data could be accessed from outside hive, real time updates to relevant structures is not possible and in few cases using some cache capabilities were also not possible. However, there were commands provided in hive to build some of these structures, that would allow hive to improve performance. As with most large enterprises, especially given the amount of data under management in Hive, along with the string of applications built on top of this data in HDFS, using MR / Tez / Spark apps, it will take some time for users to move to using Hive as a Traditional DB. The concept of External table prevented hive from implementing features to provide Traditional DB capabilities and hence we thought of moving replicated data atleast to move to a more traditional db centric format.
Since then we have realized that does not allow a easy migration path for customers since now they will have to interact with these tables differently on source and target warehouse. Additionally, we wanted to build fail over and fail back capabilities across two Data centers w.r.t to hive and with the table type change this would make it more challenging for us and complicated for users to effectively interact with the warehouse. So we relooked at how we can replicate external tables on source to external tables on target and we have got the new design and code implemented to achieve the same which should be available with apache hive 4.0.
Design/Capabilities of External Table Replication
- External tables could not adhere to point-in-time replication capability that managed tables have, for the reason that we dont know who interacts with the storage layer to change data and when, hence no NOTIFICATION_LOG entires. However we could have achieved a partial point-in-time capability for newly created tables + changes done to external tables via hive, as these could be tracked. We did not go this route since it would be unnecessary burden for users to understand what kind of interactions with source cluster external table would give them this capability. Given we know there are always cases where we cant provide this guarantee we decided to implement this assuming we wont provide point-in-time guarantee.
- This also means every incremental replication will have to copy all the directories associated with external tables from source to target. Distcp came to the rescue there, since given top level directories it has capabilities to do the diff of data and only copy/delete differential data.
- Since the location for external tables can be anywhere in HDFS, for use cases where multiple source clusters replicate data to same target cluster there is a high possibility of overwriting data from different sources. To prevent this we mandate the use of a base_directory configuration (hive.repl.replica.external.table.base.dir) to be provided in the WITH clause of REPL LOAD.
- Foe ease of using repl commands we provide external table replication as part of the same commands. This means that even though data is copied out of band, the metadata still follows point-in-time impression on target.
- There is no change management system for data in external tables, since this was created primarily for point-in-time capability.
Implementation Details
We wanted to have similar process of replicating them during bootstrap and incremental phases. To achieve this we segregated the data and metadata operations for external tables. The metadata replication would follow the existing mode, as done for other tables, where in during bootstrap/incremental we dump the table info in _metadata file and keep the _files empty. For data replication we created a new file named _external_tables_info under the root directory for bootstrap and incremental dump directories. This contains a “,” separated entry per table_name / partition_name ( possibly partitions — will discuss later in performance improvements section ) and base_64 encoded, fully qualified hdfs path. During repl load operations, we did the following:
- for metadata we had to modify the location of external table on target based on the value base_directory configuration. So if base_directory=/users/admins/warehouse/one/external_tables, when replicating a table from source whose hdfs location was /user/anishek/table1 , it would become /users/admins/warehouse/one/external_tables/user/anishek/table1 on target. This meant we had to modify the metadata definition based on table type before passing it to DDLTask to do the necessary work.
- for data we created independent distcp jobs for each entry in the _external_tables_info file.
When using external tables most users also have Storage base Authorization enabled hence we had to do work to preserve permissions for directories on target to be same as source. Most of the sub directories permission and ownership was preserved using distcp options, however for top level directories we had to do this explicitly since distcp didnot carry these over.
Performance Considerations
Since replication has been built (and recommended) to use the parallel execution capability in hive ( enabled via hive.exec.parallel=true), we wanted to make sure we optimize external tables also to leverage this. Queries are converted to DAG’s to be executed by execution engine. Vertex on the same level in a DAG can execute in parallel. Since the bootstrap DAG is already built to have vertex at same level, they can be executed in parallel, we didnt have to do much apart from creating a bunch of distcp jobs as root vertex’s for the DAG. For incremental replication the DAG is always a linear DAG ( no paralllism possible, since events on source are serialized to represent a point-in-time action) we could utilize the parallelism for distcp jobs. This again required that we create multiple root vertex. However since DAG’s for replication can be really long we had earlier implemented Dynamic DAG generation ( creating DAG at runtime with configurable limitations on the size of DAG ) we had to make sure we are doing proper bookeeping w.r.t to tasks we create specially during incremental phase, for Ex: there are 100 external tables, and the incremental events size is 10000, but the configurable dag size is only 200 (hive.repl.approx.max.load.tasks) with parallelism allowed in the system to 8 threads (hive.exec.parallel.thread.number), we should only create 7 external table distcp jobs at root level, with DAG for 193 events out of 10000 for incremental for the first pass DAG.
External tables can have large number of partitions, possibly in range of 100,000 , in this case we dont have to so many entries in the _external_tables_info file.This will also create additional overhead to create as many distcp jobs, one per partition etc. We can may be just have one entry per table and not bother about partitions, but this depends on the assumption that all partition directories will be with the table directory. However partitions could be created to have their locations not as a child of the table level directory. We determine this parent child relationship everytime we write to _external_tables_info, with entries written, only in case partition directories dont fall under table level directories.
Closing Statement
Hopefully, with this capability it will be easier for users to adopt hive replication for their use cases. We would love to hear uses cases, other performance problems you have encountered with hive replication, so we can make it better.