Apache Hive Replication: Overview of changes in ACID
Earlier I had published a post introducing what we have been doing with Replication V2 in Apache Hive. In this post I am going to provide a very brief overview of couple of challenges we faced to support replication of ACID tables in Apache Hive. There was no change done to the SQL command syntax of Replication, to support ACID tables.
First a brief introduction to ACID tables in hive. The minimal set of configurations to be enabled to have ACID tables in hive are ( these are for the HiveServer2 + beeline mode rather than the older Hive Cli mode):
HiveServer 2 (all instances in case of HA)
hive.support.concurrency — true
hive.enforce.bucketing — true (Not required as of Hive 2.0)
hive.exec.dynamic.partition.mode — nonstrict
hive.txn.manager — org.apache.hadoop.hive.ql.lockmgr.DbTxnManagerHiveMetastore
hive.compactor.initiator.on — true
hive.compactor.worker.threads — a positive number on at least one instance of the Thrift metastore service
A more detailed set of configurations can be changed using this doc. There are two types of ACID tables created in hive,
- Full Transactional tables: These provide full ACID semantics for all insert / update / delete operations. These tables currently require the file format to be ORC.
- Insert only Tables(MicroManaged): These tables provide ACID semantics, but allow only Insert operations. These tables can store data in any format.
Concurrent access to ACID tables in hive is provided by having the MVCC architecture in the storage layer. This requires storing additional info per row per table in the storage layer, called as ROW_ID. This currently contains:
- original_transaction_id: The original transaction_id that created this row.
- bucket_id: bucket that this row belongs to.
- row_id: current row number of this row in this bucket.
- current_transaction_id: The current transaction_id that has modified this row
Among these the first three <original_transaction_id, bucket_id, row_id> form the primary key for this row, which means there are no two rows in the ORC files for a given table, which have the same value of the primary key. The transaction_id in hive is a global sequence number for the warehouse, used to provide mutually exclusive, monotonically increasing, transaction_id for every new transaction. This allows transactions to be concurrent across databases within the same warehouse ( MVCC ).
In context of ACID replication, the use cases we had to support were.
- Ability to replicate a source database (can contain ACID / non-ACID tables) to target database.
- The target warehouse can have multiple database’s, with some of them being part of replication, while others being just databases used on the target warehouse.
- Ability to run read workloads on replicated ACID tables on target.
- Allow ability to build fail-over + fail-back capability in future for hive.
Challenge 1 (most complex)
Supporting all the above with the existing implementation of ACID, with a global transaction_id in hive for ACID tables was not feasible. There was no way to sync / map these ids reliably across different warehouse’s where currently even a read operation (which by default will run within a transaction on a Hive warehouse enabled for ACID tables) can result in creation of new values of transaction_id.
To allow replication sub system to tag the data correctly on both source/target warehouse’s to provide ACID semantics, we had to make sure that the transaction semantics, per table level, was independent with transaction semantics from other tables. This however had to also couple with the global transaction management implementation across the warehouse to provide ACID semantics. This would provide, replication, the ability to have the same primary key for same data on both source and target, which was essential to achieve all the goals for ACID replication. This lead us to changing the ACID semantics in hive by introducing the concept of a transaction_id per table, per database(write_id) in hive which provided transaction semantics to be independent per table in a database. To continue supporting ACID, we now had to have additional mapping of the global transaction_id to table level write_id. There were few other changes like additional mapping tables, changing the structure of valid transaction list definition per transaction(this allows recently committed / aborted transactions, visible to the to the current transaction, since the acid housekeeping work happens separately) ,its serialization / de serialization.
Challenge 2
As part of replication as i had discussed in my previous post, one of the goals of replication has been to prevent redundant resource usage, we wanted to extend the same to ACID table replication as well.
We do point in time replication, by depending on Events ( log stream of changes ) generated for changes on the source warehouse. The same event stream was enhanced to now have events for Open/Commit/Abort Transaction, allocation of write_id etc. A single transactional operation on hive would generate multiple events and the replication sub system in hive should be able to replicate events without understanding the transaction boundaries.
Currently hive only supports single statement transaction. This however was a limitation of hive and there is going to be future work to allow multi-statement transactions in hive. Since replication works from a event stream without understanding transaction boundaries, we wanted to minimize the amount of data we copied, if the transaction was aborted, in future. This was achieved my moving all the data copy operations to be performed for ACID tables to be done as part of the COMMIT TRANSACTION event and not as part of the regular INSERT / UPDATE events when we run the REPL LOAD command.