Apache Hive Beeline : Progress Bar

anishek agarwal
4 min readMar 22, 2017

Apache Hive needs no introduction for any one working in the big data space. Its the default go to SQL on Hadoop solution used in most enterprises. Its comprised of the following components at a high level:

  • Meta store RDBMS DB : This stores all the metadata related information about the various databases / functions etc. Its the equivalent of system tables in any RDBMS server which has the metadata about the Database schema itself. This information is stored in a RDBMS database like Mysql/Postgres.
  • Execution component: This component is responsible for supporting the Query syntax of hive, building the query plans, talking to the actual execution framework like MR/Tez etc to execute the plan and show the results to the user. There are two modes of running this either as hive-cli (legacy mode)or hiveserver2+beeline (preferred mode).

I will concentrate on the later use case as that allows for easier management of the system for administrators + users.

HiveServer2: This component runs as a server component providing all hive capabilities as a server, exposing the functionality via thrift interface (binary + transport).

Beeline: This is a the jdbc client implementation that talks to hive server2 to allow users interact with the system.

Since Hiveserver2 tries to provide SQL interface for big data, it also tries to provide a similar user experience when interacting with the system. RDBMS was the established storage framework that allowed many organizations to build useful applications. Since lot existing users are comfortable with one or another RDBMS framework, hive tries to leverage the same know how to allow them to easily interact with itself, hence providing a jdbc connector to talk to hiveserver2 system.

Since JDBC API was not designed for long running queries over terabytes of data, the API does not address all the requirements to provide a seamless user experience. For ex: in the RDBMS world, once a query is sent for execution via jdbc client, the user either expects the query to quickly return results in sub seconds, or if it takes longer, it most probably indicates the data layout to be wrong or the query is written incorrectly. This however is not true for queries that run on hive since they run on large datasets and its not possible to always define the most appropriate data layout due to various reasons as new data sources are on boarded the system, due to the large quantities of data and not restricting the use cases to be solved by the system, most of the data provided is stored with very basic ETL process.

Hence the need for beeline client to talk to hiveserver2 using specific api’s outside the scope of jdbc API to allow for better user experience. Beeline currently provides this information by printing minimal and relevant logs related to the query execution for the user to see. However, the user has to make sense by reading the logs as to how far his query has progressed. This is again time consuming for the user, hive-cli already had a solution for this via Progress Bar information as printed below:

This shows a lot of relevant data which is easily comprehended by the user without knowing too many details about the system, hence the next logical step to for beeline was to emulate a similar user experience. This feature is now complete and available in apache hive master branch and it should be available for users in the next release stated to be 2.2.0. Progress bar functionality is affected via the following configurations on hiveserver2:

set hive.async.log.enabled=false;

The above is required as most of the information required for showing the progress bar is made available in the user session and async logging for now does not allow access to user session level objects. This is a server side property(this will require a restart of the server) even though a user session is allowed to set a value for this.

set hive.execution.engine=tez;

The progress bar for now is only supported if the underlying hive execution engine is tez. its not available for “mr” and “spark”. However when using tez execution engine its available for both “llap”(new in memory execution engine for BI applications with sub second query response times) and “container”(falls back to mr style of execution) set via hive.execution.mode hive configuration. This is also a server side property unless the administrator has configured the hadoop infrastructure to work with both mr or tez.

set hive.server2.in.place.progress=true;

To allows the server to provide the information required to display the progress information on beeline side. This is a session level property and can be switched off/on per session.

The server log4j configuration should have “INFO” level log configured, since this is the bare minimum required log level for the logger framework at the backend to write progress logs. The logger framework is used to channel this information from within hive server code to a log file which is then made available via the FetchResults method on hive Thrift api. The logger sub-system is configured such that it provides the above ability and is internally referred to as OperationLogs. This logger subsystem essentially creates a log file per session by using the session_id in the log file path and writing session specific logs to this file which is then served from the Hive server thrift api via the FetchResults api. OperationLogs Location can be configured in hiveserver using

set hive.server2.logging.operation.log.location=[root_path]

One potential issue to be careful about with the above property is that, the linux user used to start hiveserver2 should have permissions to write to the above directory else there will be no logs created and hence the progress bar will not be visible on beeline. This is a silent failure on the hiveserver2 side and it will not affect any query processing.

--

--