Quick note about Hive and Presto

I am working to build application using Treasuredata. Treasuredata provide Hive or Presto to execute jobs. So here are my mote for those two.

Hive is a program to manage big data, built on top of Hadoop.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop.


How Hive works

Hive translates SQL queries into multiple stages of MapReduce and it is powerful enough to handle huge numbers of jobs (Although as Arun C Murthy pointed out, modern Hive runs on Tez whose computational model is similar to Spark’s). MapReduce is fault-tolerant since it stores the intermediate results into disks and enables batch-style data processing. Many of our customers issue thousands of Hive queries to our service on a daily basis. A key advantage of Hive over newer SQL-on-Hadoop engines is robustness: Other engines like Cloudera’s Impala and Presto require careful optimizations when two large tables (100M rows and above) are joined. Hive can join tables with billions of rows with ease and should the jobs fail it retries automatically. Furthermore, Hive itself is becoming faster as a result of the Hortonworks Stinger initiative.

How Presto Works

Presto is a SQL engine, built on top of Hadoop.

In some instances simply processing SQL queries is not enough—it is necessary to process queries as quickly as possible so that data scientists and analysts can use Treasure Data for quickly gaining insights from their data collections. For these instances Treasure Data offers the Presto query engine. Presto is an in-memory distributed SQL query engine developed by Facebook that has been open-sourced since November 2013.Presto has been adopted at Treasure Data for its usability and performance.

Presto versus Hive: What You Need to Know

Use Hive for batch – routine jobs. Use Presto to fetch smaller (simpler) data.