There is a great hype around Azure DataBricks and we must say that is probably deserved. It will put Spark in memory engine at your work without much effort and with decent amount of “polishedness” and easy-to-scale-with-few-clicks.
We also have to remember that Spark is a somehow old horse in the zoo as it is available in Azure HDInsight for long time now.
This post pretends to show some light on the integration of Azure DataBricks and the Azure HDInsight ecosystem as customers tend to not understand the “glue” for all this different Big Data technologies.
1 – If you use Azure HDInsight or any Hive deployments, you can use the same “metastore”.
One of the greatness (not everything is great in metastore, btw) of Apache Hive project is the metastore that is basically a relational database that saves all metadata from Hive: tables, partitions, statistics, columns names, datatypes, etc etc.
Azure DataBricks can use an external metastore to use Spark-SQL and query the metadata and the data itself taking care of 3 different parameter types.
- Connection to External Metastore (spark.hadoop.javax.jdo.option.ConnectionDriverName, ConnectionURL, ConnectionUserName, ConnectionPassword)
- Avoid DataBricks to update your Hive metastore schema ( hive.metastore.schema.verification.record.version = true, hive.metastore.schema.verification = true)
- Give access to your Azure Data Lake Store or Azure Blob Storage that contains your Hive data.
PS: That means, the same scaling issues that you might have in Hive metastore will be present in DataBricks metastore access.
2 – Use and abuse of Spark-SQL on top of “Hive” tables
Here is a list of things you can do with Spark-SQL on top of your Hive tables: “almost everything” 🙂
That is, you can run any type of query that you would run on top of Azure HDInsight with Hive, with a few four import exceptions:
- ACID tables update are not supported by Spark-SQL
- Writing in Hive bucketed tables is not supported
- Automatic Columns Statistics fetching is not supported by Spark-SQL
- Automatic small files merge is not supported also.
We’d say that most of these things are rare, except maybe some ACID tables with SCD.
3 – Imagine that Spark ML (Spark Machine Learning library) can access your already deployed Hive tables
So, if you deployed any Azure HDInsight with large amounts of Hive tables with correct datatypes and partitioning and you are worried about how can you leverage Spark Machine Learning libraries, have no worries about it. Feeding Spark with your already stored data in Hive tables is easy and works as expected.