Using Spark, Hive and OpenMetadata
It might feel like one of those stories starting with “Once upon a time,” but not everyone is (yet) on the cloud. However, metadata has been and will always be there. We just need to learn how to make the most out of it.
While data aims to give organizations insights into their businesses, metadata helps teams ensure that their data is reliable and meaningful. Moreover, metadata allows us to close the gap between data and people, diving into topics such as data ownership, usage, and governance.
Metadata ingestion is the first step to unlocking this huge value chain. In this post, we will look at the past and prepare a local Spark application storing its data into an external Hive metastore. We will then use OpenMetadata to extract the generated metadata into a centralized platform.
We are going to follow the steps listed down here, using updated versions:
The goal is to spin up Hive, which uses Hadoop underneath.
After downloading the packages, you can update your
.bashrc files with the following
exports, where I located the files under
export HADOOP_HOME=~/dev/hadoop-3.3.4export PATH=$HIVE_HOME/bin:$PATH
For Hadoop, we are going to follow the Pseudo-Distributed Operation from the official guide:
$HADOOP_HOME/sbin/start-all.shto spin up datanodes and namenodes locally.
~/dev/hadoop-3.3.4 ❯ $HADOOP_HOME/sbin/start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as pmbrull in 10 seconds.
Starting namenodes on [localhost]
Starting secondary namenodes [Peres-MBP.lan]
Note that this last step needs your user to SSH into
localhost without a password. If you can’t
ssh localhost, then you need to run:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
If you are a macOS user, you might need to enable Remote Login on System Preferences:
Following the setup guide, we are going to create the
tmp and the metastore warehouse directories in HDFS with the necessary permissions to write data in them:
$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir -p /user/hive/warehouse $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
The next step will be to set up where Hive will store the metadata about the tables. Finally, we will use a local MySQL instance where we will execute a schema migration to prepare the required database structure.
- Inside your MySQL instance, create a database and provide a user:
mysql> create database demo_hive;
Query OK, 1 row affected (0.01 sec) mysql> create user APP identified by 'password';
Query OK, 0 rows affected (0.01 sec) mysql> GRANT ALL PRIVILEGES ON demo_hive.* TO 'APP'@'%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.00 sec)
- Prepare your
hive-site.xmlfile based on the provided template:
cp $HIVE_HOME/conf/hive-default.xml.template $HIVE_HOME/conf/hive-site.xml
- And edit
$HIVE_HOME/conf/hive-site.xmlwith the following properties:
javax.jdo.option.ConnectionURL jdbc:mysql://127.0.0.1:3306/demo_hive javax.jdo.option.ConnectionUserName APP javax.jdo.option.ConnectionPassword password javax.jdo.option.ConnectionDriverName com.mysql.cj.jdbc.Driver hive.metastore.warehouse.dir hdfs://localhost:9000/user/hive/warehouse
Note that we are selecting here the MySQL JDBC driver. You can download it from here and place it under
- We can now run the
schematoolto populate the database with the configurations above:
$HIVE_HOME/bin/schematool -dbType mysql -initSchema
- Finally, let’s start the metastore with:
$HIVE_HOME/bin/hive --service metastore
This command will start the Hive metastore running locally at
We will now use Spark 3.3.0 (pre-built for Hadoop 3.3+) to create and persist some data and explore how the different ingredients are involved.
Note that you can prepare this export as well:
The only requirement here is to point Spark to the external Hive metastore. We can do so by editing
$SPARK_HOME/conf/hive-site.xml to look like this:
Then, create a Spark shell pointing to the right metastore jars and the warehouse directory configured when setting up Hadoop:
Once inside the shell, let’s materialize a sample DataFrame:
import spark.implicits._val someDF = Seq(
).toDF("number", "word")// Convert the DataFrame into a table we can reach
// from within Spark SQL
someDF.createOrReplaceTempView("mytempTable")// Materialize the table
spark.sql("create table some_df as select * from mytempTable");
Once these operations have run successfully, we can zoom in into the architecture to verify what has happened:
- As we defined the
spark.sql.warehouse.dirpointing to the HDFS path
/user/hive/warehouse, we can see that indeed data files appeared for our new table
❯ $HADOOP_HOME/bin/hadoop fs -ls /user/hive/warehouse/some_df
Found 3 items
-rwxr-xr-x 3 pmbrull supergroup 6 2022-08-27 22:11 /user/hive/warehouse/some_df/part-00000-e99ca76a-477c-47d3-829a-c4aa4e03c3a3-c000
-rwxr-xr-x 3 pmbrull supergroup 9 2022-08-27 22:11 /user/hive/warehouse/some_df/part-00001-e99ca76a-477c-47d3-829a-c4aa4e03c3a3-c000
-rwxr-xr-x 3 pmbrull supergroup 10 2022-08-27 22:11 /user/hive/warehouse/some_df/part-00002-e99ca76a-477c-47d3-829a-c4aa4e03c3a3-c000
- As the Hive metastore is plugged into the MySQL instance, the metadata about the new table was created there as well:
In this last step, we will use OpenMetadata to ingest the metadata we just generated into this open-source metadata management solution.
Data, as well as metadata, is enjoyed the most in company.
The goal is to break metadata silos and bring the organization together, collaborating in a single place. The easiest way to get started is using the Local Docker Deployment.
After the setup is complete, we can go on and configure a DeltaLake service.
Note that while OpenMetadata is running inside Docker, the Hive Metastore lives on our local. To reach the host’s network we need to write the url as
Following the steps to prepare an Ingestion Pipeline, we can regularly bring any changes on the metastore to OpenMetadata. After the first ingestion, the newly created can be explored and shared:
Making our data discoverable is the first step when shifting towards Data Products. Data that is shared can be discussed and improved.
In this post, we have:
- Configured HDFS to store the data,
- Created an external Hive Metastore plugged into a MySQL instance,
- Generated some data using Spark,
- Ingested the metadata with OpenMetadata.
Managing the entire lifecycle of the data architecture allows us to understand better all the ingredients involved in a world where cloud providers abstract most of the pieces and solutions almost seem like magic.