Where in Hive Metastore is the s3 locations of Databricks (Spark) tables can be found?

107 Views Asked by At

I have a few Databricks clusters, some share a single Hive Metastore (HMS), call them PROD_CLUSTERS, and an adidiotnal cluster, ADHOC_CLUSTER, which has its own HMS. 

All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-write on those s3 buckets, and ADHOC_CLUSTER has only read privileges. 

I need to setup ADHOC_CLUSTER so that it has a set of views on top of all the tables (as s3 pathes) defined by the PROD_CLUSTERS. 

For example, if one of the PROD_CLUSTERS created a table prod_schema.employees with LOCATION "s3://somebucket/employees", then in ADHOC_CLUSTER I would run (via databricks notebook):

CREATE VIEW live_views_prod_schema.employees 
as 
SELECT * FROM delta.`s3://somebucket/employees`

What I tried

My thinking was to have a script in ADHOC_CLUSTER, run a SELECT against the HMS of the PROD_CLUSTERS (meaning JDBC to the MySQL HMS, to query the relational DB directly), then get all tables names and s3 locations, and programmatically issue all the necessary CREATE VIEW statements in ADHOC_CLUSTER. 

I imagined this would be simple as in HMS the SDS table would hold the location (s3://somebucket/employees) in column SDS.location, for every delta table created by PROD_CLUSTERS (and some json based tables). Problem is: In SDS I only find the correct location for some of the tables.

For other tables, the s3 path listed in HMS's SDS.location is pointing to the default bucket of the schema (the table was created on a different bucket), to a path that goes like

s3a://<default-bucket>/prod_schema.db/emplyees-__PLACEHOLDER__

Couldn't get past this, nor find the actual data files location somewhere else in the HMS.

My question

So, question is: How can one query HMS to get the full path for all data files of tables defined in that HMS?

Thanks!

0

There are 0 best solutions below