I have a few Databricks clusters, some share a single Hive Metastore (HMS), call them PROD_CLUSTERS, and an adidiotnal cluster, ADHOC_CLUSTER, which has its own HMS.
All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-write on those s3 buckets, and ADHOC_CLUSTER has only read privileges.
I need to setup ADHOC_CLUSTER so that it has a set of views on top of all the tables (as s3 pathes) defined by the PROD_CLUSTERS.
For example, if one of the PROD_CLUSTERS created a table prod_schema.employees with LOCATION "s3://somebucket/employees", then in ADHOC_CLUSTER I would run (via databricks notebook):
CREATE VIEW live_views_prod_schema.employees
as
SELECT * FROM delta.`s3://somebucket/employees`
What I tried
My thinking was to have a script in ADHOC_CLUSTER, run a SELECT against the HMS of the PROD_CLUSTERS (meaning JDBC to the MySQL HMS, to query the relational DB directly), then get all tables names and s3 locations, and programmatically issue all the necessary CREATE VIEW statements in ADHOC_CLUSTER.
I imagined this would be simple as in HMS the SDS table would hold the location (s3://somebucket/employees) in column SDS.location, for every delta table created by PROD_CLUSTERS (and some json based tables).
Problem is: In SDS I only find the correct location for some of the tables.
For other tables, the s3 path listed in HMS's SDS.location is pointing to the default bucket of the schema (the table was created on a different bucket), to a path that goes like
s3a://<default-bucket>/prod_schema.db/emplyees-__PLACEHOLDER__
Couldn't get past this, nor find the actual data files location somewhere else in the HMS.
My question
So, question is: How can one query HMS to get the full path for all data files of tables defined in that HMS?
Thanks!