Databricks SQL Query Never Executes (15+ hour runtime)

19 Views Asked by At

I am connecting to a database to run SQL queries that is hosted on Azure Databricks.

My cluster specs below:

enter image description here

I have multiple SQL queries in my notebook, all nearly identical in structure. I pull data from the database to form my "cohort" of information I am interested in, filter my table/temporary view (I have tried using both), and then run analyses (count/sum/min/max etc.) to get the calculations I need.

Here is an example of my code:

%sql
CREATE OR REPLACE TEMP VIEW temp_drc2_hg AS (
  WITH cte AS (
    SELECT 
        dr.person_id,
        dr.location_id,
        dr.fv_date,
        hgb.measurement_date,
        hgb.value_as_number
    FROM dr_cohort2 dr
    LEFT JOIN dr_cohort_hemoglobin_measurements_31224 hgb 
    ON dr.person_id = hgb.person_id
    WHERE hgb.measurement_date <= dr.fv_date -- only keep ones where the measurement date is before/on the first visit date
)
SELECT 
    person_id,
    MAX(location_id) as loc,
    ROUND(AVG(value_as_number), 1) AS avg_hgba1c,
    CASE 
        WHEN AVG(value_as_number) <= 6.5 THEN '<= 6.5'
        WHEN AVG(value_as_number) > 6.5 AND AVG(value_as_number) <= 8 THEN '6.5-8'
        WHEN AVG(value_as_number) > 8 AND AVG(value_as_number) <= 10 THEN '8-10'
        WHEN AVG(value_as_number) > 10 THEN '> 10'
        ELSE 'Missing'
    END AS avg_hgba1c_group
FROM cte
GROUP BY person_id
)

%sql
CREATE OR REPLACE TEMP VIEW drc2_hg AS ( -- fix patients switching sites
WITH cte AS (
  SELECT person_id,
    MAX(location_id) AS location_id,
    age
  FROM drc2_dem
  GROUP BY person_id, age
)

SELECT 
    n.person_id,
    n.avg_hgba1c_group,
    c.location_id AS location_id
  FROM cte AS c
  LEFT JOIN temp_drc2_hg AS n
    ON c.person_id = n.person_id
)

%sql
SELECT d.avg_hgba1c_group,
  COUNT(DISTINCT person_id) as num,
  (SELECT 
  COUNT(DISTINCT person_id) AS total
  FROM drc2_hg) As group_total,
  23230 - (SELECT 
  COUNT(DISTINCT person_id) AS total
  FROM drc2_hg) AS diff,
  (COUNT(DISTINCT person_id) / 23230)*100 as per
FROM drc2_hg d
GROUP BY d.avg_hgba1c_group, group_total, diff
ORDER BY d.avg_hgba1c_group

Each %sql denotes a separate cell in which I run the code block.

Each table/temporary view has anywhere from 10,000-60,000 rows.

I have 13-15 such sets of tables/temporary views like the above. The first 7 have run fine; I have also dropped tables/temporary views that I no longer need. I have checked my memory on my cluster and it shows that I have memory to spare (my max is 128 GBs):

enter image description here

The issue is with the 8th set of code cells. The code is identical, but now when I run it, the query stalls and never completes:

enter image description here

For instance, this code cell ran for over 9 hours.

I have tried CLEAR CACHE and spark.catalog.clearCache() to free up memory, but to no avail.

I don't understand why near identical code I ran before is no longer running when I have enough space on my cluster.

Another example is like this:

enter image description here

enter image description here

enter image description here

The table/temporary view I am running is smaller than any of my other tables/temporary views (less than 10,000 rows).

Yet when I checked the CPU and server distribution while it was running, it was much higher than the other tables/temporary views.

I need help troubleshooting why my cluster performance is so slow when 1) I have enough memory available and 2) previous code that is nearly identical in data size was able to run previously very quickly.

I've tried clearing my cache using the SQL command CLEAR CACHE;.

I also tried switching all my tables from permanent tables CREATE OR REPLACE TABLE to temporary views CREATE OR REPLACE TEMP VIEW. Both results never finish running.

I broke up SQL queries into individual cells by saving common table expressions as temporary views themselves and running it separately to reduce the workload per cell.

0

There are 0 best solutions below