I work with PostGIS and the data is quite large, around 1.5 billion rows, I have a grid table along with its administrative areas, I need to get a value which is the number of residents in a particular administrative area to carry out further calculations. The steps I took were to create a sum_residents column and fill it with the aggregate number of residents grouped by administrative area name (adm_col1, adm_col2, adm_col3)
Query I ran:
UPDATE table_grid AS gd
SET sum_resident_grid = subquery.total_sum
FROM (
SELECT adm_col1, adm_col2, adm_col3, SUM(resident_grid) AS total_sum
FROM table_grid
GROUP BY adm_col1, adm_col2, adm_col3
) AS subquery
WHERE gd.adm_col1 = subquery.adm_col1
AND gd.adm_col2 = subquery.adm_col2
AND gd.adm_col3 = subquery.adm_col3;
Initially for relatively small data, the query was no problem, but when the data was very large, the query took a very long time, it could take more than a day. Is there a faster way to get the sum_resident_grid value?
Stage 2: After getting the sum_resident_grid, I actually use this value to carry out weighting calculations, where the formula is resident_grid/sum_resident_grid*actual_resident_value obtained by joining the administrative table, the query I ran:
UPDATE table_grid AS gd
SET
sum_income = ROUND(gd.resident_grid::float/gd.sum_resident_grid*de.actual_resident_value)::float/de.actual_resident_value*de.total_inco ,
sum_expend = ROUND(gd.resident_grid::float/gd.sum_resident_grid*de.actual_resident_value)::float/de.actual_resident_value*de.total_expe ,
resident_grid = ROUND(gd.resident_grid::float/gd.sum_resident_grid*de.actual_resident_value)
FROM table_administrative de
WHERE gd.sum_resident_grid !=0
AND gd.adm_code1 = de.adm_code1;
so is there any other way that can speed up the query that I am running? Thank in advance for your help!
This answer is mainly aiming to provide a general approach which can alleviate the need to do the costly full table aggregation/summation. Instead of doing your current query, you may consider keeping a running summation. Each time a new record is inserted into the
table_gridtable, you can update the running total. In Postgres, you would accomplish this via a trigger:In other words, for every insert to the
table_gridtable, we keep a running total of theresident_gridfor each(adm_col1, adm_col2, adm_col3)tuple. To access these sums, you may simply query thetable_grid_totaltable at any time.