Aside from doing a direct match on something like a whitespace normalized hash of a query, what might be a useful (but-not-necessarily-perfect) way to handle query cache in a partial manner? For example, let's take the following basic case:
SELECT
Product, # VARCHAR
Revenue # DOUBLE
FROM
Sales
WHERE
Country='US'
This potentially could be used as a 'base-cache' upon which a further query could be executed to potentially improve performance:
SELECT
Product, # VARCHAR
Revenue # DOUBLE
FROM
Sales
WHERE
Country='US' AND State='CA'
So, assuming the data in the from table(s) don't change, the following might serve as a starting point for determining cache:
- fields: [field:type, ...] // can be less but not more
- from: hash of table(s)+joins
- filters: [filter1, filter2, ...] // can be less but not more
- aggregations: [agg1, agg2, ...] // can be less but not more
- having: [having1, having2, ...] // can be less but not more
- order+limit+offset if limited result-set // can be less but not more
However, this becomes quite tricky when we think about something like the following case:
SELECT
ProductGroup AS Product, # Would produce a Product:VARCHAR hash
Revenue
FROM
Sales
WHERE
Country='US'
What might be a realistic starting point for how a partial- query cache could be implemented.
Use case: writing SQL to query data in a non-DBMS-managed source, such as a CSV file which will take ~20s or so to issue any query and we cannot create indexes on the file. https://en.wikipedia.org/wiki/SQL/MED or Spark-like.
I think the following might be a good starting place for a basic cache implementation that allows the usage of a cache that can be further queried for refinements:
Here would be an example of how the data might look saved:
Now let's try a few examples, I will use human-readable hashes for easier readability:
Does the above seem like a valid initial implementation?