MAriadb optimizing query for averaging

106 Views Asked by At

I have database like this (in real there are over 30 different sKey):

+----+------+------+---------------------+
|ID  | sKey | sVal |      timestamp      |
+----+------+------+---------------------+
| 1  | temp |   19 | 2023-07-14 20:32:06 |
| 2  | humi |   60 | 2023-07-14 20:33:06 |
| 3  | temp |   20 | 2023-07-14 20:34:06 |
| 4  | humi |   65 | 2023-07-14 20:35:06 |
| 5  | pres | 1023 | 2023-07-14 20:36:06 |
| 6  | temp |   22 | 2023-07-14 20:37:06 |
| 7  | temp |   21 | 2023-07-14 20:38:06 |
| 8  | pres | 1028 | 2023-07-14 20:39:06 |
| 9  | temp |   20 | 2023-07-14 20:40:06 |
|10  | temp |   19 | 2023-07-14 20:43:06 |  <-time glitch
|11  | pres | 1022 | 2023-07-14 20:44:06 |
|12  | temp |   19 | 2023-07-14 20:45:06 |
|13  | humi |   66 | 2023-07-14 20:46:06 |
|14  | humi |   63 | 2023-07-14 20:47:06 |
|15  | temp |   19 | 2023-07-14 20:48:06 |
|16  | pres | 1029 | 2023-07-14 20:49:06 |
|20  | temp |   19 | 2023-07-14 20:50:06 | <- ID not consecutive (deleted records)
|21  | pres | 1022 | 2023-07-14 20:61:06 |
|22  | temp |   19 | 2023-07-14 20:62:06 |
|23  | pres | 1029 | 2023-07-14 20:63:06 |
+----+------+------+---------------------+

Now I wish to get averages for each sKey (with value in sVal). For now I have working solution but it's to slow. Actually I have 3 separate queries like this:

SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='temp'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)
SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='humi'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)
SELECT AVG(`sVal`), `timestamp` FROM `Test` WHERE sKey='pres'  AND timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06' GROUP BY FLOOR(TO_SECONDS(`timestamp`)/180)

As database is allready over 2000000 records now, the single query take around 3 seconds. I assume if I can somehow join query to just one as only WHERE sKey=... the result should be faster. So how to improve that.

Or maybe I have wrong approach at all. The wanted result is to get averaged sVal per each sKey. The averaging interval cant be by record numbers (ID) as some record may be deleted. Even if ID is in row there may be some record missed. So I think only interval in timestamp itself can be used. But I'm novice in SQL and I'm possible miss something.

Average time (3 minutes in example) may be any value.

3

There are 3 best solutions below

4
Turo On BEST ANSWER

I don't understand why you don't group by skey, what's wrong with

SELECT AVG(`sVal`), `skey`, `timestamp` FROM `Test`
WHERE sKey in ('temp','humi','pres')  AND 
   timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06'
GROUP BY  FLOOR(TO_SECONDS(`timestamp`)/180), `skey`

and mysql is something diffent in grouping and allows your construct, shouldn't it be:

SELECT AVG(`sVal`), `skey`, FLOOR(TO_SECONDS(`timestamp`)/180) FROM `Test`
WHERE sKey in ('temp','humi','pres')  AND 
   timestamp between '2023-07-14 20:34:06' and '2023-07-14 20:51:06'
GROUP BY  FLOOR(TO_SECONDS(`timestamp`)/180), `skey`
9
nbk On

You can use conditional avg.

thsi query will benefit from an Index on timestamo, also you should test if a combined index on skey and timestamp will also increase the speed

SELECT 
    AVG(IF(sKey = 'temp', `sVal`, 0)) AS `temp`,
    AVG(IF(sKey = 'humi', `sVal`, 0)) AS `humi`,
    AVG(IF(sKey = 'pres', `sVal`, 0)) AS `pres`,
    MIN(`timestamp`) As `timestamp`
FROM
    `Test`
WHERE
    timestamp BETWEEN '2023-07-14 20:34:06' AND '2023-07-14 20:51:06'
GROUP BY FLOOR(TO_SECONDS(`timestamp`) / 180);
2
user1191247 On

Assuming (timestamp, sKey) is unique, you will probably get a significant improvement by dropping that surrogate PK.

Check (timestamp, sKey) is unique:

SELECT IF(COUNT(*) = COUNT(DISTINCT timestamp, sKey), 'unique', 'not unique')
FROM Test;

Create new table and populate:

CREATE TABLE `Test2` (
  `sKey` varchar(4) NOT NULL,
  `sVal` smallint unsigned NOT NULL,
  `timestamp` datetime NOT NULL,
  PRIMARY KEY (`timestamp`,`sKey`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO Test2
SELECT sKey, sVal, timestamp
FROM Test;

As mentioned in my comment previously, you have not included enough detail in your question to determine what other changes may be worthwhile. For example, if most of your queries involve doing some interval manipulation to your timestamp column, and the number of rows meeting the criteria is significant, changing the timestamp column to a bigint storing the unix timestamp will further improve performance.

Again, if grouping large result sets, using a DIV b instead of FLOOR(a/b) can have a surprisingly significant impact.

Have a play with these to see what I mean:

SELECT BENCHMARK(1000000, FLOOR(TO_SECONDS('2023-02-01 13:00:00')/180));
SELECT BENCHMARK(1000000, TO_SECONDS('2023-02-01 13:00:00') DIV 180);

SELECT BENCHMARK(1000000, FLOOR(1675256400/180));
SELECT BENCHMARK(1000000, 1675256400 DIV 180);

The differences will be far more significant if your MariaDB server is resource constrained.