Given the below SQL file I would expect that only fully aggregated result (for each hour) is written to MySQL db table.
However, a separate row is created for each intermediate aggregation result...
i.e.
|eventDate|userId|responseTotalTime|
|2020-08-21 01:00:00|1|10
|2020-08-21 01:00:00|1|20
|2020-08-21 01:00:00|1|30
....
I'm only interested in the latest (most actual row)..
Shouldn't Flink Window only emit one final result once it's closed?
Thanks in advance,
CREATE TABLE test_source
(
`eventDate` TIMESTAMP_LTZ(3),
`userId` STRING,
`responseTime` INT
WATERMARK FOR eventDate AS eventDate - INTERVAL '1' MINUTE
) WITH (
'connector' = 'kafka',
...
);
CREATE TABLE IF NOT EXISTS test_sync
(
`eventDate` STRING,
`userId` STRING,
`responseTotalTime` INT
PRIMARY KEY(`eventDate`, `userID`) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
...
);
INSERT INTO test_sync
SELECT
CAST(window_start AS VARCHAR) AS eventDate,
userId,
SUM(responseTime) as responseTotalTime
FROM TABLE(TUMBLE(TABLE test_source, DESCRIPTOR(eventDate), INTERVAL '1' HOUR)) PA
GROUP BY window_start, userId;
p.s. After some further investigations and testing I can conclude the following:
- If we GROUP BY window_start only - then this is a typical Group Aggregation function from Flink perspective, watermarks are not really working, updates are immediate and window is never closed. To avoid memory issues we need to set
table.exec.state.ttlconfiguration. - If we GROUP BY window_start, window_end - then Window Aggregation is used, watermarks work as expected and window content is emitted only when the window is closed.
However, this is not explicitly mentioned in Flink documentation anyhwere.. Wish it was better documented.
Still doubt - is it the expected behavior (for option 1) or a possible defect in Flink?