Flink SQL Streaming Window TVF Aggregation Sinks Intermediate Results to JDBC Connector

33 Views Asked by At

Given the below SQL file I would expect that only fully aggregated result (for each hour) is written to MySQL db table.
However, a separate row is created for each intermediate aggregation result... i.e.

|eventDate|userId|responseTotalTime|
|2020-08-21 01:00:00|1|10
|2020-08-21 01:00:00|1|20
|2020-08-21 01:00:00|1|30
....

I'm only interested in the latest (most actual row)..

Shouldn't Flink Window only emit one final result once it's closed?

Thanks in advance,

CREATE TABLE test_source
(
    `eventDate` TIMESTAMP_LTZ(3),
    `userId` STRING,
    `responseTime` INT  
    WATERMARK FOR eventDate AS eventDate - INTERVAL '1' MINUTE
) WITH (
      'connector' = 'kafka',
     ...
);

CREATE TABLE IF NOT EXISTS test_sync
(
    `eventDate` STRING,
    `userId` STRING,
    `responseTotalTime` INT
    PRIMARY KEY(`eventDate`, `userID`) NOT ENFORCED
 ) WITH (
           'connector' = 'jdbc',
          ...
 );
 
 
INSERT INTO test_sync
SELECT   
    CAST(window_start AS VARCHAR) AS eventDate,
    userId,
    SUM(responseTime) as responseTotalTime
FROM TABLE(TUMBLE(TABLE test_source, DESCRIPTOR(eventDate), INTERVAL '1' HOUR)) PA
GROUP BY window_start, userId;

p.s. After some further investigations and testing I can conclude the following:

  1. If we GROUP BY window_start only - then this is a typical Group Aggregation function from Flink perspective, watermarks are not really working, updates are immediate and window is never closed. To avoid memory issues we need to set table.exec.state.ttl configuration.
  2. If we GROUP BY window_start, window_end - then Window Aggregation is used, watermarks work as expected and window content is emitted only when the window is closed.

However, this is not explicitly mentioned in Flink documentation anyhwere.. Wish it was better documented.

Still doubt - is it the expected behavior (for option 1) or a possible defect in Flink?

0

There are 0 best solutions below