SQLAlchemy isn't batching rows / using server side cursor via `yield_per`

728 Views Asked by At

Following documentation, and the code snippet provided from https://docs.sqlalchemy.org/en/14/core/connections.html#streaming-with-a-fixed-buffer-via-yield-per (posted directly below), my query is not being batched into 50_000 rows.

with engine.connect() as conn:
    result = conn.execution_options(yield_per=100).execute(text("select * from table"))

    for partition in result.partitions():
        # partition is an iterable that will be at most 100 items
        for row in partition:
            print(f"{row}")

Here I am writing the results of a SQL query to a CSV file:

with engine.connect() as conn: # connect to database
    with open(csv_path, mode='w', newline='') as f:
        
        c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
        c.writerow(columns) # write column headers

        result = conn.execution_options(yield_per=50_000).execute(sql_string) # execute query that results in 110k rows, expecting batches of 50k rows
        for partition in result.partitions():
            for row in tqdm.tqdm(partition): # each partition only has 1 row, when I expect 50k rows
                c.writerow(row)

This sql_string results in 110k rows, so I expect 3 iterations of result.partitions(), however, I am seeing 110k iterations. I fear this is equivalent to DDOS-ing my own SQL Server database.

I've also tried doing this without partitions(), and the same thing happens - no batching.

    with engine.connect() as conn:
        with open(csv_path, mode='w', newline='') as f:
            c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
            c.writerow(columns)
            for row in tqdm.tqdm(conn.execution_options(yield_per=50_000).execute(sql_string)): # I expect 3 iterations, but I get 110k.
                c.writerow(row)

The server seems to handle this ok, I get 22k iterations / second, but I still wonder, am I DDOS-ing myself?

I am using SQL Alchemy 1.4.42 (pip install --upgrade 'sqlalchemy==1.4.42') and Microsoft SQL Server 2019 - 15.0.X (X64)

1

There are 1 best solutions below

3
scrollout On BEST ANSWER

The yield_per argument was having no effect in execute_options. In the example query snippets I posted in my question, fetchone() gets called N times, where N is the query result's row count. I discovered these fetchone() calls by getting the traceback from doing ctrl-C during an execution. That's a lot of server calls.

Instead of concerning myself with yield_per, I was able to achieve pagination by using fetchmany(batch_size).

    with engine.connect() as conn:
        print('Connected!...')
        with open(csv_path, mode='w', newline='', encoding='utf-8') as f:
            c = csv.writer(f, quoting=csv.QUOTE_MINIMAL)
            c.writerow(columns)
            result = conn.execute(sql_string)
            counter = 1
            while True:
                print(f'Querying batch {counter}...')
                rows = result.fetchmany(50_000) # batch size
                print(f'Contains {len(rows)} rows...')
                if not rows:
                    break
                print(f'Writing batch {counter} to CSV...')
                for row in rows:
                    c.writerow(row)
                counter += 1
            result.close()

I don't understand the reason for yield_per in SQLAlchemy 1.4.42, or why the the SQLAlchemy docs don't reference using fetchmany() instead. Oh well.