I need to find the fastest way of reading a large OpenEdge table (100 million rows plus), preferably programmatically (in c#) and outside of ETL tools such as SSIS or staging formats such as text file extracts.
I'm currently using ODBC (driver: Progress OpenEdge 11.5) to query the OpenEdge 11.5 tables in batches using the OFFSET and FETCH modifiers
SELECT COL_1, COL_2
FROM PUB.TABLE_1
ORDER BY ROWID ASC
OFFSET {currentBatchStart} ROWS
FETCH NEXT {batchSize} ROWS ONLY
I'm querying via a system DSN with FetchArraySize: 25 and QueryTimeout: -1. And I'm connecting to an OpenEdge server group set up for SQL only access with message buffer size: 1024.
I'm finding the performance is poor (about 1 million records every 15 minutes) and I suspect it will only slow down as I advance through the table when using the OFFSET FETCH modifiers.
My question is are there any methods I can adopt or settings I can play with to tune the query performance?
- For example are there better ways of constructing my SQL query? e.g. should I order by columns in an index rather than ROWID?
- Should I increase the message buffer size on the sql server group
Or should I be looking at alternative methods to read the data out of the table?
Note: Each batch is subsequently sqlbulkcopy'ed into a SQL Server table
I'm not much on ODBC - from what I can make of your code this will have increasing performance issues as you get further down the table as you surmise.
My suggestion would to be to identify a unique index on that table and use that index's keys to determine what values to get next. Then your query becomes something like this:
Then the db engine can use your field values to find the offset and get the next values - this'll be much more performant than what you have now.
If this will be an ongoing concern 11.7 has Change Data Capture for logging data changes for replication elsewhere, and Progress sells the Pro2 tool to provide ongoing replication of data.