How to efficiently extract large dataset from Oracle to a file?

7.6k Views Asked by At

I have a oracle server from where i need to extract data using python into files. These files are used by downstream systems as inputs.

Few technical details: Oracle and Python are running on different server. The database is client hosted while all scripts are running on an AWS RHEL EC2 server. Details of the EC2 instance are highlighted in this screenshot. screenshot.

My Approach To do this i choose Python's Pyodbc library to connect to the Remote Oracle client and extract data using the SQL query. Below is an extract of the code that fetches the data based on SQL Query provided.

def fetch_data_to_file(self,curr,query,dataset):
    try:
        self.logger.info('Executing query: {}'.format(query))
        start = time.time()
        curr.execute(query)
        query_time = time.time()-start
        start = time.time()
        rowcount=0
        with open(dataset,'a+') as f:
            writer = csv.writer(f,delimiter='|')
            writer.writerow([i[0] for i in curr.description])
            self.logger.info('Writing file: {}'.format(dataset))
            while True:
                rows = curr.fetchmany(self.batch_limit)
                self.logger.info('Writing {} rows'.format(self.batch_limit))
                rowcount+=len(rows)
                if not rows:
                    break           

        self.timer.info('{} Query Execution Time: {} seconds'.format(dataset,query_time))
        self.timer.info('{} File Writing Time: {} seconds. at {} rows/second'.format(dataset,time.time()-start,int(rowcount / (time.time()-start))))

        self.logger.info("File written.")
    except Exception as e:
        self.error.info("Error in fetching data.Error: {}".format(e))
        raise SystemExit(1)

The dataset that I am extracting is close to 8GB uncompressed size (close to 35Million rows returned). And it takes the code ~1.5 hours to download the file on my EC2 server. I tested with multiple variations of batch_limits and found 1Million - 2Million to be the optimal size to batch download data however I am unsure if there is something else i can do more efficiently to figure out what my batch size should be.

What else have i looked into I was looking online to figure out ways to write large datasets to files using python, and many suggested using Pandas. I tried to figure that out but failed to do so. Also, it is important that i preserve the data and its datatypes while extracting them to files.

My ask here is: is there anything i can do better to make this code more efficient ? Is Python the best suited language for this? (Please note, i need to be able to automate the jobs whatever language i choose. Going for licensed libraries is a bit difficult at this point due to internal pricing challenges at my firm).

Also, not sure if this helps, but here is a snapshot of my memory usage while the code was downloading data (htop)enter image description here enter image description here

3

There are 3 best solutions below

3
Adam vonNieda On BEST ANSWER

One possibility would be to download the free "SQLcl" utility, basically a Java based SQL-Plus but does much more. Download here. What you can do with SQLcl is drop it on the client machine and use it to extract the data, while also taking care of the delimiter for you. Here I'm setting the delimiter to the pipe symbol. This may be more efficient than trying to do it via Python, and you'd still be able to script it and call it from Python, or wherever.

$ sqlcl username/password@'<hostname>:<port>/ORACLE_SID'
> Set sqlformat delimited |
> Spool <some file to hold the data>
> Select * from <some table>;
> Spool off

And the above can easily be dropped into a shell script.

#!/bin/bash

sqlcl username/password@'<hostname>:<port>/ORACLE_SID' <<EOF
Set sqlformat delimited |
Spool <some file to hold the data>
Select * from <some table>;
Spool off
EOF

Example

sqlcl> Select * from emp;
"EMPNO"|"ENAME"|"JOB"|"MGR"|"HIREDATE"|"SAL"|"COMM"|"DEPTNO"
7839|"KING"|"PRESIDENT"||17-NOV-81|5000||10
7698|"BLAKE"|"MANAGER"|7839|01-MAY-81|2850||30
7782|"CLARK"|"MANAGER"|7839|09-JUN-81|2450||10
7566|"JONES"|"MANAGER"|7839|02-APR-81|2975||20
7788|"SCOTT"|"ANALYST"|7566|09-DEC-82|3000||20
7902|"FORD"|"ANALYST"|7566|03-DEC-81|3000||20
7369|"SMITH"|"CLERK"|7902|17-DEC-80|800||20
7499|"ALLEN"|"SALESMAN"|7698|20-FEB-81|1600|300|30
7521|"WARD"|"SALESMAN"|7698|22-FEB-81|1250|500|30
7654|"MARTIN"|"SALESMAN"|7698|28-SEP-81|1250|1400|30
7844|"TURNER"|"SALESMAN"|7698|08-SEP-81|1500|0|30
7876|"ADAMS"|"CLERK"|7788|12-JAN-83|1100||20
7900|"JAMES"|"CLERK"|7698|03-DEC-81|950||30
7934|"MILLER"|"CLERK"|7782|23-JAN-82|1300||10
3
Christopher Jones On

A very fast solution for dumping data in CSV format is SQL*Plus 12.2's CSV mode. If you don't already have it, you can get SQL*Plus for free from the Instant Client packages.

Create a SQL script ex.sql like:

set feedback off
set arraysize 500
select * from locations;
exit

You can/should tune ARRAYSIZE for optimum performance.

Then invoke SQL*Plus with the -m 'csv on' option. This uses a new, fast I/O subsystem:

sqlplus -l -s -m 'csv on delim |' cj@'"localhost/orclpdb1"' @ex.sql

Note the -s option will make the prompt for the password invisible.

The output will be like:

"LOCATION_ID"|"STREET_ADDRESS"|"POSTAL_CODE"|"CITY"|"STATE_PROVINCE"|"COUNTRY_ID"
1000|"1297 Via Cola di Rie"|"00989"|"Roma"||"IT"
. . .

My release announcement blog post has some more details.

1
Paul C On

Just a thought in mind is to create a python method that is multithreaded breaking your dataset in parts. Depending on your target database these files can be read to external tables to eliminate one more step if you want to push it to a table.