Efficiently migrate Blobs from Oracle table into S3-Glacier

56 Views Asked by At

Scenario :- I have a Live table with few String columns(<10) and 1 Blob column (to hold binary for PDF).This in on premise cloud in Oracle 19c.

Each Blob item has an average size of 2MB, current table size shows around 9TB in size.

Requirement :- Table has total 10M rows, To Save the space in Oracle, I need to delete blob column values from 9M rows (based on a expiry flag boolean column) and preserve Blobs in S3 Glacier vault, also update archive_id in same table to hold the reference of archive in Glacier.

My Initial approach :- I created a Java app with AWS SDK Glacier Client, to upload the files and then update an archive_id column into my table with archive_id received in the response.

I am currently executing it in batch with 500 records at a time to avoid Java Heap errors.as follows

My current process works as follows

 //FileEntity consists of three fields fileId and byte[] field named pdf and archiveId
void migrateBlobs(){
    List<FileEntity> files =  jpaRepository.getNextBatch(); // Size 500 items takes around 16 seconds to fetch. 
    files.forEach( fileItem -> {
        String archiveId = uploadClient.uploadFile(fileItem.getPdf());
        fileItem.setArchiveId(archiveId);
        jpaRepository.save(fileItem); // Each item takes 2-4 seconds to save.
    });
}

// My code for initializing Client and uploading

@Bean 
public AmazonGlacier glacierClient(){
    return AmazonGlacierClient.builder()
                .withEndpointConfiguration(new AwsClientBuilder.EndpointConfiguration(endpoint, "region"))
                .withCredentials(new AWSStaticCredentialsProvider(awsCreds))
                .build();
} 

public uploadFile(byte[] pdf){
        UploadArchiveRequest uploadRequest = new UploadArchiveRequest()
                .withVaultName(vaultName) //ENV variable in Code
                .withBody(new ByteArrayInputStream(pdf))
                .withContentLength((long) byteArrayContent.length);
        UploadArchiveResult uploadResult = glacierClient.uploadArchive(uploadRequest);
        return uploadResult.getArchiveId();
}

My problem :- This is one time activity and taking long time in Production. It could only process 1500 records during a dry run in 1 hour. My team is looking to improve the performance to complete this within 1 to 2 weeks if possible.

Questions :-

  1. Can I upload 500 PDFs in S3-Glacier vault using a single HTTP call instead of separate calls to upload each item ?
  2. Are there any AWS services I can use to upload files and archive_id in my table rather than creating Java app ?
0

There are 0 best solutions below