Issues with processing large Excel files from SFTP using Roo gem in Rails application

37 Views Asked by At

I'm encountering difficulties processing large Excel files fetched from an SFTP server in my Rails application. The application comprises a Rails server and a Karafka server, and the application should be able to download large Excel (.xlsx) files (up to 1,000,000 records) from an SFTP path and parse them for further processing. Each day only 1 file will be there, still we're looping in sftp directory because, maximum there can be 2 files, but only one file will be having huge no.of records.

To fetch files from the SFTP server, I'm using the sftp.dir.glob method to loop through the directory and download the files. Here's the relevant code snippet:

sftp.dir.glob(@file_path, "#{@file_regex}") do |file|
  begin
    Rails.logger.info "The File name is #{file.name}"
    file_data = sftp.download!("#{@file_path}#{file.name}")
    @downloaded_files << { file_name: file.name, file_data: file_data }
  rescue StandardError => e
    Rails.logger.error("Error occurred while pulling sftp data from file: #{@file_path}#{@file_regex} #{e.message}")
    raise StandardError.new(e.message)
  end
end

After i save everything in @downloaded_files, i parse each files. Below is the logic for parsing the file.

        def convert_excel_file_data
          temp_file = Tempfile.new(["sftp_#{SecureRandom.alphanumeric(8)}_#{file_name}", ".xlsx"])
          temp_file.write(file_data)
          temp_file.rewind
          Rails.logger.info "Tempfile created for file_name: #{file_name}"

          excel = Roo::Excelx.new(temp_file.path)
          header = excel.row(1)

          Rails.logger.info "Start fetching data form file_name: #{file_name}"
          items = []

          excel.each_row_streaming(offset: 1) do |row|
            row_data = Hash[header.zip(row.map(&:value))]
            items << row_data
          end

          Rails.logger.info "Data fetch completed for file_name: #{file_name}"

          temp_file.close
          temp_file.unlink # Remove the temporary file

          Rails.logger.info "Temfile deleted for file_name: #{file_name}, items count: #{items.length}"

          items
        end

After all the data is pushed to the 'items' array, I'm saving it in the PostgreSQL table in chunks.

However, when processing files even with (8MB, 35k records), the application appears to hang indefinitely at the line Roo::Excelx.new(temp_file.path). It seems like the code stops executing at this point, causing the Karafka server to retry processing the message.

I'm seeking guidance on how to troubleshoot and resolve this issue. Is there a better approach to handle large Excel files with the Roo gem in a streaming manner, or do you have any suggestions to optimize the parsing process to prevent the hang? Additionally, I would like to know if my current approach of SFTP download is reliable for large files.

I tried entire process with small file (1k records). Those are working fine.

Rails version: 7.0.5 Karafka version: 2.1.11 Roo version: 2.10.0

0

There are 0 best solutions below