When transforming an xml file to json, the Data Fusion pipeline, configured in Autoscaling mode up to 84 cores, stops indicating an error.

Can anybody help me to make it work?

The 100-pages Raw log file seems indicating that possible errors were:

  • +ExitOnOutOfMemoryError
  • Container exited with a non-zero exit code 3. Error file: prelaunch.err

It happened with the following configuration:

The weird thing is that the very same pipeline, with an xml file 10-times smaller, of only 141MB, worked correctly:

Can anybody help me in understanding why the Cloud Data Fusion pipeline, set in Autoscaling mode up to 84 cores, succeeds with the 141MB xml file and it fails with a 1.4GB xml file?

For clarity, following all the detailed steps:

1

There are 1 best solutions below

3
Fernando Velasquez On

Parsing a 1GB xml file requires a significant amount of memory in your workers.

Looking at your pipeline JSON, your pipeline is currently configured to allocate 2GB of ram per worker.

"config": {
    "resources": {
        "memoryMB": 2048,
        "virtualCores": 1
    },
    "driverResources": {
        "memoryMB": 2048,
        "virtualCores": 1
    },
    ...
}

This is likely insufficient to hold the entire parsed ~1.1GB json payload.

Try increasing the amount of executor memory in the Config -> Resources -> Executor section. I would suggest trying with 8 GB of ram for your example.

Resources Config Example

EDIT: When using the Default or Autoscaling compute profile, CDF will create workers with 2 vCPU cores and 8 GB of Ram. You will need to increase this value using the following runtime arguments:

system.profile.properties.workerCPUs = 4 
system.profile.properties.workerMemoryMB = 22528 

Runtime arguments to increase worker CPU and Memory allocation

This will increase the worker size to 4 vCPU and 22GB of RAM, which will be large enough to fit the requested executor in the worker.