Argo workflow container or/and script cannot read CSV downloaded from s3

23 Views Asked by At

I am writing a workflow that downloads a CSV from S3, performs an action using a docker container (convert to another file), and then uploads the converted file back to an S3 bucket.

AWS EKS K8S version: 1.29

Helm chart:

    chart: argo-workflows
    targetRevision: 0.40.14 # Version of Argo Workflows

Problem: The template seems to be downloading the file, but the containers or scripts cannot locate it. I noticed something odd: the init container log says that the file was downloaded to a location not defined in my input artifact. Why? Also I cannot find that directory referenced anywhere (helm chart values, configmaps or codebase). Where is that coming from?

Here is what I have tried:

Note: All three debug options are commented and uncommented below in the provided Workflow template below.

  1. I tried a simple approach of using input artifact for s3 and spinning up a container where I can cat the CSV file using ls -l /data && cd /data && ls -l && cat data.csv, but I cannot cd into the dir. Here is the output of the main container:
│ /usr/bin/sh: 1: cd: can't cd to /data                                                                                                                                                                                                     
│ time="2024-03-26T19:41:59.188Z" level=info msg="sub-process exited" argo=true error="<nil>"                                                                                                                                               
│ Error: exit status 2 

As observed, the dir size is 228b which is accurate because the file in s3 is also 228b. Which leads me to believe that the file was downloaded in the dir. But why cant I cd into the dir?

  1. Next, I tried to print a log to validate if the dir exists using a Python script, but it returned an error. Here is the output of the main container:
│ hello                                                                                                                                                                                                                                     
│ Traceback (most recent call last):                                                                                                                                                                                                        
│   File "/argo/staging/script", line 5, in <module>                                                                                                                                                                                        
│     print(os.listdir(path='/data'))                                                                                                                                                                                                       
│           ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                        
│ NotADirectoryError: [Errno 20] Not a directory: '/data'                                                                                                                                                                                   
│ time="2024-03-26T19:44:08.838Z" level=info msg="sub-process exited" argo=true error="<nil>"                                                                                                                                               
│ Error: exit status 1
  1. Next, I tried using a PVC, but I got an error A templates.download.inputs.artifacts[0].path '/data' already mounted in container.volumeMounts.workdir when I referenced the PVC in the same template as input artifact S3. So, I tried creating two templates referencing the PVC. The first template downloads the S3 file, while the second template performs an action. Same result above.

init container logs:

│ time="2024-03-26T19:32:33.499Z" level=info msg="Starting Workflow Executor" version=v3.5.5                                                                                                                                                │
│ time="2024-03-26T19:32:33.508Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5                                                                                                                  │
│ time="2024-03-26T19:32:33.508Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=kvs-csv-to-delta-download-2650783132 templateName=download version="&Vers │
│ ion{Version:v3.5.5,BuildDate:2024-02-29T20:59:20Z,GitCommit:c80b2e91ebd7e7f604e8442f45ec630380ffa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}"                                              │
│ time="2024-03-26T19:32:33.628Z" level=info msg="Start loading input artifacts..."                                                                                                                                                         │
│ time="2024-03-26T19:32:33.628Z" level=info msg="Downloading artifact: storage"                                                                                                                                                            │
│ time="2024-03-26T19:32:33.628Z" level=info msg="S3 Load path: /argo/inputs/artifacts/storage.tmp, key: data.csv"                                                                                                                          │
│ time="2024-03-26T19:32:33.650Z" level=info msg="Creating minio client using AWS SDK credentials"                                                                                                                                          │
│ time="2024-03-26T19:32:33.655Z" level=info msg="Getting file from s3" bucket=<REMOVED> endpoint=s3.amazonaws.com key=data.csv path=/argo/inputs/artifacts/storage.tmp                                              │
│ time="2024-03-26T19:32:33.743Z" level=info msg="Load artifact" artifactName=storage duration=115.271126ms error="<nil>" key=data.csv                                                                                                      │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Detecting if /argo/inputs/artifacts/storage.tmp is a tarball"                                                                                                                             │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Successfully download file: /argo/inputs/artifacts/storage"                                                                                                                               │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Alloc=10853 TotalAlloc=16524 Sys=23141 NumGC=4 Goroutines=7"                                                                                                                              │
│ Stream closed EOF for argo/kvs-csv-to-delta-download-2650783132 (init)                                                                                                                                                                    │

Workflow Template

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: kvs-csv-to-delta
spec:
  entrypoint: diamond
  
  volumes:
  - name: workdir
    persistentVolumeClaim:
      claimName: s3-pv-claim
  
  templates:
  - name: download
    inputs:
      artifacts:
      - name: storage
        path: /data
        mode: 0777
        s3:
          endpoint: s3.amazonaws.com
          bucket: <BUCKET_NAME>
          key: data.csv
          region: us-east-1
          useSDKCreds: true

    # DEBUG 1 ========================      
    container:
      image: debian:latest
      command: [sh, -c]
      args: ["ls -l /data && cd /data && ls -l && cat data.csv"]

    
    # DEBUG 2 ========================
    # script:
    #   image: python:alpine
    #   imagePullPolicy: IfNotPresent
    #   command: [ python ]
    #   source: |
    #       import os
    #       import time
    #       print("hello")
          
    #       print(os.listdir(path='/data'))
    #       print("\n listing files for /data: \n")
    
    # DEBUG 3 ========================
    # container:
    #   image: <ECR_IMAGE>
    #   command: ["/tools/data_cli/data_cli"]
    #   args: ["format_data", "--input_file=/data/data.csv", "--input_format=CSV", "--output_file=/data/DELTA_0000000000000001", "--output_format=DELTA"]
    #   volumeMounts:
    #   - name: workdir
    #     mountPath: /data



  - name: diamond
    dag:
      tasks:
      - name: A
        template: download

I sense that this is a trivial and ubiquitous use case. Am I missing something simple in the Workflow? Is it a configuration issue? Any help would be greatly appreciated. Thank you.

1

There are 1 best solutions below

0
TitaniuM On

I figured it out. The CSV is being downloaded from S3 just fine. In fact, the "/data" defined value for the path attribute is the reference to the data.csv file. So if I passed "cat /data" args to the container, it will display the content of the downloaded CSV file.

      artifacts:
      - name: storage
        path: /data
        mode: 0777
        s3:
          endpoint: s3.amazonaws.com
          bucket: <BUCKET_NAME>
          key: data.csv
          region: us-east-1
          useSDKCreds: true