I am writing a workflow that downloads a CSV from S3, performs an action using a docker container (convert to another file), and then uploads the converted file back to an S3 bucket.
AWS EKS K8S version: 1.29
Helm chart:
chart: argo-workflows
targetRevision: 0.40.14 # Version of Argo Workflows
Problem: The template seems to be downloading the file, but the containers or scripts cannot locate it. I noticed something odd: the init container log says that the file was downloaded to a location not defined in my input artifact. Why? Also I cannot find that directory referenced anywhere (helm chart values, configmaps or codebase). Where is that coming from?
Here is what I have tried:
Note: All three debug options are commented and uncommented below in the provided Workflow template below.
- I tried a simple approach of using input artifact for s3 and spinning up a container where I can cat the CSV file using
ls -l /data && cd /data && ls -l && cat data.csv,but I cannotcdinto the dir. Here is the output of themaincontainer:
│ /usr/bin/sh: 1: cd: can't cd to /data
│ time="2024-03-26T19:41:59.188Z" level=info msg="sub-process exited" argo=true error="<nil>"
│ Error: exit status 2
As observed, the dir size is 228b which is accurate because the file in s3 is also 228b. Which leads me to believe that the file was downloaded in the dir. But why cant I cd into the dir?
- Next, I tried to print a log to validate if the dir exists using a Python script, but it returned an error. Here is the output of the
maincontainer:
│ hello
│ Traceback (most recent call last):
│ File "/argo/staging/script", line 5, in <module>
│ print(os.listdir(path='/data'))
│ ^^^^^^^^^^^^^^^^^^^^^^^^
│ NotADirectoryError: [Errno 20] Not a directory: '/data'
│ time="2024-03-26T19:44:08.838Z" level=info msg="sub-process exited" argo=true error="<nil>"
│ Error: exit status 1
- Next, I tried using a PVC, but I got an error
A templates.download.inputs.artifacts[0].path '/data' already mounted in container.volumeMounts.workdirwhen I referenced the PVC in the same template as input artifact S3. So, I tried creating two templates referencing the PVC. The first template downloads the S3 file, while the second template performs an action. Same result above.
init container logs:
│ time="2024-03-26T19:32:33.499Z" level=info msg="Starting Workflow Executor" version=v3.5.5 │
│ time="2024-03-26T19:32:33.508Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5 │
│ time="2024-03-26T19:32:33.508Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=kvs-csv-to-delta-download-2650783132 templateName=download version="&Vers │
│ ion{Version:v3.5.5,BuildDate:2024-02-29T20:59:20Z,GitCommit:c80b2e91ebd7e7f604e8442f45ec630380ffa0,GitTag:v3.5.5,GitTreeState:clean,GoVersion:go1.21.7,Compiler:gc,Platform:linux/amd64,}" │
│ time="2024-03-26T19:32:33.628Z" level=info msg="Start loading input artifacts..." │
│ time="2024-03-26T19:32:33.628Z" level=info msg="Downloading artifact: storage" │
│ time="2024-03-26T19:32:33.628Z" level=info msg="S3 Load path: /argo/inputs/artifacts/storage.tmp, key: data.csv" │
│ time="2024-03-26T19:32:33.650Z" level=info msg="Creating minio client using AWS SDK credentials" │
│ time="2024-03-26T19:32:33.655Z" level=info msg="Getting file from s3" bucket=<REMOVED> endpoint=s3.amazonaws.com key=data.csv path=/argo/inputs/artifacts/storage.tmp │
│ time="2024-03-26T19:32:33.743Z" level=info msg="Load artifact" artifactName=storage duration=115.271126ms error="<nil>" key=data.csv │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Detecting if /argo/inputs/artifacts/storage.tmp is a tarball" │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Successfully download file: /argo/inputs/artifacts/storage" │
│ time="2024-03-26T19:32:33.744Z" level=info msg="Alloc=10853 TotalAlloc=16524 Sys=23141 NumGC=4 Goroutines=7" │
│ Stream closed EOF for argo/kvs-csv-to-delta-download-2650783132 (init) │
Workflow Template
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: kvs-csv-to-delta
spec:
entrypoint: diamond
volumes:
- name: workdir
persistentVolumeClaim:
claimName: s3-pv-claim
templates:
- name: download
inputs:
artifacts:
- name: storage
path: /data
mode: 0777
s3:
endpoint: s3.amazonaws.com
bucket: <BUCKET_NAME>
key: data.csv
region: us-east-1
useSDKCreds: true
# DEBUG 1 ========================
container:
image: debian:latest
command: [sh, -c]
args: ["ls -l /data && cd /data && ls -l && cat data.csv"]
# DEBUG 2 ========================
# script:
# image: python:alpine
# imagePullPolicy: IfNotPresent
# command: [ python ]
# source: |
# import os
# import time
# print("hello")
# print(os.listdir(path='/data'))
# print("\n listing files for /data: \n")
# DEBUG 3 ========================
# container:
# image: <ECR_IMAGE>
# command: ["/tools/data_cli/data_cli"]
# args: ["format_data", "--input_file=/data/data.csv", "--input_format=CSV", "--output_file=/data/DELTA_0000000000000001", "--output_format=DELTA"]
# volumeMounts:
# - name: workdir
# mountPath: /data
- name: diamond
dag:
tasks:
- name: A
template: download
I sense that this is a trivial and ubiquitous use case. Am I missing something simple in the Workflow? Is it a configuration issue? Any help would be greatly appreciated. Thank you.
I figured it out. The CSV is being downloaded from S3 just fine. In fact, the "/data" defined value for the path attribute is the reference to the data.csv file. So if I passed "cat /data" args to the container, it will display the content of the downloaded CSV file.