I have a csv file, which is tab separated. The following code:
import numpy as np
import sys
import pyarrow.csv as pa_csv
import pandas as pd
df = pd.read_csv(sys.argv[1],sep='\t',header=0,dtype='object')
parse_options = pa_csv.ParseOptions(delimiter='\t')
data = pa_csv.read_csv(sys.argv[1], parse_options=parse_options)
fails on the pyarrow read:
Having looked at the data I have been given it seems the nunmber of columns varies:
awk '{print NF}' data.csv:
200651
200651
200651
200653
200651
200651
200651
How does pandas handle this case, and why doesnt pyarrow do the same?
Can pyarrow be forced to behave in the same way?
EDIT
The number of columns doesnt vary. I didnt use the tab as a delimter to awk.
awk -F'\t' '{print NF}'
200669
200669
200669
200669
200669
200669
200669
200669
so what is causing the error?
Update
adding
read_options=pa_csv.ReadOptions(block_size=1e9)
solved the issue. I guess it is down to the number of columns being large.