How can I read data from DataTap using cpython?

189 Views Asked by At

I would like to read data from DataTap using cpython.

In spark, I can do something like:

df = spark.read.csv("dtap://MaprClus2/tmp/airline-safety.csv")

How can I do the same if I am using cpython, for example when I don't have a pyspark Jupyter kernel?

1

There are 1 best solutions below

2
Chris Snow On BEST ANSWER

One option is to use a subprocess to call out to the hadoop cli command:

from subprocess import check_output
import pandas as pd
from io import BytesIO

def hdfs_read(fpath):
    out = check_output(['hadoop', 'fs', '-cat', fpath])
    return BytesIO(out) 

data = hdfs_read("dtap://MaprClus2/tmp/airline-safety.csv")

# row 1 contains hadoop cli warning so remove it
pd.read_csv(data, sep=",", skiprows=1)