Fetch a column value into a variable in pyspark without collect

241 Views Asked by At

My objective is to fetch a column values into a variable if possible as a list from pyspark dataframe.

Expected output = ["a", "b", "c", ... ]

I tried :

[
  col.__getitem__("x")
  for col in data.select("x").collect()
]

But it gives list of Row objects.

Output : [Row(x='a'), Row(x='b'), Row(x='c'), ...]

I don't want to use collect as well as don't need Row objects.

tried another method :

data.select(f.collect_list("x")).collect()

slightly better then earlier version but gets:

Output = [Row(collect_list(x) = ['a', 'b', 'c', ...]]

Thanks in advance and Happy new year!

1

There are 1 best solutions below

0
300 On BEST ANSWER

Tried three different solution :

df.select(f.collect_list("x").alias("temp")).first()["temp"] 
Time taken : 32.43s

df.select("x").rdd.flatMap(lambda x:x).collect()
Time taken : 13.19s

[col.__getitem__("x") for col in df.select("x").collect()]
Time taken : 22.77s

Even though I'm using collect but it was faster then other solutions. P.S df.count ~ 116M