I've created trigrams, how do I save this as a parquet file? How can I getItems from column _1 when it's not recognised as a column? (PySpark)

72 Views Asked by anon_e At 04 August 2022 at 21:58

Part one

Here's my code:

trigrams = ngrams(cleaned_text, 3)
trigramsCounts = Counter(trigrams)
trigramDf = trigramsCounts.most_common(100)

Sample of the output when displayed (using made up data for this example):

_1	_2
“_1":"how","_2":"are","_3":"you"	102
“_1":"good","_2":"thank","_3":"you"	96
“_1":"are","_2":"you","_3":"okay"	72

(column _1 text is actually in braces {} as well, not sure if that's relevant, but stackoverflow won't let me post with them)

I have been trying to getItems so I can put each word into a separate column, and then concat this to create a string of the 3 words. This is the code:

finalDf = trigramDf.withColumn('Word_1', col('_1').getItem(0))
finalDf = finalDf.withColumn('Word_2', col('_1').getItem(1))
finalDf = finalDf.withColumn('Word_3', col('_1').getItem(2))

But I get this error (which I assume is because the trigramDf variable isn't actually being recognised as a data frame).

AttributeError Traceback (most recent call last) /tmp/ipykernel_25874/2346936649.py in ----> 1 finalDf = finalDf.withColumn('Word', col('_1').getItem(0))

AttributeError: 'list' object has no attribute 'withColumn'

Part two

I also want to save the output as a parquet file so I can use these to form data visualisations (e.g. a word cloud), but again I keep getting an error.

This is the code (example):

finalDf.write.parquet('abfss://datalake.dfs.core.windows.net/desired_folder_location',mode = 'overwrite')

This is the error:

AttributeError Traceback (most recent call last) /tmp/ipykernel_25874/3576806399.py in ----> 1 finalDf.write.parquet('abfss://datalake.dfs.core.windows.net/desired_folder_location', mode = 'overwrite')

AttributeError: 'list' object has no attribute 'write'

How do I get the trigramDf to be recognised as a df?
Why won't it let me save it as a parquet file?

I appreciate this is a lengthy query but any help will be appreciated - thank you.

Original Q&A

There are 1 best solutions below

anon_e On 17 August 2022 at 11:42

Resolved! Here's how:

from pyspark.sql.types import StructField, StructType, StringType, IntegerType

emptyString = []
schema = StructType([
    StructField('_1', StringType(), True),
    StructField('_2', IntegerType(), True)])

for w in trigramDf:
    emptyString.append((' '.join(w[0]), w[1]))

BOWdf = spark.createDataFrame(emptyString,schema)

As this is now a data frame, saving the output is now no issue either.

I've created trigrams, how do I save this as a parquet file? How can I getItems from column _1 when it's not recognised as a column? (PySpark)

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PYSPARK

Related Questions in NLP

Related Questions in N-GRAM

Related Questions in TRIGRAM

Trending Questions

Popular # Hahtags

Popular Questions