Part one

Here's my code:

trigrams = ngrams(cleaned_text, 3)
trigramsCounts = Counter(trigrams)
trigramDf = trigramsCounts.most_common(100)

Sample of the output when displayed (using made up data for this example):

_1 _2
“_1":"how","_2":"are","_3":"you" 102
“_1":"good","_2":"thank","_3":"you" 96
“_1":"are","_2":"you","_3":"okay" 72

(column _1 text is actually in braces {} as well, not sure if that's relevant, but stackoverflow won't let me post with them)

I have been trying to getItems so I can put each word into a separate column, and then concat this to create a string of the 3 words. This is the code:

finalDf = trigramDf.withColumn('Word_1', col('_1').getItem(0))
finalDf = finalDf.withColumn('Word_2', col('_1').getItem(1))
finalDf = finalDf.withColumn('Word_3', col('_1').getItem(2))

But I get this error (which I assume is because the trigramDf variable isn't actually being recognised as a data frame).

AttributeError Traceback (most recent call last) /tmp/ipykernel_25874/2346936649.py in ----> 1 finalDf = finalDf.withColumn('Word', col('_1').getItem(0))

AttributeError: 'list' object has no attribute 'withColumn'


Part two

I also want to save the output as a parquet file so I can use these to form data visualisations (e.g. a word cloud), but again I keep getting an error.

This is the code (example):

finalDf.write.parquet('abfss://datalake.dfs.core.windows.net/desired_folder_location',mode = 'overwrite')

This is the error:

AttributeError Traceback (most recent call last) /tmp/ipykernel_25874/3576806399.py in ----> 1 finalDf.write.parquet('abfss://datalake.dfs.core.windows.net/desired_folder_location', mode = 'overwrite')

AttributeError: 'list' object has no attribute 'write'


  1. How do I get the trigramDf to be recognised as a df?
  2. Why won't it let me save it as a parquet file?

I appreciate this is a lengthy query but any help will be appreciated - thank you.

1

There are 1 best solutions below

0
anon_e On

Resolved! Here's how:

from pyspark.sql.types import StructField, StructType, StringType, IntegerType

emptyString = []
schema = StructType([
    StructField('_1', StringType(), True),
    StructField('_2', IntegerType(), True)])

for w in trigramDf:
    emptyString.append((' '.join(w[0]), w[1]))

BOWdf = spark.createDataFrame(emptyString,schema)

As this is now a data frame, saving the output is now no issue either.