Part one
Here's my code:
trigrams = ngrams(cleaned_text, 3)
trigramsCounts = Counter(trigrams)
trigramDf = trigramsCounts.most_common(100)
Sample of the output when displayed (using made up data for this example):
| _1 | _2 |
|---|---|
| “_1":"how","_2":"are","_3":"you" | 102 |
| “_1":"good","_2":"thank","_3":"you" | 96 |
| “_1":"are","_2":"you","_3":"okay" | 72 |
(column _1 text is actually in braces {} as well, not sure if that's relevant, but stackoverflow won't let me post with them)
I have been trying to getItems so I can put each word into a separate column, and then concat this to create a string of the 3 words. This is the code:
finalDf = trigramDf.withColumn('Word_1', col('_1').getItem(0))
finalDf = finalDf.withColumn('Word_2', col('_1').getItem(1))
finalDf = finalDf.withColumn('Word_3', col('_1').getItem(2))
But I get this error (which I assume is because the trigramDf variable isn't actually being recognised as a data frame).
AttributeError Traceback (most recent call last) /tmp/ipykernel_25874/2346936649.py in ----> 1 finalDf = finalDf.withColumn('Word', col('_1').getItem(0))
AttributeError: 'list' object has no attribute 'withColumn'
Part two
I also want to save the output as a parquet file so I can use these to form data visualisations (e.g. a word cloud), but again I keep getting an error.
This is the code (example):
finalDf.write.parquet('abfss://datalake.dfs.core.windows.net/desired_folder_location',mode = 'overwrite')
This is the error:
AttributeError Traceback (most recent call last) /tmp/ipykernel_25874/3576806399.py in ----> 1 finalDf.write.parquet('abfss://datalake.dfs.core.windows.net/desired_folder_location', mode = 'overwrite')
AttributeError: 'list' object has no attribute 'write'
- How do I get the trigramDf to be recognised as a df?
- Why won't it let me save it as a parquet file?
I appreciate this is a lengthy query but any help will be appreciated - thank you.
Resolved! Here's how:
As this is now a data frame, saving the output is now no issue either.