AWS Comprehend and PySpark - F.when Not Working

102 Views Asked by At

I have rows of sentences in different languages and the language code is in a separate column. I am specifying to only process certain languages (en, es, fr, or de) since I know AWS Comprehend does not support 'nl' (Dutch). For some reason I continue to get an error that 'nl' is not supported even though it is not listed in my when condition and should therefore not be getting sent through the Comprehend udf. Any ideas on what might be wrong?

Here is my code:

import pyspark.sql.functions as F

def detect_sentiment(text,language):
    comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
    sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
    return sentiment_analysis


detect_sentiment_udf = F.udf(detect_sentiment)

reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', \
        F.when( (F.col('LANGUAGE')=='en') | (F.col('LANGUAGE')=='es') | (F.col('LANGUAGE')=='fr') | (F.col('LANGUAGE')=='de') , \
               detect_sentiment_udf('SENTENCE', 'LANGUAGE')).otherwise(None) )

reviews_4.show(50)

I get this error:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the DetectSentiment operation: 
Value 'nl' at 'languageCode'failed to satisfy constraint: Member must satisfy enum value set: [ar, hi, ko, zh-TW, ja, zh, de, pt, en, it, fr, es]
1

There are 1 best solutions below

0
user3242036 On

Still unsure why my code above wouldn't work but I found the following work around to be effective.

def detect_sentiment(text,language):
    comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
    if (language == 'en') | (language == 'es') | (language == 'fr') | (language == 'de') :
        sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
        return sentiment_analysis
    else:
        return None

detect_sentiment_udf = F.udf(detect_sentiment)

reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', detect_sentiment_udf('SENTENCE','LANGUAGE'))

reviews_4.show(50)