How to apply googletrans to a Spark DataFrame?

87 Views Asked by At

I'm trying to translate a column using the google translator API. When applying the UDF Function it shows the following error: PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object.

However, when using the translator on a single string, it works properly.

Here is my current code:

!pip install googletrans==3.1.0a0
from googletrans import Translator

# Sample data
data = [("Dies ist der Text auf Deutsch.",),
        ("Este es el texto en español.",),
        ("Ceci est le texte en français.",),
        ("Questo è il testo in italiano.",)]

# Define the schema with a single StringType column named "TEXT"
schema = StructType([StructField("TEXT", StringType(), True)])

# Create the DataFrame with the specified schema
df = spark.createDataFrame(data, schema)

# Function to translate to english
def translate_to_english(text):
  # Tokenize text
  tokens = nlp(text)   
  # Initialize an empty string to store the translated text
  translated_text = ''  
  for token in tokens:
    try:
      # Translate the token text
      translated = translator.translate(token.text, dest='en')  
      # Append the translated word and a space
      translated_text += translated.text + ' '     
    except:
      # If translation fails, use the original word
      translated_text += token.text + ' '    
  # Remove the trailing space and return the translated text
  return translated_text.strip()  

# Register the UDF
translate_to_english_udf = udf(translate_to_english, StringType())

# Apply the UDF to dataframe
df = df.withColumn("TRANSLATED_TEXT", translate_to_english_udf(df["TEXT"]))
0

There are 0 best solutions below