I have a large Spanish dataset in Stata with more than 2500 variables and I want to translate this into English. A lot of these variables are in the form of value labels. I am using Google's API for translation. At the moment I just took 10 observations and 2 variables (p4 and p5) which have value labels and trying to write a code to translate this. However, there is an issue in the translation of value labels. In my orginal dataset the p4 variable has the following label values:
4 Educación Básica o Preparatoria completa
6 Educación Media o Humanidades completa
7 Instituto Profesional o Centros de Formación Técnica incompl
8 Instituto Profesional o Centros de Formación Técnica complet
9 Universitaria incompleta
10 Universitaria completa
However, the translated dataset (p4 variable) is showing the following labels: 0 Complete Basic or High School Education 1 Secondary Education or Complete Humanities 2 Professional Institute or Technical Training Centers incomplete 3 Professional Institute or Complete Technical Training Centers 4 incomplete university 5 Complete university
Basically the numbers in the value labels are not getting recorded correctly in the final dataset which is again in dta format. How do I modify my python code to solve this?
Following is my code. Please suggest how to modify this to solve the above issue.
import pandas as pd
from googletrans import Translator, LANGUAGES
# Initialize the translator
translator = Translator()
# Step 1: Read the Stata dataset into Python
df = pd.read_stata('C:\\transl_trial.dta')
# Step 2: Identify the variables with value labels
columns_to_translate = ['p4', 'p5']
from pandas.api.types import CategoricalDtype
# Step 3: Translate the value labels
for col in columns_to_translate:
# Extract value labels for the column
value_labels = df[col].cat.categories.tolist()
print(value_labels)
translations = {}
for label in value_labels:
# Translate from Spanish to English
translated_text = translator.translate(label, src='es', dest='en').text
translations[label] = translated_text
print(translations)
# Replace the original categories with their translated versions
df[col] = df[col].replace(translations).astype('category')
output_path = r'C:\\translated_dataset.dta'
df.to_stata(output_path, write_index=False)
As far as I can remember Stata, value labels in Stata datasets are associated with numeric codes, and when you translate the labels, the numeric codes are lost. Therefore, you want to preserve these numbers when translating. I think this code solves issue: