I'm currently using Hugging Face's transformers library for Zero Shot Classification to analyze Customer reviews of products (in Spanish), but I'm facing a scalability problem.
At first, I was using the model below, but it takes too long to process each review text (I need to process around 5k to 10k reviews daily).
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")
Then I switched to this small version of the model but, although the processing time is much better, the quality of the results is very poor compared to the previous model.
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli")
I'd like to know if there are ways to improve this situation or maybe a completly different approach (first time doing NLP). My main objective is to check each review and see if it's related to certain topics (good product quality, bad product quality, correct size, wrong size, corrct color, wrong color, damaged product, ...) so I can detect problems with the products or publications and see differences between brands, suppliers, categories, etc.
The code:
# classifier = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli")
candidate_labels = ['buena talla', 'mala talla', 'buen tamaño', 'mal tamaño', 'color equivocado', 'no me gustó el color', 'buen producto', 'mal producto',
'producto se encuentra dañado', 'producto no es el que pedí', 'le faltan partes al pedido', 'entrega rápida', 'demora en llegar', 'buena calidad', 'mala calidad',
'se rompe', 'buena calidad', 'mala calidad', 'talla grande', 'talla pequeña', 'comodo', 'incomodo', 'buena experiencia', 'mala experiencia', 'lo recomiendo',
'no lo recomiendo', 'no era lo que esperaba', 'descripción incorrecta']
for index, row in df.iterrows():
output = classifier(row['COMENTARIO'], candidate_labels, multi_label=True)
df.at[index, 'BUENA_TALLA'] = output['scores'][output['labels'].index('buena talla')]
df.at[index, 'MALA_TALLA'] = output['scores'][output['labels'].index('mala talla')]
df.at[index, 'BUEN_TAMANO'] = output['scores'][output['labels'].index('buen tamaño')]
df.at[index, 'MAL_TAMANO'] = output['scores'][output['labels'].index('mal tamaño')]
df.at[index, 'COLOR_EQUIVOCADO'] = output['scores'][output['labels'].index('color equivocado')]
df.at[index, 'NO_GUSTA_COLOR'] = output['scores'][output['labels'].index('no me gustó el color')]
df.at[index, 'BUEN_PRODUCTO'] = output['scores'][output['labels'].index('buen producto')]
df.at[index, 'MAL_PRODUCTO'] = output['scores'][output['labels'].index('mal producto')]
df.at[index, 'PRODUCTO_DANADO'] = output['scores'][output['labels'].index('producto se encuentra dañado')]
df.at[index, 'NO_CORRESPONDE'] = output['scores'][output['labels'].index('producto no es el que pedí')]
df.at[index, 'PRODUCTO_INCOMPLETO'] = output['scores'][output['labels'].index('le faltan partes al pedido')]
df.at[index, 'ENTREGA_RAPIDA'] = output['scores'][output['labels'].index('entrega rápida')]
df.at[index, 'DEMORA_LLEGAR'] = output['scores'][output['labels'].index('demora en llegar')]
df.at[index, 'BUENA_CALIDAD'] = output['scores'][output['labels'].index('buena calidad')]
df.at[index, 'MALA_CALIDAD'] = output['scores'][output['labels'].index('mala calidad')]
df.at[index, 'SE_ROMPE'] = output['scores'][output['labels'].index('se rompe')]
df.at[index, 'BUENA_CALIDAD'] = output['scores'][output['labels'].index('buena calidad')]
df.at[index, 'MALA_CALIDAD'] = output['scores'][output['labels'].index('mala calidad')]
df.at[index, 'TALLA_GRANE'] = output['scores'][output['labels'].index('talla grande')]
df.at[index, 'TALLA_PEQUENA'] = output['scores'][output['labels'].index('talla pequeña')]
df.at[index, 'COMODO'] = output['scores'][output['labels'].index('comodo')]
df.at[index, 'INCOMODO'] = output['scores'][output['labels'].index('incomodo')]
df.at[index, 'BUENA_EXP'] = output['scores'][output['labels'].index('buena experiencia')]
df.at[index, 'MALA_EXP'] = output['scores'][output['labels'].index('mala experiencia')]
df.at[index, 'RECOMIENDO'] = output['scores'][output['labels'].index('lo recomiendo')]
df.at[index, 'NO_RECOMIENDO'] = output['scores'][output['labels'].index('no lo recomiendo')]
df.at[index, 'NO_ERA_LO_QUE_ESPERABA'] = output['scores'][output['labels'].index('no era lo que esperaba')]
df.at[index, 'DESCRIPCION_INCORRECTA'] = output['scores'][output['labels'].index('descripción incorrecta')]
print(f"Review: {row['COMENTARIO']}")
print("Predicted labels:", output['labels'])
print("Scores:", output['scores'])
print("="*50)