How to concatenate a split word using NLP caused by tokenizers after machine translation?

209 Views Asked by At

Russian translation produces the following result, is there a NLP function which we can use to concatenate as "Europe's" in the following string?

"Nitzchia Protector Todibo can go to one of Europe ' s top clubs"

1

There are 1 best solutions below

1
alvas On BEST ANSWER

Try detokenizers but because there are rules to process tokens that are expected to change x 's -> x's but not x ' s -> x's, you might have to iteratively apply the detokenizer, e.g. using sacremoses

>>> from sacremoses import MosesTokenizer, MosesDetokenizer
>>> md = MosesDetokenizer(lang='en')

>>> md.detokenize("Nitzchia Protector Todibo can go to one of Europe ' s top clubs".split())
"Nitzchia Protector Todibo can go to one of Europe 's top clubs"

>>> md.detokenize(md.detokenize("Nitzchia Protector Todibo can go to one of Europe ' s top clubs".split()).split())
"Nitzchia Protector Todibo can go to one of Europe's top clubs"