Seq2Seq Neural Machine Translation step for aligning right to left languages with English (Or any LTR language)

70 Views Asked by At

I've so far worked with left to right languages and NLTK worked fine for tokenization. But while working on a research paper focused on several languages including RTL languages, the normal procedure has been giving me completely inaccurate translations. Could anyone please let me know what is the norm in neural machine translation when working with languages like Persian or Hebrew?

I've tried following the steps mentioned in nmt with attention, where I changed the two regex to fit the Farsi and Urdu scripts along with other languages and seperating the punctuations,

def lowerSplitPunct(text):
  # Split accented characters.
  text = tf_text.normalize_utf8(text, 'NFKC')
  text = tf.strings.lower(text)
  # Keep space, a to z, and select punctuation.
  text = tf.strings.regex_replace(text, '[^\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F\u0980-\u09FFa-z۔؟،«»।ا.?!,]', '')
  # Add spaces around punctuation.
  text = tf.strings.regex_replace(text, '[۔؟،«»ا।.?!,]', r' \0 ')
  # Strip whitespace.
  text = tf.strings.strip(text)

  text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
  return text

and it still doesn't solve my problem.

0

There are 0 best solutions below