I've so far worked with left to right languages and NLTK worked fine for tokenization. But while working on a research paper focused on several languages including RTL languages, the normal procedure has been giving me completely inaccurate translations. Could anyone please let me know what is the norm in neural machine translation when working with languages like Persian or Hebrew?
I've tried following the steps mentioned in nmt with attention, where I changed the two regex to fit the Farsi and Urdu scripts along with other languages and seperating the punctuations,
def lowerSplitPunct(text):
# Split accented characters.
text = tf_text.normalize_utf8(text, 'NFKC')
text = tf.strings.lower(text)
# Keep space, a to z, and select punctuation.
text = tf.strings.regex_replace(text, '[^\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F\u0980-\u09FFa-z۔؟،«»।ا.?!,]', '')
# Add spaces around punctuation.
text = tf.strings.regex_replace(text, '[۔؟،«»ا।.?!,]', r' \0 ')
# Strip whitespace.
text = tf.strings.strip(text)
text = tf.strings.join(['[START]', text, '[END]'], separator=' ')
return text
and it still doesn't solve my problem.