Add diacritics to Arabic ligature

88 Views Asked by At

I have a word in Arabic without diacritics “الآن”, the unicode string representing it is as follows:

  • \uFEE5\uFEF5\uFE8D

This code corresponds to three elements :

  • FEE5 → ARABIC LETTER NOON ISOLATED FORM

  • FEF5 → ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE ISOLATED FORM

  • FE8D → ARABIC LETTER ALEF ISOLATED FORM

I would like to include the diacritics to this word, which becomes "اَلْآَنْ", and add the modifications in the unicode string, except ligature is a problem, how do I do this? What is the code corresponding to this modification?

  • \u... ?
1

There are 1 best solutions below

6
Rob Napier On

First, some basic notes: Your unicode expansion is backwards.

(Though looking over your earlier question I have a feeling you may just be on the completely wrong track here. Unicode is not intended to encode presentation. You're going to fight it at every turn if you try to force it to. For this question, I'm going to assume you mean \uFE8D\uFEF5\uFEE5 because that's answerable. Perhaps the real problem is "why do you have isolated forms in the first place? This is not what they're for." Maybe that's best discussed in comments, and we can refine this answer. If the question is "how do I encode in Unicode the ligature lam-alef with proper diacritic layout on each portion within the glyph," the answer is that's not possible in Unicode. Unicode doesn't encode layout at all. You would need a font, or possibly a layout engine, that understood this construction.)

Also, your textual forms do not actually match your descriptions, which make things a bit harder to answer. In your question, you write الآن, but that isn't \uFE8D\uFEF5\uFEE5, it's \u0627\u0644\U0622\U0646. I'm going to assume the following:

  • You have: \uFE8D\uFEF5\uFEE5 (alef isolated, lam-alef-madda isolated, noon isolated).
  • You want to get to a valid Unicode representation of: alef+fatha, lam+sukun, alef+madda+fatha, noon+sukun

Yes, the pre-composed ligature is the problem. You need to decompose this first. Unicode can apply fatha and sukun to the lam-alef, but no layout engine is likely to properly place the sukun over the lam and the fatha over the alef-madda once you've composed everything.

The closest you can get would be \uFE8D\u064E\uFEF5\u0652\u064E\uFEE5\u0652, which is probably not what you'd want: ﺍَﻵَْﻥْ. I'm sure readers can figure it out, but... no.

(Note that the actual placement of diacritics is not part of Unicode. Unicode just describes the language units. It does not prescribe specifically how to display them. Display is a matter for rendering engines and fonts. So, in principle, a font would be free to display \uFEF5\u0652\u064E the way you want without violating Unicode, but I doubt any ever would. You could, of course, make a custom font to handle this if needed.)

The form you want here is probably NFKC. How you normalize depends on your language and tools. As an example, however, icu4c includes a commandline tool, unconv, which can do this:

$ echo ﺍﻵﻥ | uni id
     CPoint  Dec    UTF8        HTML       Name (Cat)
'ﺍ'  U+FE8D  65165  ef ba 8d    ﺍ   ARABIC LETTER ALEF ISOLATED FORM (Other_Letter)
'ﻵ'  U+FEF5  65269  ef bb b5    ﻵ   ARABIC LIGATURE LAM WITH ALEF WITH MADDA ABOVE ISOLATED FORM (Other_Letter)
'ﻥ'  U+FEE5  65253  ef bb a5    ﻥ   ARABIC LETTER NOON ISOLATED FORM (Other_Letter)

$ echo ﺍﻵﻥ | uconv -x any-nfkc | uni id
     CPoint  Dec    UTF8        HTML       Name (Cat)
'ا'  U+0627  1575   d8 a7       ا    ARABIC LETTER ALEF (Other_Letter)
'ل'  U+0644  1604   d9 84       ل    ARABIC LETTER LAM (Other_Letter)
'آ'  U+0622  1570   d8 a2       آ    ARABIC LETTER ALEF WITH MADDA ABOVE (Other_Letter)
'ن'  U+0646  1606   d9 86       ن    ARABIC LETTER NOON (Other_Letter)

In this form, adding the diacritics is straightforward:

\u0627\u064e\u0644\u0652\u0622\u064E\u0646\u0652: اَلْآَنْ