What is an algorithm to prove that a transliteration scheme is lossless?

89 Views Asked by At

Say you want to convert one set of characters to another. Instead of there being a 1-1 mapping of each character in set A to a corresponding character in set B, it's slightly more complicated. Think of like a natural language transliterator from a script like Thai or Korean to Latin script. In the Korean example, you get something like this:

SKATS:  LUM CU LE  MEG KUGG BE.
Hangul: 김치가 맛있다.

There is a 1-to-1 mapping of SKATS to Korean Hangul. LU is one letter, M another, CU another, etc. There is no correspondence between these letters and the pronunciation of the Hangul characters, these are just picked from a hat basically.

The issue starts to come in when you have multiple characters represent a single character.

Say for example you have an "aspirated t" sound, represented as in IPA, and you also have the letters t and h, and you have the english th sound (θ in IPA). Say you decide to represent the sound as th. Well, you could have this actual sequence:

tʰ-tʰ-th-t-t-h-t-tʰ-h-th

If you translated that it would be:

 th-th-th-t-t-h-t-th-h-th

Combined how it would actually look is:

 thththtthtthhth

Now the problem is how to get back to the original character sequence. We could interpret this in many ways:

 th-th-th-t-th-t-th-h-th
 th-th-th-t-t-h-t-th-h-th
 th-th-th-t-t-h-t-t-h-h-th
 th-th-th-t-t-h-t-t-h-h-t-h
 th-th-th-t-t-h-t-t-h-h-tʰ
 th-th-th-t-t-h-t-tʰ-h-tʰ
 etc.

How do you write an algorithm that will check if your mapping is 1:1 lossless to prevent this problem? I am trying to wrap my head around it for a few days but I haven't gotten anywhere.

Another example is if you have "single letters" be transcoded into "ts", "s", "t", "h", "th", "tsh", "thsh", etc. So then you have a real problem.

0

There are 0 best solutions below