Say you want to convert one set of characters to another. Instead of there being a 1-1 mapping of each character in set A to a corresponding character in set B, it's slightly more complicated. Think of like a natural language transliterator from a script like Thai or Korean to Latin script. In the Korean example, you get something like this:
SKATS: LUM CU LE MEG KUGG BE.
Hangul: 김치가 맛있다.
There is a 1-to-1 mapping of SKATS to Korean Hangul. LU is one letter, M another, CU another, etc. There is no correspondence between these letters and the pronunciation of the Hangul characters, these are just picked from a hat basically.
The issue starts to come in when you have multiple characters represent a single character.
Say for example you have an "aspirated t" sound, represented as tʰ in IPA, and you also have the letters t and h, and you have the english th sound (θ in IPA). Say you decide to represent the tʰ sound as th. Well, you could have this actual sequence:
tʰ-tʰ-th-t-t-h-t-tʰ-h-th
If you translated that it would be:
th-th-th-t-t-h-t-th-h-th
Combined how it would actually look is:
thththtthtthhth
Now the problem is how to get back to the original character sequence. We could interpret this in many ways:
th-th-th-t-th-t-th-h-th
th-th-th-t-t-h-t-th-h-th
th-th-th-t-t-h-t-t-h-h-th
th-th-th-t-t-h-t-t-h-h-t-h
th-th-th-t-t-h-t-t-h-h-tʰ
th-th-th-t-t-h-t-tʰ-h-tʰ
etc.
How do you write an algorithm that will check if your mapping is 1:1 lossless to prevent this problem? I am trying to wrap my head around it for a few days but I haven't gotten anywhere.
Another example is if you have "single letters" be transcoded into "ts", "s", "t", "h", "th", "tsh", "thsh", etc. So then you have a real problem.