I was wondering if collation is still a big issue when working with databases without legacy data.
For example, for something like BigQuery or Snowflake the character encoding is utf-8. BigQuery actually only supports two collations -- default and case-insensitive 'und:ci'. Snowflake has some additional collations.
In my own usage, I have only found myself using case-sensitive or case-insensitive collation on a string/character column. Are there other uses of where collation may be used? I apologize if this is a naive question (perhaps this is related to my only knowing English and never having to deal much with sorting other languages).
It is difficult to answer, but if you ask, probably it doesn't matter for you.
Collation is about ordering stuffs alphabetically (non-numeric). Do it matter for you if
acome before or afterA, the order ofAaA,aBA,ABa, etc? (111is beforeAAAor afterZZZ?) (and about accented characters? Near base character, or within symbols?). On most application we do no care: at most we want a consistent ordering. Phone book had different ordering then most dictionaries. So there is no single collation for a single language). And between languages there are strange rules (llin Spanish,åin Danish without forgetting å is also a unit symbol).Also to make thing more complex: now an application may be multilingual, so a single collation for database is not enough. And probably not per table or per field. So now it is good to select the collation at query time (so with language of the user), but that break indices (you cannot make an index before knowing the ordering). Or we just use Unicode collation algorithm, which it is easier to understand (and without many historical exceptions). It work well for most languages.
So, it is up to you. You are doing an online dictionary and in several languages? So you need a language specific collation, and as people expect on a dictionary. Else: it doesn't matter so much. We now uses more searches then indices (but for search, we normalize strings for searches, so with less surprises with accents).
So, if you do not have some particular need, uses the default, or the Unicode default collation. If people complain, then you know about the need of better collation, and you should also have more information (about use cases). But I would not over-engineer for a case probably nobody uses or cares (and so with eventual slow down on indexing).