Accent sensitive sort in PostgreSQL

178 Views Asked by At

How to sort text in PostgreSQL in Latvian alphabetic order, so any letter without accent always comes before the letter with accent?

For example this

select * from (
    select 'bā' as name
    union select 'āa'
    union select 'ac'
    union select 'Āb'
    union select 'Ā'
) t
order by name COLLATE "lv-x-icu"

returns Ā āa Āb ac bā

but the expected result is ac Ā āa Āb bā

I have tried to create some custom collations but none of them returned the expected result.

2

There are 2 best solutions below

3
Laurenz Albe On BEST ANSWER

This is an unusual requirement. In natural language collations (including the Latuvian collation), strings are compared in different steps:

  • In the first step, punctuation, case and accents are ignored. If the strings compare differently, that determines the sorting order. This is called a difference on the primary level.

  • If the strings are equal on the primary level, only punctuation and case are ignored, but accents matter. If the strings compare differently, that determines the sorting order. This is a difference on the secondary level.

  • If the strings are equal on the secondary level, only punctuation is ignored, but accents and case matter. This is the tertiary level.

  • Finally, if the strings are equal on the tertiary level, differences in punctuation (the quaternary level) matter.

That explains the sort order you see: since a < aa < ab < ac <ba on the primary level, accents are not taken into account.

If you want ac < Ā, you want accents to be considered different on the primary level, as if a and ā were different letters. For that, you'd have to use an ICU collation and customize it with additional rules. This is only possible from v16 on:

CREATE COLLATION latuvian_custom (
   PROVIDER = icu,
   LOCALE = 'lv',
   RULES = '& a < ā & A < Ā & c < č & C < Č & e < ē & E < Ē & g < ģ & G < Ģ & k < ķ & K < Ķ & l < ļ & L < Ļ & n < ņ & N < Ņ & s < š & S < Š & u < ū & U < Ū & z < ž & Z < Ž'
);

select * from (
    select 'bā' as name
    union select 'āa'
    union select 'ac'
    union select 'Āb'
    union select 'Ā'
) t
order by name COLLATE latuvian_custom;

 name 
══════
 ac
 Ā
 Āb
 āa
 bā
(5 rows)

The rules introduce a primary level difference (<) between the unaccented and the accented characters.

1
Zegarek On

A small supplement to @Laurenz' great answer: in order to achieve the desired result of ac Ā āa Āb bā specified in the question, accents need to be considered at the primary level, but case shouldn't. Demo:

CREATE COLLATION latvian_custom (PROVIDER=icu,LOCALE='lv',
   RULES='&a<Ā=ā &c<Č=č &e<Ē=ē &g<Ģ=ģ &i<Ī=ī &k<Ķ=ķ 
          &l<Ļ=ļ &n<Ņ=ņ &s<Š=š &u<Ū=ū &z<Ž=ž'
);
SELECT string_agg(sample, E'\t' ORDER BY sample COLLATE "latvian_custom") 
FROM (VALUES ('bā'),('āa'),('ac'),('Āb'),('Ā')) AS _(sample);
string_agg
ac Ā āa Āb bā

At that stage, all non-accented characters are already equal to their uppercase variant, so it should be enough to anchor the equal accented pairs behind either (e.g. &A<Ā=ā or &a<Ā=ā). Both case variants of each accented character need to be anchored as equal - as additions, they are treated independently, so it wouldn't work if you only do &a<ā &c<č...&z<ž.

Here's a comparison:

collation_name sorted_samples
expected ac Ā āa Āb bā (OP's example)
default ac bā Ā Āb āa
latUvian_custom ac Ā Āb āa bā
latvian_custom ac Ā āa Āb bā (the only match)
latvian_lower_first Ā āa Āb ac bā
latvian_lower_first_custom ac Ā Āb āa bā
latvian_upper_first Ā āa Āb ac bā
latvian_upper_first_custom ac Ā Āb āa bā
lv-LV-x-icu Ā āa Āb ac bā
lv-x-icu Ā āa Āb ac bā
unicode Ā āa Āb ac bā