Accent sensitive sort in PostgreSQL

Question

Accent sensitive sort in PostgreSQL

178 Views Asked by va. At 13 October 2023 at 13:27

How to sort text in PostgreSQL in Latvian alphabetic order, so any letter without accent always comes before the letter with accent?

For example this

select * from (
    select 'bā' as name
    union select 'āa'
    union select 'ac'
    union select 'Āb'
    union select 'Ā'
) t
order by name COLLATE "lv-x-icu"

returns Ā āa Āb ac bā

but the expected result is ac Ā āa Āb bā

I have tried to create some custom collations but none of them returned the expected result.

Original Q&A

There are 2 best solutions below

Zegarek On 20 October 2023 at 14:51

A small supplement to @Laurenz' great answer: in order to achieve the desired result of ac Ā āa Āb bā specified in the question, accents need to be considered at the primary level, but case shouldn't. Demo:

CREATE COLLATION latvian_custom (PROVIDER=icu,LOCALE='lv',
   RULES='&a<Ā=ā &c<Č=č &e<Ē=ē &g<Ģ=ģ &i<Ī=ī &k<Ķ=ķ 
          &l<Ļ=ļ &n<Ņ=ņ &s<Š=š &u<Ū=ū &z<Ž=ž'
);
SELECT string_agg(sample, E'\t' ORDER BY sample COLLATE "latvian_custom") 
FROM (VALUES ('bā'),('āa'),('ac'),('Āb'),('Ā')) AS _(sample);

string_agg
ac Ā āa Āb bā

At that stage, all non-accented characters are already equal to their uppercase variant, so it should be enough to anchor the equal accented pairs behind either (e.g. &A<Ā=ā or &a<Ā=ā). Both case variants of each accented character need to be anchored as equal - as additions, they are treated independently, so it wouldn't work if you only do &a<ā &c<č...&z<ž.

Here's a comparison:

collation_name	sorted_samples
expected	ac Ā āa Āb bā (OP's example)
default	ac bā Ā Āb āa
latUvian_custom	ac Ā Āb āa bā
latvian_custom	ac Ā āa Āb bā (the only match)
latvian_lower_first	Ā āa Āb ac bā
latvian_lower_first_custom	ac Ā Āb āa bā
latvian_upper_first	Ā āa Āb ac bā
latvian_upper_first_custom	ac Ā Āb āa bā
lv-LV-x-icu	Ā āa Āb ac bā
lv-x-icu	Ā āa Āb ac bā
unicode	Ā āa Āb ac bā

**Laurenz Albe** · Accepted Answer · 2023-10-13T13:47:54.207000

This is an unusual requirement. In natural language collations (including the Latuvian collation), strings are compared in different steps:

In the first step, punctuation, case and accents are ignored. If the strings compare differently, that determines the sorting order. This is called a difference on the primary level.
If the strings are equal on the primary level, only punctuation and case are ignored, but accents matter. If the strings compare differently, that determines the sorting order. This is a difference on the secondary level.
If the strings are equal on the secondary level, only punctuation is ignored, but accents and case matter. This is the tertiary level.
Finally, if the strings are equal on the tertiary level, differences in punctuation (the quaternary level) matter.

That explains the sort order you see: since a < aa < ab < ac <ba on the primary level, accents are not taken into account.

If you want ac < Ā, you want accents to be considered different on the primary level, as if a and ā were different letters. For that, you'd have to use an ICU collation and customize it with additional rules. This is only possible from v16 on:

CREATE COLLATION latuvian_custom (
   PROVIDER = icu,
   LOCALE = 'lv',
   RULES = '& a < ā & A < Ā & c < č & C < Č & e < ē & E < Ē & g < ģ & G < Ģ & k < ķ & K < Ķ & l < ļ & L < Ļ & n < ņ & N < Ņ & s < š & S < Š & u < ū & U < Ū & z < ž & Z < Ž'
);

select * from (
    select 'bā' as name
    union select 'āa'
    union select 'ac'
    union select 'Āb'
    union select 'Ā'
) t
order by name COLLATE latuvian_custom;

 name 
══════
 ac
 Ā
 Āb
 āa
 bā
(5 rows)

The rules introduce a primary level difference (<) between the unaccented and the accented characters.

Accent sensitive sort in PostgreSQL

There are 2 best solutions below

Related Questions in POSTGRESQL

Related Questions in SQL-ORDER-BY

Related Questions in COLLATION

Related Questions in ALPHABETICAL-SORT

Related Questions in ACCENT-SENSITIVE

Trending Questions

Popular # Hahtags

Popular Questions