PostgreSQL pattern matching with Unicode graphemes

267 Views Asked by TechnoSam At 18 November 2020 at 03:25

Is there any way to pattern match with Unicode graphemes?

As a quick example, when I run this query:

CREATE TABLE test (
    id SERIAL NOT NULL, 
    name VARCHAR NOT NULL,
    PRIMARY KEY (id), 
    UNIQUE (name)
);
INSERT INTO test (name) VALUES (' One');
INSERT INTO test (name) VALUES (' Two');

SELECT * FROM public.test WHERE test.name LIKE '%';

I get both rows returned, rather than just ' Two'. Postgres seems to be just comparing code points, but I want it to compare full graphemes, so it should only match ' Two', because is a different grapheme.

Is this possible?

Original Q&A

There are 1 best solutions below

S-Man On 18 November 2020 at 08:02 BEST ANSWER

It's a very interesting question!

I am not quite sure if it is possible anyway:

The skinned emojis are, in fact, two joined characters (like ligatures). The first character is the yellow hand which is followed by an emoji skin modifier

This is how the light skinned hand is stored internally. So, for me, your result makes sense:

When you query any string, that begins with , it will return:

Two (trivial)
_ One (ignore the underscore, I try to suppress the automated ligature with this)

So, you can see, the light skinned emoji internally also starts with . That's why I believe, that your query doesn't work the way you like.

Workarounds/Solutions:

You can add a space to your query. This ensures, that there's no skin modifier after your character. Naturally, this only works in your case, where all data sets have a space after the hand:
```
SELECT * FROM test WHERE name LIKE ' %';
```

You can simply extend the WHERE clause like this:

SELECT * FROM test 
WHERE name LIKE '%'
    AND name NOT LIKE '%'
    AND name NOT LIKE '%'
    AND name NOT LIKE '%'
    AND name NOT LIKE '%'
    AND name NOT LIKE '%'

You can use regular expression pattern matching to exclude the skins:
```
SELECT * FROM test 
WHERE name  ~ '^[^]*$'
```

see demo:db<>fiddle (note that the fiddle seems not to provide automated ligatures, so both characters are separated displayed there)

PostgreSQL pattern matching with Unicode graphemes

There are 1 best solutions below

Related Questions in POSTGRESQL

Related Questions in UNICODE

Related Questions in GRAPHEME

Trending Questions

Popular # Hahtags

Popular Questions