Why does RLIKE match emoji in MariaDB 10.2?

67 Views Asked by At

DB: MariaDB 10.2

Why is this simple regexp matching emoji when the emoji is 4 bytes long. Shouldn't it just match question mark character?

([email protected]:3306) [test]> select '' RLIKE '^[?]+$';
+-----------------------------------+
| '\xF0\x9F\x98\x83' RLIKE '^[?]+$' |
+-----------------------------------+
|                                 1 |
+-----------------------------------+
1 row in set (0,00 sec)


([email protected]:3306) [test]> SHOW VARIABLES LIKE 'collation%';
+----------------------+--------------------+
| Variable_name        | Value              |
+----------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database   | utf8mb4_general_ci |
| collation_server     | utf8mb4_general_ci |
+----------------------+--------------------+
3 rows in set (0,00 sec)
1

There are 1 best solutions below

1
On

I can replicate on 10.6.12 with:

set @@character_set_connection=utf8mb4;
select '' rlike '^[?]+$';

I thought it might be related to this issue:

https://jira.mariadb.org/browse/MDEV-11777?jql=project%20%3D%20MDEV%20AND%20text%20~%20%22regexp%22

The REGEXP_REPLACE function converts supplementary characters (4 byte utf8 encoding) into "?" when the charset is utf8mb4.

but the response to danblack's MDEV-32904 indicates it may be happening because of a mismatch in your character_set% variables. For example:

MariaDB [(none)]> show variables like 'character_set_%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb3                    |
| character_set_connection | utf8mb3                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb3                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8mb3                    |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.001 sec)

MariaDB [(none)]> select hex('');
+-------------------------+
| hex('\xF0\x9F\x98\x83') |
+-------------------------+
| F09F9883                |
+-------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> set @@character_set_connection=utf8mb4;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> select hex('');
+-------------------------+
| hex('\xF0\x9F\x98\x83') |
+-------------------------+
| 3F3F3F3F                |
+-------------------------+
1 row in set (0.000 sec)

MariaDB [(none)]> set names utf8mb4;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> select hex('');
+----------+
| hex('?') |
+----------+
| F09F9883 |
+----------+
1 row in set (0.000 sec)

MariaDB [(none)]>