I got Galera master-master cluster on MariaDB 10.4 and galera-4 as replication lib. It runs perfectly until some network issues happens between some node and the others. After node reconnect and resync it's state the whole cluster performance drops down drastically. And the only way to restore performance is to rebuild the whole cluster from a donor node.
This is what I got in log:
2023-03-19 1:23:20 0 [Note] WSREP: ####### Adjusting cert position: 61149358 -> 61149359
2023-03-19 1:23:20 0 [Note] WSREP: Service thread queue flushed.
2023-03-19 1:23:20 0 [Note] WSREP: Lowest cert index boundary for CC from ist: 61149189
2023-03-19 1:23:20 0 [Note] WSREP: Min available from gcache for CC from ist: 60988106
2023-03-19 1:23:20 0 [Note] WSREP: Receiving IST...100.0% (545/545 events) complete.
2023-03-19 1:23:21 2 [Note] WSREP:
================================================ View: id: 6caf7137-be5b-11ed-952d-8e2998c314a7:61149359 status: primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 2 members(4):
0: 38ac2156-c3e1-11ed-b954-ba79e6d7687d, DB1
1: 637c05dc-c17f-11ed-b3f2-0e364b289bef, DB3
2: 667386bc-c5db-11ed-ad68-ee03169ac5b9, DB4
3: c9c34048-c3e0-11ed-8cbc-13957f2241e4, DB2
=================================================
2023-03-19 1:23:21 2 [Note] WSREP: Server status change initialized -> joined
2023-03-19 1:23:21 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-03-19 1:23:21 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-03-19 1:23:21 2 [Note] WSREP: Draining apply monitors after IST up to 61149359
2023-03-19 1:23:21 2 [Note] WSREP: IST received: 6caf7137-be5b-11ed-952d-8e2998c314a7:61149359
2023-03-19 1:23:21 2 [Note] WSREP: Lowest cert index boundary for CC from sst: 61149189
2023-03-19 1:23:21 2 [Note] WSREP: Min available from gcache for CC from sst: 60988107
2023-03-19 1:23:21 0 [Note] WSREP: 2.0 (DB4): State transfer from 3.0 (DB2) complete.
2023-03-19 1:23:21 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 61149618)
2023-03-19 1:23:21 0 [Note] WSREP: Processing event queue:... 0.0% ( 0/260 events) complete.
2023-03-19 1:23:32 2 [Note] WSREP: Processing event queue:... 56.8% (208/366 events) complete.
2023-03-19 1:23:40 0 [Note] WSREP: Member 2.0 (DB4) synced with group.
2023-03-19 1:23:40 0 [Note] WSREP: Processing event queue:...100.0% (402/402 events) complete.
2023-03-19 1:23:40 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 61149756)
2023-03-19 1:23:43 2 [Note] WSREP: Server DB4 synced with group.
2023-03-19 1:23:43 2 [Note] WSREP: Server status change joined -> synced
2023-03-19 1:23:43 2 [Note] WSREP: Synchronized with group, ready for connections
2023-03-19 1:23:43 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-03-19 1:23:59 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61150337, -110 (Connection timed out)
2023-03-19 1:24:15 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61151132, -110 (Connection timed out)
2023-03-19 1:24:17 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61151205, -110 (Connection timed out)
2023-03-19 1:24:23 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61151478, -110 (Connection timed out)
2023-03-19 1:24:33 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61151724, -110 (Connection timed out)
2023-03-19 1:24:38 0 [Note] InnoDB: Buffer pool(s) load completed at 230319 1:24:38
2023-03-19 1:24:41 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61152114, -110 (Connection timed out)
2023-03-19 1:24:43 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61152261, -110 (Connection timed out)
2023-03-19 1:24:46 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61152364, -110 (Connection timed out)
2023-03-19 1:27:32 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61158126, -110 (Connection timed out)
2023-03-19 1:27:36 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61158269, -110 (Connection timed out)
2023-03-19 1:27:40 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61158385, -110 (Connection timed out)
2023-03-19 1:27:47 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61158486, -110 (Connection timed out)
2023-03-19 1:28:12 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61159789, -110 (Connection timed out)
2023-03-19 1:28:15 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61159889, -110 (Connection timed out)
2023-03-19 1:28:18 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61160056, -110 (Connection timed out)
2023-03-19 1:28:21 0 [Warning] WSREP: Failed to report last committed 6caf7137-be5b-11ed-952d-8e2998c314a7:61160171, -110 (Connection timed out)
My wsrep config:
[galera]
wsrep_on=ON
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_address="gcomm://xx.xx.xx.xxx,xxx.xx.xx.xxx,xx.xxx.xx.xxx,xx.xxx.xxx.xx"
wsrep_sst_method=mariabackup
wsrep_sst_auth=xxxxxxxxx:xxxxxxxxx
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
wsrep_cluster_name="xxxxxxx"
wsrep_node_address="xx.xxx.xxx.xx"
wsrep_node_name="DBX"
What is causing this performance issue and how to fix it? Help me please.
Generally speaking this should not happen. Resyncing the node does not impact the performance of said node - it will work as it is supposed to be working. Resyncing process itself has significant overhead as data has to be transferred, but as long as it completes, no further impact is expected.
The logs that you have shared is a typical picture of a node that joins the cluster and performs state transfer via IST (Incremental State Transfer). There are some warnings at the end of the log which may indicate some network hiccups but it is hard to comment based only on this short time frame that is included in the log.
Having said that, there is potential reason why a node may slow down after the rebuild. It should affect only particular queries and should not be visible in short, fast, properly indexed queries.
MySQL, to be precise its storage engine InnoDB, keeps a track of the contents stored in the tables. It calculates statistics how rows are looking, are there many distinct values in the table or maybe majority of the rows contain the same value. It does not analyze all the rows but a small sample. We don’t want to get here in details but such statistics affect how the optimizer plans the query execution plan. As a result, if you rebuild the node, especially if it is full rebuild, via SST, it may happen that the statistics change (a different sample has been taken) and this may affect the query execution plan.
This is quite unlikely but it is something you can easily check. If you know the query execution plan for a particular query that slowed down (EXPLAIN SELECT … ), you can check that query execution plan after the rebuild and see if it differs or not.
Finally, the fact that the node has been restarted may also have an impact on the performance. Buffer pool has to be filled up again either via query or from the saved buffer pool from before the restart. This would be a temporary slowdown, though.
We hope this will point you towards a good direction, the truth is, it is quite hard to analyze such cases without having hands-on access to the databases and the data.