MySQL运行时的可观测性

感知SQL运行时的状态

1. 说在前面的话

在MySQL里,一条SQL运行时产生多少磁盘I/O,占用多少内存,是否有创建临时表,这些指标如果都能观测到,有助于更快发现SQL瓶颈,扑灭潜在隐患。

从MySQL 5.7版本开始,performance_schema就默认启用了,并且还增加了sys schema,到了8.0版本又进一步得到增强提升,在SQL运行时就能观察到很多有用的信息,实现一定程度的可观测性。

下面举例说明如何进行观测,以及主要观测哪些指标。

2. 安装employees测试库

安装MySQL官方提供的employees测试数据库,戳此链接(https://dev.mysql.com/doc/index-other.html)下载,解压缩后开始安装:

$ mysql -f < employees.sql;

INFO
CREATING DATABASE STRUCTURE
INFO
storage engine: InnoDB
INFO
LOADING departments
INFO
LOADING employees
INFO
LOADING dept_emp
INFO
LOADING dept_manager
INFO
LOADING titles
INFO
LOADING salaries
data_load_time_diff
00:00:37

MySQL还提供了相应的使用文档:https://dev.mysql.com/doc/employee/en/

本次测试采用GreatSQL 8.0.32-24版本,且运行在MGR环境中:

greatsql> \s
...
Server version:         8.0.32-24 GreatSQL, Release 24, Revision 3714067bc8c
...

greatsql> select MEMBER_ID, MEMBER_ROLE, MEMBER_VERSION from performance_schema.replication_group_members;
+--------------------------------------+-------------+----------------+
| MEMBER_ID                            | MEMBER_ROLE | MEMBER_VERSION |
+--------------------------------------+-------------+----------------+
| 2adec6d2-febb-11ed-baca-d08e7908bcb1 | SECONDARY   | 8.0.32         |
| 2f68fee2-febb-11ed-b51e-d08e7908bcb1 | ARBITRATOR  | 8.0.32         |
| 5e34a5e2-feb6-11ed-b288-d08e7908bcb1 | PRIMARY     | 8.0.32         |
+--------------------------------------+-------------+----------------+

3. 观测SQL运行状态

查看当前连接/会话的连接ID、内部线程ID:

greatsql> select processlist_id, thread_id from performance_schema.threads where processlist_id = connection_id();
+----------------+-----------+
| processlist_id | thread_id |
+----------------+-----------+
|            110 |       207 |
+----------------+-----------+

查询得到当前的连接ID=110,内部线程ID=207。

P.S,由于本文整理过程不是连续的,所以下面看到的 thread_id 值可能会有好几个,每次都不同。

3.1 观测SQL运行时的内存消耗

执行下面的SQL,查询所有员工的薪资总额,按员工号分组,并按薪资总额倒序,取前10条记录:

greatsql> explain select emp_no, sum(salary) as total_salary from salaries group by emp_no order by total_salary desc limit 10\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 7
          ref: NULL
         rows: 2838426
     filtered: 100.00
        Extra: Using temporary; Using filesort

看到需要全索引扫描(其实也等同于全表扫描,因为是基于PRIMARY索引),并且还需要生成临时表,以及额外的filesort。

在正式运行该SQL之前,在另外的窗口中新建一个连接会话,执行下面的SQL先观察该连接/会话当前的内存分配情况:

greatsql> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 207\G
*************************** 1. row ***************************
         thread_id: 207
              user: root@localhost
current_count_used: 9
 current_allocated: 26266
 current_avg_alloc: 2918.4444
 current_max_alloc: 16464
   total_allocated: 30311

等到该SQL执行完了,再一次查询内存分配情况:

greatsql> select * from sys.x$memory_by_thread_by_current_bytes where thread_id = 207\G
*************************** 1. row ***************************
         thread_id: 207
              user: root@localhost
current_count_used: 13
 current_allocated: 24430
 current_avg_alloc: 1879.2308
 current_max_alloc: 16456
   total_allocated: 95719

我们注意到几个数据的变化情况,用下面表格来展示:

指标 运行前 运行后
total_allocated 30311 95719

也就是说,SQL运行时,需要分配的内存是:95719 – 30311 = 65408 字节。

3.2 观测SQL运行时的其他开销

通过观察 performance_schema.status_by_thread 表,可以知道相应连接/会话中SQL运行的一些状态指标。在SQL运行结束后,执行下面的SQL命令即可查看:

greatsql> select * from performance_schema.status_by_thread where thread_id = 207;
...
|       207 | Created_tmp_disk_tables             | 0                        |
|       207 | Created_tmp_tables                  | 0                        |
...
|       207 | Handler_read_first                  | 1                        |
|       207 | Handler_read_key                    | 1                        |
|       207 | Handler_read_last                   | 0                        |
|       207 | Handler_read_next                   | 2844047                  |
|       207 | Handler_read_prev                   | 0                        |
|       207 | Handler_read_rnd                    | 0                        |
|       207 | Handler_read_rnd_next               | 0                        |
|       207 | Handler_rollback                    | 0                        |
|       207 | Handler_savepoint                   | 0                        |
|       207 | Handler_savepoint_rollback          | 0                        |
|       207 | Handler_update                      | 0                        |
|       207 | Handler_write                       | 0                        |
|       207 | Last_query_cost                     | 286802.914893            |
|       207 | Last_query_partial_plans            | 1                        |
...
|       207 | Select_full_join                    | 0                        |
|       207 | Select_full_range_join              | 0                        |
|       207 | Select_range                        | 0                        |
|       207 | Select_range_check                  | 0                        |
|       207 | Select_scan                         | 1                        |
|       207 | Slow_launch_threads                 | 0                        |
|       207 | Slow_queries                        | 1                        |
|       207 | Sort_merge_passes                   | 0                        |
|       207 | Sort_range                          | 0                        |
|       207 | Sort_rows                           | 1                       |
|       207 | Sort_scan                           | 1                        |
...

上面我们只罗列了部分比较重要的状态指标。从这个结果也可以佐证slow query log中的结果,确实没创建临时表。

作为参照,查看这条SQL对应的slow query log记录:

# Query_time: 0.585593  Lock_time: 0.000002 Rows_sent: 10  Rows_examined: 2844057 Thread_id: 110 Errno: 0 Killed: 0 Bytes_received: 115 Bytes_sent: 313 Read_first: 1 Read_last: 0 Read_key: 1 Read_next: 2844047 Read_prev: 0 Read_rnd: 0 Read_rnd_next: 0 Sort_merge_passes: 0 Sort_range_count: 0 Sort_rows: 10 Sort_scan_count: 1 Created_tmp_disk_tables: 0 Created_tmp_tables: 0 Start: 2023-07-06T10:06:01.438376+08:00 End: 2023-07-06T10:06:02.023969+08:00 Schema: employees Rows_affected: 0
# Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# InnoDB_trx_id: 0
# Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0
#   InnoDB_IO_r_ops: 0  InnoDB_IO_r_bytes: 0  InnoDB_IO_r_wait: 0.000000
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 4281
use employees;
SET timestamp=1688609161;
select emp_no, sum(salary) as total_salary from salaries group by emp_no order by total_salary desc limit 10;

可以看到,Created_tmp_disk_tables, Created_tmp_tables, Handler_read_next, Select_full_join, Select_scan, Sort_rows, Sort_scan, 等几个指标的数值是一样的。

还可以查看该SQL运行时的I/O latency情况,SQL运行前后两次查询对比:

greatsql> select * from sys.io_by_thread_by_latency where thread_id = 207;
+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+
| user           | total | total_latency | min_latency | avg_latency | max_latency | thread_id | processlist_id |
+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+
| root@localhost |     7 | 75.39 us      | 5.84 us     | 10.77 us    | 22.12 us    |       207 |            110 |
+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+

...

greatsql> select * from sys.io_by_thread_by_latency where thread_id = 207;
+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+
| user           | total | total_latency | min_latency | avg_latency | max_latency | thread_id | processlist_id |
+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+
| root@localhost |     8 | 85.29 us      | 5.84 us     | 10.66 us    | 22.12 us    |       207 |            110 |
+----------------+-------+---------------+-------------+-------------+-------------+-----------+----------------+

可以看到这个SQL运行时的I/O latency是:85.29 – 75.39 = 9.9us。

3.3 观测SQL运行进度

我们知道,运行完一条SQL后,可以利用PROFLING功能查看它各个阶段的耗时,但是在运行时如果也想查看各阶段耗时该怎么办呢?

从MySQL 5.7版本开始,可以通过 performance_schema.events_stages_% 相关表查看SQL运行过程以及各阶段耗时,需要先修改相关设置:

# 确认是否对所有主机&用户都启用
greatsql> SELECT * FROM performance_schema.setup_actors;
+------+------+------+---------+---------+
| HOST | USER | ROLE | ENABLED | HISTORY |
+------+------+------+---------+---------+
| %    | %    | %    | NO      | NO      |
+------+------+------+---------+---------+

# 修改成对所有主机&用户都启用
greatsql> UPDATE performance_schema.setup_actors
 SET ENABLED = 'YES', HISTORY = 'YES'
 WHERE HOST = '%' AND USER = '%';

# 修改 setup_instruments & setup_consumers 设置
greatsql> UPDATE performance_schema.setup_consumers
 SET ENABLED = 'YES'
 WHERE NAME LIKE '%events_statements_%';

greatsql> UPDATE performance_schema.setup_consumers
 SET ENABLED = 'YES'
 WHERE NAME LIKE '%events_stages_%'; 

这就实时可以观测SQL运行过程中的状态了。

在SQL运行过程中,从另外的窗口查看该SQL对应的 EVENT_ID

greatsql> SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL_TEXT        FROM performance_schema.events_statements_history WHERE thread_id = 85 order by event_id desc limit 5;
+----------+----------+-------------------------------------------------------------------------------------------------------------------------------+
| EVENT_ID | Duration | SQL_TEXT                                                                                                                      |
+----------+----------+-------------------------------------------------------------------------------------------------------------------------------+
|   149845 |   0.6420 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 |
|   149803 |   0.6316 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 |
|   149782 |   0.6245 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 |
|   149761 |   0.6361 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 |
|   149740 |   0.6245 | select emp_no, sum(salary) as total_salary, sleep(0.000001) from salaries group by emp_no order by total_salary desc limit 10 |
+----------+----------+-------------------------------------------------------------------------------------------------------------------------------+

# 再根据 EVENT_ID 值去查询 events_stages_history_long
greatsql> SELECT thread_id ,event_Id, event_name AS Stage, TRUNCATE(TIMER_WAIT/1000000000000,6) AS Duration  FROM performance_schema.events_stages_history_long WHERE NESTING_EVENT_ID = 149845 order by event_id;
+-----------+----------+------------------------------------------------+----------+
| thread_id | event_Id | Stage                                          | Duration |
+-----------+----------+------------------------------------------------+----------+
|        85 |   149846 | stage/sql/starting                             |   0.0000 |
|        85 |   149847 | stage/sql/Executing hook on transaction begin. |   0.0000 |
|        85 |   149848 | stage/sql/starting                             |   0.0000 |
|        85 |   149849 | stage/sql/checking permissions                 |   0.0000 |
|        85 |   149850 | stage/sql/Opening tables                       |   0.0000 |
|        85 |   149851 | stage/sql/init                                 |   0.0000 |
|        85 |   149852 | stage/sql/System lock                          |   0.0000 |
|        85 |   149854 | stage/sql/optimizing                           |   0.0000 |
|        85 |   149855 | stage/sql/statistics                           |   0.0000 |
|        85 |   149856 | stage/sql/preparing                            |   0.0000 |
|        85 |   149857 | stage/sql/Creating tmp table                   |   0.0000 |
|        85 |   149858 | stage/sql/executing                            |   0.6257 |
|        85 |   149859 | stage/sql/end                                  |   0.0000 |
|        85 |   149860 | stage/sql/query end                            |   0.0000 |
|        85 |   149861 | stage/sql/waiting for handler commit           |   0.0000 |
|        85 |   149862 | stage/sql/closing tables                       |   0.0000 |
|        85 |   149863 | stage/sql/freeing items                        |   0.0000 |
|        85 |   149864 | stage/sql/logging slow query                   |   0.0000 |
|        85 |   149865 | stage/sql/cleaning up                          |   0.0000 |
+-----------+----------+------------------------------------------------+----------+

上面就是这条SQL的运行进度展示,以及各个阶段的耗时,和PROFILING的输出一样,当我们了解一条SQL运行所需要经历的各个阶段时,从上面的输出结果中也就能估算出该SQL大概还要多久能跑完,决定是否要提前kill它。

如果想要观察DDL SQL的运行进度,可以参考这篇文章:不用MariaDB/Percona也能查看DDL的进度

更多的观测指标、维度还有待继续挖掘​,以后有机会再写。​

延伸阅读
Query Profiling Using Performance Schema, https://dev.mysql.com/doc/refman/8.0/en/performance-schema-query-profiling.html
不用MariaDB/Percona也能查看DDL的进度
事件记录 | performance_schema全方位介绍
内存分配统计视图 | 全方位认识 sys 系统库

全文完 :)

自打有了GIPKs,DBA和开发再也不用battle了

>

GIPKs解决了历史难题

1. GIPKs特性简介

从MySQL 8.0.30开始,新引入一个叫做GPIKs的特性,其全称是 Generated Invisible Primary Keys,简言之就是 自动生成隐含的主键列,更完整的说法是:启用GIPKs后,当新建的InnoDB表没有显式主键时,会自动创建一个不可见的主键列 my_row_id,这个列会被定义为 bigint unsigned NOT NULL AUTO_INCREMENT,并且是不可见的(INVISIBLE)。

2. GIPKs特性的作用

实际上这个特性在有些分支版本上早就已经实现了,这个需求也是非常迫切,MySQL官方对这个特性的支持虽迟但到,积极意义还是很大滴,解决了几个历史难题:
1. DBA无需再和开发battle,强调一定要有显式自增主键列。当然了,个别情况下非要显式指定非自增列(例如选择UUID/VARCHAR类型列)做主键的,DBA也无可奈何啊。
2. 在MGR架构中,也不用要求每个InnoDB表都必须要有显式定义的主键列。

上述这两种情况下,都可以从GIPKs特性中获益,会自动创建隐含的 my_row_id 主键列。

GIPKs特性带来的一点点不便是,当我们想要显式创建一个名为 my_row_id 的列名时,会报错,不让创建,因为被GIPKs特性给当做保留关键字了,例如:

greatsql> create table t2(
id bigint unsigned not null auto_increment,
my_row_id int NOT NULL);
ERROR 4109 (HY000): Failed to generate invisible primary key. Auto-increment column already exists.

需要注意的是,在传统主从复制或MGR架构中,GIPKs特性的设置值不会被复制到从节点,仅影响当前节点。不过,这完全不影响主从复制或MGR的正常工作,也就是说:在主节点上创建无显式定义主键列的表数据,可以正常复制到从节点。前提条件是设置 binlog_format = row,在MGR中,要求binlog必须采用row格式。

另外,mysqldump 中也相应增加了新选项 --skip-generated-invisible-primary-key,用于指定备份时是否要忽略隐含主键列。

3. 玩转GIPKs

下面我们在MGR环境中举栗说明怎么玩转GIPKs特性:

# 当前使用 GreatSQL 8.0.32-24 版本
greatsql> \s
..
Server version:     8.0.32-24 GreatSQL, Release 24, Revision 3714067bc8c
...

# 在MGR环境中测试
greatsql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+----------------------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST   | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | MEMBER_COMMUNICATION_STACK |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+----------------------------+
| group_replication_applier | 2adec6d2-febb-11ed-baca-d08e7908bcb1 | 192.168.5.160 |        3307 | ONLINE       | SECONDARY   | 8.0.32         | XCom                       |
| group_replication_applier | 2f68fee2-febb-11ed-b51e-d08e7908bcb1 | 192.168.5.160 |        3308 | ONLINE       | ARBITRATOR  | 8.0.32         | XCom                       |
| group_replication_applier | 5e34a5e2-feb6-11ed-b288-d08e7908bcb1 | 192.168.5.160 |        3306 | ONLINE       | PRIMARY     | 8.0.32         | XCom                       |
+---------------------------+--------------------------------------+---------------+-------------+--------------+-------------+----------------+----------------------------+

# 确认启用GIPKs特性
greatsql> show variables like 'sql_generate_invisible_primary_key';
+------------------------------------+-------+
| Variable_name                      | Value |
+------------------------------------+-------+
| sql_generate_invisible_primary_key | ON    |
+------------------------------------+-------+

# 新建表,未显式指定主键列
greatsql> create table t1 ( id int not null, c1 varchar(10) not null, unique key(id));

greatsql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT /*!80023 INVISIBLE */,
  `id` int NOT NULL,
  `c1` varchar(10) NOT NULL,
  PRIMARY KEY (`my_row_id`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

可以看到,在建表时已经创建了唯一索引列(该索引列可以被选用作聚集索引),但由于没显式指定主键索引,所以还是会创建一个隐含的主键列 my_row_id,这个隐含的主键列默认是不可见的,除非我们手动修改其可见性。

# 即便是 SELECT *,也无法读取隐含的主键列
greatsql> select * from t1;
+----+----+
| id | c1 |
+----+----+
|  1 | c1 |
|  2 | c2 |
+----+----+

# 除非修改隐含主键列为可见
greatsql> alter table t1 alter column my_row_id set visible;

# 这时就能看到这个隐含主键列
greatsql> select * from t1;
+-----------+----+----+
| my_row_id | id | c1 |
+-----------+----+----+
|         1 |  1 | c1 |
|         2 |  2 | c2 |
+-----------+----+----+

# 再次查看表结构
greatsql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT,
  `id` int NOT NULL,
  `c1` varchar(10) NOT NULL,
  PRIMARY KEY (`my_row_id`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

# 还可以再次将其设置为不可见
greatsql> alter table t1 alter column my_row_id set invisible;

greatsql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `my_row_id` bigint unsigned NOT NULL AUTO_INCREMENT /*!80023 INVISIBLE */,
  `id` int NOT NULL,
  `c1` varchar(10) NOT NULL,
  PRIMARY KEY (`my_row_id`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

如果不想继续使用该隐含列作为主键,可以执行类似下面的SQL命令进行修改:

# 删除隐含主键列、主键,并新建自定义的主键列
greatsql> alter table t1 drop column my_row_id, drop primary key, add aid bigint unsigned not null auto_increment primary key first;

# 再次查看表结构和查询表数据
greatsql> show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `aid` bigint unsigned NOT NULL AUTO_INCREMENT,
  `id` int NOT NULL,
  `c1` varchar(10) NOT NULL,
  PRIMARY KEY (`aid`),
  UNIQUE KEY `id` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

greatsql> select * from t1;
+-----+----+----+
| aid | id | c1 |
+-----+----+----+
|   1 |  1 | c1 |
|   2 |  2 | c2 |
+-----+----+----+

可以看到,GIPKs特性还是很灵活实用的。

P.S,最新发布的GreatSQL 8.0.32-24版本中,已经包含了该特性,可以放心使用。

全文完。

Enjoy MySQL :)

延伸阅读
– Generated Invisible Primary Keys, https://dev.mysql.com/doc/refman/8.0/en/create-table-gipks.html
– Changes in GreatSQL 8.0.32-24, https://gitee.com/GreatSQL/GreatSQL-Doc/blob/master/relnotes/greatsql-803224/changes-greatsql-8-0-32-24-20230605.md

从零开始学习MySQL调试跟踪(2)

上一篇文档介绍了如何构建gdb跟踪调试环境,本文介绍如何根据错误日志信息,跟踪定位问题可能的原因,以及如何利用coredump文件查找问题线索。

1. 启用coredump

程序运行过程中可能会异常终止或崩溃,OS会把程序挂掉时的内存状态记录下来,写入core文件,这就叫 coredump,通过gdb结合core文件可以方便地进行调试。

利用core文件中保留的异常堆栈文件,能够帮助研发同学更快定位问题。因此,如果某些故障断断续续会出现,建议阶段性开启coredump功能。

想要开启coredump,需要先修改OS层的几个设置:

$ ulimit -c unlimited
$ sysctl -w fs.suid_dumpable=2
$ echo "core.%p.%e.%s" > /proc/sys/kernel/core_pattern

同时,将这些修改持久化到相应文件中(假定MySQL/GreatSQL服务进程的属主用户是 mysql):

$ echo "mysql  -  core   unlimited" >> /etc/security/limits.conf
$ echo "fs.suid_dumpable=2" >> /etc/sysctl.conf
$ echo "kernel.core_pattern=core.%e.%p.%t" >> /etc/sysctl.conf
$ sysctl -p

接下来,修改 my.cnf 配置文件,增加以下两行内容:

core_file
innodb_buffer_pool_in_core_file=OFF

然后重启GreatSQL服务进程,即可生效,查询确认下:

mysql> show global variables like '%core%';
+---------------------------------+-------+
| Variable_name                   | Value |
+---------------------------------+-------+
| core_file                       | ON    |
| innodb_buffer_pool_in_core_file | OFF   |
+---------------------------------+-------+

这样设置完成后,需要的话会在 datadir 目录下生成core文件。

2. 制造一个coredump场景

我们可以给mysqld进程发送 SIGSEGV(11) 信号,即可模拟出coredump的场景,例如:

$ kill -s SIGSEGV `pidof mysqld`

这时查看GreatSQL错误日志文件,以及core文件,就会发现有coredump:

$ls -la 
...
-rw-------   1 mysql mysql 1081147392 Feb 20 22:36 core.mysqld-debug.2658134.1676903816
...

$ less error.log
...
14:36:56 UTC - mysqld got signal 11 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Build ID: 1f4232b893100742b7c519df2fa714648c2d76d9
Server Version: 8.0.25-16-debug Source distribution

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x80000
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(my_print_stacktrace(unsigned char const*, unsigned long)+0x43) [0x4b04
d26]
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(handle_fatal_signal+0x2cb) [0x39a7d22]
/lib64/libpthread.so.0(+0x12c20) [0x7fc3e669ac20]
/lib64/libc.so.6(__poll+0x51) [0x7fc3e45c4a41]
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(Mysqld_socket_listener::listen_for_connection_event()+0x57) [0x3995195
]
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(Connection_acceptor<Mysqld_socket_listener>::connection_event_loop()+0
x30) [0x355a024]
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(mysqld_main(int, char**)+0x27d2) [0x354e4a6]
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(main+0x20) [0x32de906]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7fc3e44f6493]
/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug(_start+0x2e) [0x32de82e]
Please help us make Percona Server better by reporting any
bugs at https://bugs.percona.com/
...

在一线的同学,如果需要向研发寻求支持或报告故障时,可以先参考这篇文章 MySQL报障之coredump收集处理流程,需要采集其他几个信息:
– 故障时刻的error log。
– 故障产生的core文件。
– 如果有general log的话,也采集起来(故障时刻往前约1小时或10万行日志)。
– 导致core发生涉及到的表DDL以及相应的SQL语句,有必要的话,可能还要同时提供真实数据(或样例数据)。

3. 真实故障场景分析跟踪

在GreatSQL 8.0.25-15版本(上一个版本)中,InnoDB并行查询功能在特定场景下存在bug,会导致crash,相应的日志见下:

mysqld-debug: /opt/greatsql-8.0.25/sql/item.cc:6047: virtual void Item_field::make_field(Send_field*): Assertion `item_name.is_set()' failed.
01:59:20 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.

Build ID: 1f4232b893100742b7c519df2fa714648c2d76d9
Server Version: 8.0.25-debug Source distribution

Thread pointer: 0x7fb4a9a0b000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb4f7aa53b0 thread_stack 0x80000
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(my_print_stacktrace(unsigned char const*, unsigned long)+0x43) [0x4b04d26]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(handle_fatal_signal+0x2cb) [0x39a7d22]
/lib64/libpthread.so.0(+0x12c20) [0x7fb5146cac20]
/lib64/libc.so.6(gsignal+0x10f) [0x7fb51253a37f]
/lib64/libc.so.6(abort+0x127) [0x7fb512524db5]
/lib64/libc.so.6(+0x21c89) [0x7fb512524c89]
/lib64/libc.so.6(+0x2fa76) [0x7fb512532a76]  #<--从这里网上,都是错误信息处理逻辑
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(Item_field::make_field(Send_field*)+0x9e) [0x3338758]  #<--从这里往下,才是真正触发故障的位置,并记住 "0x3338758" 这个指针
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(THD::send_result_metadata(mem_root_deque<Item*> const&, unsigned int)+0x19d) [0x36977ab]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(Query_result_send::send_result_set_metadata(THD*, mem_root_deque<Item*> const&, unsigned int)+0x2d) [0x35f3ff9]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(Query_expression::ExecuteIteratorQuery(THD*)+0x1f1) [0x38d057b]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(Query_expression::execute(THD*)+0xed) [0x38d0d7d]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(Sql_cmd_dml::execute_inner(THD*)+0x1c1) [0x381db25]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(Sql_cmd_dml::execute(THD*)+0x5c7) [0x381cfab]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(mysql_execute_command(THD*, bool)+0x565c) [0x37a1a2b]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(dispatch_sql_command(THD*, Parser_state*, bool)+0x769) [0x37a3a1d]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x1491) [0x3799819]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug(do_command(THD*)+0x51c) [0x3797c48]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug() [0x3991168]
/usr/local/GreatSQL-8.0.25-Linux-glibc2.28-x86_64/bin/mysqld-debug() [0x52e4b22]
/lib64/libpthread.so.0(+0x817a) [0x7fb5146c017a]
/lib64/libc.so.6(clone+0x43) [0x7fb5125ffdc3]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7fb4a9a65028): SELECT  ...  FROM t1 WHERE ...  #<-- 这是触发bug的SQL语句
Connection ID (thread ID): 8
Status: NOT_KILLED

按照上面所说的方法,我们采集了所有相关信息,并能在测试环境重现上述故障。

接下来,我们利用gdb来定位分析问题原因:

$ gdb path/bin/mysqld-debug path/core.mysqld-debug.2657287.1657270311
GNU gdb (GDB) Red Hat Enterprise Linux 9.2-4.el8
...
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./bin/mysqld-debug...
...
[New LWP 2675795]
[New LWP 2675825]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `./bin/mysqld-debug --defaults-extra-file=./my.cnf'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
56        return (INTERNAL_SYSCALL_ERROR_P (val, err)
[Current thread is 1 (Thread 0x7fb4f7aa7700 (LWP 2676055))]
(gdb)
(gdb) b *0x3338758  #<-- 上面记下的指针值,前面加个 "*" 号,在这里打上断点
Breakpoint 1 at 0x3338758: file /opt/greatsql-8.0.25/sql/item.cc, line 6048.  #<-- 指向可能触发问题的源码位置
(gdb)
(gdb) bt  #<-- 打印详细backtrace信息
#0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1  0x0000000004b04f1d in my_write_core (sig=6) at /opt/greatsql-8.0.25/mysys/stacktrace.cc:409
#2  0x00000000039a7f84 in handle_fatal_signal (sig=6) at /opt/greatsql-8.0.25/sql/signal_handler.cc:199
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00007fb512524db5 in __GI_abort () at abort.c:79
#6  0x00007fb512524c89 in __assert_fail_base (fmt=0x7fb51268d698 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x57a1835 "item_name.is_set()",
    file=0x57a1400 "/opt/greatsql-8.0.25/sql/item.cc", line=6047, function=<optimized out>) at assert.c:92
#7  0x00007fb512532a76 in __GI___assert_fail (assertion=0x57a1835 "item_name.is_set()", file=0x57a1400 "/opt/greatsql-8.0.25/sql/item.cc", line=6047,
    function=0x57a3a40 "virtual void Item_field::make_field(Send_field*)") at assert.c:101
#8  0x0000000003338758 in Item_field::make_field (this=0x7fb4a9b5bcf8, tmp_field=0x7fb4f7aa2380) at /opt/greatsql-8.0.25/sql/item.cc:6047
#9  0x00000000036977ab in THD::send_result_metadata (this=0x7fb4a9a0b000, list=..., flags=5) at /opt/greatsql-8.0.25/sql/sql_class.cc:2824
#10 0x00000000035f3ff9 in Query_result_send::send_result_set_metadata (this=0x7fb4a9a0fda0, thd=0x7fb4a9a0b000, list=..., flags=5)
    at /opt/greatsql-8.0.25/sql/query_result.cc:76
#11 0x00000000038d057b in Query_expression::ExecuteIteratorQuery (this=0x7fb4a9a65178, thd=0x7fb4a9a0b000) at /opt/greatsql-8.0.25/sql/sql_union.cc:1150
#12 0x00000000038d0d7d in Query_expression::execute (this=0x7fb4a9a65178, thd=0x7fb4a9a0b000) at /opt/greatsql-8.0.25/sql/sql_union.cc:1321
#13 0x000000000381db25 in Sql_cmd_dml::execute_inner (this=0x7fb4a9a0fd68, thd=0x7fb4a9a0b000) at /opt/greatsql-8.0.25/sql/sql_select.cc:814
#14 0x000000000381cfab in Sql_cmd_dml::execute (this=0x7fb4a9a0fd68, thd=0x7fb4a9a0b000) at /opt/greatsql-8.0.25/sql/sql_select.cc:585
#15 0x00000000037a1a2b in mysql_execute_command (thd=0x7fb4a9a0b000, first_level=true) at /opt/greatsql-8.0.25/sql/sql_parse.cc:4684
#16 0x00000000037a3a1d in dispatch_sql_command (thd=0x7fb4a9a0b000, parser_state=0x7fb4f7aa41d0, update_userstat=false)
    at /opt/greatsql-8.0.25/sql/sql_parse.cc:5284
#17 0x0000000003799819 in dispatch_command (thd=0x7fb4a9a0b000, com_data=0x7fb4f7aa5370, command=COM_QUERY) at /opt/greatsql-8.0.25/sql/sql_parse.cc:1940
#18 0x0000000003797c48 in do_command (thd=0x7fb4a9a0b000) at /opt/greatsql-8.0.25/sql/sql_parse.cc:1388
#19 0x0000000003991168 in handle_connection (arg=0x7fb4ba094500) at /opt/greatsql-8.0.25/sql/conn_handler/connection_handler_per_thread.cc:307
#20 0x00000000052e4b22 in pfs_spawn_thread (arg=0x7fb511e44320) at /opt/greatsql-8.0.25/storage/perfschema/pfs.cc:2899
#21 0x00007fb5146c017a in start_thread (arg=<optimized out>) at pthread_create.c:479
#22 0x00007fb5125ffdc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

有了这些信息,研发同学再去跟踪定位问题根源就会方便很多。

本文简单演示了如何利用core文件去跟踪定位分析可能导致crash的原因,更多有趣实用的方法还有待进一步挖掘,一起探索新世界吧。

延伸阅读
– https://gohalo.me/post/mysql-core-file.html
– https://zhuanlan.zhihu.com/p/275698560
– https://blog.csdn.net/weixin_35186171/article/details/113425698
– https://blog.csdn.net/weixin_34565946/article/details/113299910

从零开始学习MySQL调试跟踪(1)

有时为了跟踪故障需要调试MySQL/GreatSQL源码,本文介绍如何在Linux下构建MySQL/GreatSQL源码调试环境。

在这之前,我也是一名小白,一起从零开始探索吧。

本文以CentOS 8.x环境下的GreatSQL 8.0.25-16版本为例。

1. 编译GreatSQL

查看系统环境:

$ cat /etc/system-release

CentOS Linux release 8.4.2105

首先,从 https://gitee.com/GreatSQL/GreatSQL/releases/ 下载GreatSQL 8.0.25-16的源码包

  1. Source Code
Packages Size
greatsql-8.0.25-16.tar.gz 503M

接下来,参考文章 在Linux下源码编译安装GreatSQL 构建好编译环境。然后开始编译GreatSQL源码,编译参数中增加/修改debug相关选项,这样编译后得到的二进制文件才能支持调试模式,例如:

$ cd /opt/greatsql-8.0.25-16
$ mkdir -p build
$ cd build
$ cmake3 .. \
-DBOOST_INCLUDE_DIR=/opt/boost_73_0 \
-DLOCAL_BOOST_DIR=/opt/boost_73_0 \
-DCMAKE_INSTALL_PREFIX=/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64 \
-DWITH_ZLIB=bundled \
-DWITH_NUMA=ON \
-DCMAKE_EXE_LINKER_FLAGS="-ljemalloc" \
-DBUILD_CONFIG=mysql_release \
-DWITH_TOKUDB=OFF \
-DWITH_ROCKSDB=OFF \
-DMAJOR_VERSION=8 \
-DMINOR_VERSION=0 \
-DPATCH_VERSION=25 \
-DWITH_UNIT_TESTS=OFF \
-DWITH_NDBCLUSTER=OFF \
-DWITH_SSL=system \
-DWITH_SYSTEMD=ON \
-DWITH_LDAP=OFF \
-DWITH_AUTHENTICATION_LDAP=OFF \
-DWITH_DEBUG=1 \
-DCMAKE_BUILD_TYPE=Debug \
&& make -j8 VERBOSE=1 && make install

主要是增加两个参数 -DWITH_DEBUG=1-DCMAKE_BUILD_TYPE=Debug,注意不要有参数 -DCMAKE_BUILD_TYPE=RelWithDebInfo

编译完成后,即可得到包含debug功能的GreatSQL二进制文件,执行下面的命令检查:

$ cd /usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64
$ ./bin/mysqld-debug --verbose --version

/usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug  Ver 8.0.25-16-debug for Linux on x86_64 (Source distribution)

可以看到,输出的结果中包含 debug 关键字,这就表示成功了。

2. 安装gdb

直接执行yum安装gdb即可:

$ yum install -y gdb
$ gdb --version
GNU gdb (GDB) Red Hat Enterprise Linux 9.2-4.el8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

gdb常用的调试相关指令有以下几个:

命令 缩写 备注
attach 挂接/进入准备调试的进程pid
detach 取消挂接进程(退出进程)
list l 显示多行源代码
break b 设置断点,程序运行到断点的位置会停下来
info i 描述程序的状态
run r 开始运行程序
display disp 跟踪查看某个变量,每次停下来都显示它的值
step s 执行下一条语句,如果该语句为函数调用,则进入函数执行其中的第一条语句
next n 执行下一条语句,如果该语句为函数调用,不会进入函数内部执行(即不会一步步地调试函数内部语句)
print p 打印内部变量值
continue c 继续程序的运行,直到遇到下一个断点
set var name=v 设置变量的值
start st 开始执行程序,在main函数的第一条语句前面停下来
file 装入需要调试的程序
kill k 终止正在调试的程序
watch 监视变量值的变化
backtrace bt 查看函数调用信息(堆栈)
frame f 查看栈帧
quit q 退出gdb

3. 开始调试GreatSQL源码

第一次运行gdb准备调试时,可能会提示类似下面的信息

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0x00007ffb358ada41 in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install keyutils-libs-1.5.10-9.el8.x86_64 ...

这表示缺少一些相关的debuginfo包,可以根据提示内容补充安装,例如:

dnf debuginfo-install keyutils-libs-1.5.10-9.el8.x86_64 ...

如果提示找不到这些安装包:

Could not find debuginfo package for the following installed packages: keyutils-libs-1.5.10-9.el8.x86_64 ...

可以检查yum配置文件 /etc/yum.repos.d/CentOS-Linux-Debuginfo.repo,确认是否设置了 enable = 1,例如:

# CentOS-Linux-Debuginfo.repo
#
# All debug packages are merged into a single repo, split by basearch, and are
# not signed.

[debuginfo]
name=CentOS Linux $releasever - Debuginfo
baseurl=http://debuginfo.centos.org/$releasever/$basearch/
gpgcheck=1
enabled=1    #<---这里要设置1
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-centosofficial

此外,还要把GreatSQL 8.0.25-16的源码包解压缩到 /opt 目录下:

$ tar zxf PATH/greatsql-8.0.25-16.tar.gz -C /opt/

接下来,演示如何跟踪调试。

先初始化GreatSQL数据文件,然后再启动GreatSQL服务进程:

# 初始化GreatSQL
$ cd /usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64
$ ./bin/mysqld-debug --no-defaults --initialize-insecure --user=mysql --datadir=./data

# 启动GreatSQL
$ ./bin/mysqld-debug --no-defaults --user=mysql --datadir=./data1 &

# 查看进程pid
$ ps -ef | grep mysqld
...
mysql    2644322 2542442  3 14:38 pts/7    00:00:01 ./bin/mysqld-debug --no-defaults --user=mysql --datadir=./data1

# 在另一个终端(终端#2),连入GreatSQL
$ mysql -S/tmp/mysql.sock
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 7
Server version: 8.0.25-16-debug Source distribution
...
mysql>\s
...
Server version:         8.0.25-16-debug Source distribution
...

启动gdb,准备调试跟踪GreatSQL,我们分别演示几种不同方式。

3.1 利用gdb设置断点

终端#1 中启动gdb,并挂接GreatSQL进程,准备跟踪

$ gdb -p 2644322
GNU gdb (GDB) Red Hat Enterprise Linux 9.2-4.el8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...
Attaching to process 2644322
[New LWP 2643482]
[New LWP 2643483]
[New LWP 2643484]
...
[New LWP 2643522]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--  #<-- 这里按下回车,即可进入
0x00007fb7ae93ba41 in __GI___poll (fds=0x7fb7ae229140, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29        return SYSCALL_CANCEL (poll, fds, nfds, timeout);
(gdb)
(gdb) p mysql_sysvar_version  #<-- 打印变量,查看GreatSQL版本号
$1 = {flags = 68101, name = 0x7f10d1c6cc90 "innodb_version", comment = 0x6c47f92 "InnoDB version", check = 0x37dd9e2
     <check_func_str(THD*, SYS_VAR*, void*, st_mysql_value*)>, update = 0x37ddeb0 <update_func_str(THD*, SYS_VAR*, void*, void const*)>,
  value = 0x7e7c768 <innodb_version_str>, def_val = 0x6c38440 "8.0.25-15"}
(gdb) 
(gdb) 
(gdb) b mysql_execute_command  #<--- 输入指令"b dispatch_command"设置断点,意为当GreatSQL程序运行到这个函数时,就会停下来
Breakpoint 3 at 0x379c3f2: file /opt/greatsql-8.0.25-16/sql/sql_parse.cc, line 2875.
(gdb)

切换到 终端#2,随便执行一条SQL命令:

mysql> select 'debug' from dual;

回到 终端#1,继续调试:

(gdb) 
(gdb) bt  #<-- 打印函数调用信息
#0  dispatch_command (thd=0x7f10a3a0b000, com_data=0x7f10d12a7370, command=COM_QUERY) at /opt/greatsql-8.0.25-16/sql/sql_parse.cc:1605
#1  0x0000000003797c48 in do_command (thd=0x7f10a3a0b000) at /opt/greatsql-8.0.25-16/sql/sql_parse.cc:1388
#2  0x0000000003991168 in handle_connection (arg=0x7f10d1f9d120) at /opt/greatsql-8.0.25-16/sql/conn_handler/connection_handler_per_thread.cc:307
#3  0x00000000052e4b22 in pfs_spawn_thread (arg=0x7f10e8a45660) at /opt/greatsql-8.0.25-16/storage/perfschema/pfs.cc:2899
#4  0x00007f10eb1e917a in start_thread (arg=<optimized out>) at pthread_create.c:479
#5  0x00007f10e9128dc3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb)
(gdb) p thd->m_query_string  #<-- 打印SQL语句
$14 = {str = 0x7f10a3a0e828 "select 'debug' from dual", length = 24}
(gdb)
(gdb) c  #<-- 继续执行,终端#2里被阻塞的SQL语句就可以执行了
Continuing.

切回 终端#2 查看SQL语句执行结果:

mysql> select 'debug' from dual;
+-------+
| debug |
+-------+
| debug |
+-------+
1 row in set (12 min 11.55 sec)

可以看到,因为一直被阻塞,这条SQL请求耗时超过12分钟。当 终端#2 的连接断开退出后,可以看到gdb端也有相应提示:

Thread 39 "mysqld-debug" hit Breakpoint 1, dispatch_command (thd=0x7f10a3a0b000, com_data=0x7f10d12a7370, command=COM_QUIT)
    at /opt/greatsql-8.0.25-16/sql/sql_parse.cc:1605
1605      bool error = false;
(gdb)

如果不想继续跟踪调试了,只需输入指令 qquit 即可退出gdb。

(gdb) quit
A debugging session is active.

        Inferior 1 [process 2644322] will be detached.

Quit anyway? (y or n) y
Detaching from program: /usr/local/GreatSQL-8.0.25-16-Linux-glibc2.28-x86_64/bin/mysqld-debug, process 2644322
[Inferior 1 (process 2644322) detached]

3.2 使用 Trace 文件调试

还可以在GreatSQL客户端中设置变量 debug 为不同值,就可以输出GreatSQL运行过程中涉及的调用模块、函数、状态信息等全部信息,并记录到本地文件中。用法示例:

mysql> SET SESSION debug = 'debug_options';

变量 debug 支持多种设置模式:

debug_options = field_1:field_2:...:field_N
field = [+|-]flag[,modifier,modifier,...,modifier]

+, - 表示从当前debug值添加或者减少某些选项。

flag相关可选项如下:

flag 说明
d 开启DBUG
f 只跟踪指定的函数
F 跟踪指定的源码文件
i 跟踪指定的线程
L 跟踪指定的源码行数
n 打印函数调用层次序号
N 输出日志从0开始打印行号
o 指定输出到某个文件
O 类似o,每次写文件都会flush,reopen
P 匹配DBUG_PROCESS
p 打印process name
t 打印函数调用和退出

使用案例1(精简模式)

# 设置debug选项
mysql> set session debug='d:t:o,/tmp/mysqld.trace';

# 执行SQL请求
mysql> select 'debug' from dual;
+-------+
| debug |
+-------+
| debug |
+-------+
1 row in set (0.00 sec)

查看生成的trace文件:

$ cat /tmp/mysqld.trace
...
>do_command
| >THD::clear_error
| <THD::clear_error
| >Diagnostics_area::reset_diagnostics_area
| <Diagnostics_area::reset_diagnostics_area
| >my_net_set_read_timeout
| | enter: timeout: 28800
| | >vio_socket_timeout
| | <vio_socket_timeout
| <my_net_set_read_timeout
| >vio_is_blocking
| <vio_is_blocking
| >net_read_raw_loop
| | >vio_read
| | | >vio_is_blocking
| | | <vio_is_blocking
| | | >vio_io_wait
| | | <vio_io_wait
| | <vio_read
| <net_read_raw_loop
| THD::enter_stage: 'starting' /opt/greatsql-8.0.25-16/sql/conn_handler/init_net_server_extension.cc:102
...

使用案例2(复杂模式)
增加了打印文件名和行号等信息,更方便定位查找。

mysql> set session debug='d:t:L:F:o,/tmp/mysqld.trace';
mysql> select 'debug' from dual;
...

查看生成的trace文件:

$ cat /tmp/mysqld.trace
...
  sql_parse.cc: <do_command
  sql_parse.cc:  1269: >do_command
   sql_class.h:  3287: | >THD::clear_error
   sql_class.h: | <THD::clear_error
  sql_error.cc:   357: | >Diagnostics_area::reset_diagnostics_area
  sql_error.cc: | <Diagnostics_area::reset_diagnostics_area
   net_serv.cc:  2246: | >my_net_set_read_timeout
   net_serv.cc:  2247: | | enter: timeout: 28800
  viosocket.cc:   380: | | >vio_socket_timeout
  viosocket.cc: | | <vio_socket_timeout
   net_serv.cc: | <my_net_set_read_timeout
  viosocket.cc:   373: | >vio_is_blocking
  viosocket.cc: | <vio_is_blocking
   net_serv.cc:  1341: | >net_read_raw_loop
  viosocket.cc:   169: | | >vio_read
  viosocket.cc:   373: | | | >vio_is_blocking
  viosocket.cc: | | | <vio_is_blocking
  viosocket.cc:  1118: | | | >vio_io_wait
  viosocket.cc: | | | <vio_io_wait
  viosocket.cc: | | <vio_read
   net_serv.cc: | <net_read_raw_loop
  sql_parse.cc:   320: | THD::enter_stage: 'starting' /opt/greatsql-8.0.25-16/sql/conn_handler/init_net_server_extension.cc:102
...

本文简单演示了如何跟踪调试GreatSQL的几种方法,更多有趣实用的方法还有待进一步挖掘,一起探索新世界吧。

P.S,我也在MacOS环境下构建了基于vscode的跟踪调试环境,但还是更喜欢在Linux终端命令行模式下工作,所以本文没介绍如何利用vscode跟踪调试,有兴趣的读者可以根据其他资料自行构建。

延伸阅读
– https://www.cnblogs.com/radondb/p/15702898.html
– https://developer.aliyun.com/article/727403
– https://mytechshares.com/2022/05/18/debug-mysql-with-vscode/

『叶问』#41,三节点的MGR集群,有两个节点宕机后还能正常工作吗

每周学点MGR知识。

1. 三节点的MGR集群,有两个节点宕机后还能正常工作吗

要看具体是哪种情况。

如果两个节点是正常关闭的话,则会向MGR集群发送退出信号,这种情况下,这两个节点属于正常退出,最后仅剩的节点会被提升为Primary角色,还可以正常工作,允许对其进行读写,只是此时没有可用性冗余了。当其他节点再次启动并加入集群后,又能恢复正常服务。

如果是因为网络故障,或者mysqld进程发生oom、或被误杀、或其他原因退出了,则这些节点会被标识为 UNREACHABLE 状态,等待直到 group_replication_member_expel_timeout 时长(单位:秒)后这个节点才会正式退出集群。在这种情况下,一旦超过多数派节点处于 UNREACHABLE 状态时,则整个集群不可用,无法提供读写服务。这种情况下,需要把剩下的节点重启MGR服务才能恢复。

正常情况下,不要把 group_replication_member_expel_timeout 值调整太大,并且MGR的事务一致性级别尽量不要选择 AFTER 模式,以防出现整个集群服务不可用的问题,详细参见这篇文章:为什么MGR一致性模式不推荐AFTER

2. MGR可以像主从复制那样只启动两个节点吗

MGR在初始化启动时,是可以只启动两个节点,甚至只有一个节点,但是这样就失去MGR的意义了。因为只要少于三个节点,就没办法进行多数派投票,当发生网络故障等情况时,无法投票确认哪些节点该被踢出集群。