MySQL MHA实现集群无故障迁移

830 查看

MySQL的主从复制功能(replication)可以配置一台机器为master,另外一台或者多台机器作为Slave,从master同步数据。整个过程主要包括以下几点
1. Master将update event写入bin-log
2. Master通过Binlog Dump线程将update event发给Slave
3. Slave通过IO Thread读取Master发送过来的update event并写入relay-log
4. Slave通过SQL Thread读取relay-log将update event应用到Slave

可以看到MySQL的replication是异步的,Master只负责将update event写入bin-log,至于Binlog Dump线程什么时候把event发送给Slave以及Slave什么时候把event应用在自己端,它其实是不会管的。所以极有可能update event已经在Master上生效,但是Slave上却没有的情况发生。

MySQL自5.5开始提供了半同步的复制(semisynchronous replication),客户端在发送update event给Master后,Master必须保证至少一个Slave已经接收到了update event才能将update commit,但也有一个例外,如果所有的Slave都已经死了或者网络不通,那么Master在经过一定的时间(Rpl_semi_sync_master_tx_wait_time)后发现还是没有任何一台Slave接收到了update event,那么会立即提交并且半同步复制退化为异步复制。

不管是异步复制还是半同步复制,MySQL集群都逃不了Master的单点问题,当然程序员的智慧是无穷的,有以下方案来实现MySQL的"HA"。

Master - Master

主主模式包含两台服务器,每一个都被设置成为对方的备库,它们是一对主库,如图

这样的配置不能对两台Master同时进行读操作,否则会引起冲突,假设同时执行下面两条语句

  • 在第一台Master上

    update tb1 set col = col + 1;
    
  • 在第二台Master上

    update tb1 set col = col * 2;
    

如此会引起两个Master数据不一致而没有任何的错误,所以对于主主模式一般将其中一台设置为只读,所有的写请求全部都写到一台Master上,只有在某台Master死机之后,写请求才到另外一台上。
这种模式的缺点
1. Master只允许死机一次,如果第二次死机,则不能实现故障转移
2. 如果Master的某些update event没有及时发到另一台Master上,则数据丢失
3. 只有两台机器,对于读请求大的应用存在瓶颈

Master - Backup Master - Multiple Slaves

这种方案使用一台机器作为Master,另外一台机器作为备用的Master,其他多台机作为Slave,备用Master是在现在的Master死机之后的替代,如图

这种方案跟以上的方案一样的存在数据丢失的问题,同时在Master死机的时候可能Slave的数据已经不一致了,直接把Slave的Master都切换到Backup Master存在风险

那么在Master死机的时候会存在哪些情况呢,下面具体分析一下
1. 所有的Slave都已经接收到了master的update event, 如下所示

这种情况是最幸运也是最简单的情况,任何一台Slave都可以提升为master,而且不用做任何的数据恢复, 假如将Slave 1提升为master,然后再slave 2和slave3上执行change master to ...就完成了master failover
2. 所有的slave都接收到了master的相同位置的update event,但是还有event在master端没有发送出来,如下所示

在这种情况下,如果直接将一台slave提升为master,那么master的id为102的event丢失。当然可以使用前面提到过的半同步复制来减少这样的问题的出现。其实还可以做的就是,如果master只是MySQL进程死掉了,而机器仍然可以访问,那么可以到master机器上读取bin-log,将id为102的event保存下来,并应用到要提升的slave上,然后再提升这台机器为master,如此就没有数据丢失了
3. slave间本身的就是不同步的,如下所示

这种情况下需要在最新的slave(slave 2)上拿到其它slave没有应用的event,然后在其它slave上执行,以保持所有slave的同步

以上说的所有的情况,MHA已经做了处理,下面来说说MHA。
MHA是一个日本的大神用Perl写的,代码托管在Google code,Google最近被封的厉害,你懂得

MHA分为两个角色,manager和node, manager是用来监控master的状态的,node是在master死机后做恢复的时候apply最新的event的

安装配置

环境说明:
一共四台机

hh1, MySQL Master

hh2, MySQL Slave

hh3, MySQL Slave

hh4, MHA Manager

MHA在master死机后会通过SSH的方式去读取master的bin-log,在同步slave的时候也会通过SSH的方式去读取slave的relay-log以比较两者的差异,所以需要配置四台机都能通过SSH互相访问

  • 在hh1, hh2, hh3, hh4上分别执行

    ssh-keygen -t rsa
    ssh-copy-id -i .ssh/id_rsa.pub hh1
    ssh-copy-id -i .ssh/id_rsa.pub hh2
    ssh-copy-id -i .ssh/id_rsa.pub hh3
    ssh-copy-id -i .ssh/id_rsa.pub hh4
    
  • 下载mha manager和node 在Google Driver

  • 在hh1,hh2, hh3上安装node

    apt-get install libdbd-mysql-perl
    dpkg -i mha4mysql-node_0.54-0_all.deb
    
  • 在hh4上安装manager

    apt-get install libdbd-mysql-perl
    apt-get install libconfig-tiny-perl
    apt-get install liblog-dispatch-perl
    apt-get install libparallel-forkmanager-perl
    dpkg -i mha4mysql-node_0.54-0_all.deb
    
    dpkg -i mha4mysql-manager_0.55-0_all.deb
    
  • 在hh4上创建配置文件

    vim /etc/app1.cnf
    

    在配置文件中加入如下

    [server default]
    user=root
    password=password
    ssh_user=root
    manager_workdir=/usr/share/bi/app1
    remote_workdir=/usr/share/bi/app1
    
    [server1]
    hostname=hh1
    
    [server2]
    hostname=hh2
    
    [server3]
    hostname=hh3
    
  • 检查SSH配置

    masterha_check_ssh --conf=/etc/app1.cnf
    
  • 启动Manager

    nohup masterha_manager --conf=/etc/app1.cnf < /dev/null > /usr/share/bi/app1/app1.log 2>&1 &
    

启动完成之后查看manager的日志/usr/share/bi/app1/app1.log,以下截取部分做分析

...
Sat Jun 28 01:55:52 2014 - [info] Dead Servers:
Sat Jun 28 01:55:52 2014 - [info] Alive Servers:
Sat Jun 28 01:55:52 2014 - [info]   hh1(10.3.40.102:3306)
Sat Jun 28 01:55:52 2014 - [info]   hh2(10.3.40.37:3306)
Sat Jun 28 01:55:52 2014 - [info]   hh3(10.3.40.71:3306)
...
Sat Jun 28 01:56:09 2014 - [info] Set master ping interval 3 seconds.
...
Sat Jun 28 01:56:09 2014 - [info] Starting ping health check on hh1(10.3.40.102:3306)..
Sat Jun 28 01:56:09 2014 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

可以从日志中看到,Manager检测到了三个节点,并且它已经发现了master是hh1,然后开始在检查他是否一直活着了

现在我在hh1上手动停止MySQL服务

service mysql stop

再观察日志文件(以下截取部分)

Sat Jun 28 01:56:57 2014 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Sat Jun 28 01:56:57 2014 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/var/lib/mysql,/var/log/mysql --output_file=/usr/share/bi/app1/save_binary_logs_test --manager_version=0.55 --binlog_prefix=binlog
  Creating /usr/share/bi/app1 if not exists..    ok.
  Checking output directory is accessible or not..
   ok.
  Binlog found at /var/lib/mysql, up to binlog.000002
Sat Jun 28 01:56:59 2014 - [info] HealthCheck: SSH to hh1 is reachable.
Sat Jun 28 01:57:00 2014 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.3.40.102' (111))
Sat Jun 28 01:57:00 2014 - [warning] Connection failed 1 time(s)..
Sat Jun 28 01:57:03 2014 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.3.40.102' (111))
Sat Jun 28 01:57:03 2014 - [warning] Connection failed 2 time(s)..
Sat Jun 28 01:57:06 2014 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '10.3.40.102' (111))
Sat Jun 28 01:57:06 2014 - [warning] Connection failed 3 time(s)..
Sat Jun 28 01:57:06 2014 - [warning] Master is not reachable from health checker!
Sat Jun 28 01:57:06 2014 - [warning] Master hh1(10.3.40.102:3306) is not reachable!

从日志中可以看到,Manager发现hh1的MySQL不再响应了,同时它做了三次检查以确保MySQL服务确实已经死了,那么它发现master死了之后怎么办了,这个过程比较复杂,简单描述如下:

  • Configuration check: 重新检查一遍配置文件,因为master死机可能是在整个系统上线后很长时间才发生的,比如说在两个月后,那么在这段时间内很有可能MySQL集群的配置已经改变,所以需要重新检查一遍配置文件
  • Dead master shutdown: 强制关闭Manager认为已经死机的master

    • master_ip_failover_script 在关闭已经死机的master的时候会调用这个用户可自定义的脚本,可以用来做IP转移
  • Master Recovery: 新的master恢复阶段

    • Getting Latest (and oldest) Slaves: 检查所有的slave的状态,得到同步最新和最旧的slave
    • Saving Dead Master's Binlog(node): 如果死机的master还能通过SSH访问,那么,Manager会通过node取得它的bin-log
    • Determining New Master(latest? candidate master?): 选取新的master,有两种策略,第一种是最简单的,如果用户在配置文件里配置了哪台slave作为candidate master,那么直接用那台slave做新的master,第二种是在用户没有配置的情况下用的,选择同步到最新的slave作为master
    • New Master Diff Log Generation: 因为node已经取得了死掉的master的bin-log,所以可以跟新选择的master的log进行比较,生成diff log
    • Master Log Apply: 将diff log应用在新的master上

      • Apply log(node): 由node来应用diff log,以使新的master的数据跟死掉的master的数据是一致的
      • master_ip_failover_script: 然后再调用一次用户自定义的脚本,同样可以用来做IP转移
  • Slaves Recovery: 恢复所有的slave阶段

    • Starting Parallel Slave Diff Log Generation: 同样的生成diff log,因为可能有多个slave,所以是parallel(并行)的
    • Starting Parallel Slave Log Apply: 将diff log应用在slave上

      • Apply log(node): 由node来应用diff log
      • Change master to…: 将slave的master指向新的master
  • New master cleanup: 新的master原来是slave,所以要清掉他的slave信息

整个过程就是这样的,不是很复杂,也不是很简单,这个过程基本上可以保证数据不会丢失,同时所有的slave都能同步到一样的状态。可以查看log日志验证以上步骤,log比较长,已经做了删减,不想看的可以忽略:

Sat Jun 28 01:57:07 2014 - [info] Starting master failover.
Sat Jun 28 01:57:07 2014 - [info] 
Sat Jun 28 01:57:07 2014 - [info] * Phase 1: Configuration Check Phase..
Sat Jun 28 01:57:07 2014 - [info] 
Sat Jun 28 01:57:07 2014 - [info] Dead Servers:
Sat Jun 28 01:57:07 2014 - [info]   hh1(10.3.40.102:3306)
Sat Jun 28 01:57:07 2014 - [info] Checking master reachability via mysql(double check)..
Sat Jun 28 01:57:07 2014 - [info]  ok.
Sat Jun 28 01:57:07 2014 - [info] Alive Servers:
Sat Jun 28 01:57:07 2014 - [info]   hh2(10.3.40.37:3306)
Sat Jun 28 01:57:07 2014 - [info]   hh3(10.3.40.71:3306)
Sat Jun 28 01:57:07 2014 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Jun 28 01:57:07 2014 - [info] 
Sat Jun 28 01:57:07 2014 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Jun 28 01:57:07 2014 - [info] 
Sat Jun 28 01:57:07 2014 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Jun 28 01:57:07 2014 - [info] Executing master IP deactivatation script:
Sat Jun 28 01:57:07 2014 - [info]   /usr/share/bi/master_ip_failover.sh --orig_master_host=hh1 --orig_master_ip=10.3.40.102 --orig_master_port=3306 --command=stopssh --ssh_user=root  
Sat Jun 28 01:57:08 2014 - [info]  done.
Sat Jun 28 01:57:08 2014 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Jun 28 01:57:08 2014 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Jun 28 01:57:08 2014 - [info] 
Sat Jun 28 01:57:08 2014 - [info] * Phase 3: Master Recovery Phase..
Sat Jun 28 01:57:08 2014 - [info] 
Sat Jun 28 01:57:08 2014 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Jun 28 01:57:08 2014 - [info] 
Sat Jun 28 01:57:08 2014 - [info] The latest binary log file/position on all slaves is binlog.000002:120
...
Sat Jun 28 01:57:08 2014 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Sat Jun 28 01:57:08 2014 - [info] 
Sat Jun 28 01:57:09 2014 - [info] Fetching dead master's binary logs..
Sat Jun 28 01:57:13 2014 - [info] scp from root@10.3.40.102:/usr/share/bi/app1/saved_master_binlog_from_hh1_3306_20140628015707.binlog to local:/usr/share/bi/app1/saved_master_binlog_from_hh1_3306_20140628015707.binlog succeeded.
Sat Jun 28 01:57:17 2014 - [info] * Phase 3.3: Determining New Master Phase..
Sat Jun 28 01:57:17 2014 - [info] Searching new master from slaves..
Sat Jun 28 01:57:17 2014 - [info]  Candidate masters from the configuration file:
Sat Jun 28 01:57:17 2014 - [info]  Non-candidate masters:
Sat Jun 28 01:57:17 2014 - [info] New master is hh2(10.3.40.37:3306)
Sat Jun 28 01:57:17 2014 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Sat Jun 28 01:57:19 2014 - [info] * Phase 3.4: Master Log Apply Phase..
Sat Jun 28 01:57:19 2014 - [info] Starting recovery on hh2(10.3.40.37:3306)..
Sat Jun 28 01:57:19 2014 - [info]  Generating diffs succeeded.
Sat Jun 28 01:57:19 2014 - [info] Waiting until all relay logs are applied.
Sat Jun 28 01:57:21 2014 - [info] ** Finished master recovery successfully.
Sat Jun 28 01:57:21 2014 - [info] * Phase 3: Master Recovery Phase completed.
Sat Jun 28 01:57:21 2014 - [info] * Phase 4: Slaves Recovery Phase..
Sat Jun 28 01:57:21 2014 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Sat Jun 28 01:57:21 2014 - [info] -- Slave diff file generation on host hh3(10.3.40.71:3306) started, pid: 29373. Check tmp log /usr/share/bi/app1/hh3_3306_20140628015707.log if it takes time..
Sat Jun 28 01:57:21 2014 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Sat Jun 28 01:57:21 2014 - [info] -- Slave recovery on host hh3(10.3.40.71:3306) started, pid: 29375. Check tmp log /usr/share/bi/app1/hh3_3306_20140628015707.log if it takes time..
Sat Jun 28 01:57:24 2014 - [info] scp from local:/usr/share/bi/app1/saved_master_binlog_from_hh1_3306_20140628015707.binlog to root@hh3:/usr/share/bi/app1/saved_master_binlog_from_hh1_3306_20140628015707.binlog succeeded.
Sat Jun 28 01:57:24 2014 - [info] Starting recovery on hh3(10.3.40.71:3306)..
Sat Jun 28 01:57:24 2014 - [info]  Generating diffs succeeded.
...
Sat Jun 28 01:57:26 2014 - [info] -- Slave recovery on host hh3(10.3.40.71:3306) succeeded.
Sat Jun 28 01:57:26 2014 - [info] All new slave servers recovered successfully.
Sat Jun 28 01:57:26 2014 - [info] 
Sat Jun 28 01:57:26 2014 - [info] * Phase 5: New master cleanup phase..
Sat Jun 28 01:57:26 2014 - [info] 
Sat Jun 28 01:57:26 2014 - [info] Resetting slave info on the new master..
Sat Jun 28 01:57:26 2014 - [info]  hh2: Resetting slave info succeeded.
Sat Jun 28 01:57:26 2014 - [info] Master failover to hh2(10.3.40.37:3306) completed successfully.

MHA的优点:

  • 实现了故障转移,快速将从服务器晋级为主服务器(通常在10-30s)
  • 不会有性能损耗,容易安装
  • 不必更改现有的部署环境,适用于任何存储引擎
  • 使用半同步复制,可以大大降低数据丢失的风险

MHA的缺点

  • 虽然MHA试图从死机的master上保存二进制日志,但如果死机的master是由于硬件损坏等原因导致的,无法通过SSH访问到,那么也是有可能引起数据丢失的
  • 当master死机,切换到另外的服务器后,即使恢复了master,也不能立即加入到MHA系统的监控中去,要重新部署

以上,Hope you enjoy !