一、简介
在《Galera集群恢复的常见七种场景》中详细介绍了其中Galera集群故障的七种恢复场景,除了脑裂场景外(场景七),针对前六种常见恢复场景,撰写Shell脚本check-or-recover-galera.sh进行Galera检测以及故障恢复,该脚本可作为业务系统服务自启动的一部分。
Galera集群环境:三个节点(controller01、controller02和controller03)centos7.1、mariadb 10.1
二、脚本
1、恢复流程
大体流程分为服务检查、集群恢复和状态检测三部分。
详细如下:本脚本首先检查三个节点上mariadb服务状态,根据总体服务状态判断恢复场景
- 对于一个或两个节点服务停止的场景(场景一、二、四、五),直接启动对应节点上的服务即可。
- 对于全部节点上服务停止的场景(场景三、六),需要判断执行最新事务状态的节点,然后首先在该节点上启动Galera集群,然后在其他节点启动服务即可。问题的关键变为寻找执行最新事务状态的节点,分为以下两种情况:
- 如果全部节点上的服务全部正常停止(场景三):查找每个节点上grastate.dat文件中最后执行事务序号seqno,然后排序,最大者为启动节点
- 如果全部节点上的服务存在非正常停止(场景六及其他):这种情况一般是机房断电等原因引起,每个节点上grastate.dat文件中最后执行事务序号seqno都为-1,需要查询gvwstate.dat文件中Primary Component的记录,找到my_uuid和view_id相等的节点即为启动节点。
找到启动节点后,恢复Galera集群,最后进行状态检测。
2、脚本内容
check-or-recover-galera.sh内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
#!/bin/sh password_galera_root=a263f6a89fa2 STOP_NODES=(); UUID=$(uuidgen) rm -rf /tmp/GTID_* findBootstrapNode(){ for host in $(cat /tmp/GTID_${UUID}|grep "\-1"|awk '{print $2}') do VIEW_ID=$(ssh ${host} cat /var/lib/mysql/gvwstate.dat|grep view_id|awk '{print $3}') MY_UUID=$(ssh ${host} cat /var/lib/mysql/gvwstate.dat|grep my_uuid|awk '{print $2}') if [ $VIEW_ID = $MY_UUID ];then echo $host break fi done } ### 1. Check mariadb service in every nodes for i in 01 02 03; do FLAG=$(ssh controller$i systemctl status mariadb |grep Active:|grep running|wc -l) if [ "${FLAG}" = "0" ];then echo "[INFO] controller$i is down!" let INDEX=${#STOP_NODES[@]}+1 STOP_NODES[INDEX]=controller$i seqno=$(ssh controller$i cat /var/lib/mysql/grastate.dat|grep seqno:|awk '{print $2}') echo $seqno" "controller$i >> /tmp/GTID_$UUID elif [ "$FLAG" = "1" ];then echo "[INFO] controller$i is up!" else echo "[ERROR] Get the status of controller$i ariadb is error!" exit 127 fi done ### 2. Recover Galera Cluster let CLUSTER_SIZE=3-${#STOP_NODES[@]} if [ "${CLUSTER_SIZE}" = "3" ]; then echo "[INFO] Galera is OK!" elif [ "$CLUSTER_SIZE" = "2" -o "$CLUSTER_SIZE" = "1" ];then echo "[INFO] One or Two MariaDB nodes is down!" ## 2.1 Only start the mariadb service in stopped nodes for node in ${STOP_NODES[@]}; do ssh ${node} systemctl start mariadb done elif [ "${CLUSTER_SIZE}" = "0" ]; then echo "[INFO] All MariaDB nodes is down!" ABNORMAL_SIZE=$(cat /tmp/GTID_$UUID |grep "\-1"|wc -l) ## 2.2 Find the latest state node to bootstrap and start others nodes ## 2.2.1 All three nodes are gracefully stopped if [ "$ABNORMAL_SIZE" = "0" ];then BOOTSTARP_NODE=$(cat /tmp/GTID_$UUID|sort -n -r|head -n 1|awk '{print $2}') echo "[INFO] All three nodes are gracefully stopped!" ## 2.2.2 All nodes went down without proper shutdown procedure elif [ "$ABNORMAL_SIZE" = "1" ];then BOOTSTARP_NODE=$(cat /tmp/GTID_$UUID|grep "\-1"|awk '{print $2}') echo "[INFO] One node disappear in Galera Cluster! Two nodes are gracefully stopped!" elif [ "$ABNORMAL_SIZE" = "2" ];then echo "[INFO] Two nodes disappear in Galera Cluster! One node is gracefully stopped!" BOOTSTARP_NODE=$(findBootstrapNode) elif [ "$ABNORMAL_SIZE" = "3" ];then echo "[INFO] All nodes went down without proper shutdown procedure!" BOOTSTARP_NODE=$(findBootstrapNode) else echo "[ERROR] No grastate.dat or gvwstate.dat file!" exit 127 fi ### Recover Galera echo "[INFO] The bootstarp node is:"$BOOTSTARP_NODE MYSQL_PID=$(ssh $BOOTSTARP_NODE netstat -ntlp|grep 4567|awk '{print $7}'|awk -F "/" '{print $1}') ssh $BOOTSTARP_NODE /bin/bash << EOF kill -9 $MYSQL_PID mv /var/lib/mysql/gvwstate.dat /var/lib/mysql/gvwstate.dat.bak galera_new_cluster EOF for i in 01 02 03; do if [ "controller$i" = $BOOTSTARP_NODE ];then echo "[INFO] controller$i's mariadb service status:"$(ssh controller$i systemctl status mariadb |grep Active:) else echo "[INFO] controller$i start service:" ssh "controller$i" systemctl start mariadb fi done else echo "[ERROR] Recover Galera Cluster is error!" exit 127 fi ### 3. Check Galera Status sleep 5 WSREP_CLUSTER_SIZE=$(mysql -uroot -p$password_galera_root -e "SHOW STATUS LIKE 'wsrep_cluster_size';"|grep wsrep_cluster_size|awk '{print $2}') echo "[INFO] Galera cluster CLUSTER_SIZE:"$WSREP_CLUSTER_SIZE if [ "${WSREP_CLUSTER_SIZE}" = "3" ]; then echo "[INFO] Galera Cluster is OK!" exit 0 elif [ "$WSREP_CLUSTER_SIZE" = "2" ];then echo "[INFO] One MariaDB nodes is down!" exit 2 elif [ "$WSREP_CLUSTER_SIZE" = "1" ];then echo "[INFO] Two MariaDB nodes is down!" exit 1 else echo "[INFO] All MariaDB nodes is down!" exit 3 fi |
膜拜下大神
不敢不敢,菜鸟一枚!