背景: 因机房故障,导致redis-cluster 集群中 有个节点 down 了 ,再从新拉起节点加入集群后,就没在意
过来小许,服务器磁盘告警来了,我看了下 应用服务器上的tomcat 的日志,不停的刷,日志内容 报redis 错误, CLUSTERDOWN Hash slot not served
吓得我 立马 去 redis 集群上去 检查 是否是我的刚刚加入的节点有问题
[root@admin ~]# redis-cli -c -h 192.168.207.251 -p 7001
192.168.207.251:7001>set testfxkj 6666
(error) CLUSTERDOWN Hash slot not served
192.168.207.251:7001>exit
提示hash 槽有问题 ;
退出redis集群 ,使用 redis-trib.rb check 对集群进行检测
备注: 我这里是redis 3.2 版本 ,使用 redis-trib.rb check 进行检测
如果是 redis 3 版本以上 请使用 redis-cli –cluster check 命令
[root@admin ~]# redis-trib.rb check 192.168.207.251:7001
检测后 提示 如下图, [ERR] Not all 16384 slots are covered by nodes
可以看到 16384号slots没有被分配
下一步,我们尝试使用 redis 官方推荐的 redis-trib.rb fix 命令 来修复集群(同样 redis 3版本以上的请使用redis-cli –cluster fix 进行修复 )
[root@admin ~]# redis-trib.rb fix 192.168.207.251:7001
>>> Performing Cluster Check (using node 192.168.207.251:7001)
M: 3d1808df8a82dd705a67e8fef26cb8c8482bd69a 192.168.207.251:7001
slots:0-5459 (5460 slots) master
1 additional replica(s)
S: 6fa8513ee71e0153e2e746fef2a3b1ce08348766 192.168.207.159:7003
slots: (0 slots) slave
replicates a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32
M: 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c 192.168.207.160:7001
slots:10923-16383 (5461 slots) master
1 additional replica(s)
S: d057d13bc64b9be9f68c0291700965b3751127c0 192.168.207.159:7005
slots: (0 slots) slave
replicates 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c
M: a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32 192.168.207.160:7002
slots:5461-10922 (5462 slots) master
1 additional replica(s)
S: a84cb46e811925ee48f0ff8c25e753be752fc45c 192.168.207.159:7004
slots: (0 slots) slave
replicates 3d1808df8a82dd705a67e8fef26cb8c8482bd69a
[ERR] Nodes don't agree about configuration!
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.
>>> Fixing slots coverage...
List of not covered slots: 5460
Slot 5460 has keys in 0 nodes:
The folowing uncovered slots have no keys across the cluster:
5460
Fix these slots by covering with a random node? (type 'yes' to accept): yes
>>> Covering slot 6829 with 192.168.207.251:7001
/usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis/client.rb:114:in `call': ERR Slot 6829 is already busy (Redis::CommandError)
from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:2646:in `block in method_missing'
from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:57:in `block in synchronize'
from /usr/local/ruby/lib/ruby/2.2.0/monitor.rb:211:in `mon_synchronize'
from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:57:in `synchronize'
from /usr/local/ruby/lib/ruby/gems/2.2.0/gems/redis-3.2.2/lib/redis.rb:2645:in `method_missing'
from /usr/local/bin/redis-trib.rb:462:in `block in fix_slots_coverage'
from /usr/local/bin/redis-trib.rb:459:in `each'
from /usr/local/bin/redis-trib.rb:459:in `fix_slots_coverage'
from /usr/local/bin/redis-trib.rb:398:in `check_slots_coverage'
from /usr/local/bin/redis-trib.rb:361:in `check_cluster'
from /usr/local/bin/redis-trib.rb:1139:in `fix_cluster_cmd'
from /usr/local/bin/redis-trib.rb:1695:in `<main>'
fix 修复失败,提示 slot 6829 正在使用
登录集群,CLUSTER DELSLOTS 6829 (使一个特定的Redis Cluster节点去忘记一个主节点正在负责的哈希槽)
[root@admin ~]# redis-cli -c -h 192.168.207.251 -p 7001
192.168.207.251:7001> CLUSTER DELSLOTS 6829
OK
再次 使用 redis-trib.rb fix 去修复集群
[root@admin ~]# redis-trib.rb fix 192.168.207.251:7001
>>> Performing Cluster Check (using node 192.168.207.251:7001)
M: 3d1808df8a82dd705a67e8fef26cb8c8482bd69a 192.168.207.251:7001
slots:0-6829 (6829 slots) master
1 additional replica(s)
S: 6fa8513ee71e0153e2e746fef2a3b1ce08348766 192.168.207.159:7003
slots: (0 slots) slave
replicates a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32
M: 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c 192.168.207.160:7001
slots:10923-16383 (5461 slots) master
1 additional replica(s)
S: d057d13bc64b9be9f68c0291700965b3751127c0 192.168.207.159:7005
slots: (0 slots) slave
replicates 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c
M: a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32 192.168.207.160:7002
slots:5461-10922 (5462 slots) master
1 additional replica(s)
S: a84cb46e811925ee48f0ff8c25e753be752fc45c 192.168.207.159:7004
slots: (0 slots) slave
replicates 3d1808df8a82dd705a67e8fef26cb8c8482bd69a
[ERR] Nodes don't agree about configuration!
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.
>>> Fixing slots coverage...
List of not covered slots: 6829
Slot 5460 has keys in 0 nodes:
The folowing uncovered slots have no keys across the cluster:
5460
Fix these slots by covering with a random node? (type 'yes' to accept): yes
>>> Covering slot 6829 with 192.168.207.251:7001
再次check
[root@admin ~]# redis-trib.rb fix 192.168.207.251:7001
>>> Performing Cluster Check (using node 192.168.207.251:7001)
M: 3d1808df8a82dd705a67e8fef26cb8c8482bd69a 192.168.207.251:7001
slots:0-6829 (6829 slots) master
1 additional replica(s)
S: 6fa8513ee71e0153e2e746fef2a3b1ce08348766 192.168.207.159:7003
slots: (0 slots) slave
replicates a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32
M: 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c 192.168.207.160:7001
slots:10923-16383 (5461 slots) master
1 additional replica(s)
S: d057d13bc64b9be9f68c0291700965b3751127c0 192.168.207.159:7005
slots: (0 slots) slave
replicates 36dc79859c644cb03e6c1b0ec4a2d8bf56cfa39c
M: a10cf2bdaf1c445c34d6ae447a2f79bc5d4e9c32 192.168.207.160:7002
slots:5460-10922 (5463 slots) master
1 additional replica(s)
S: a84cb46e811925ee48f0ff8c25e753be752fc45c 192.168.207.159:7004
slots: (0 slots) slave
replicates 3d1808df8a82dd705a67e8fef26cb8c8482bd69a
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
发现 16384 个哈希槽全部分配好了
这时我们再登录集群测试写个KEY
[root@admin ~]# redis-cli -c -h 192.168.207.251 -p 7001
192.168.207.251:7001>set testfxkj 6666
ok
发现 成功写入了KEY
同时也看了下 tomcat 得到日志 也不再刷 CLUSTERDOWN Hash slot not served
文章来源:https://www.cnaaa.net,转载请注明出处:https://www.cnaaa.net/archives/8182