记录一次ES修复的经历

0x00 起因

前端时间公司的日志服务器挂了,开发小伙伴吐槽说日志服务器无法我访问,登录到其中一台服务器上发现,elasticsearch无法启动,数据盘也没有挂载,fstab没有挂盘信息。elasticsearch是跑在阿里云上的,故障的ES节点被执行过DHCP故障的脚本,怀疑是脚本导致服务器重启。
通过挂载数据磁盘,启动ES节点,集群状态还是red,但节点已经恢复为3个。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
$ curl -XGET -u user:passwd 'http://127.0.0.1:9200/_cluster/health?pretty'
{
"cluster_name" : "es_cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 1124,
"active_shards" : 1126,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 1178,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 48.702422145328725
}

0x01 红变黄

经过一小段时间的观察,发现unassigned_shards并没有减少,经过一系列的GOOGLE,找到是分片未分配导致,于是通过es api找到未分配的分片

1
curl -XGET -u user:passwd http://127.0.0.1:9200/_cat/shards  | grep UNASSIGNED

发现有大量未分配的分片和副本,其中p是分片,r是副本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
logstash-2018.12.11             2 r UNASSIGNED
logstash-2018.12.11 1 r UNASSIGNED
logstash-2018.12.11 0 r UNASSIGNED
logstash-2019.05.06 3 p UNASSIGNED
logstash-2019.05.06 3 r UNASSIGNED
logstash-2019.05.06 4 p UNASSIGNED
logstash-2019.05.06 4 r UNASSIGNED
logstash-2019.05.06 2 p UNASSIGNED
logstash-2019.05.06 2 r UNASSIGNED
logstash-2019.05.06 1 p UNASSIGNED
logstash-2019.05.06 1 r UNASSIGNED
logstash-2019.05.06 0 p UNASSIGNED
logstash-2019.05.06 0 r UNASSIGNED
logstash-2019.03.16 4 r UNASSIGNED
logstash-2019.03.16 2 r UNASSIGNED
logstash-2019.03.16 1 r UNASSIGNED
logstash-2019.02.20 3 r UNASSIGNED
logstash-2019.02.20 2 r UNASSIGNED
logstash-2019.02.20 0 r UNASSIGNED
logstash-2019.02.06 3 r UNASSIGNED
logstash-2019.02.06 1 r UNASSIGNED
logstash-2019.02.06 0 r UNASSIGNED

于是尝试手工分配到节点,发现分片可以被分配,但是对副本进行分配,会报错已存在。

1
2
3
4
5
6
7
8
9
# 分片分配API
curl -H "Content-Type: application/json" -XPOST -u user:passwd http://127.0.0.1:9200/_cluster/reroute -d '{
"commands" : [ {
"allocate_empty_primary" : {
"index" : "logstash-2019.05.06", #索引,对应上面第一列
"shard" : "3", #分片号,对应上面第二列
"node": "node-1", #节点名称
"accept_data_loss" : true
} }]}'

修复命令测试OK,编写脚本,修复所有分片

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# 先使用
# curl -XGET -u user:passwd http://127.0.0.1:9200/_cat/shards | grep 'p UNASSIGNED' > error_shard
# 将数据导入到error_shard文件内
es_arry=(
"node-1"
"node-2"
"node-3"
)
sum=0

cat error_shard | while read line
do
index=`echo $line | awk '{print $1}'`
shard=`echo $line | awk '{print $2}'`
sum=`expr $sum + 1`
node=`expr $sum % 3`
curl -H "Content-Type: application/json" -XPOST -u user:passwd http://127.0.0.1:9200/_cluster/reroute -d '{
"commands" : [ {
"allocate_empty_primary" : {
"index" : "'${index}'",
"shard" : "'${shard}'",
"node": "'${es_arry[node]}'", "accept_data_loss" : true } }]}'

done

修复完所有未分配的切片后,查看集群状态,从red变成了yellow。

0x02 删除引起的插曲

经过一宿的观察,发现集群依旧处于yellow状态,检查服务器磁盘空间,发现一台服务器磁盘使用量在83%,另外两台87%,节点无法分配到切片和副本,多出来的切片和副本就这么被挂空了。经过沟通,决定只保留1个月的数据。

1
2
3
4
5
6
7
8
9
10
11
curl -u user:password  -H'Content-Type:application/json' -d'{
"query": {
"range": {
"@timestamp": {
"lt": "now-30d",
"format": "epoch_millis"
}
}
}
}
' -XPOST "http://127.0.0.1:9200/*-*/_delete_by_query?pretty"

上面的命令,跑了半天没反应,查看集群状态red,node数量2,懵逼。。。。。,ES节点删挂了一个。

0x03 黄变绿

经过一系列的操作,把挂掉的ES节点重新拉回集群,把red状态恢复到yellow状态。于是乎换了一个索引删除方式

1
curl -X DELETE -u user:passwd 'http://127.0.0.1:9200/logstash-2019.03*'

通过手动删除大量历史索引,集群磁盘空出大量空间,经过几个小时的等待,集群恢复到了green状态。

0x04 编写程序定时删除

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
package main

import (
"flag"
"fmt"
"net/http"
"strings"
"time"
)

type values struct {
index string
format string
day int
}

func main() {
f := &values{}
f.valueFlag()
now := time.Now()
t := now.AddDate(0, -1, 0-f.day)
err := delteEsIndex(f.index + t.Format(f.format))
if err != nil {
fmt.Println(err)
}

}

func (v *values) valueFlag() {
index := flag.String("index", "logstash-", "索引前缀")
format := flag.String("format", "yyyy.mm.dd", "索引的时间格式")
day := flag.Int("n", 30, "删除n天以前的数据")
flag.Parse()
*format = strings.Replace(*format, "yyyy", "2006", -1)
*format = strings.Replace(*format, "mm", "01", -1)
*format = strings.Replace(*format, "dd", "02", -1)
v.index = *index
v.format = *format
v.day = *day
}

func delteEsIndex(index string) error {
url := "http://user:password@127.0.0.1:9200/" + index
request, err := http.NewRequest("DELETE", url, nil)
if err != nil {
return err
}
request.Header.Set("Content-Type", "application/json")
client := &http.Client{}
_, err = client.Do(request)
if err != nil {
return err
}

return nil
}

备注

其他的维护接口

  • 设置副本数量,设置为0表示取消副本
    curl -XPUT -u user:password 'http://127.0.0.1:9200/index/_settings' -d '{"number_of_replicas": 0}'

  • 查看所有index
    curl -XGET -u user:password http://127.0.0.1:9200/_cat/indices\?v

  • 查看详细的集群监控信息
    curl -XGET -u user:password http://127.0.0.1:9200/_cat/indices\?v

0%