连接跟踪模块导致的网络不可用

今天将一个业务的流量切到新部署的一台机器上。前几天已经灰度了一个业务到这台机器上,一直很稳定,所以准备切更多的流量。
11点左右开始把业务的流量切到这台机器上,没多久业务反馈服务不可访问,紧急把流量切回到原服务,留下新机器备查。

这台机器的操作系统是CentOS 7.0:

1
2
3
4
[root@localhost product]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[root@localhost product]#uname -a
Linux localhost 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

查找原因

查看业务日志发现的日志:

1
2
3
4
5
6
7
8
2019-07-30 11:49:04 [ERROR] [metrics] sendto failed: operation not permitted
2019-07-30 11:49:04 [ERROR] can't watch /cloud/redisservice/xxx.xxx.xxx.xxx:9207: Get http://config.xxx.xxx.com/: dial tcp: lookup config.xxx.xxx.com on xxx.xxx.xxx.xxx:53: write udp xxx.xxx.xxx.xxx:42381->xxx.xxx.xxx.xxx:53: write: operation not permitted
2019-07-30 11:49:05 [ERROR] [metrics] sendto failed: operation not permitted
2019-07-30 11:49:08 [ERROR] [metrics] sendto failed: operation not permitted
2019-07-30 11:49:09 [ERROR] [metrics] sendto failed: operation not permitted
2019-07-30 11:49:11 [ERROR] [metrics] sendto failed: operation not permitted
2019-07-30 11:49:12 [ERROR] [metrics] sendto failed: operation not permitted
2019-07-30 11:49:13 [ERROR] [metrics] sendto failed: operation not permitted

发现有大量的UDP的错误,比如DNS查询, 以及发送metric消息(我们是通过UDP发送的), 错误类型operation not permitted。通过google发现有类似的问题: Linux UDP sendto error: Operation not permitted解决方法, 原因是连接跟踪表满了,导致丢包,所以先朝着这个方向跟踪。

查看内核的消息

使用dmesg查看内核的消息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[27772673.752270] nf_conntrack: table full, dropping packet
[27772678.728802] net_ratelimit: 1367 callbacks suppressed
[27772678.728809] nf_conntrack: table full, dropping packet
[27772678.728857] nf_conntrack: table full, dropping packet
[27772678.729754] nf_conntrack: table full, dropping packet
[27772678.732953] nf_conntrack: table full, dropping packet
[27772678.733410] nf_conntrack: table full, dropping packet
[27772678.734449] nf_conntrack: table full, dropping packet
[27772678.740774] nf_conntrack: table full, dropping packet
[27772678.742810] nf_conntrack: table full, dropping packet
[27772678.743280] nf_conntrack: table full, dropping packet
[27772678.746589] nf_conntrack: table full, dropping packet
[27783682.308966] sysctl (13991): drop_caches: 3
[27786553.698652] sysctl (2160): drop_caches: 3

果然连接跟踪表满了。

查看连接跟踪日志

那么既然我们确定是连接跟踪表满了,那么检查一下连接跟踪的日志信息:

1
2
3
4
5
6
7
8
9
10
11
[root@dbl14195 redisservice_product]# tail -f /proc/net/nf_conntrack
ipv4 2 tcp 6 28 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=63518 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=63518 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 64 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=60390 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=60390 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 86 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8788 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=8788 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 111 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=57070 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=57070 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 37 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=12400 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=12400 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 9 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=43830 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=43830 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 40 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=33886 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=33886 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 30 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=41580 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=41580 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 42 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=51388 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=51388 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 117 TIME_WAIT src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=63766 dport=8063 src=xxx.xxx.xxx.xxx dst=xxx.xxx.xxx.xxx sport=8063 dport=63766 [ASSURED] mark=0 zone=0 use=2

不光是UDP的,还包括TCP的连接。

查看nf_conntrack的有些参数配置:

1
2
3
4
5
[root@localhost product]# cat /proc/sys/net/netfilter/nf_conntrack_count
4489
[root@localhost product]# cat /proc/sys/net/netfilter/nf_conntrack_max
65536

默认情况下nf_conntrack_max是65536,可能的原因是短时间内有大量的短连接到了这台机器,超过了这个最大值,导致连接大量丢包,服务不可能用。

解决办法

尝试移除 conntrack 模块

查看conntrack模块的情况:

1
2
3
4
5
[root@localhost product]# lsmod|grep conntrack
xt_conntrack 12760 1
nf_conntrack_ipv4 14862 2
nf_defrag_ipv4 12729 1 nf_conntrack_ipv4
nf_conntrack 105702 6 xt_CT,nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4

看到这台机器加载了nf_conntrack相关的模块,在原来的业务的机器上查看没有加载这些模块,这也是原来业务跑的好好的,切换到现在的机器上出就出问题的原因。

尝试移除这些模块:

1
2
3
4
5
6
7
8
9
10
11
[root@localhost product]# modprobe -r nf_conntrack_netbios_ns nf_conntrack_ipv4 xt_conntrack
modprobe: FATAL: Module nf_conntrack_ipv4 is in use.
[root@localhost product]# lsmod|grep nf_conntrack_ipv4
nf_conntrack_ipv4 14862 2
nf_defrag_ipv4 12729 1 nf_conntrack_ipv4
nf_conntrack 105702 6 xt_CT,nf_nat,nf_nat_ipv4,xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4
[root@localhost product]# rmmod nf_conntrack
rmmod: ERROR: Module nf_conntrack is in use by: xt_CT nf_nat nf_nat_ipv4 xt_conntrack nf_nat_masquerade_ipv4 nf_conntrack_ipv4

模块被其它模块所使用,而且其它模块又被另外的模块所使用,不容易删除,而且删很多的模块还是一个计较危险的操作。

禁用连接跟踪

提高跟踪连接的最大数只是一个临时的办法,因为不确定的连接数最大又多少,而且跟踪连接数还耗费一定的资源,既然暂时美誉办法把模块卸载,所以想办法禁用掉。

然后找到一个简单的办法,执行下面的命令可以对所有的连接停止跟踪:

1
2
iptables -t raw -A PREROUTING -j NOTRACK
iptables -t raw -A OUTPUT -j NOTRACK

如果只想禁掉对特定端口的跟踪,可以使用下面的命令:

1
2
3
4
iptables -t raw -I PREROUTING -p tcp --dport 80 -j NOTRACK
iptables -t raw -I PREROUTING -p tcp --sport 80 -j NOTRACK
iptables -t raw -I OUTPUT -p tcp --dport 80 -j NOTRACK
iptables -t raw -I OUTPUT -p tcp --sport 80 -j NOTRACK

如果你想保存iptabls配置,以便重启生效的话,执行命令,它会把当前iptables配置规则持久化到/etc/sysconfig/iptables

1
service iptables save

查看当前的 connection tracking设置:

1
iptables -t raw -vnL

如果你正在配置一个高吞吐的机器,一定要关注conntrackLinux调优也介绍了调整conntrack的参数,网上业余很多相关的介绍。

依照文章When Linux conntrack is no longer your friend介绍,即使你设置了连接跟踪表到12万,每秒如果有1100个短连接,也会导致连接跟踪表爆表:

  • The most obvious case is if your server handles an extremely high number of simultaneously active connections. For example, if your conntrack table is configured to be 128k entries but you have >128k simultaneous connections, you’ll definitely hit issues!
    • The slightly less obvious case is if your server handles an extremely high number of connections per second. Even if the connections are short-lived, connections continue to be tracked by Linux for a short timeout period (120s by default). For example, if your conntrack table is configured to be 128k entries, and you are trying to handle 1,100 connections per second, that’s going to exceed the conntrack table size even if the connections are very short-lived (128k / 120s = 1092 connections/s).