最近线上出现一些VM网卡收包队列不均匀的问题,即使是将网卡队列中断均匀的绑定到各个CPU上,依然会出现某个核特别高的情况:
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.9 hi, 0.9 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni, 97.5 id, 0.0 wa, 0.8 hi, 1.7 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
%Cpu3 : 0.9 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
%Cpu4 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu5 : 0.0 us, 0.0 sy, 0.0 ni, 97.4 id, 0.0 wa, 0.9 hi, 1.7 si, 0.0 st
%Cpu6 : 0.0 us, 0.0 sy, 0.0 ni, 97.4 id, 0.0 wa, 0.9 hi, 1.7 si, 0.0 st
%Cpu7 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 46.3 id, 0.0 wa, 3.4 hi, 50.3 si, 0.0 st
%Cpu9 : 0.0 us, 0.0 sy, 0.0 ni, 97.4 id, 0.0 wa, 0.9 hi, 1.7 si, 0.0 st
%Cpu10 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu11 : 0.0 us, 0.0 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
%Cpu12 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni, 99.1 id, 0.0 wa, 0.0 hi, 0.9 si, 0.0 st
%Cpu14 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
%Cpu15 : 0.0 us, 0.0 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 1.7 si, 0.0 st
能看到其他核大部分还是比较均匀的,就是cpu8确实比其他核高很多。经过了一些排查,发现和VM使用了IPIP Tunnel
有关。
用户的场景是,在网络入口处有一台机器充当负载均衡的角色,然后这台负载均衡再通过IPIP Tunnel
将用户的请求转发到这台VM,由于IPIP
的原理是在原有的IP包基础上再“套”一层IP包头,导致RSS计算的Hash的之后只能看到外层的IP,所以即使内层IP包的五元组分布非常均匀,也会出现所以隧道的流量都跑到一个核上的情况。
为了网络的灵活性,VM的网络流量,是经过了一个DPDK程序进行转发的,这个DPDK程序逻辑非常简单,就是从网卡对应的rx队列N接收数据包,然后发送到VM对应的rx队列N,因此如果VM的接收队列不平衡,也就意味着从网卡收包的时候就是不均匀的。
那怎么解决这个问题呢,很显然的一个思路是,当DPDK收包之后,重新计算一下数据包的Hash,在计算过程中,如果发现数据包是一个IPIP数据包,就按内层IP头去计算Hash,然后再根据这个Hash计算VM的接受队列,并把包转发到对应队列。在这种情况下,不管从网卡收包是否是均衡的,到VM的流量基本就会是均衡的。这样确实会非常灵活(这也是当时多加一层DPDK而不是直接网卡直通的原因),但很显然DPDK程序的计算量增加了,对性能会有不小的影响。
那网卡能不能支持针对IPIP数据包提供个“更高级的”RSS算法呢?毕竟现在的网卡功能特性都比较多,功能也比较强大,很有可能可以直接在网卡层面直接实现支持基于隧道内层IP头进行RSS的能力,如果能通过网卡层面实现,那是最优解了。
跟网卡厂商交流之后,确认了网卡是支持这个特性的,而且,在使用Linux驱动收包的情况下,默认就是开启的,也就是说,如果使用的是网卡直通的模式,那直接就不会遇到这个问题。但我们使用了DPDK进行中转,默认是没有这个行为的,如果要开启基于IPIP Tunnel内层IP头进行RSS,需要给网卡下发这样一条流表:flow create 0 group 0 ingress pattern eth / ipv4 proto is 4 / ipv4 / tcp / end actions rss queues 0 1 2 3 4 end level 2 / end
。简单翻译一下,就是通过流表去匹配ipv4.proto == 4
(也就是IPIP Tunnel协议)的数据包,并让网卡以内层IP进行rss,并分配到0 1 2 3 4
这几个队列中。
知道了这个规则,我们就可以用testpmd
测试下了:
testpmd> set fwd rxonly
Set rxonly packet forwarding mode
testpmd>
testpmd> start
rxonly packet forwarding - ports=1 - cores=1 - streams=16 - NUMA support enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 16 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=4 (socket 0) -> TX P=0/Q=4 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=5 (socket 0) -> TX P=0/Q=5 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=6 (socket 0) -> TX P=0/Q=6 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=7 (socket 0) -> TX P=0/Q=7 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=8 (socket 0) -> TX P=0/Q=8 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=9 (socket 0) -> TX P=0/Q=9 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=10 (socket 0) -> TX P=0/Q=10 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=11 (socket 0) -> TX P=0/Q=11 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=12 (socket 0) -> TX P=0/Q=12 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=13 (socket 0) -> TX P=0/Q=13 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=14 (socket 0) -> TX P=0/Q=14 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=15 (socket 0) -> TX P=0/Q=15 (socket 0) peer=02:00:00:00:00:00
rxonly packet forwarding packets/burst=32
nb forwarding cores=1 - nb forwarding ports=1
port 0: RX queue number: 16 Tx queue number: 16
Rx offloads=0x0 Tx offloads=0x10000
RX queue: 0
RX desc=4096 - RX free threshold=64
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=4096 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x10000 - TX RS bit threshold=0
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...
------- Forward Stats for RX Port= 0/Queue=12 -> TX Port= 0/Queue=12 -------
RX-packets: 139389 TX-packets: 0 TX-dropped: 0
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 139389 RX-dropped: 0 RX-total: 139389
TX-packets: 0 TX-dropped: 0 TX-total: 0
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 139389 RX-dropped: 0 RX-total: 139389
TX-packets: 0 TX-dropped: 0 TX-total: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Done.
可以看到默认情况下,所有的包全跑到Queue=12
这个队列了。很显然默认情况下RSS是有些问题的,那就继续试试下发流表之后的情况:
testpmd> flow create 0 group 0 ingress pattern eth / ipv4 proto is 4 / ipv4 / tcp / end actions rss queues 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 end level 2 / end
Flow rule #0 created
testpmd> start
rxonly packet forwarding - ports=1 - cores=1 - streams=16 - NUMA support enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 16 streams:
RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=4 (socket 0) -> TX P=0/Q=4 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=5 (socket 0) -> TX P=0/Q=5 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=6 (socket 0) -> TX P=0/Q=6 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=7 (socket 0) -> TX P=0/Q=7 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=8 (socket 0) -> TX P=0/Q=8 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=9 (socket 0) -> TX P=0/Q=9 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=10 (socket 0) -> TX P=0/Q=10 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=11 (socket 0) -> TX P=0/Q=11 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=12 (socket 0) -> TX P=0/Q=12 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=13 (socket 0) -> TX P=0/Q=13 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=14 (socket 0) -> TX P=0/Q=14 (socket 0) peer=02:00:00:00:00:00
RX P=0/Q=15 (socket 0) -> TX P=0/Q=15 (socket 0) peer=02:00:00:00:00:00
rxonly packet forwarding packets/burst=32
nb forwarding cores=1 - nb forwarding ports=1
port 0: RX queue number: 16 Tx queue number: 16
Rx offloads=0x0 Tx offloads=0x10000
RX queue: 0
RX desc=4096 - RX free threshold=64
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
RX Offloads=0x0
TX queue: 0
TX desc=4096 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX offloads=0x10000 - TX RS bit threshold=0
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...
------- Forward Stats for RX Port= 0/Queue= 0 -> TX Port= 0/Queue= 0 -------
RX-packets: 6001 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 0/Queue= 1 -------
RX-packets: 5894 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 2 -> TX Port= 0/Queue= 2 -------
RX-packets: 5931 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 3 -> TX Port= 0/Queue= 3 -------
RX-packets: 5759 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 4 -> TX Port= 0/Queue= 4 -------
RX-packets: 5821 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 5 -> TX Port= 0/Queue= 5 -------
RX-packets: 5787 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 6 -> TX Port= 0/Queue= 6 -------
RX-packets: 5893 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 7 -> TX Port= 0/Queue= 7 -------
RX-packets: 5909 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 8 -> TX Port= 0/Queue= 8 -------
RX-packets: 6013 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue= 9 -> TX Port= 0/Queue= 9 -------
RX-packets: 5956 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue=10 -> TX Port= 0/Queue=10 -------
RX-packets: 5735 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue=11 -> TX Port= 0/Queue=11 -------
RX-packets: 5885 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue=12 -> TX Port= 0/Queue=12 -------
RX-packets: 5771 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue=13 -> TX Port= 0/Queue=13 -------
RX-packets: 5878 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue=14 -> TX Port= 0/Queue=14 -------
RX-packets: 5844 TX-packets: 0 TX-dropped: 0
------- Forward Stats for RX Port= 0/Queue=15 -> TX Port= 0/Queue=15 -------
RX-packets: 5930 TX-packets: 0 TX-dropped: 0
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 94007 RX-dropped: 0 RX-total: 94007
TX-packets: 0 TX-dropped: 0 TX-total: 0
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 94007 RX-dropped: 0 RX-total: 94007
TX-packets: 0 TX-dropped: 0 TX-total: 0
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Done.
可以看到下发流表后,流量相对均匀的分布到了0-15
这16个队列上。说明网卡的功能没有问题。
剩下来就是如何将这个规则下发以代码的形式集成到转发逻辑中了。
static int create_ipip_rss_flow(dpdk_port_t port_id) {
// flow create 0 group 0 ingress pattern eth / ipv4 proto is 4 / ipv4 / end actions rss queues 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 end level 2 / end
struct rte_flow_error error;
struct rte_flow *flow;
struct rte_flow_attr flow_attr = {
.ingress = 1,
};
uint16_t queue_list[16];
for (int i = 0; i < 16; i++) {
queue_list[i] = i;
}
struct rte_flow_item patterns[] = {
{
.type = RTE_FLOW_ITEM_TYPE_ETH,
},
{
.type = RTE_FLOW_ITEM_TYPE_IPV4,
.spec = &(struct rte_flow_item_ipv4){
.hdr.next_proto_id = IPPROTO_IPIP,
},
.mask = &(struct rte_flow_item_ipv4){
.hdr.next_proto_id = 0xFF,
},
},
{
.type = RTE_FLOW_ITEM_TYPE_IPV4,
},
{
.type = RTE_FLOW_ITEM_TYPE_END,
},
};
struct rte_flow_action actions[] = {
{
.type = RTE_FLOW_ACTION_TYPE_RSS,
.conf = &(struct rte_flow_action_rss){
.queue_num = 16,
.queue = queue_list,
.level = 2,
},
},
{
.type = RTE_FLOW_ACTION_TYPE_END,
},
};
flow = rte_flow_create(port_id, &flow_attr, patterns, actions, &error);
if (!flow) {
log_error("Failed to create ipip_rss_flow: %s", error.message);
return -1;
}
return 0;
}
实际开发过程中,稍微调整了一下流表的规则,不再匹配tcp/udp
协议,而是只匹配到ipv4层,这样就可以同时支持TCP/UDP
了。
最后感谢一下Github Copilot
和Cursor
在代码研发中提供的巨大帮助!
评论前必须登录!
注册