欢迎光临
我们一直在努力

IPIP隧道导致的DPDK收包RSS队列不均匀问题

最近线上出现一些VM网卡收包队列不均匀的问题,即使是将网卡队列中断均匀的绑定到各个CPU上,依然会出现某个核特别高的情况:

%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.9 hi,  0.9 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 97.5 id,  0.0 wa,  0.8 hi,  1.7 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
%Cpu3  :  0.9 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.9 hi,  1.7 si,  0.0 st
%Cpu6  :  0.0 us,  0.0 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.9 hi,  1.7 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni, 46.3 id,  0.0 wa,  3.4 hi, 50.3 si,  0.0 st
%Cpu9  :  0.0 us,  0.0 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.9 hi,  1.7 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu11 :  0.0 us,  0.0 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
%Cpu12 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.9 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu15 :  0.0 us,  0.0 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st

能看到其他核大部分还是比较均匀的,就是cpu8确实比其他核高很多。经过了一些排查,发现和VM使用了IPIP Tunnel有关。

用户的场景是,在网络入口处有一台机器充当负载均衡的角色,然后这台负载均衡再通过IPIP Tunnel将用户的请求转发到这台VM,由于IPIP的原理是在原有的IP包基础上再“套”一层IP包头,导致RSS计算的Hash的之后只能看到外层的IP,所以即使内层IP包的五元组分布非常均匀,也会出现所以隧道的流量都跑到一个核上的情况。

为了网络的灵活性,VM的网络流量,是经过了一个DPDK程序进行转发的,这个DPDK程序逻辑非常简单,就是从网卡对应的rx队列N接收数据包,然后发送到VM对应的rx队列N,因此如果VM的接收队列不平衡,也就意味着从网卡收包的时候就是不均匀的。

那怎么解决这个问题呢,很显然的一个思路是,当DPDK收包之后,重新计算一下数据包的Hash,在计算过程中,如果发现数据包是一个IPIP数据包,就按内层IP头去计算Hash,然后再根据这个Hash计算VM的接受队列,并把包转发到对应队列。在这种情况下,不管从网卡收包是否是均衡的,到VM的流量基本就会是均衡的。这样确实会非常灵活(这也是当时多加一层DPDK而不是直接网卡直通的原因),但很显然DPDK程序的计算量增加了,对性能会有不小的影响。

那网卡能不能支持针对IPIP数据包提供个“更高级的”RSS算法呢?毕竟现在的网卡功能特性都比较多,功能也比较强大,很有可能可以直接在网卡层面直接实现支持基于隧道内层IP头进行RSS的能力,如果能通过网卡层面实现,那是最优解了。

跟网卡厂商交流之后,确认了网卡是支持这个特性的,而且,在使用Linux驱动收包的情况下,默认就是开启的,也就是说,如果使用的是网卡直通的模式,那直接就不会遇到这个问题。但我们使用了DPDK进行中转,默认是没有这个行为的,如果要开启基于IPIP Tunnel内层IP头进行RSS,需要给网卡下发这样一条流表:flow create 0 group 0 ingress pattern eth / ipv4 proto is 4 / ipv4 / tcp / end actions rss queues 0 1 2 3 4 end level 2 / end。简单翻译一下,就是通过流表去匹配ipv4.proto == 4(也就是IPIP Tunnel协议)的数据包,并让网卡以内层IP进行rss,并分配到0 1 2 3 4这几个队列中。

知道了这个规则,我们就可以用testpmd测试下了:

testpmd> set fwd rxonly
Set rxonly packet forwarding mode
testpmd>
testpmd> start
rxonly packet forwarding - ports=1 - cores=1 - streams=16 - NUMA support enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 16 streams:
  RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=4 (socket 0) -> TX P=0/Q=4 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=5 (socket 0) -> TX P=0/Q=5 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=6 (socket 0) -> TX P=0/Q=6 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=7 (socket 0) -> TX P=0/Q=7 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=8 (socket 0) -> TX P=0/Q=8 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=9 (socket 0) -> TX P=0/Q=9 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=10 (socket 0) -> TX P=0/Q=10 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=11 (socket 0) -> TX P=0/Q=11 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=12 (socket 0) -> TX P=0/Q=12 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=13 (socket 0) -> TX P=0/Q=13 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=14 (socket 0) -> TX P=0/Q=14 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=15 (socket 0) -> TX P=0/Q=15 (socket 0) peer=02:00:00:00:00:00

  rxonly packet forwarding packets/burst=32
  nb forwarding cores=1 - nb forwarding ports=1
  port 0: RX queue number: 16 Tx queue number: 16
    Rx offloads=0x0 Tx offloads=0x10000
    RX queue: 0
      RX desc=4096 - RX free threshold=64
      RX threshold registers: pthresh=0 hthresh=0  wthresh=0
      RX Offloads=0x0
    TX queue: 0
      TX desc=4096 - TX free threshold=0
      TX threshold registers: pthresh=0 hthresh=0  wthresh=0
      TX offloads=0x10000 - TX RS bit threshold=0
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...

  ------- Forward Stats for RX Port= 0/Queue=12 -> TX Port= 0/Queue=12 -------
  RX-packets: 139389         TX-packets: 0              TX-dropped: 0

  ---------------------- Forward statistics for port 0  ----------------------
  RX-packets: 139389         RX-dropped: 0             RX-total: 139389
  TX-packets: 0              TX-dropped: 0             TX-total: 0
  ----------------------------------------------------------------------------

  +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
  RX-packets: 139389         RX-dropped: 0             RX-total: 139389
  TX-packets: 0              TX-dropped: 0             TX-total: 0
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Done.

可以看到默认情况下,所有的包全跑到Queue=12这个队列了。很显然默认情况下RSS是有些问题的,那就继续试试下发流表之后的情况:

testpmd> flow create 0 group 0 ingress pattern eth / ipv4 proto is 4 / ipv4 / tcp / end actions rss queues 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 end level 2 / end
Flow rule #0 created
testpmd> start
rxonly packet forwarding - ports=1 - cores=1 - streams=16 - NUMA support enabled, MP allocation mode: native
Logical Core 1 (socket 0) forwards packets on 16 streams:
  RX P=0/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=1 (socket 0) -> TX P=0/Q=1 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=2 (socket 0) -> TX P=0/Q=2 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=3 (socket 0) -> TX P=0/Q=3 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=4 (socket 0) -> TX P=0/Q=4 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=5 (socket 0) -> TX P=0/Q=5 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=6 (socket 0) -> TX P=0/Q=6 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=7 (socket 0) -> TX P=0/Q=7 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=8 (socket 0) -> TX P=0/Q=8 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=9 (socket 0) -> TX P=0/Q=9 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=10 (socket 0) -> TX P=0/Q=10 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=11 (socket 0) -> TX P=0/Q=11 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=12 (socket 0) -> TX P=0/Q=12 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=13 (socket 0) -> TX P=0/Q=13 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=14 (socket 0) -> TX P=0/Q=14 (socket 0) peer=02:00:00:00:00:00
  RX P=0/Q=15 (socket 0) -> TX P=0/Q=15 (socket 0) peer=02:00:00:00:00:00

  rxonly packet forwarding packets/burst=32
  nb forwarding cores=1 - nb forwarding ports=1
  port 0: RX queue number: 16 Tx queue number: 16
    Rx offloads=0x0 Tx offloads=0x10000
    RX queue: 0
      RX desc=4096 - RX free threshold=64
      RX threshold registers: pthresh=0 hthresh=0  wthresh=0
      RX Offloads=0x0
    TX queue: 0
      TX desc=4096 - TX free threshold=0
      TX threshold registers: pthresh=0 hthresh=0  wthresh=0
      TX offloads=0x10000 - TX RS bit threshold=0
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...

  ------- Forward Stats for RX Port= 0/Queue= 0 -> TX Port= 0/Queue= 0 -------
  RX-packets: 6001           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 1 -> TX Port= 0/Queue= 1 -------
  RX-packets: 5894           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 2 -> TX Port= 0/Queue= 2 -------
  RX-packets: 5931           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 3 -> TX Port= 0/Queue= 3 -------
  RX-packets: 5759           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 4 -> TX Port= 0/Queue= 4 -------
  RX-packets: 5821           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 5 -> TX Port= 0/Queue= 5 -------
  RX-packets: 5787           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 6 -> TX Port= 0/Queue= 6 -------
  RX-packets: 5893           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 7 -> TX Port= 0/Queue= 7 -------
  RX-packets: 5909           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 8 -> TX Port= 0/Queue= 8 -------
  RX-packets: 6013           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue= 9 -> TX Port= 0/Queue= 9 -------
  RX-packets: 5956           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue=10 -> TX Port= 0/Queue=10 -------
  RX-packets: 5735           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue=11 -> TX Port= 0/Queue=11 -------
  RX-packets: 5885           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue=12 -> TX Port= 0/Queue=12 -------
  RX-packets: 5771           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue=13 -> TX Port= 0/Queue=13 -------
  RX-packets: 5878           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue=14 -> TX Port= 0/Queue=14 -------
  RX-packets: 5844           TX-packets: 0              TX-dropped: 0

  ------- Forward Stats for RX Port= 0/Queue=15 -> TX Port= 0/Queue=15 -------
  RX-packets: 5930           TX-packets: 0              TX-dropped: 0

  ---------------------- Forward statistics for port 0  ----------------------
  RX-packets: 94007          RX-dropped: 0             RX-total: 94007
  TX-packets: 0              TX-dropped: 0             TX-total: 0
  ----------------------------------------------------------------------------

  +++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
  RX-packets: 94007          RX-dropped: 0             RX-total: 94007
  TX-packets: 0              TX-dropped: 0             TX-total: 0
  ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Done.

可以看到下发流表后,流量相对均匀的分布到了0-15这16个队列上。说明网卡的功能没有问题。

剩下来就是如何将这个规则下发以代码的形式集成到转发逻辑中了。

static int create_ipip_rss_flow(dpdk_port_t port_id) {
	// flow create 0 group 0 ingress pattern eth / ipv4 proto is 4 / ipv4 / end actions rss queues 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 end level 2 / end
    struct rte_flow_error error;
    struct rte_flow *flow;
	struct rte_flow_attr flow_attr = {
		.ingress = 1,
	};
	uint16_t queue_list[16];
	for (int i = 0; i < 16; i++) {
        queue_list[i] = i;
    }

    struct rte_flow_item patterns[] = {
		{
			.type = RTE_FLOW_ITEM_TYPE_ETH,
		},
		{
			.type = RTE_FLOW_ITEM_TYPE_IPV4,
			.spec = &(struct rte_flow_item_ipv4){
                .hdr.next_proto_id = IPPROTO_IPIP,
            },
			.mask = &(struct rte_flow_item_ipv4){
                .hdr.next_proto_id = 0xFF,
            },
		},
		{
			.type = RTE_FLOW_ITEM_TYPE_IPV4,
		},
		{
			.type = RTE_FLOW_ITEM_TYPE_END,
		},
	};
    struct rte_flow_action actions[] = {
		{
			.type = RTE_FLOW_ACTION_TYPE_RSS,
			.conf = &(struct rte_flow_action_rss){
				.queue_num = 16,
				.queue = queue_list,
				.level = 2,
			},
		},
		{
			.type = RTE_FLOW_ACTION_TYPE_END,
		},
	};

    flow = rte_flow_create(port_id, &flow_attr, patterns, actions, &error);
    if (!flow) {
        log_error("Failed to create ipip_rss_flow: %s", error.message);
        return -1;
    }
    return 0;
}

实际开发过程中,稍微调整了一下流表的规则,不再匹配tcp/udp协议,而是只匹配到ipv4层,这样就可以同时支持TCP/UDP了。

最后感谢一下Github CopilotCursor在代码研发中提供的巨大帮助!

赞(0) 打赏
转载请注明来源:IT技术资讯 » IPIP隧道导致的DPDK收包RSS队列不均匀问题

评论 抢沙发

评论前必须登录!

 

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏