Rss hash for fragmented packet

1k Views Asked by At

I am using Mellanox Technologies MT27800 Family [ConnectX-5], using dpdk multi rx queue with rss "ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP"

I analyzer traffic and need all packet of same session to arrive to the same process ( session for now can be ip+port)

So Packet that have the same ip + port arrive to the same queue.

But If some packet are ip fragmented, packet arrive to different process. It is a problem!

How can i calculate the hash value in the c++ code, like it is done in the card, so i can reassemble packets and send them to the same process like the non fragmented packets

2

There are 2 best solutions below

1
lukashino On

Instead of ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP you can only use ETH_RSS_IP to calculate the RSS hash only from the IP addresses of the packet. This way even if the packet is fragmented, segments of the packet will arrive to the same the CPU core.

RSS value of the packets can be calculated in the software with using the following library https://doc.dpdk.org/api/rte__thash_8h.html While this option is possible, I would still recommend you to check out the proposed setting of ETH_RSS_IP only.

When ETH_RSS_IP | ETH_RSS_UDP | ETH_RSS_TCP is enabled, the RSS function takes IP addresses and src + dst ports to calculate the RSS hash value. As you don't have ports present in the IP fragmented packets, you are unable to compute the same value as non-fragmented IP packets.

You either can:

  • reassemble the IP fragments to form a complete IP packet and then using the rte_thash library compute the RSS value,
  • compute the RSS value only from the IP addresses (use the ETH_RSS_IP setting only).

As you are only doing load-balancing on CPU cores I think the latter option suits your use-case well enough.

0
Vipin Varghese On

@Davidboo based on the question and explanation in comment, what you have described is

  1. all packet of same session to arrive to the same process ( session for now can be ip+port) - which means you are looking for symmetric hash
  2. some packet are ip fragmented, packet arrive to different process - you need packet reassemble before symmetric RSS
  3. Mellanox Technologies MT27800 Family [ConnectX-5] - current NIC does not support reassemble in NIC (embedded switch).

Hence the actual question is what is the right way to solve the problem with the following constrains. There are 3 (1 HW and 2 SW) solutions.

  • option 1 (HW): Use smart NIC or network Appliance offload card, that can ingress the traffic and does reassembly for fragmented before sending to Host Server
  • option 2 (SW): Disable RSS and use single RX queue. Check packet is fragment or not. If yes, reassemble and then use rte_flow_distributor or rte_eventdev with atomic flow to spread traffic to worker cores.
  • option 3 (SW): Disable RSS, but use n + 1 RX queues and n SW ring . By default all packets will be received on queue 0. Based on JHASH (for SW RSS) add rte_flow rules pinning the flows to queues 1 to (n + 1).

Assuming you can not change the NIC, my recommendation is option-2 with evenetdev for the following reasons.

  1. It is much easier and simpler to allow either HW or SW DLB (eventdev) to spread traffic across multiple cores.
  2. rte_flow_distributor is similar to HW RSS where static pinning will lead to congestion and packet drops.
  3. Unlike option-3 one need not maintain state flow table to keep track of the flows.

How to achieve the same.

  1. Use dpdk example code skeleton to create basic port initialization.
  2. enable DPDK rte_ethdev for PTYPES to identify if IP, non-IP, IP fragmented without parsing the Frame and payload.
  3. check packets are fragmented by using RTE_ETH_IS_IPV4_HDR and rte_ipv4_frag_pkt_is_fragmented for ipv4 and RTE_ETH_IS_IPV6_HDR and rte_ipv6_frag_get_ipv6_fragment_header for ipv6 (refer DPDK example ip_reassembly).
  4. extract SRC-IP, DST-IP, SRC-PORT, DST-PORT (in case if packet is not SCTP, UDP or TCP set src port and dst port as 0) and use rte_hash_crc_2byte. This will ensure it is symmetric hash (SW RSS).
  5. Then feed the packet with the hash value to eventdev stage or flow distributor stage (refer eventdev_pipeline or distributor).

Note:

  • each worker core will be running business logic.
  • On broadwell Xeon core can handle around 30Mpps.
  • On icelake Xeon core can handle around 100Mpps.