Calico + ipvs/strict_arp kube-proxy + Metallb l2 mode multi interface problem

662 Views Asked by At

i am playing around with metallb in l2 mode and iptables routing on a ubuntu 22.04 system with 2 interfaces.

I have ens160 (on all nodes master + worker) for all the local traffic and ens192 (only on my worker) where metallb has access to my public ip network. I configured metallb to only use my worker nodes where ens192 is available. I am using Ubuntu 22.04 which uses netplan per default with which i finally tried to setup a few rules for the interface ens192.

The interface ens192 has no ip set up directly. According to metallb and kube-proxy documentation using kube-proxy in ipvs mode with strict arp mode is the way it should work and the ips should be announced using arp. As ingress I am using nginx which successfully gets an ip assigned by metallb. When checking the dummy interface kube-ipvs0 I can see the assigend ip address.

kubectl -n nginx-ingress get svc
NAME                                 TYPE           CLUSTER-IP      EXTERNAL-IP     PORT(S)                      AGE
ingress-nginx-controller             LoadBalancer   10.249.73.29    1.2.3.17   80:30311/TCP,443:31050/TCP   21d
ingress-nginx-controller-admission   ClusterIP      10.247.82.117   <none>          443/TCP                      21d

In my cluster directly i can access the service but not from outside. It times out.

My routing rules set with netplan are as following:


network:
  version: 2
  renderer: networkd
  ethernets:
    ens160:
      dhcp4: no
      dhcp6: no
      addresses:
        - 172.31.16.20/24
      routes:
      - to: default
        via: 172.31.16.254
      nameservers:
        addresses:
          - 10.2.2.2
          - 10.7.2.2
        search:
          - esrv.local
    ens192:
      dhcp4: no
      dhcp6: no
      routing-policy:
      - from: 1.2.3.16/28
        table: 1019
        priority: 100
      - from:1.2.3.16/28
        to: 192.168.0.0/16
        priority: 99
      routes:
      - to: default        
        via: 1.2.3.30
        table: 1019
      - to: 1.2.3.16/28
        table: 1019
      - to: 1.2.3.16/28

Route information:

ip rule show
0: from all lookup local
99: from 1.2.3.16/28 to 192.168.0.0/16 lookup main proto static
100: from 1.2.3.16/28 lookup 1019 proto static
32766: from all lookup main
32767: from all lookup default
ip route list
default via 172.31.16.254 dev ens160 proto static
172.31.16.0/24 dev ens160 proto kernel scope link src 172.31.16.20
192.168.135.64/26 via 192.168.135.65 dev vxlan.calico onlink
blackhole 192.168.177.192/26 proto 80
192.168.177.232 dev calid7e72cc188e scope link
192.168.177.233 dev cali3542ba50312 scope link
192.168.177.234 dev cali101d1e0fb1d scope link
1.2.3.16/28 dev ens192 proto static scope link
ip route list table 1019
default via 1.2.3.30 dev ens192 proto static onlink
1.2.3.16/28 dev ens192 proto static scope link

When i kick out the 100: from 1.2.3.16/28 lookup 1019 proto static rule i can see that the traffic get routed through ens160. Which would be correct in this case because of the default route.


tcpdump -n -e -q -vvvvv -i any port 80

tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:05:07.961685 ens192 In ifindex 3 70:70:8b:1d:6a:bf (tos 0x0, ttl 58, id 63476, offset 0, flags [DF], proto TCP (6), length 60)
[CLIENT PUB IP].10400 > 1.2.3.17.80: tcp 0
12:05:07.961967 cali3542ba50312 Out ifindex 6 ee:ee:ee:ee:ee:ee (tos 0x0, ttl 57, id 63476, offset 0, flags [DF], proto TCP (6), length 60)
172.31.16.20.14633 > 192.168.177.233.80: tcp 0
12:05:07.962018 cali3542ba50312 In ifindex 6 e6:d1:f8:03:b9:b7 (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
192.168.177.233.80 > 172.31.16.20.14633: tcp 0
12:05:07.962062 ens160 Out ifindex 2 00:50:56:a6:1e:38 (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60)
1.2.3.17.80 > [CLIENT PUB IP].10400: tcp 0
12:05:07.962290 ens160 In ifindex 2 54:75:d0:5b:10:fc (tos 0x0, ttl 255, id 43249, offset 0, flags [none], proto TCP (6), length 40)
[CLIENT PUB IP].10400 > 1.2.3.17.80: tcp 0
12:05:07.962344 cali3542ba50312 Out ifindex 6 ee:ee:ee:ee:ee:ee (tos 0x0, ttl 254, id 43249, offset 0, flags [none], proto TCP (6), length 40)
172.31.16.20.14633 > 192.168.177.233.80: tcp 0
^C
6 packets captured
8 packets received by filter
0 packets dropped by kernel

But when adding the 100: from 1.2.3.16/28 lookup 1019 proto static rule it seems to use the routing table but i can't see the traffic routed out.

tcpdump -n -e -q -vvvvv -i any port 80

tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:08:26.444843 ens192 In ifindex 3 70:70:8b:1d:6a:bf (tos 0x0, ttl 58, id 20253, offset 0, flags [DF], proto TCP (6), length 60)
[CLIENT PUB IP].10400 > 1.2.3.17.80: tcp 0
12:08:26.444975 cali3542ba50312 Out ifindex 6 ee:ee:ee:ee:ee:ee (tos 0x0, ttl 57, id 20253, offset 0, flags [DF], proto TCP (6), length 60)
172.31.16.20.38026 > 192.168.177.233.80: tcp 0
12:08:26.445009 cali3542ba50312 In ifindex 6 e6:d1:f8:03:b9:b7 (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
192.168.177.233.80 > 172.31.16.20.38026: tcp 0
12:08:27.467228 cali3542ba50312 In ifindex 6 e6:d1:f8:03:b9:b7 (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
192.168.177.233.80 > 172.31.16.20.38026: tcp 0
12:08:27.492653 ens192 In ifindex 3 70:70:8b:1d:6a:bf (tos 0x0, ttl 58, id 20254, offset 0, flags [DF], proto TCP (6), length 60)
[CLIENT PUB IP].10400 >1.2.3.17.80: tcp 0
12:08:27.492742 cali3542ba50312 Out ifindex 6 ee:ee:ee:ee:ee:ee (tos 0x0, ttl 57, id 20254, offset 0, flags [DF], proto TCP (6), length 60)
172.31.16.20.38026 > 192.168.177.233.80: tcp 0
12:08:27.492773 cali3542ba50312 In ifindex 6 e6:d1:f8:03:b9:b7 (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
192.168.177.233.80 > 172.31.16.20.38026: tcp 0
^C
7 packets captured
9 packets received by filter
0 packets dropped by kernel

IP Info:

2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:a6:1e:38 brd ff:ff:ff:ff:ff:ff
altname enp3s0
inet 172.31.16.20/24 brd 172.31.16.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fea6:1e38/64 scope link
valid_lft forever preferred_lft forever
3: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:a6:8d:f5 brd ff:ff:ff:ff:ff:ff
altname enp11s0
inet6 fe80::250:56ff:fea6:8df5/64 scope link
valid_lft forever preferred_lft forever
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
inet 1.2.3.17/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
Ubuntu 22.04 with kernel 5.15.0-76-generic
Kubernetes: 1.26.5
Calico cluster: v3.25.0
Metallb: 0.13.10
Kube Proxy in ipvs mode with strict arp

Calico config:

helm install calico projectcalico/tigera-operator --version v3.25.0 -f calico-config.yaml --namespace tigera-operator

---
installation:
  cni:
    type: Calico
  calicoNetwork:
    bgp: Disabled
    ipPools:
    - cidr: 192.168.0.0/16
      encapsulation: VXLAN

Metallb was installed using helm with default parameters. Metallb config:

 cat metallb-namespace.yml 
apiVersion: v1
kind: Namespace
metadata:
  labels:
    pod-security.kubernetes.io/enforce: privileged
    pod-security.kubernetes.io/audit: privileged
    pod-security.kubernetes.io/warn: privileged
  name: metallb
cat metallb-crds.yml 
---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: isp-vlan1086-ipp
spec:
  addresses:
  - 1.2.3.17 - 1.2.3.27
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: isp-vlan1086-adv
spec:
  ipAddressPools:
  - isp-vlan1086-ipp
  nodeSelectors:
  - matchLabels:
      kubernetes.io/hostname: itsrv4635.esrv.local
  interfaces:
  - ens192

I tried to follow this article with no luck: https://itnext.io/configuring-routing-for-metallb-in-l2-mode-7ea26e19219e

I hope anybody has a clue whats going on here. I am playing around with this issue since weeks and don't know what i am missing.

In the last few weeks i worked through more than 20 different threads an github issues with no luck. The most importent thries i guess: https://github.com/projectcalico/calico/issues/6789 https://github.com/metallb/metallb/issues/610

And additionally through an article which describes how the routing should be set up: https://itnext.io/configuring-routing-for-metallb-in-l2-mode-7ea26e19219e

I begun with RHEL 9 which had problems with rook ceph. Changed to RHEL 8 on which i had no luck with routing and ended up with Ubuntu 22.04 where i also have no luck currently.

EDIT: I changed from calico to flannel applied source based routing and I am now able to see that the traffic is stucking after cni0:

09:12:50.867851 ens192 In  ifindex 3 70:70:8b:1d:6a:bf (tos 0x0, ttl 59, id 54409, offset 0, flags [DF], proto TCP (6), length 60)
    [Client PUB IP].54660 > 1.2.3.17.80: tcp 0
09:12:50.868209 cni0  Out ifindex 6 6a:2a:26:51:8c:94 (tos 0x0, ttl 58, id 54409, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.2.1.60619 > 192.168.2.10.80: tcp 0
09:12:50.868218 vethc393243b Out ifindex 9 6a:2a:26:51:8c:94 (tos 0x0, ttl 58, id 54409, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.2.1.60619 > 192.168.2.10.80: tcp 0
09:12:50.868258 vethc393243b P   ifindex 9 ea:59:75:f8:df:bc (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.2.10.80 > 192.168.2.1.60619: tcp 0
09:12:50.868258 cni0  In  ifindex 6 ea:59:75:f8:df:bc (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    192.168.2.10.80 > [Client PUB IP].54660: tcp 0

It now seems to just not able to leave via ens192.

0

There are 0 best solutions below