EC2 ARM instance crashes randomly "ens5 Could not set DHCPv4 address: Connection timed out"

716 Views Asked by At

So randomly after 7 months of running fine my AWS Ec2 instances running on arm started having this issue. It is inconsistent sometime once a day, sometimes once a week. He are some things I have already explored:

  • CPU does not max out or get close, running 8 cores and my application written in go never uses more than 4 cores and has not gone over 50% of CPU usage.
  • Disk Space sits at 56%.
  • Memory is 15GB capacity we never get close to maxing it.
  • We run an EBS for the disk space.
  • There have been no changes to the VPC.

From what I can tell the issue is the EC2 instance uses Systemd to set the internal network address via DHCP instance. When this fails it, the instances is no longer acting as though it is in the VPC and then SnapD and other services crash and the system needs a reboot before you can access it again.

I found nothing in the logs that pointed to why this happens, it just appears as it is below randomly.

I have read a bunch of other threads talking about ens5 issues but none of them seem to apply to the parameters we have. Any ideas on what is happening here?

Aug 12 17:02:55 systemd-networkd[491]: ens5: Could not set DHCPv4 address: Connection timed out
Aug 12 17:03:04 systemd-networkd[491]: ens5: Failed
Aug 12 17:04:05 systemd[1]: snapd.service: Watchdog timeout (limit 5min)!
Aug 12 17:04:19 systemd[1]: snapd.service: Killing process 545 (snapd) with signal SIGABRT.
1

There are 1 best solutions below

0
cofiem On

It looks like you might be running into this issue of systemd-networkd not "handling a timeout of the netlink reconfiguration stage of a DHCPv4 refresh". It seems to be a bug in systemd-networkd.

" Reproduction steps:

  1. Configure a machine with a DHCPv4 lease on a network with a DHCPv4 server.
  2. Place machine under unusual load sufficient to cause a timeout on netlink requests.
  3. Observe the interface failing with the following logs:
systemd-networkd[139370]: eth0: Could not set DHCPv4 address: Connection timed out
systemd-networkd[139370]: eth0: Failed

It appears to be much easier/more common to produce this situation with unusually high load in a credit based virualized compute environment."