How to alert anomalies on network traffic jump with prometheus?

1.9k Views Asked by At

We want to detect if a VM in our IaaS infra is under DDOS attack or not.

And We have several symptoms and metrics like: node_nf_conntrack_entries, node_network_receive_packets_total and also libvirt_domain_interface_stats_receive_packets_total

We do not want to have a false positive by setting a trigger point. Traffic > n then alert!

rate(libvirt_domain_interface_stats_receive_packets_total{host="x"}[5m])

enter image description here

rate(node_network_receive_packets_total{instance="y1"}[5m])

enter image description here

sum(node_nf_conntrack_entries_limit - node_nf_conntrack_entries) by (instance) < 1000

enter image description here

2

There are 2 best solutions below

2
valyala On BEST ANSWER

You can compare the average network traffic for the last 5 minutes to the average 5-minute network traffic 5 minutes ago. If it increases in 5 minutes by more than 10x, then alert:

(
  rate(node_network_receive_packets_total[5m])
    /
  rate(node_network_receive_packets_total[5m] offset 5m)
) > 10

See docs for offset modifier.

This query may result in incorrect alerts though. For example, if the network traffic was close to zero and then it increased by more than 10x, but in absolute values it is still too small. This can be solved by adding a filter on too low network traffic. For example, the following query would alert only if the average per-second packet rate for the last 5 minutes is greater than 1000:

((
  rate(node_network_receive_packets_total[5m])
    /
  rate(node_network_receive_packets_total[5m] offset 5m)
) > 10)
  and
(
  rate(node_network_receive_packets_total[5m]) > 1000
)

This query can miss slow-changing DOS-attack when the network traffic grows at a rate lower than 10x per 5 minutes. This can be fixed by playing with offset value or by adding the absolute maximum packet rate, when the query should alert unconditionally. For example, the following query would alert unconditionally when the average packet rate for the last minute exceeds 100K/sec:

(
  ((
    rate(node_network_receive_packets_total[5m])
      /
    rate(node_network_receive_packets_total[5m] offset 5m)
  ) > 10)
    and
  (
    rate(node_network_receive_packets_total[5m]) > 1000
  )
)
  or
(
  rate(node_network_receive_packets_total[1m]) > 100000
)

See these docs for and and or operators.

0
Thomas Decaux On

To detect peak, you could use max_over_time function:

max_over_time(range-vector): the maximum value of all points in the specified interval.

So you dont loose peak information when alertmanager do the query.