Grafana Graphite query that value differs from average for same hour during the week

157 Views Asked by At

My data source is Graphite.

I'm collecting simple metric: number of responses to my app, grouped by route alias and response code, like following:

my_app.prod.count.api.*.*

I can thus group error requests with 4xx and 5xx response codes like following:

my_app.prod.count.api.*.4*
my_app.prod.count.api.*.5*

The problem is that I don't want to compare current value to the one right before it - the app traffic changes over the day and the pattern seem to be pretty much the same every day (even on weekends). See attached image for 14d period: screenshot of grafana timeseries graph over 14 days period

Thus, I would love the "Ok"-meter to compare current 1h value to the value for this exact hour of day (averaged for the last week or two, to compensate for sudden peaks during faulty deployments or something). And to trigger "not-ok" state when the value differs from that average by something around 3 times the std dev.

My question is: What the query would approximately look like?

I've seen something similar described here as

my_metric / avg_over_time(my_metric[1w])
my_metric / avg_over_time(my_metric[1w] offset 1w)

but not sure of how to adopt to my case (as pointed in comments, it's Prometheus, not Graphite), with triggering over the 3 times std dev.

1

There are 1 best solutions below

0
Eduard Sukharev On

TL;DR: What I came up with looks horrible, but seem to do the job just right:

aliasByNode(diffSeriesLists(aliasByNode(my_app.prod.count.newgs.api.*.4*, 5, 6), aliasByNode(scale(groupByNodes(timeStack(my_app.prod.count.newgs.api.*.4*, '1w', 0, 20), "stddev", 5, 6), 3), 0, 1)), 0, 1)

Query breakdown: First we get a stack of series with 1 week offset, for 20 weeks in past, including current week, using timeStack:

timeStack(my_app.prod.count.newgs.api.*.4*, '1w', 0, 20)

(this could be done with 1 day offsets; substitute 1w and 20 with 1d and whatever number of days to be looked behind)

Then we calculate standard deviation for each route and error code. The node naming is my_app.prod.count.newgs.api.*.4* which means we want to group by 5th and 6th nodes (0-based indexing here), and apply stddev aggregation function, using groupByNodes:

groupByNodes(
  timeStack(
    my_app.prod.count.newgs.api.*.4*, '1w', 0, 20
  ), "stddev", 5, 6
)

Since we wanted to compare current value to 3 std devs, we should multiply the resulting series by constant, using scale (using multiplySeries works only for series, not constants):

scale(
  groupByNodes(
    timeStack(
      my_app.prod.count.newgs.api.*.4*, '1w', 0, 20
    ), "stddev", 5, 6
  ), 3
)

Then we want to know how much current value exceeds the calculated stddev from the past period, using diffSeriesLists.

But that requires us that both series have same naming, thus we apply aliasByNode to both operands, and another one on the outside to get nice naming:

aliasByNode(
  diffSeriesLists(
    aliasByNode(
      my_app.prod.count.newgs.api.*.4*, 5, 6
    ),
    aliasByNode(
      scale(
        groupByNodes(
          timeStack(
            my_app.prod.count.newgs.api.*.4*, '1w', 0, 20
          ), "stddev", 5, 6
        ), 3
      ), 0, 1
    )
  ), 0, 1
)

Note: pretty formatting is done solely for demonstration purposes, actual query in grafana does not allow newlines and nice indentation in queries