I'm trying to extend the base NGINX ingress controller log pipeline that already has 6 pipelines that does a lot of the parsing. All I want to do is extract out part of the path in the URL from the log. For example I would like to extract datadoghq out of the URL path and set it as a variable called service_name in the final parsed output.
172.16.99.64 - - [19/Mar/2020:16:02:20 +0000] "GET /datadoghq/company?test=var1%20Pl HTTP/1.1" 503 605 "-" "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0" 4033 0.000 [proxyname-8080] [] - - - - abcdefg12345abcdef
I'm not sure if I should edit the Grok parser which already has a bunch of primary parsing rules in place:
access.common %{_client_ip}(?: - \[%{notSpace}\])? - %{_ident} \[%{_date_access}\] "(?>%{_method} |)%{_url}(?> %{_version}|)" %{_status_code} (?>%{_bytes_written}|-) "%{_referer}" "%{_user_agent}" %{_request_size} %{_duration} \[%{_proxy_name}\](?: \[%{_alternate_proxy_name}?\])? (?:%{_upstream_ip}:%{_upstream_port}|-)(?:, %{notSpace})?(?:, %{notSpace})? (?:%{_bytes_read}|-)(?:, %{number}|, -)?(?:, %{number}|, -)? (?:%{_upstream_time}|-)(?:, %{number}|, -)?(?:, %{number}|, -)? (?:%{_upstream_status}|-)(?:, %{number}|, -)?(?:, %{number}|, -)?(?: %{_request_id})?.*
error.format %{date("yyyy/MM/dd HH:mm:ss"):date_access} \[%{word:level}\] %{data:error.message}(, %{data::keyvalue(": ",",")})?
controller_format %{regex("\\w"):level}%{date("MMdd HH:mm:ss.SSSSSS"):date_access}\s+%{number} %{notSpace:logger.name}:%{number:lineno}\] .*
and 21 helper rules
_request_id %{notSpace:http.request_id}
_upstream_status %{number:http.upstream_status_code}
_upstream_time %{number:http.upstream_duration}
_bytes_read %{number:network.bytes_read}
_upstream_port %{number:network.destination.port}
_upstream_ip %{ipOrHost:network.destination.ip}
_proxy_name %{notSpace:proxy.name}
_alternate_proxy_name %{notSpace:proxy.alternate_name}
_duration %{number:duration:scale(1000000000)}
_request_size %{number:network.request_size}
_bytes_written %{integer:network.bytes_written}
_client_ip %{ipOrHost:network.client.ip}
_version HTTP\/%{regex("\\d+\\.\\d+"):http.version}
_url %{notSpace:http.url}
_ident %{notSpace:http.ident:nullIf("-")}
_user_agent %{regex("[^\\\"]*"):http.useragent}
_referer %{notSpace:http.referer}
_status_code %{integer:http.status_code}
_method %{word:http.method}
_date_access %{date("dd/MMM/yyyy:HH:mm:ss Z"):date_access}
_x_forwarded_for %{regex("[^\\\"]*"):http._x_forwarded_for:nullIf("-")}
After it parses the logs it sets an http.url variable in the output and I feel like it would be simple to parse this further to keep http.url in tact but to create another variable called service_name that is simply datadoghq
{
"duration": 0,
"http": {
"url": "/datadoghq/company?test=var1%20Pl",
"status_code": 503,
"version": "1.1",
"referer": "-",
"request_id": "abcdefg12345abcdef",
"useragent": "Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0",
"method": "GET"
},
"proxy": {
"name": "proxyname-8080"
},
"date_access": 1584633740000,
"network": {
"client": {
"ip": "172.16.99.64"
},
"bytes_written": 605,
"request_size": 4033
}
}
Does anyone have any recommendations about the best way to go about this either edit the Grok parser to try to parse the log and have to sift through the many parsing rules, or can a create another pipeline to do what I need on top of the object.
Thanks!