Open Telemetry datadog exporter sending duplicate metric

102 Views Asked by At

I have otel collector running on each k8s node (EKS) as daemonset and has exporter set as datadog. I noticed when metric is shipped from datadog otel exporter, its duplicated. Because in the host tag it looks like that, its tagged not just for the pod but also for the instance from where this metric is being reported. Same metric when shipped directly from datadog agent instead of using otel collector its reporter correctly.

Example from datadog metric :


| TAG KEY | COUNT | TAG VALUES             |
| --------| ------|----------------------- |
| host    |  8    |  host:i-XXXXXXX        |
|         |       |   host: Y-pod-server-0 |
|         |       |     host:i-XXXX        |
|         |       |    host:i-XXXX         |
|         |       |   host:Y-pod-server-2  |
|         |       |   host:i-XXXXX         |
|         |       |   host:Y-pod-server-3  |
|         |       |  host:Y-pod-server-1   |

       

Ideally it would have been just 4. When I use datadog agent to export metric directly instead of otel collector, host tag looks like below in datadog :


| TAG KEY | COUNT | TAG VALUES             |
| --------| ------|----------------------- |
| host    |  4    |   
|         |       |   host: Y-pod-server-0 |
|         |       |   host:Y-pod-server-2  |
|         |       |   host:Y-pod-server-3  |
|         |       |  host:Y-pod-server-1   |

Below is my otel collector config map :

kind: ConfigMap
  metadata:
    annotations:
      kubernetes.io/description: Contains a CA bundle that can be used to verify the
        kube-apiserver when using internal endpoints such as the internal service
        IP or kubernetes.default.svc. No other usage is guaranteed across distributions
    uid: 22cd0904-b90c-4123-92b3-9775318e1638
- apiVersion: v1
  data:
    relay: |
      exporters:
        datadog:
          api:
            key: ${env:DD_API_KEY}
            site: datadoghq.com
          host_metadata:
            enabled: true
            hostname_source: first_resource
        debug: {}
      extensions:
        health_check: {}
        memory_ballast:
          size_in_percentage: 40
      processors:
        resourcedetection/eks:
          detectors: [env, eks]
          timeout: 2s
          override: true
        attributes/k8s_cluster_name:
          actions:
          - action: upsert
            key: k8s.cluster.name
            value: devel_plane_regional_us-west-2
        batch: {}
        k8sattributes:
          extract:
            labels:
            - from: pod
              key_regex: (.*)
              tag_name: $$1
            metadata:
            - k8s.namespace.name
            - k8s.deployment.name
            - k8s.statefulset.name
            - k8s.daemonset.name
            - k8s.cronjob.name
            - k8s.job.name
            - k8s.node.name
            - k8s.pod.name
            - k8s.pod.uid
            - k8s.pod.start_time
          filter:
            node_from_env_var: K8S_NODE_NAME
          passthrough: false
          pod_association:
          - sources:
            - from: resource_attribute
              name: k8s.pod.ip
          - sources:
            - from: resource_attribute
              name: k8s.pod.uid
          - sources:
            - from: connection
        memory_limiter:
          check_interval: 5s
          limit_percentage: 80
          spike_limit_percentage: 25
      receivers:
        hostmetrics:
          collection_interval: 10s
          root_path: /hostfs
          scrapers:
            cpu: null
            disk: null
            filesystem:
              exclude_fs_types:
                fs_types:
                - autofs
                - binfmt_misc
                - bpf
                - cgroup2
                - configfs
                - debugfs
                - devpts
                - devtmpfs
                - fusectl
                - hugetlbfs
                - iso9660
                - mqueue
                - nsfs
                - overlay
                - proc
                - procfs
                - pstore
                - rpc_pipefs
                - securityfs
                - selinuxfs
                - squashfs
                - sysfs
                - tracefs
                match_type: strict
              exclude_mount_points:
                match_type: regexp
                mount_points:
                - /dev/*
                - /proc/*
                - /sys/*
                - /run/k3s/containerd/*
                - /var/lib/docker/*
                - /var/lib/kubelet/*
                - /snap/*
            load: null
            memory: null
            network: null
        jaeger:
          protocols:
            grpc:
              endpoint: ${env:MY_POD_IP}:14250
            thrift_compact:
              endpoint: ${env:MY_POD_IP}:6831
            thrift_http:
              endpoint: ${env:MY_POD_IP}:14268
        otlp:
          protocols:
            grpc:
              endpoint: ${env:MY_POD_IP}:4317
            http:
              endpoint: ${env:MY_POD_IP}:4318
        prometheus:
          config:
            scrape_configs:
            - job_name: opentelemetry-collector_sp_regional_us-west-2
              scrape_interval: 10s
              static_configs:
              - targets:
                - ${env:MY_POD_IP}:8888
            - job_name: sp_control_plane_regional_us-west-2
              kubernetes_sd_configs:
              - role: pod
              relabel_configs:
              - action: keep
                regex: true
                source_labels:
                - __meta_kubernetes_pod_annotation_prometheus_io_scrape
              - action: keep
                regex: true
                source_labels:
                - __meta_kubernetes_pod_annotation_custom_telemetry
              - source_labels:
                - __meta_kubernetes_pod_container_port_name
                regex: metric
                action: keep
              - regex: __meta_kubernetes_pod_node_name
                action: labeldrop
              - action: replace
                regex: (.+)
                source_labels:
                - __meta_kubernetes_pod_annotation_prometheus_io_path
                target_label: __metrics_path__
              scrape_interval: 10s
        zipkin:
          endpoint: ${env:MY_POD_IP}:9411
      service:
        extensions:
        - health_check
        - memory_ballast
        pipelines:
          logs:
            exporters:
            - debug
            processors:
            - k8sattributes
            - memory_limiter
            - batch
            receivers:
            - otlp
          metrics:
            exporters:
            - otlp/newrelic
            - datadog
            - debug
            processors:
            - k8sattributes
            - memory_limiter
            - attributes/k8s_cluster_name
            - batch
            receivers:
            - prometheus
            - hostmetrics
          traces:
            exporters:
            - debug
            processors:
            - k8sattributes
            - memory_limiter
            - batch
            receivers:
            - otlp
            - jaeger
            - zipkin
        telemetry:
          metrics:
            address: ${env:MY_POD_IP}:8888
  kind: ConfigMap
  metadata:
    annotations:
      meta.helm.sh/release-name: otel-collector
      meta.helm.sh/release-namespace: otel-collector
    creationTimestamp: "2023-10-11T23:32:52Z"
    labels:
      app.kubernetes.io/instance: otel-collector
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: opentelemetry-collector
      app.kubernetes.io/version: 0.86.0
      helm.sh/chart: opentelemetry-collector-0.69.2
    name: otel-collector-opentelemetry-collector-agent
    namespace: otel-collector
kind: List
metadata: {}




  • In debug log of otel collector I have noticed, below log line
2023-10-12T18:30:11.678Z        info    provider/provider.go:59 Resolved source {"kind": "exporter", "data_type": "metrics", "name": "datadog", "provider": "ec2", "source": {"Kind":"host","Identifier":"i-XXXXXXXXX"}}

I am wondering is it because datadog is unable to detect, that its running on EKS. Reason why its reporting metric instance id as host instead of just pods.

For the same, tried setting host_metadata field in datadog exporter, and set it to first_resource instead of setting to config_or_system but it has changed nothing :

          host_metadata:
            enabled: true
            hostname_source: first_resource
  • Also tried to drop metric with label pod_node_name using kubernetes_sd_config of prometheus but doesnt look to be working either :
              - regex: __meta_kubernetes_pod_node_name
                action: labeldrop
0

There are 0 best solutions below