When your application is slow, you check CPU, memory, disk, and network. You check the database query log. You check the application traces. But how often do you check DNS? A misconfigured resolver, a failing DNSSEC chain, or a flapping upstream server can add hundreds of milliseconds to every request — and most monitoring stacks have zero DNS visibility. This guide builds a complete DNS observability practice: from expert-level dig techniques through dnstap streaming to Prometheus dashboards that catch problems before users notice.

dig Mastery: Beyond the Basics

DNS Observability on Linux: Troubleshooting with dig, dnstap, and Prometheus visual summary diagram — Visual summary of the key concepts in this guide.

Every engineer knows dig example.com. Here is what most do not know:

Trace the Full Resolution Path

# +trace follows the entire resolution chain from root servers
dig +trace example.com

# Output shows each delegation step:
# .                   518400 IN NS a.root-servers.net.
# com.                172800 IN NS a.gtld-servers.net.
# example.com.        172800 IN NS ns1.example.com.
# example.com.        300    IN A  93.184.216.34

# This is invaluable for diagnosing delegation failures

Measure Resolution Time at Every Hop

# Query specific servers to isolate where latency lives
dig @a.root-servers.net com. NS +norecurse +stats | grep "Query time"
# Query time: 12 msec

dig @a.gtld-servers.net example.com NS +norecurse +stats | grep "Query time"
# Query time: 24 msec

dig @ns1.example.com example.com A +norecurse +stats | grep "Query time"
# Query time: 45 msec

# Compare recursive resolver performance
for ns in 127.0.0.1 1.1.1.1 8.8.8.8 9.9.9.9; do
    time=$(dig @$ns example.com +stats | grep "Query time" | awk '{print $4}')
    echo "$ns: ${time}ms"
done

EDNS Client Subnet Testing

# Test how CDNs route based on your source IP
# +subnet tells the resolver to include your subnet in the query
dig @8.8.8.8 cdn.example.com +subnet=203.0.113.0/24
# Shows what IP the CDN would return for clients in that subnet

dig @8.8.8.8 cdn.example.com +subnet=198.51.100.0/24
# Different subnet may get a different answer (geo-routing)

Identify Which Server Answered

# +nsid requests the Name Server Identifier
dig @1.1.1.1 example.com +nsid
# Shows which Cloudflare PoP answered (e.g., "NSID: lhr01" for London)

# +identify shows the server address in the response
dig @1.1.1.1 example.com +identify

# Check for EDNS cookie support
dig @1.1.1.1 example.com +cookie

Batch Queries from a File

# Create a query file
cat > /tmp/queries.txt << 'EOF'
example.com A
example.com AAAA
example.com MX
example.com TXT
mail.example.com A
www.example.com CNAME
_dmarc.example.com TXT
EOF

# Run all queries
dig -f /tmp/queries.txt +short

# With timing for each
dig -f /tmp/queries.txt +stats 2>&1 | grep -E "(^;|Query time)"

YAML/JSON Output for Scripting

# JSON output (dig 9.18+)
dig example.com +json | jq '.[] | .answer_section[]'

# Compact output for scripting
dig example.com +short +ttlid

# Specific field extraction
dig example.com MX +short | sort -n
# 10 mail.example.com.
# 20 mail2.example.com.

delv: DNSSEC Chain Validation

delv (DNS Evaluation Lookup Validation) does what dig +dnssec cannot — it validates the entire DNSSEC trust chain and reports exactly where validation fails.

# Full DNSSEC validation
delv example.com

# Output for valid DNSSEC:
# ; fully validated
# example.com.    300 IN A 93.184.216.34
# example.com.    300 IN RRSIG A ...

# Output for broken DNSSEC:
delv dnssec-failed.org
# ;; resolution failed: SERVFAIL
# ;; validating dnssec-failed.org/A: no valid signature found

# Trace the validation chain
delv +vtrace example.com
# Shows each step of DNSSEC validation:
# - Root KSK → Root ZSK → .com DS
# - .com KSK → .com ZSK → example.com DS
# - example.com KSK → example.com ZSK → A RRSIG

# Test specific trust anchor
delv @127.0.0.1 +trust-anchor=/var/lib/unbound/root.key example.com

drill: Lightweight Alternative to dig

# Install drill (part of ldns tools)
sudo dnf install -y ldns-utils  # RHEL
sudo apt install -y ldnsutils   # Debian

# Basic query
drill example.com

# DNSSEC trace
drill -DT example.com
# Shows DNSSEC trust chain with visual indicators

# Chase DNSSEC signatures
drill -S example.com
# Follows the chain of trust and validates signatures

# Reverse lookup
drill -x 93.184.216.34

dnstap: Structured DNS Logging

Traditional DNS logging (query logs) produces massive text files that are difficult to parse. dnstap is a binary protocol that captures DNS queries and responses in a structured, efficient format. It is supported by BIND9, Unbound, CoreDNS, and Knot Resolver.

Enable dnstap in Unbound

# /etc/unbound/unbound.conf
dnstap:
    dnstap-enable: yes
    dnstap-socket-path: "/var/run/unbound/dnstap.sock"
    dnstap-send-identity: yes
    dnstap-send-version: yes
    dnstap-log-resolver-query-messages: yes
    dnstap-log-resolver-response-messages: yes
    dnstap-log-client-query-messages: yes
    dnstap-log-client-response-messages: yes

Enable dnstap in BIND9

# /etc/named.conf
dnstap {
    client;
    resolver;
    auth;
    forwarder;
};
dnstap-output unix "/var/run/named/dnstap.sock";

Process dnstap with dnstap-read and dnscollector

# Read dnstap data directly
dnstap-read /var/log/dns/dnstap.log

# Install dnscollector for production processing
go install github.com/dmachard/go-dnscollector@latest

# dnscollector config: receive dnstap, export to Prometheus + Loki
cat > /etc/dnscollector/config.yml << 'EOF'
global:
  trace:
    verbose: false

multiplexer:
  collectors:
    - name: dnstap-collector
      dnstap:
        listen-ip: 0.0.0.0
        listen-port: 6000

  loggers:
    - name: prometheus-logger
      prometheus:
        listen-ip: 0.0.0.0
        listen-port: 9165

    - name: loki-logger
      lokiclient:
        server-url: "http://loki:3100/loki/api/v1/push"
        job-name: "dnscollector"

  routes:
    - from: [dnstap-collector]
      to: [prometheus-logger, loki-logger]
EOF

# Run dnscollector
dnscollector -config /etc/dnscollector/config.yml

tcpdump and tshark: When dnstap Is Not Available

# Capture DNS packets on port 53
sudo tcpdump -i any port 53 -nn -l

# Capture to file for analysis
sudo tcpdump -i any port 53 -w /tmp/dns-capture.pcap -c 10000

# Parse with tshark (Wireshark CLI)
tshark -r /tmp/dns-capture.pcap -T fields \
    -e frame.time \
    -e ip.src \
    -e dns.qry.name \
    -e dns.qry.type \
    -e dns.flags.rcode \
    -e dns.time \
    -Y "dns.flags.response == 1" | \
    sort -t$'\t' -k6 -rn | head -20

# Real-time DNS query monitoring
sudo tshark -i any -f "port 53" -T fields \
    -e ip.src -e dns.qry.name -e dns.qry.type \
    -Y "dns.flags.response == 0" 2>/dev/null

# Find slow DNS responses (>100ms)
tshark -r /tmp/dns-capture.pcap \
    -Y "dns.flags.response == 1 && dns.time > 0.1" \
    -T fields -e dns.qry.name -e dns.time | sort -t$'\t' -k2 -rn

Prometheus DNS Monitoring

Blackbox Exporter: Active DNS Probing

# /etc/blackbox_exporter/config.yml
modules:
  dns_internal:
    prober: dns
    timeout: 5s
    dns:
      query_name: "example.com"
      query_type: "A"
      transport_protocol: "udp"
      preferred_ip_protocol: "ip4"
      valid_rcodes:
        - NOERROR
      validate_answer_rrs:
        fail_if_not_matches_regexp:
          - ".*93\\.184\\.216\\.34.*"

  dns_mx:
    prober: dns
    dns:
      query_name: "company.com"
      query_type: "MX"
      valid_rcodes:
        - NOERROR

  dns_soa:
    prober: dns
    dns:
      query_name: "company.com"
      query_type: "SOA"
      valid_rcodes:
        - NOERROR

# Prometheus scrape config
scrape_configs:
  - job_name: 'dns-probes'
    metrics_path: /probe
    params:
      module: [dns_internal]
    static_configs:
      - targets:
          - '10.0.0.1'      # Internal resolver
          - '10.0.0.2'      # Secondary resolver
          - '1.1.1.1'       # Cloudflare (baseline)
          - '8.8.8.8'       # Google (baseline)
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Grafana Dashboard Queries

# DNS resolution latency by resolver
probe_dns_lookup_time_seconds{job="dns-probes"}

# DNS probe success rate
avg_over_time(probe_success{job="dns-probes"}[5m]) * 100

# Unbound cache hit ratio (via unbound_exporter)
unbound_response_total{type="cache_hit"} /
(unbound_response_total{type="cache_hit"} + unbound_response_total{type="cache_miss"})

# BIND9 query rate by type (via bind_exporter)
rate(bind_incoming_queries_total[5m])

# CoreDNS SERVFAIL rate
rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m])

Application-Level DNS Measurement

# Measure DNS impact on HTTP requests with curl
curl -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" \
    -o /dev/null -s https://api.company.com/health

# Output:
# DNS: 0.023s
# Connect: 0.045s
# TLS: 0.089s
# Total: 0.124s

# DNS is 18% of total request time in this example

# Batch test across multiple endpoints
for url in api.company.com app.company.com cdn.company.com mail.company.com; do
    dns_time=$(curl -w "%{time_namelookup}" -o /dev/null -s "https://$url/")
    echo "$url: ${dns_time}s"
done

# Build SLIs around DNS resolution time
# SLI: 99th percentile of DNS resolution time < 50ms
# SLO: 99.9% of requests meet the SLI over a rolling 30-day window

Systematic Troubleshooting Workflow

#!/bin/bash
# dns-troubleshoot.sh — Systematic DNS diagnosis
# Usage: ./dns-troubleshoot.sh example.com

DOMAIN="${1:?Usage: $0 }"
RESOLVER=$(awk '/^nameserver/ {print $2; exit}' /etc/resolv.conf)

echo "=== DNS Troubleshooting: $DOMAIN ==="
echo "System resolver: $RESOLVER"
echo ""

# Step 1: Can the system resolver resolve it?
echo "--- Step 1: Local resolution ---"
RESULT=$(dig @$RESOLVER "$DOMAIN" +short +time=5)
RCODE=$(dig @$RESOLVER "$DOMAIN" +noall +comments | grep -oP 'status: \K\w+')
echo "Result: $RESULT"
echo "RCODE: $RCODE"

if [ "$RCODE" == "SERVFAIL" ]; then
    echo "SERVFAIL — possible DNSSEC failure or upstream timeout"
    echo "Testing without DNSSEC validation:"
    dig @$RESOLVER "$DOMAIN" +cd +short
fi

# Step 2: Can a public resolver resolve it?
echo ""
echo "--- Step 2: Public resolver test ---"
for ns in 1.1.1.1 8.8.8.8 9.9.9.9; do
    result=$(dig @$ns "$DOMAIN" +short +time=3 2>/dev/null)
    echo "$ns: ${result:-FAILED}"
done

# Step 3: Query the authoritative servers directly
echo ""
echo "--- Step 3: Authoritative server test ---"
NS_SERVERS=$(dig NS "$DOMAIN" +short 2>/dev/null)
if [ -z "$NS_SERVERS" ]; then
    # Try parent zone
    PARENT=$(echo "$DOMAIN" | cut -d. -f2-)
    NS_SERVERS=$(dig NS "$PARENT" +short 2>/dev/null | head -2)
    echo "No NS for $DOMAIN, checking parent: $PARENT"
fi

for ns in $NS_SERVERS; do
    result=$(dig @$ns "$DOMAIN" A +norecurse +short +time=3 2>/dev/null)
    echo "$ns: ${result:-NO_ANSWER}"
done

# Step 4: DNSSEC validation
echo ""
echo "--- Step 4: DNSSEC check ---"
delv "$DOMAIN" 2>&1 | head -5

# Step 5: Latency
echo ""
echo "--- Step 5: Resolution latency ---"
dig @$RESOLVER "$DOMAIN" +stats 2>&1 | grep "Query time"

echo ""
echo "=== Diagnosis complete ==="

Detecting DNS Security Threats

# Detect DNS tunneling (abnormally long subdomains)
# Legitimate: www.example.com (15 chars)
# Tunnel: aGVsbG8gd29ybGQ.t.example.com (30+ chars per label)

# Alert on queries with labels longer than 50 characters
tshark -i any -f "port 53" -T fields -e dns.qry.name \
    -Y "dns.flags.response == 0" 2>/dev/null | \
    awk -F. '{for(i=1;i<=NF;i++) if(length($i)>50) print}'

# Detect DGA (Domain Generation Algorithm) domains
# High entropy domain names indicate malware C2
tshark -i any -f "port 53" -T fields -e dns.qry.name \
    -Y "dns.flags.response == 0 and dns.flags.rcode == 3" 2>/dev/null | \
    sort | uniq -c | sort -rn | head -20
# High NXDOMAIN rate for random-looking domains = potential DGA

# Monitor for unexpected zone transfers
tshark -i any -f "port 53" -T fields -e ip.src -e dns.qry.name \
    -Y "dns.qry.type == 252" 2>/dev/null
# Type 252 = AXFR (zone transfer) — should only come from known secondaries

DNS observability turns "the network is slow" from a mystery into a diagnosis. When you can see resolution latency per resolver, cache hit ratios over time, SERVFAIL spikes correlated with upstream changes, and query patterns that indicate misconfiguration or attack — you stop guessing and start fixing. Build the dashboard, set the alerts, and never debug a "DNS problem" blind again.

DNS observability stack architecture showing data flow from BIND9/Unbound/CoreDNS through dnstap collection to Prometheus metrics and Grafana dashboards

Bash CLI Sysadmin Monitoring