DNS Privacy Stack — Troubleshooting

Troubleshooting¶

Every issue documented here was encountered in production. Each section includes the symptoms, root cause, diagnosis steps, and fix.

`dig @127.0.0.1` Times Out but LAN IP Works¶

Symptoms:

$ dig @127.0.0.1 google.com +timeout=3
;; communications error to 127.0.0.1#53: timed out
;; no servers could be reached

$ dig @<YOUR_SERVER_IP> google.com +timeout=3
142.251.220.206   # Works!

Root cause: AdGuard Home's allowed_clients does not include 127.0.0.0/8. When a query arrives from 127.0.0.1, AdGuard refuses it because the source IP isn't in the allowlist. TCP queries get REFUSED; UDP queries are silently dropped.

Diagnosis:

## TCP shows REFUSED (the clue)
dig @127.0.0.1 google.com +tcp +timeout=3
## status: REFUSED

## Check allowed_clients
docker exec adguardhome grep -A5 'allowed_clients' /opt/adguardhome/conf/AdGuardHome.yaml

Fix:

docker exec adguardhome sed -i 's/  allowed_clients:/  allowed_clients:\n    - 127.0.0.0\/8/' \
  /opt/adguardhome/conf/AdGuardHome.yaml
docker restart adguardhome

Unbound Returns SERVFAIL on Everything¶

Symptoms: Every query returns status: SERVFAIL. Unbound is running and listening.

Possible causes (check in order):

1. `use-caps-for-id: yes`¶

This feature randomizes query name casing to detect DNS spoofing. Many authoritative servers don't preserve case, causing Unbound to treat every response as spoofed.

grep 'use-caps-for-id' /etc/unbound/unbound.conf.d/adguard.conf
## If yes, change to no

Logs will show module_event_capsfail repeatedly.

2. `harden-referral-path: yes`¶

This does extra queries to validate the referral chain. If any validation query fails, the entire resolution fails. Remove it entirely — the security benefit is minimal for a forwarder.

3. DNSSEC trust anchor priming failure¶

info: failed to prime trust anchor -- could not fetch DNSKEY rrset

If Unbound can't validate the root DNSSEC key (common with ISP hijacking), and val-permissive-mode: no, all queries fail. Set val-permissive-mode: yes to log failures without blocking.

4. `subnetcache` module interference¶

On Ubuntu 24.04, Unbound 1.19.2 has the subnetcache module compiled in. It loads automatically even without send-client-subnet in the config. Combined with serve-expired and prefetch, it produces warnings:

warning: subnetcache: serve-expired is set but not working for data originating from the subnet module cache

This doesn't break forwarding but caused issues with recursive resolution. The module auto-loads — don't try to exclude it with module-config: "validator iterator" as this can cause worse problems on some builds.

ISP DNS Hijacking Breaks Recursive Resolution¶

Symptoms: Unbound forwarding to 1.1.1.1 works, but recursive resolution (no forward-zone) times out on every root server query.

Root cause: Your ISP transparently redirects all port 53 traffic (UDP and TCP) to their own DNS servers. When Unbound sends a non-recursive query (RD=0) to a root server, the ISP's proxy intercepts it. The proxy can't handle non-recursive queries, so it drops them or returns garbage.

How to confirm:

## This works (your shell sends RD=1, ISP resolver handles it)
dig @198.41.0.4 google.com +short +timeout=3
## Returns an IP — but you're NOT actually talking to the root server

## This is what Unbound sends (RD=0) — fails because ISP can't handle it
dig @198.41.0.4 . NS +norec +timeout=3
## Timeout or SERVFAIL

The definitive test from RIPE Labs:

dig @198.41.0.4 hostname.bind CH TXT +timeout=3
## If hijacked: timeout, SERVFAIL, or wrong answer
## If not hijacked: returns the root server's hostname

Fix: Use DNS-over-TLS forwarding instead of recursive resolution. DoT uses port 853, which ISPs don't hijack:

forward-zone:
  name: "."
  forward-tls-upstream: yes
  forward-addr: 194.242.2.2@853#dns.mullvad.net

ISPs known to hijack: Many ISPs in Asia ( China, India, Indonesia, Brazil, Turkey. If you're on one of these, recursive resolution will not work without a VPN tunnel.

Unbound Swap Thrashing Causes Cascading DNS Outage¶

Symptoms: After 3-7 days, the server becomes unresponsive. SSH sessions freeze. All containers lose DNS. Server swap usage is 12GB+.

Root cause: Unbound's caches were set too large (1GB rrset + 512MB msg + 256MB key = 1.8GB). With malloc overhead, actual usage is ~2.5x the configured value (~4.5GB). On a 16GB server running 100+ containers, this pushes the system into heavy swap usage. Unbound's cache access patterns cause constant page faults, which cascade into I/O wait, which blocks DNS responses, which causes all containers to retry, which increases load further.

Fix: Right-size the caches. For a homelab with ~10,000 unique domains:

rrset-cache-size: 64m
msg-cache-size: 32m
key-cache-size: 16m
neg-cache-size: 8m

Monitor:

## Check Unbound memory
systemctl status unbound | grep Memory
## Should show ~20-30MB, peak ~50MB. If it exceeds 200MB, caches are too large.

## Check system swap
free -m | grep Swap
## Swap used should be under 1GB for healthy operation

GL.iNet Router: DNS Stops Working After Configuration¶

Symptoms: After changing the router's DNS settings, ping google.com returns bad address on the router and all clients lose internet.

Trap 1: `force_dns='1'`¶

GL.iNet routers have force_dns='1' by default. This creates iptables DNAT rules that redirect ALL port 53 traffic passing through the router to dnsmasq. If you set DHCP option 6 to <YOUR_SERVER_IP>, clients try to reach your DNS server, but the router intercepts the traffic → dnsmasq tries to forward → gets intercepted → DNS loop → total failure.

Fix: Always disable before changing DNS:

uci set dhcp.@dnsmasq[0].force_dns='0'
uci commit dhcp
/etc/init.d/dnsmasq restart

Trap 2: Setting `noresolv` and WAN DNS¶

Setting noresolv='1' and pointing the router's upstream to <YOUR_SERVER_IP> sounds logical but is fragile. If AdGuard/Unbound restarts, the router itself loses DNS, which can prevent dnsmasq from resolving anything — including the path back to <YOUR_SERVER_IP> if it goes through DNS.

Safe approach: Only use DHCP option 6. Leave the router's own DNS untouched (ISP DNS). This way the router always works, and only clients use your privacy DNS.

## SAFE: Only affects clients
uci add_list dhcp.lan.dhcp_option='6,<YOUR_SERVER_IP>'
uci commit dhcp
/etc/init.d/dnsmasq restart

## DANGEROUS: Don't do this
## uci set dhcp.@dnsmasq[0].noresolv='1'
## uci add_list dhcp.@dnsmasq[0].server='<YOUR_SERVER_IP>'

Trap 3: `/etc/init.d/network restart`¶

Never restart the router's network stack when testing DNS changes. It takes down all interfaces briefly, which disconnects your SSH session and can leave the router in a bad state. Only restart dnsmasq.

`systemd` Watchdog Kills Unbound Every 60 Seconds¶

Symptoms: Monitoring alerts show Unbound entering failed state every 60 seconds:

unbound.service: Failed with result 'watchdog'
unbound.service: Killing process with signal SIGABRT

Root cause: WatchdogSec=60 in the systemd override requires Unbound to send sd_notify(WATCHDOG=1) pings. Unbound does not implement systemd watchdog. After 60 seconds without a ping, systemd kills it.

Fix:

sudo tee /etc/systemd/system/unbound.service.d/override.conf << 'EOF'
[Service]
LimitNOFILE=65536
LimitNPROC=512
Restart=on-failure
RestartSec=5
EOF

sudo systemctl daemon-reload
sudo systemctl restart unbound

AdGuard Container Takes 10+ Seconds to Start¶

Symptoms: After docker restart adguardhome, port 53 returns "connection refused" for 10-15 seconds. Scripts that test immediately after restart fail.

Root cause: AdGuard Home enumerates all Docker veth interfaces on startup (host networking mode). With 100+ containers, this takes 7-15 seconds before the DNS listener starts.

Workaround: When scripting, poll instead of using a fixed sleep:

docker restart adguardhome
for i in $(seq 1 12); do
    r=$(dig @127.0.0.1 google.com +short +timeout=3 2>&1 | head -1)
    if [[ "$r" =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
        echo "Ready after $((i*5))s"
        break
    fi
    sleep 5
done

`resolv.conf` Resets After Reboot¶

Symptoms: After rebooting the server, cat /etc/resolv.conf shows the ISP/DHCP nameserver instead of 127.0.0.1. All containers that use the host resolver fail.

Root cause: Netplan or DHCP client overwrites /etc/resolv.conf on boot.

Fix: Create a netplan override:

sudo tee /etc/netplan/99-dns-override.yaml << 'EOF'
network:
  version: 2
  ethernets:
    <YOUR_INTERFACE>:
      nameservers:
        addresses: [127.0.0.1]
      dhcp4-overrides:
        use-dns: false
EOF

sudo chmod 600 /etc/netplan/99-dns-override.yaml
sudo netplan apply

Quick Diagnostic Commands¶

## Full stack test (run all at once)
echo "=== resolv.conf ===" && cat /etc/resolv.conf && \
echo "=== Unbound ===" && dig @127.0.0.1 -p 5335 google.com +short +timeout=5 && \
echo "=== AdGuard localhost ===" && dig @127.0.0.1 google.com +short +timeout=5 && \
echo "=== AdGuard LAN ===" && dig @<YOUR_SERVER_IP> google.com +short +timeout=5 && \
echo "=== System ===" && ping -c 1 google.com | head -2

## Check Unbound is using DoT
ss -tnp | grep :853

## Check Unbound memory
systemctl status unbound | grep Memory

## Check AdGuard is running
docker ps --filter name=adguardhome --format '{{.Status}}'

## Check cache hit rate
dig @127.0.0.1 google.com +timeout=5 | grep "Query time"
## Second run should be 0ms

Previous: 04-optimization | Next: 06-resources

DNS Privacy Stack — Troubleshooting

Troubleshooting¶

dig @127.0.0.1 Times Out but LAN IP Works¶

Unbound Returns SERVFAIL on Everything¶

1. use-caps-for-id: yes¶

2. harden-referral-path: yes¶