error_outlineProblem Solving
Troubleshooting & Real Issues
Issue 1: "Iringa Customers Can't Reach Moshi"
Problem: Wednesday morning, 10:15 AM. Dennis gets calls: "Internet is down in Iringa!" But Mbeya customers are fine, and Moshi is fine too.
DIAGNOSIS (Irene checks):
1. Check fiber physical layer (Layer 1)
Mbeya CCR optics:
[Iringa port] RX power: -3.5 dBm (normal: -3 to 0dBm)
[Moshi port] RX power: -2.8 dBm (normal)
Fiber to Iringa appears OK (signal level normal)
2. Check routing (Layer 3)
Mbeya: "show route"
192.168.2.0/24 via 192.168.2.1 (OSPF, distance 110)
192.168.3.0/24 via 192.168.3.1 (OSPF, distance 110)
Both routes EXIST in routing table
3. Test Layer 3 connectivity
$ ping 192.168.2.1
Response: OK, 1.2ms
$ ping 192.168.2.100 (Iringa customer)
Response: TIMEOUT (bad!)
$ ping 192.168.3.1
Response: OK, 0.8ms
$ traceroute 192.168.2.100
1. 192.168.1.1 (Mbeya, OK)
2. 192.168.2.1 (Iringa, OK)
3. 192.168.2.100 (TIMEOUT - packet lost here!)
4. Problem is BEYOND Iringa router!
Irene thinks: "Switch at Iringa? Or customer CPE?"
5. Check Iringa switch (Layer 2)
SSH to 192.168.2.1
$ show interfaces
* Port 1-24: UP
* But Port 13 (customer segment): DOWN!
Msabi is in the field. Called: "Port 13 on the Switch - no link."
Msabi goes to the building, checks the cable:
"RJ45 plugged into the patch panel, but..."
Msabi notices: Cable is unplugged on SWITCH side!
Fiber damage? No. Simple cable came loose.
6. SOLUTION
Msabi re-plugs the cable into port 13 on the Iringa switch
Waits 2 seconds... LED lights up: LINK UP
From Mbeya:
$ ping 192.168.2.100
Response: OK, 1.2ms!
All customers' PCs get internet again.
TOTAL OUTAGE: 15 minutes
ROOt CAUSE: Physical Layer (cable unplugged - possibly from power surge or vibration)
PREVENTION: Cable clips, surge protectors, scheduled inspections
THE LESSON:
Always start at Layer 1. Before debugging routing, check the cable!
Issue 2: "NAT Not Working — Customer Gets Blocked"
Problem: Customer at 192.168.1.105 in Mbeya complains: "I can't access my work VPN. It keeps timing out."
ANALYSIS (Irene + Dennis collaborate):
1. Dennis tests from his workstation:
$ curl http://192.168.1.105:8080 (VPN app)
Response: Connection refused
Hmm, service isn't even running locally.
Customer's PC specs: "Windows 10, OpenVPN client"
Customer says: "OpenVPN opens, tries to connect, then times out after 60 sec"
2. Irene checks firewall state & NAT:
Customer's OpenVPN uses port 1194 (UDP)
Customer connects to: 198.51.100.10 (workplace VPN server)
Expected flow:
[SRC 192.168.1.105:59234] → [DST 198.51.100.10:1194]
↓ NAT-OUT ↓
[SRC 197.248.25.100:59234] → [DST 198.51.100.10:1194]
Firewall rule check: "Outbound UDP"?
Firewall rule: ALLOW (default allow for established sessions)
But WAIT: UDP is stateless!
Firewall NAT rule: "Masquerade outbound UDP"?
Does the rule exist?
$ show firewall nat
/ip firewall nat
chain=srcnat protocol=tcp action=masquerade
chain=srcnat protocol=icmp action=masquerade
Aha! TCP and ICMP are allowed, but NO UDP rule!
3. ROOT CAUSE
OpenVPN (UDP port 1194) isn't in the NAT rules!
Packets leave 192.168.1.105 with original IP
Workplace VPN server sees packet from 192.168.1.105 (private IP)
Tries to respond to 192.168.1.105
Response can't route back (192.168.1.105 doesn't exist on internet)
Customer never sees reply
Timeout!
4. SOLUTION
Irene adds NAT rule:
/ip firewall nat add chain=srcnat protocol=udp action=masquerade
OR more specific:
/ip firewall nat add chain=srcnat dst-address=198.51.100.10/32 protocol=udp action=masquerade
Customer re-opens OpenVPN
Now packets are NATted:
[SRC 192.168.1.105] → [SRC 197.248.25.100]
Workplace VPN responds to 197.248.25.100
Firewall de-NATtes return: [DST 197.248.25.100] → [DST 192.168.1.105]
Customer receives response
VPN connects successfully!
THE LESSON:
NAT rules must be explicit. UDP, TCP, and ICMP are treated separately.
Most firewalls default to TCP/ICMP only. VPN, DNS, and gaming (UDP) need explicit rules.
Issue 3: "BGP Route Flapping"
Problem: Every 30 seconds, traffic to internet cuts out for 5 seconds. Customers report: "Websites load, then hang, then load again."
INVESTIGATION:
1. Irene checks BGP status:
$ show bgp summary
BGP session to 203.0.198.1: FLAPPING (alternates UP/DOWN)
BGP log:
[10:00:01] Session established
[10:00:30] Session lost (TCP connection reset)
[10:00:32] Reconnect attempt
[10:00:33] Session established
[10:00:59] Session lost again
Pattern: Lost every ~30 seconds
2. Check firewall rules:
Is port 179 (BGP) being rate-limited?
$ show firewall filter stats
"Found it!" One rule: "Limit BGP to 10 packets/second"
BGP keepalive + updates exceed 10 pps
Firewall drops excess packets
TCP connection times out
Session drops
Reconvergence takes 30 seconds
Repeat!
3. SOLUTION
Irene removes the overly-strict BGP rate limit:
/ip firewall filter remove [find comment="Limit BGP"]
OR adjust it:
/ip firewall filter set [find comment="Limit BGP"] limit=1000/s
BGP session stabilizes
Traffic is now consistent
Websites load smoothly
THE LESSON:
Firewalls can help but also hurt if misconfigured.
BGP MUST NOT be rate-limited.
Critical protocols: DNS (53), NTP (123), BGP (179), SSH (22) — never rate-limit these.
Issue 4: "DNS Broken — Websites Won't Load"
Problem: Customers report: "Internet is working, but Google, Facebook, YouTube — nothing loads by name. But if I type in an IP address directly (8.8.8.8), it works!"
DIAGNOSIS:
Customer's symptoms:
• ping google.com → times out (can't resolve domain)
• ping 8.8.8.8 → works perfectly (IP works)
• Browser shows "DNS_PROBE_FINISHED_NXDOMAIN" or "Failed to resolve"
This is a CLASSIC DNS failure signature!
STEP 1: Test DNS from customer location
Dennis (customer) runs on his PC:
$ nslookup google.com
Server: 192.168.2.1 (ISP nameserver)
*** Can't find google.com: Server failed
That's VERY bad. ISP's nameserver is not responding.
STEP 2: Irene tests from Mbeya HQ
Irene runs:
$ nslookup google.com 192.168.2.2
Server: 192.168.2.2 (Iringa DNS server)
*** Can't find google.com: Server failed
Same problem! But wait:
$ nslookup google.com 8.8.8.8
Name: google.com
Address: 142.251.41.14
✓ Works!
This tells Irene:
• 8.8.8.8 (Google's DNS) works
• 192.168.2.2 (SprintMbale's DNS) broken
• Customer's router points to broken nameserver
STEP 3: Check SprintMbale's DNS server
Irene SSH's to dns-server.local (192.168.2.2):
$ systemctl status named
● named.service - BIND DNS Server
Loaded: loaded
Active: inactive (dead)
BOOM! DNS service crashed!
Check the log:
$ tail -50 /var/log/named.log
[Jun 05 14:23:22] Error: Zone file corrupt
[Jun 05 14:23:23] Shutting down...
[Jun 05 14:23:24] fatal: unable to load zone "example.local"
STEP 4: Restart DNS
Irene restarts:
$ systemctl restart named
$ systemctl status named
● named.service - BIND DNS Server
Loaded: loaded
Active: active (running)
Now test:
$ nslookup google.com 192.168.2.2
Name: google.com
Address: 142.251.41.14
✓ Works!
Dennis immediately reports:
• Browser works
• google.com loads
• YouTube loads
• No more DNS_PROBE errors
STEP 5: Prevent future failures
Issue: What if the DNS server crashes again?
Solution: Set up redundant DNS
Option A - Customer uses Google DNS directly:
$ dhcp-client config
dns: 8.8.8.8, 8.8.4.4
Problem: Depends on internet to Google
Pro: Faster than SprintMbale's DNS
Con: Less control, privacy concerns
Option B - Add fallback nameserver:
$ cat /etc/resolv.conf
nameserver 192.168.2.2 (Primary)
nameserver 8.8.8.8 (Secondary)
If 192.168.2.2 fails → tries 8.8.8.8
Timeout on primary: ~3 seconds
Then tries backup: ~1 second
Total delay: ~4 seconds
Option C - Set up secondary DNS server
192.168.2.3 (DNS-2 mirror)
Replicates zones from 192.168.2.2 (Primary)
Zone transfer (AXFR) every hour
$ dig @192.168.2.2 example.local AXFR
Customers use both:
nameserver 192.168.2.2
nameserver 192.168.2.3
Irene implements C + uses 8.8.8.8 as safety net
COMMON DNS FAILURES & HOW THEY MANIFEST:
1. DNS SERVER DOWN (this issue)
↓ Symptom: "Can't resolve ANY domain"
↓ Fix: Restart service or failover to backup
2. DNS CACHE POISONED
↓ Some domains resolve to WRONG IP
↓ Symptom: "google.com resolves to 1.1.1.1 instead of 142.251.41.14"
↓ Fix: FLUSH DNS CACHE
$ rndc flush
↓ Cause: Malicious DNS response cached (DNSSEC should prevent this)
3. DNSSEC VALIDATION FAILURE
↓ Domain has DNSSEC enabled, but signature invalid
↓ Symptom: "google.com not found" (even though it exists)
↓ Check: $ dig +dnssec google.com
↓ Fix: Check time sync (DNSSEC uses timestamps!)
$ timedatectl
Verify NTP is running
$ ntpq -p
4. FIREWALL BLOCKING PORT 53
↓ DNS uses UDP 53 and TCP 53 (zone transfers)
↓ Symptom: "Queries timeout, never get response"
↓ Check firewall rules:
$ show firewall filter (UDP 53 blocked?)
↓ Fix: Allow port 53
/ip firewall filter add protocol=udp dst-port=53 action=accept
5. RECURSIVE QUERY LOOP
↓ DNS server points to itself
↓ Symptom: "No response, query hangs 30 seconds"
↓ Config error: forwarders = 192.168.2.2 (itself!)
↓ Fix: forwarders = upstream.dns.server.com
6. WRONG DNS CONFIG VIA DHCP
↓ Router sends bad DNS server in DHCP offers
↓ Symptom: "Works for me, but not for other customers"
↓ Check DHCP server config:
$ show dhcp server
dns-servers = 192.168.2.99 (doesn't exist!)
↓ Fix: Update DHCP pool
$ dhcp set dns-servers 192.168.2.2
THE LESSON:
DNS is critical but fragile. It's often overlooked until it breaks.
Key DNS facts:
• UDP 53 (queries), TCP 53 (zone transfers blocking at firewalls)
• DNS timeouts hide root causes (server down? Network? Firewall?)
• Always have DNS monitoring (check /var/log/named.log regularly)
• DNSSEC adds complexity but prevents poisoning attacks
• Redundant DNS (primary + secondary) is non-negotiable for ISPs
• NTP (timekeeping) is critical for DNSSEC validation
• Customers need both ISP DNS + fallback (8.8.8.8) for resilience
The Complete Picture
How all 5 Sessions Come Together:
Issue 1: Physical cable unplugged
↓
Layer 1: Electrical signal lost
Layer 2: No MAC learning (switch port down)
Layer 3: OSPF dead timer (no hellos from Iringa)
Route withdrawn from table
All customer packets to Iringa dropped
Issue 2: NAT rule missing for UDP
↓
Layer 4: OpenVPN uses UDP port 1194
Firewall sees outbound UDP, no NAT rule
Packet leaves with original private IP (192.168.1.105)
Remote VPN server can't respond to private IP
Connection times out
Issue 3: BGP rate-limited
↓
Layer 3: BGP uses TCP port 179
Firewall drops packets exceeding 10 pps
BGP keepalives lost
Session drops every 30 seconds
Traffic reconverges but oscillates
Issue 4: DNS server crashed
↓
Layer 4: DNS uses UDP 53 (and TCP 53 for zone transfers)
DNS service crashed, not responding to queries
Firewall rule might also block port 53 (misconfiguration)
Customers can't resolve domains by name
IP addresses still work (routing intact)
Users see "Can't reach server" or "DNS lookup failed"
All required Sessions' knowledge to solve:
• Session 01: IP addresses, binary, understanding 192.168.1.0/24
• Session 02: Sockets & ports, TCP 179 for BGP, UDP 1194 for VPN, UDP 53 for DNS, DNS resolution process
• Session 03: Port understanding, opening 1194, not blocking SSH or DNS
• Session 04: Full stack — layer 1 cable, layer 2 switch port, layer 3 routing, layer 4 firewall/NAT/DNS
• Session 05 (This one): Real scenarios piecing it all together, troubleshooting methodology