SprintTZ Knowledge Share — Session 05: Building a Small Town ISP

groupThe Scenario

The Team & Their Small Town ISP

Meet the team that's building SprintMbale — Tanzania's newest regional ISP. They're connecting three towns with fiber-optic cables, serving 500+ customers. Each team member owns a piece of the puzzle.

👩‍💻

Irene

Network Engineer

IP: 192.168.1.3

Manages the core routing and firewall. Designs IP subnets.

👨‍💼

Dennis

NOC Technician

IP: 192.168.1.4

Monitors customer connections, troubleshoots outages.

👨‍🔧

Msabi

Hardware Tech

IP: 192.168.1.5

Field installations, fiber terminations, cable management.

👨‍💻

Daudi

Systems Admin

IP: 192.168.1.7

Manages customer billing, access portals, DNS.

The Three Towns

Town	Location	Network	Customers	Key Device
Mbeya	HQ Site (Mountain)	192.168.1.0/24	150	MikroTik CCR (Core Router)
Iringa	15 km East	192.168.2.0/24	200	MikroTik hEX
Moshi	10 km South	192.168.3.0/24	150	MikroTik hEX

The Infrastructure

check_circleLayer 1 (Physical): Fiber optic trunk between cities (SFP-10G-LR, 10km single-mode). Copper to customer buildings (RJ45). Radio wireless for remote areas.
check_circleLayer 2 (Switches): Mikrotik CSS106 switches at each site (24 ports). Customer edge equipment varies (home routers, outdoor APs).
check_circleLayer 3 (Routing): OSPF between the three router sites. BGP to upstream ISP (TZCOM). Static routes for customer networks.
check_circleLayer 4 (Firewall/Security): Mikrotik firewall NAT rules. Customer isolation via VLANs. Rate limiting per customer.

architectureSystem Design

The Network Architecture

Here's how SprintMbale's network is physically and logically structured:

Physical Network Diagram

🌐 TZCOM (Tanzanian ISP Backbone) ↓ 203.0.198.1 (Upstream Gateway) ↓ ┌────────────────────────┐ │ MBEYA (HQ) │ │ CCR (Core Router) │ │ 197.248.25.100 │ ← Public IP │ 192.168.1.1 (gateway) │ └──────────┬─────────────┘ │ │ │ │ SFP optical fiber │ (2x10G, redundant) │ ┌───┴──────────────────────┬──────────────┐ │ (15km) (10km)│ ↓ ↓ ┌────────────┐ ┌──────────────┐ │ IRINGA │ │ MOSHI │ │ hEX Router │ │ hEX Router │ │ 192.168.2.1│ │ 192.168.3.1 │ └────────────┘ └──────────────┘ Each town connects to ~50-200 customer buildings Via: RJ45 copper, WiFi, or wireless backbone

IP Addressing Hierarchy

SprintMbale IP Space: 197.248.25.0/24 (Public)
                      192.168.0.0/16 (Private)

┌─────────────────────────────┐
│ PUBLIC INTERNET FACING      │
├─────────────────────────────┤
│ 197.248.25.100/32  HQ Router WAN Interface
│ 197.248.25.50/32   Customer-facing website
│ 197.248.25.60/32   NOC portal (Dennis's monitoring)
│ (Others reserved for future services)
└─────────────────────────────┘

┌─────────────────────────────┐
│ MBEYA (HQ) INTERNAL         │
├─────────────────────────────┤
│ 192.168.1.0/24              │
│  192.168.1.1    → Gateway (CCR)
│  192.168.1.2    → Management VLAN (Irene, Dennis, Msabi, Daudi login here)
│  192.168.1.3    → Irene's workstation
│  192.168.1.4    → Dennis' workstation
│  192.168.1.5    → Msabi's workstation
│  192.168.1.7    → Daudi's workstation
│  192.168.1.100-150 → Staff WiFi (pool)
│  192.168.1.200  → NTP Server (time sync)
│  192.168.1.201  → DNS cache (recursive)
│  192.168.1.202  → RADIUS Auth (for WiFi)
└─────────────────────────────┘

┌─────────────────────────────┐
│ IRINGA REMOTE SITE          │
├─────────────────────────────┤
│ 192.168.2.0/24              │
│  192.168.2.1    → Gateway (hEX router)
│  192.168.2.2    → Management
│  192.168.2.100/25 → Customer pool (up to 100 IPs)
│                    (192.168.2.100–192.168.2.127)
│  192.168.2.150/25 → Backup segment
└─────────────────────────────┘

┌─────────────────────────────┐
│ MOSHI REMOTE SITE           │
├─────────────────────────────┤
│ 192.168.3.0/24              │
│  192.168.3.1    → Gateway (hEX router)
│  192.168.3.100/25 → Customer pool
│  192.168.3.150/25 → Backup segment
└─────────────────────────────┘

Routing Table (Mbeya Core Router):
  Destination         Next Hop        Protocol   Cost
  192.168.1.0/24      Direct          Connected  0
  192.168.2.0/24      192.168.2.1     OSPF       10
  192.168.3.0/24      192.168.3.1     OSPF       10
  0.0.0.0/0           203.0.198.1     BGP        100

VLAN Segmentation

VLAN 1   (MGMT):     192.168.1.2/24      → Staff management, secure access
VLAN 2   (CUSTOMERS): 192.168.1.100-150  → Customer-facing (though CPEs get own IPs)
VLAN 3   (GUEST):    192.168.1.251-254  → Guest WiFi, limited bandwidth
VLAN 10  (BACKUP):   192.168.1.200/24   → Failover routing, redundancy

Firewall Rules Between VLANs:
  ✓ MGMT can access CUSTOMERS (monitoring)
  ✓ CUSTOMERS cannot access MGMT
  ✓ GUEST has 512 Kbps rate limit
  ✓ All VLANs blocked to/from internet except via NAT

dynamic_feedReal-World Flows

Communication Scenarios

Scenario 1: Customer Connects to the Internet

Setup: A customer in Iringa (town #2) has subscribed to SprintMbale. They have a home Router at 192.168.2.130 (assigned from pool). They visit youtube.com.

Step-by-step flow:

1. CUSTOMER DEVICE (Home)
   Browser: "GET https://www.youtube.com/"
   Device IP: 192.168.2.130
   Sends to: Gateway 192.168.2.1 (Iringa router)

2. IRINGA ROUTER (hEX at 192.168.2.1)
   Receives packet: [SRC 192.168.2.130] [DST 8.8.8.8:443]
   Checks routing table: "8.8.8.8? Not local. Not in 192.168.0.0/16"
   Default route: 0.0.0.0/0 → 192.168.1.1 (Mbeya)
   Action: Forward to Mbeya

   Before forwarding, apply NAT-OUT rule:
   [SRC 192.168.2.130 ] → [SRC 197.248.25.100 ] (Masquerade)
   (Customer's private IP hidden behind public ISP IP)

   Packet now: [SRC 197.248.25.100] [DST 8.8.8.8:443]

3. FIBER TRUNK (Layer 1/2)
   Packet travels 15km of fiber from Iringa to Mbeya
   Signal converted: Electrical → Optical (by Iringa SFP)
   Transmitted at 10Gbps
   Received at Mbeya SFP: Optical → Electrical
   Latency: ~1ms

4. MBEYA CCR (Core Router at 192.168.1.1)
   Receives packet on internal interface
   Checks routing: "Destination 8.8.8.8? Not mine. Not in 192.168.0.0/16"
   Default route: 0.0.0.0/0 → 203.0.198.1 (TZCOM uplink)
   Firewall check: "Outbound HTTPS from 197.248.25.100"
   Rule: ALLOW (established connection tracking)
   
   Before forwarding to ISP, translate again:
   [SRC 197.248.25.100] → [SRC 197.248.25.0] (re-masquerade if using carrier-grade NAT)
   
   Send to: 203.0.198.1 (TZCOM gateway)

5. INTERNET BACKBONE
   Packet travels through TZCOM's network
   TZCOM → Google AS (8.8.8.8 anycast DNS)
   ~10-20 hops
   Google data center receives packet
   Responds: [SRC 8.8.8.8:443] [DST 197.248.25.100]

6. RETURN PATH (reverse)
   Google's response comes back through TZCOM
   Arrives at 203.0.198.1
   Mbeya CCR receives: "Who is 197.248.25.100?"
   Firewall state table: "This is a known outbound connection"
   NAT reverse mapping: 197.248.25.100 → 192.168.2.130
   
   Translate: [DST 197.248.25.100] → [DST 192.168.2.130]
   Send to: 192.168.1.1 Iringa interface

7. FIBER RETURN (1ms)
   Packet travels 15km of fiber back to Iringa

8. IRINGA ROUTER
   Receives: [SRC 8.8.8.8:443] [DST 192.168.2.130]
   Checks ARP: "What MAC has 192.168.2.130?"
   Gets customer's home router MAC
   Forwards on layer 2 to customer's MAC

9. CUSTOMER DEVICE
   Receives YouTube response
   Browser renders page
   Total latency: ~150ms (Google usually faster than 8.8.8.8 recursive, direct TCP video stream)

KEY INSIGHTS:
  • Customer never knows they're being NATted
  • NAT happens TWICE (Iringa, then Mbeya Jungles)
  • Fiber latency is negligible (1ms for 15km)
  • Return path uses connection tracking, not MAC learning
  • YouTube video streams through same path, multiple TCP flows

Scenario 2: Irene Manages Daudi's Router Remotely

Setup: Daudi (192.168.1.7) is in Mbeya. Irene (192.168.1.3) needs to SSH into daemon-router.local (192.168.2.1 in Iringa) to troubleshoot a BGP issue.

Irene on her workstation:
$ ssh admin@192.168.2.1

What happens:

1. IRENE'S PC (192.168.1.3)
   Resolves: "192.168.2.1? Is Iringa router. That's on our network."
   No ARP needed; she checks routing:
   "192.168.2.1 is in 192.168.2.0/24"
   "I don't have direct connection to that. Default gateway: 192.168.1.1"
   Sends SSH (port 22) to 192.168.1.1

2. MBEYA CCR Firewall (192.168.1.1)
   Sees: [SRC 192.168.1.3] [DST 192.168.2.1:22] (SSH)
   Is this allowed? Check firewall rules:
   "MGMT VLAN can access CUSTOMERS VLAN"? 
   Yes. Irene is in MGMT VLAN. Iringa is CUSTOMER VLAN.
   Actually, WAIT. Let's fix this: Iringa is a REMOTE SITE.
   Use rule: "MGMT can access remote routers"? Yes.
   
   Create state entry: "SSH from 192.168.1.3:54321 → 192.168.2.1:22"
   
3. CCR Routes to Iringa
   "192.168.2.1? That's at Iringa. Use OSPF route."
   OSPF says: 192.168.2.1 is directly connected to 192.168.2.1 (obviously)
   But Iringa IS.the gateway for 192.168.2.0/24. 
   So actually, the CCR sends packet to Iringa's WAN interface.

   Because OSPF adjacency, next hop is: 192.168.2.1 via OSPF neighbor
   Direct connection on fiber. Intra-network routing.

   Forward packet over fiber to Iringa using MAC of Iringa router

4. IRINGA hEX (192.168.2.1)
   Receives SSH packet [DST 192.168.2.1:22]
   Checks: "Port 22 is listening on me (SSH server running)"
   Accepts connection
   Responds: SYN-ACK back to 192.168.1.3

5. TCP HANDSHAKE (3-way)
   192.168.1.3 → 192.168.2.1 [SYN]
   ← [SYN-ACK]
   → [ACK]
   Connection established

6. SSH PROTOCOL
   PublicKey authentication (no password)
   Irene's SSH key verified
   Remote shell opened
   Irene types: "show bgp summary"
   Remote output sent back across fiber to her terminal

STATISTICS:
  • Latency: ~2-4ms round-trip (fiber + hardware processing)
  • Bandwidth: Negligible (SSH is mostly text)
  • Security: SSH encrypted (port 22 used)
  • Firewall: State tracking allowed the return traffic automatically

Scenario 3: Dennis Monitors Network Health

Setup: Dennis (192.168.1.4) is in the NOC (Network Operations Center) at Mbeya. He checks:

check_circleCustomer bandwidth usage in Iringa
check_circleFiber link health (latency, packet loss)
check_circleActive BGP sessions

Dennis's Monitoring Dashboard (Web, port 8080):

1. SNMP POLLING (from Mbeya to Iringa)
   Dennis's monitoring tool: "Check Iringa hEX CPU usage"
   Tool → SNMPv3 query to 192.168.2.1 (community string authenticated)
   Iringa responds: CPU 45%, Memory 78%, up 156 days
   Dashboard shows GREEN (all healthy)

2. PING TEST (Latency check)
   Mbeya CCR → ICMP echo to 192.168.2.1
   Response: 1.2 ms (excellent, means fiber is healthy)
   
   Mbeya CCR → Ping to Moshi 192.168.3.1
   Response: 0.8 ms (even better, slightly shorter fiber)

3. BGP STATUS
   Mbeya CCR → show bgp summary
   Session to 203.0.198.1 (TZCOM): UP, 256 routes learned
   Advertised: 3 networks (192.168.1, 2, 3)
   Uptime: 45 days since last session drop (excellent stability)

4. CUSTOMER BANDWIDTH
   Flow tracking at Iringa border:
   25 active customers in Iringa
   Aggregate traffic: 150 Mbps downstream, 30 Mbps upstream
   (Within expected capacity for a 1 Gbps trunk)

5. ALERT: One Customer Exceeds Rate Limit
   Customer at 192.168.2.131 downloading a torrent
   Rate: 50 Mbps (exceeds their 10 Mbps SLA)
   Firewall action: "Apply QoS, queue to 10 Mbps"
   Dashboard shows YELLOW warning
   Dennis checks: "Valid? Billing says: 'Home Package' 10 Mbps"
   Actions: Flag for billing team to contact customer

All of this works because:
  • SNMP uses UDP port 161 (allowed in firewall)
  • ICMP ping allowed (actually, sometimes blocked, but they allow it on routers)
  • BGP runs on port 179 (authenticated, only to TZCOM)
  • QoS/rate limiting configured on all customer ports

error_outlineProblem Solving

Troubleshooting & Real Issues

Issue 1: "Iringa Customers Can't Reach Moshi"

Problem: Wednesday morning, 10:15 AM. Dennis gets calls: "Internet is down in Iringa!" But Mbeya customers are fine, and Moshi is fine too.

DIAGNOSIS (Irene checks):

1. Check fiber physical layer (Layer 1)
   Mbeya CCR optics:
   [Iringa port] RX power: -3.5 dBm (normal: -3 to 0dBm)
   [Moshi port]  RX power: -2.8 dBm (normal)
   Fiber to Iringa appears OK (signal level normal)

2. Check routing (Layer 3)
   Mbeya: "show route"
   192.168.2.0/24 via 192.168.2.1 (OSPF, distance 110)
   192.168.3.0/24 via 192.168.3.1 (OSPF, distance 110)
   Both routes EXIST in routing table

3. Test Layer 3 connectivity
   $ ping 192.168.2.1
   Response: OK, 1.2ms
   
   $ ping 192.168.2.100 (Iringa customer)
   Response: TIMEOUT (bad!)
   
   $ ping 192.168.3.1
   Response: OK, 0.8ms
   
   $ traceroute 192.168.2.100
   1. 192.168.1.1 (Mbeya, OK)
   2. 192.168.2.1 (Iringa, OK)
   3. 192.168.2.100 (TIMEOUT - packet lost here!)

4. Problem is BEYOND Iringa router!
   Irene thinks: "Switch at Iringa? Or customer CPE?"
   
5. Check Iringa switch (Layer 2)
   SSH to 192.168.2.1
   $ show interfaces
   * Port 1-24: UP
   * But Port 13 (customer segment): DOWN!
   
   Msabi is in the field. Called: "Port 13 on the Switch - no link."
   Msabi goes to the building, checks the cable: 
   "RJ45 plugged into the patch panel, but..."
   Msabi notices: Cable is unplugged on SWITCH side!
   Fiber damage? No. Simple cable came loose.

6. SOLUTION
   Msabi re-plugs the cable into port 13 on the Iringa switch
   Waits 2 seconds... LED lights up: LINK UP
   
   From Mbeya: 
   $ ping 192.168.2.100
   Response: OK, 1.2ms!
   All customers' PCs get internet again.

TOTAL OUTAGE: 15 minutes
ROOt CAUSE: Physical Layer (cable unplugged - possibly from power surge or vibration)
PREVENTION: Cable clips, surge protectors, scheduled inspections

THE LESSON: 
  Always start at Layer 1. Before debugging routing, check the cable!

Issue 2: "NAT Not Working — Customer Gets Blocked"

Problem: Customer at 192.168.1.105 in Mbeya complains: "I can't access my work VPN. It keeps timing out."

ANALYSIS (Irene + Dennis collaborate):

1. Dennis tests from his workstation:
   $ curl http://192.168.1.105:8080 (VPN app)
   Response: Connection refused
   Hmm, service isn't even running locally.
   
   Customer's PC specs: "Windows 10, OpenVPN client"
   Customer says: "OpenVPN opens, tries to connect, then times out after 60 sec"

2. Irene checks firewall state & NAT:
   Customer's OpenVPN uses port 1194 (UDP)
   Customer connects to: 198.51.100.10 (workplace VPN server)
   
   Expected flow:
   [SRC 192.168.1.105:59234] → [DST 198.51.100.10:1194]
   ↓ NAT-OUT ↓
   [SRC 197.248.25.100:59234] → [DST 198.51.100.10:1194]
   
   Firewall rule check: "Outbound UDP"?
   Firewall rule: ALLOW (default allow for established sessions)
   
   But WAIT: UDP is stateless! 
   Firewall NAT rule: "Masquerade outbound UDP"?
   Does the rule exist? 
   $ show firewall nat
   /ip firewall nat
   chain=srcnat protocol=tcp action=masquerade
   chain=srcnat protocol=icmp action=masquerade
   
   Aha! TCP and ICMP are allowed, but NO UDP rule!

3. ROOT CAUSE
   OpenVPN (UDP port 1194) isn't in the NAT rules!
   Packets leave 192.168.1.105 with original IP
   Workplace VPN server sees packet from 192.168.1.105 (private IP)
   Tries to respond to 192.168.1.105
   Response can't route back (192.168.1.105 doesn't exist on internet)
   Customer never sees reply
   Timeout!

4. SOLUTION
   Irene adds NAT rule:
   /ip firewall nat add chain=srcnat protocol=udp action=masquerade
   
   OR more specific:
   /ip firewall nat add chain=srcnat dst-address=198.51.100.10/32 protocol=udp action=masquerade
   
   Customer re-opens OpenVPN
   Now packets are NATted:
   [SRC 192.168.1.105] → [SRC 197.248.25.100]
   Workplace VPN responds to 197.248.25.100
   Firewall de-NATtes return: [DST 197.248.25.100] → [DST 192.168.1.105]
   Customer receives response
   VPN connects successfully!

THE LESSON:
  NAT rules must be explicit. UDP, TCP, and ICMP are treated separately.
  Most firewalls default to TCP/ICMP only. VPN, DNS, and gaming (UDP) need explicit rules.

Issue 3: "BGP Route Flapping"

Problem: Every 30 seconds, traffic to internet cuts out for 5 seconds. Customers report: "Websites load, then hang, then load again."

INVESTIGATION:

1. Irene checks BGP status:
   $ show bgp summary
   BGP session to 203.0.198.1: FLAPPING (alternates UP/DOWN)
   
   BGP log:
   [10:00:01] Session established
   [10:00:30] Session lost (TCP connection reset)
   [10:00:32] Reconnect attempt
   [10:00:33] Session established
   [10:00:59] Session lost again
   
   Pattern: Lost every ~30 seconds

2. Check firewall rules:
   Is port 179 (BGP) being rate-limited?
   $ show firewall filter stats
   "Found it!" One rule: "Limit BGP to 10 packets/second"
   
   BGP keepalive + updates exceed 10 pps
   Firewall drops excess packets
   TCP connection times out
   Session drops
   Reconvergence takes 30 seconds
   Repeat!

3. SOLUTION
   Irene removes the overly-strict BGP rate limit:
   /ip firewall filter remove [find comment="Limit BGP"]
   
   OR adjust it:
   /ip firewall filter set [find comment="Limit BGP"] limit=1000/s
   
   BGP session stabilizes
   Traffic is now consistent
   Websites load smoothly

THE LESSON:
  Firewalls can help but also hurt if misconfigured.
  BGP MUST NOT be rate-limited.
  Critical protocols: DNS (53), NTP (123), BGP (179), SSH (22) — never rate-limit these.

Issue 4: "DNS Broken — Websites Won't Load"

Problem: Customers report: "Internet is working, but Google, Facebook, YouTube — nothing loads by name. But if I type in an IP address directly (8.8.8.8), it works!"

DIAGNOSIS:

Customer's symptoms:
  • ping google.com → times out (can't resolve domain)
  • ping 8.8.8.8 → works perfectly (IP works)
  • Browser shows "DNS_PROBE_FINISHED_NXDOMAIN" or "Failed to resolve"
  
This is a CLASSIC DNS failure signature!

STEP 1: Test DNS from customer location
Dennis (customer) runs on his PC:
  $ nslookup google.com
  Server: 192.168.2.1 (ISP nameserver)
  *** Can't find google.com: Server failed
  
That's VERY bad. ISP's nameserver is not responding.

STEP 2: Irene tests from Mbeya HQ
Irene runs:
  $ nslookup google.com 192.168.2.2
  Server: 192.168.2.2 (Iringa DNS server)
  *** Can't find google.com: Server failed
  
Same problem! But wait:
  $ nslookup google.com 8.8.8.8
  Name: google.com
  Address: 142.251.41.14
  ✓ Works!

This tells Irene:
  • 8.8.8.8 (Google's DNS) works
  • 192.168.2.2 (SprintMbale's DNS) broken
  • Customer's router points to broken nameserver

STEP 3: Check SprintMbale's DNS server
Irene SSH's to dns-server.local (192.168.2.2):
  $ systemctl status named
  ● named.service - BIND DNS Server
    Loaded: loaded
    Active: inactive (dead)

BOOM! DNS service crashed!

Check the log:
  $ tail -50 /var/log/named.log
  [Jun 05 14:23:22] Error: Zone file corrupt
  [Jun 05 14:23:23] Shutting down...
  [Jun 05 14:23:24] fatal: unable to load zone "example.local"

STEP 4: Restart DNS
Irene restarts:
  $ systemctl restart named
  $ systemctl status named
  ● named.service - BIND DNS Server
    Loaded: loaded
    Active: active (running)

Now test:
  $ nslookup google.com 192.168.2.2
  Name: google.com
  Address: 142.251.41.14
  ✓ Works!

Dennis immediately reports:
  • Browser works
  • google.com loads
  • YouTube loads
  • No more DNS_PROBE errors

STEP 5: Prevent future failures
Issue: What if the DNS server crashes again?
Solution: Set up redundant DNS

Option A - Customer uses Google DNS directly:
  $ dhcp-client config
  dns: 8.8.8.8, 8.8.4.4
  
  Problem: Depends on internet to Google
  Pro: Faster than SprintMbale's DNS
  Con: Less control, privacy concerns

Option B - Add fallback nameserver:
  $ cat /etc/resolv.conf
  nameserver 192.168.2.2 (Primary)
  nameserver 8.8.8.8 (Secondary)
  
  If 192.168.2.2 fails → tries 8.8.8.8
  Timeout on primary: ~3 seconds
  Then tries backup: ~1 second
  Total delay: ~4 seconds

Option C - Set up secondary DNS server
  192.168.2.3 (DNS-2 mirror)
  Replicates zones from 192.168.2.2 (Primary)
  
  Zone transfer (AXFR) every hour
  $ dig @192.168.2.2 example.local AXFR
  
  Customers use both:
  nameserver 192.168.2.2
  nameserver 192.168.2.3
  
  Irene implements C + uses 8.8.8.8 as safety net

COMMON DNS FAILURES & HOW THEY MANIFEST:

1. DNS SERVER DOWN (this issue)
   ↓ Symptom: "Can't resolve ANY domain"
   ↓ Fix: Restart service or failover to backup

2. DNS CACHE POISONED
   ↓ Some domains resolve to WRONG IP
   ↓ Symptom: "google.com resolves to 1.1.1.1 instead of 142.251.41.14"
   ↓ Fix: FLUSH DNS CACHE
      $ rndc flush
   ↓ Cause: Malicious DNS response cached (DNSSEC should prevent this)

3. DNSSEC VALIDATION FAILURE
   ↓ Domain has DNSSEC enabled, but signature invalid
   ↓ Symptom: "google.com not found" (even though it exists)
   ↓ Check: $ dig +dnssec google.com
   ↓ Fix: Check time sync (DNSSEC uses timestamps!)
      $ timedatectl
      Verify NTP is running
      $ ntpq -p

4. FIREWALL BLOCKING PORT 53
   ↓ DNS uses UDP 53 and TCP 53 (zone transfers)
   ↓ Symptom: "Queries timeout, never get response"
   ↓ Check firewall rules:
      $ show firewall filter (UDP 53 blocked?)
   ↓ Fix: Allow port 53
      /ip firewall filter add protocol=udp dst-port=53 action=accept

5. RECURSIVE QUERY LOOP
   ↓ DNS server points to itself
   ↓ Symptom: "No response, query hangs 30 seconds"
   ↓ Config error: forwarders = 192.168.2.2 (itself!)
   ↓ Fix: forwarders = upstream.dns.server.com

6. WRONG DNS CONFIG VIA DHCP
   ↓ Router sends bad DNS server in DHCP offers
   ↓ Symptom: "Works for me, but not for other customers"
   ↓ Check DHCP server config:
      $ show dhcp server
      dns-servers = 192.168.2.99 (doesn't exist!)
   ↓ Fix: Update DHCP pool
      $ dhcp set dns-servers 192.168.2.2

THE LESSON:
  DNS is critical but fragile. It's often overlooked until it breaks.
  
  Key DNS facts:
  • UDP 53 (queries), TCP 53 (zone transfers blocking at firewalls)
  • DNS timeouts hide root causes (server down? Network? Firewall?)
  • Always have DNS monitoring (check /var/log/named.log regularly)
  • DNSSEC adds complexity but prevents poisoning attacks
  • Redundant DNS (primary + secondary) is non-negotiable for ISPs
  • NTP (timekeeping) is critical for DNSSEC validation
  • Customers need both ISP DNS + fallback (8.8.8.8) for resilience

The Complete Picture

How all 5 Sessions Come Together:

Issue 1: Physical cable unplugged
  ↓
  Layer 1: Electrical signal lost
  Layer 2: No MAC learning (switch port down)
  Layer 3: OSPF dead timer (no hellos from Iringa)
  Route withdrawn from table
  All customer packets to Iringa dropped
  
Issue 2: NAT rule missing for UDP
  ↓
  Layer 4: OpenVPN uses UDP port 1194
  Firewall sees outbound UDP, no NAT rule
  Packet leaves with original private IP (192.168.1.105)
  Remote VPN server can't respond to private IP
  Connection times out
  
Issue 3: BGP rate-limited
  ↓
  Layer 3: BGP uses TCP port 179
  Firewall drops packets exceeding 10 pps
  BGP keepalives lost
  Session drops every 30 seconds
  Traffic reconverges but oscillates

Issue 4: DNS server crashed
  ↓
  Layer 4: DNS uses UDP 53 (and TCP 53 for zone transfers)
  DNS service crashed, not responding to queries
  Firewall rule might also block port 53 (misconfiguration)
  Customers can't resolve domains by name
  IP addresses still work (routing intact)
  Users see "Can't reach server" or "DNS lookup failed"
  
All required Sessions' knowledge to solve:
  • Session 01: IP addresses, binary, understanding 192.168.1.0/24
  • Session 02: Sockets & ports, TCP 179 for BGP, UDP 1194 for VPN, UDP 53 for DNS, DNS resolution process
  • Session 03: Port understanding, opening 1194, not blocking SSH or DNS
  • Session 04: Full stack — layer 1 cable, layer 2 switch port, layer 3 routing, layer 4 firewall/NAT/DNS
  • Session 05 (This one): Real scenarios piecing it all together, troubleshooting methodology

Building a Small Town ISP Network