This is Failure Friday. We end the week with a story about the invisible layer of the internet that breaks exactly when you need it to work.
It works on my machine.
In 2024, we decided to migrate our main application from a legacy AWS EC2 instance to a modern Vercel deployment. The plan was simple:
Deploy the new site on Vercel. (Done)
Update the DNS records (A Record) to point to the new IP. (Done)
Shut down the old EC2 instance to save money. (Done)
I ran ping my-site.com on my laptop. It replied from the new Vercel IP. I asked the CEO to check. It worked for him. We high-fived. We went home.
Two hours later, support tickets started flooding in.
"Site is down."
"I'm getting a 502 Bad Gateway."
"I can't log in."
But when I checked, it was up. I was seeing the Future. Our customers were stuck in the Past.
1. The Failure: The TTL Trap
I had fallen victim to DNS Propagation. The Domain Name System is effectively a massive, distributed, eventually-consistent cache.
When you tell the world "My IP is now 1.2.3.4," you aren't telling the users directly. You are telling a Recursive Resolver (like Google 8.8.8.8 or an ISP). Those resolvers hold onto the old answer for a specific amount of time, defined by the TTL (Time To Live).
My TTL was set to 86400 (24 hours). Even though I updated the record, every ISP in the world was allowed to serve the old (dead) IP address for another full day. I had shut down the old server, so half the world was trying to connect to a ghost.
2. THE FIX: The "Idempotency Check" & Separation
You cannot force the internet to update. But you can trick it.
If you are planning a migration on Friday, you must start on Monday.
Monday: Login to your DNS provider (Cloudflare/Godaddy/AWS).
Action: Lower the TTL from
86400(24 hours) to300(5 minutes).Wait: Wait at least 24 hours. This ensures the old "long" cache expires everywhere.
Friday (Migration Day): Now, when you update the IP, resolvers will only cache the old one for 5 minutes. The world switches over almost instantly.
The Engineering Takeaway: Never shut down the old infrastructure until traffic drops to zero. In a migration, "Old" and "New" must run in parallel for at least 48 hours. This is called a Blue/Green Deployment at the infrastructure level.
3. THE CEREBRAL GYM: Solution & New Puzzle
Yesterday's solution (Distributed Locking)
The puzzle was: You use a Redis key lock:job to prevent two servers from running the same cron job. If the server crashes, the lock stays forever (Deadlock). What parameter fixes this?
The Answer: TTL (Time To Live) / Expiration. When you set the lock, you must attach an expiry: SET lock:job "server-1" EX 60 NX
EX 60: If I die, this key self-destructs in 60 seconds.NX: Only set this if it does Not eXist (ensures only one winner).
Today's puzzle (The Split Brain) System Design Friday.
You have a database cluster with one Leader (Read/Write) and two Followers (Read-Only). The network cable gets cut.
Zone A: Has the original Leader.
Zone B: Has the two Followers. They can't see the Leader, so they vote and elect a new Leader among themselves.
Now you have two Leaders accepting writes. When the network comes back, the data is corrupted because both sides have different histories.
The Question: What is the specific concept (a number) required in the voting process to ensure that Zone A knows it is too small to stay the leader and shuts itself down?
(Reply with the term!)
4. THE PULSE: Tools of the Week
DNSChecker.org Before you celebrate a migration, check this site. It pings your domain from 30 different countries. It gives you a reality check: "It works in New York, but it's broken in Tokyo." Link: dnschecker.org
Bruno (The Postman Killer) Postman has become bloated, slow, and forces cloud sync. Bruno is a new open-source API client that lives offline. It saves your API collections as simple text files in your Git repo (so you can version control them!). It is fast, clean, and free. Link: usebruno.com
HeyGen (Video Avatar) This is scary good. I used this to generate a "Demo Video" for a client. You type the text, and an AI avatar speaks it with perfect lip-sync. For documentation or onboarding videos, it beats recording yourself 50 times. Link: heygen.com
5. THE LATENT SPACE
"It's always DNS."
We like to think we control our software. But once a packet leaves your data center, it travels through a chaotic web of routers, caches, and cables that you do not own. Resiliency isn't about controlling the network. It's about assuming the network is lying to you.
Have a safe weekend. Don't deploy today.
See you tomorrow.
Harsh Kathiriya - Query & Context

