Field Notes4 min read

Our Platform Went Down on the Biggest Sales Day of the Year

Ernest Barkhudarian
Ernest Barkhudarian, Founder

Lessons from scaling a 200-location delivery network — and everything that went wrong

The server crashed at 11 PM on one of the biggest sales days of the year.

The developer who managed the infrastructure was visiting family in another city. The hosting provider's support line had a 45-minute queue. Nobody else on the team had admin access to the server. The monitoring system had alerts configured — but they were going to a Slack channel nobody was watching.

By morning, sales had been down for eight hours. Customers went to competitors. The analytics dashboard showed the gap clearly: roughly $100K in orders that would have come through on a normal night of that scale.

One night. One server. No backup plan.

Why It Happened

Looking back, every single failure point was preventable:

  • Load had tripled versus normal capacity — but nobody had done even basic load testing before the peak period
  • A "quick update" was deployed two days before the peak — introducing instability at the worst possible time
  • Monitoring existed but wasn't actionable — alerts were configured, but nobody was assigned to watch them during off-hours
  • Admin access was concentrated in one person — when that person was unavailable, the entire team was locked out
  • There was no incident response plan — no "who do you call at midnight?" document

None of these are exotic problems. They're all basic infrastructure hygiene that gets deprioritized when things are working fine — and becomes catastrophic when they aren't.

The Pre-Peak Checklist

After that incident, I built a checklist that I now run before every high-stakes period — whether it's a seasonal peak, a product launch, or any event that will put unusual load on the system:

One week before:

  • Freeze all non-critical deployments and updates
  • Run load testing at 3x expected peak volume
  • Verify that at least two people have full admin access to all critical systems
  • Confirm monitoring alerts go to a channel that someone will actually watch (phone notifications, not just Slack)
  • Update the incident response contact list — who to call, in what order, for what types of issues

48 hours before:

  • Verify backups are current and tested (not just "backup ran" — actually restore and verify)
  • Confirm hosting provider support contacts and expected response times
  • Brief the on-call person: "If X breaks, do Y. If Y breaks, call Z."

During peak:

  • Designate someone to watch monitoring dashboards (not as a side task — as their primary job)
  • Establish a communication channel for real-time status updates
  • Pre-authorize emergency decisions: "If the site goes down, you can restart without asking for approval"

Growth without chaos — launch in 1 day

Training, standards, gamification, and analytics — one operating system for your franchise family

Book a Demo

Why Franchise Networks Should Care

If you run a franchise network on a shared platform, your infrastructure risk is amplified in a way that single-location businesses don't experience.

When your platform goes down, every location goes down simultaneously. A restaurant franchise with 100 locations on a shared ordering platform doesn't lose one location's worth of revenue during an outage — it loses 100 locations' worth. The financial impact scales linearly with network size.

Franchisees experience the outage but can't fix it. The frustration is acute: your franchisees are watching orders not come in, customers walking away, and they have zero ability to resolve the issue. Every minute of downtime erodes their trust in HQ.

Peak periods are when infrastructure matters most — and when it's most likely to fail. The same seasonal surges that drive franchise revenue are the ones that expose infrastructure weaknesses. Black Friday, holiday season, local events — these are the moments when load spikes, when the platform needs to perform, and when the cost of failure is highest.

The basics are the same regardless of scale:

  • Load test before peak periods
  • Freeze deployments before high-stakes windows
  • Ensure multiple people have admin access
  • Have a documented incident response plan
  • Make sure monitoring actually reaches someone who can act

The Lesson

Infrastructure reliability isn't glamorous. Nobody gets excited about load testing or access management documentation. But the cost of ignoring it is concentrated into the worst possible moments — and in a franchise network, that cost multiplies across every location in the network simultaneously.

The $100K we lost that night was preventable with maybe half a day of preparation. The checklist I use now takes a few hours before each peak period. The math is simple — and yet I still see businesses skip it, every year, and learn the lesson the expensive way.

Growth without chaos — launch in 1 day

Training, standards, gamification, and analytics — one operating system for your franchise family

Book a Demo
Ernest Barkhudarian

Author

Ernest Barkhudarian

Founder

17 years building tech for multi-location businesses — from flower delivery networks to e-commerce operations. Writes about what he learned scaling operations across hundreds of locations, and why he built Franchise.Family.

Related Articles

Field Notes4 min read

A Broken Address Field Was Costing Us $2M Across 200 Locations

A single form field bug in a 200-location delivery network was silently killing conversions at every location. Nobody noticed — each location thought it was just a slow day. Here's how we found it, and why your franchise network probably has the same class of problem.

Field Notes4 min read

Your Network Will Face a 10x Day. Here's How to Survive It

One holiday turned every location in our 200-site delivery network into a warzone. Same platform, same training — wildly different results. Here's what separated the locations that crushed it from the ones that collapsed, and why franchise networks need to prepare differently.

Field Notes4 min read

One Employee, One Promo Code, $700K Gone

An internal 100% discount code meant 'for testing only' leaked to a partner network. 112 orders at $0 in one hour. No hack, no breach — just a missing access control. Here's what happened, and why franchise networks are even more exposed to this kind of loss.