When the website "suddenly falls ill" while burning its advertising budget, the damage is not only the burned Ads money, but also loss of orders, loss of reputation and internal turmoil. This satellite article guides the P1 troubleshooting process in 2 hours - practical, concise, with checklists - so that your team can calmly restore the website as quickly as possible, while reducing advertising budget loss.
This is a supplementary article to the main article about maintenance services in HCM. If you need the full roadmap (SLA, process, tools), see Website maintenance service Ho Chi Minh – Pillar Page.
Typical situation when "ads are down while running"
Sudden load due to strong pumping by TVC/Influencer/Performance → CPU/RAM/DB "burning red", rows wait for full connection, web time-out.
Hurry update (plugin/theme/core) before running campaign → conflict, PHP/JS error, blank page.
Weak infrastructure: cache/CDN not optimized, DB no index/caching, no autoscale.
Attacked DDoS/Layer 7 right at the “falling point” – suspicion from competitors/botnets.
Payment/cart problems: gateway timeout, webhook failure → users cannot pay, lost revenue.
P1 target in 2 hours
Access recovery (minimum page homepage/landing/checkout) at a "sufficient" level.
Reduce budget burn: adjust Ads channel to temporary landing, reduce budget, or take a controlled pause.
Isolate the cause & prevent recurrence within the incident time frame.
Short post-mortem report within 24 hours to learn test.
120-minute (minute-by-minute) process
T-00 → T-15: QUICK REVIEW & NOTIFICATION
Enable friendly maintenance page (503 + retry-after) or static failover landing (HTML + form/CTA), limited bounce.
Internal notification (Marketing/CS/Sales/Founder) in one sentence: “P1 – site downtime, ETA 120’ – updates every 15’.”
Quick check:
TTFB/uptime (StatusCake/UptimeRobot).
CPU/RAM/IO/DB connections (Cloud/Hosting dashboard) BREATHING”
Suspected high load: enable/tighten CDN/WAF, cache page, reduce TTL; temporarily turn off heavy queries (search, filter).
Suspected code conflict: rollback latest version (Git/Backup), turn off suspect plugin.
Suspected DDoS Layer 7: turn on Under Attack Mode (Cloudflare), block abnormal IP/ASN, heavy endpoint rate-limit (
/search,/cart,/checkout).DB congestion: flush slow query cache, increase
max_connectionstemporarily, restart service if necessary.
T-30 → T-60: MINIMIZE RECOVERY & REDUCE COSTS ADS
Reopen route for sale/landing/checkout page first; The blog/introduction part can be left for later.
Marketing:
If the site is not stable: transfer budget to static landing (AMP/static HTML, CDN) or backup lead form (Google Form/Typeform).
Reduce 30–70% of budget for teams that are running strong fire; Temporarily turn off poor quality placements.
Update Customer Service: “The system is undergoing urgent maintenance, please order via hotline/inbox”.
Quick QA: desktop/mobile, login/register/cart/checkout (sandbox), lead form, pixel/GA4/UTM.
T-60 → T-90: STABILIZED & HARDIFIED
Additional protection:
Cache full page HTML (except cart/checkout).
Defer/async JS, disable unnecessary scripts.
Severe feature limitations: live search, comparison, related suggestions.
Infrastructure:
Temporarily increase vCPU CPU/RAM, enable autoscale if available.
Move media to CDN if available not yet.
DB:
Optimize folding index for frequently queried tables (orders, posts, products).
Turn on object cache (Redis/Memcached).
T-90 → T-120: TRY LIGHT LOAD & REDISCRIMINATION ADS ADS
Light load test (main user journey scenario 5–10 req/s).
Reallocation of ads: small groups, gradually increasing according to load threshold.
Incident logging: timeline – preliminary cause – applied changes – backlog to be done later.
Role set & communication channels (5 people are enough)
Incident Lead (DevOps/Tech Lead): technical decision making, 15’ updates.
Web Engineer: rollback, hotfix, feature off/on, technical QA techniques.
Performance/Marketing: adjust Ads/landing, customer service messages.
CS/CRM: Receive leads/orders manually, reassure customers, synthesize feedback.
Stakeholder (PM/Founder): approve public messages & prioritize sources force.
Channel: 1 general chat group (Slack/Telegram), 1 task board (Trello/Jira) → avoid information chaos.
Quick combat checklists
A. Technical (restore in 2 hours)
Enable 503 or static landing (with CTA).
CDN/WAF: Under Attack, rate-limit, bot fight.
Rollback code/plug-in, disable newly updated stuff.
Reopen semi-important pages; turn off heavy functions.
Redis/Memcached + page cache full; Reduce DNS/CDN TTL.
DB: increase temporary connection, optimize slow queries, restart if locked.
QA checkout/form/pixel/GA4/UTM.
B. Marketing/CS (reduce budget burn)
Adjust campaign budget, temporarily turn off risk group.
Redirect to static landing/backup form.
Brief notice on fanpage/CS: urgent maintenance underway.
Manual lead collection (hotline/inbox) – re-enter CRM later.
C. After stabilization (24h)
Post-mortem: root cause (RCA) + preventive actions.
Infrastructure optimization plan (cache/CDN/WAF/autoscale).
Restore drill schedule & safety updates.
To understand the overall maintenance landscape by month (calendar, cadence, report), refer to Monthly website maintenance process and strategic perspective in articles about maintenance & after-sales.
6 common root causes & how to "plug the hole"
Code/plug-in conflicts right before the campaign
Prevention: staging → QA → deployment with checklist; freeze code 48 hours before time Monitor: hit/miss cache, TTFB.
Fix: turn on full page cache, move media to CDN, reduce scripting.DB congestion - slow queries
Prevention: add index, query plan, pagination, caching.
Monitor: slow query log, max connections.
Fix: optimize queries quickly heavy, increase resources, split replicas if any.DDoS/Layer 7 at the right time
Prevention: WAF, bot management, threshold per IP/ASN.
Monitor: unusual spikes by country/ASN.
Fix: Under Attack, challenge, block range. autoscale
Prevention: load benchmark, autoscale, separation of DB/file storage.
Remedy: temporarily upgrade configuration, after the campaign to plan the architecture.before the campaign 48–72h.
Staging required + QA checklist, with 1-click rollback.
CDN + full page cache + object cache actually works (hit/miss measurement).
WAF/Bot management preset for peak hours.
Plan autoscale (at least temporary scale-up).
DR/Backup: snapshot before the campaign; restore drill periodically.
Runbook “P1-Ads”: who does what, which channel, which message – post it immediately in the war-room.
You don't have on-call 24/7, or don't have one DevOps.
seriously, please see Web Maintenance service page to choose the appropriate package. If you need a long-term strategic perspective in HCM (SLA, roles, processes), read pillars on website maintenance in Ho Chi Minh.Receive a free 60’ technical audit and a specific SLA proposal.
Refer to the actual maintenance roadmap in HCM in in-depth article and maintenance process monthly.
Or contact directly via Web Maintenance Services for advice on suitable packages.
Sample of communication message (short – reassuring – with alternative)
“Sorry for the inconvenience! The sudden increase in traffic caused the system to be temporarily overloaded. The technical team is recovering within 120 minutes. You can quickly order at the backup link or text inbox/Hotline 09xx… for immediate support. Thank you for your understanding!”
After the incident: 7 things to do to no longer have P1
When should you outsource a maintenance team?
Website down when running Ads is risk can be controlled if the team has a clear P1 runbook, prioritizes minimum recovery in 120 minutes, burn budgets properly, and closes architectural vulnerabilities after the incident. Don't wait for the next download to make a plan.
Need a 90-day plan to be "bulletproof" during the campaign season?
Share








