Incident Response
How to respond to outages, security incidents, and other critical issues.
This document covers what to do when things go very wrong.
Incident Severity Levels
P0 - Critical (All Hands)
The service is down or severely impaired.
- Users cannot access Emberly at all
- Database is corrupted or down
- Data loss is occurring
- Security breach is active
- Major data leak detected
Response:
- ALL engineers drop what they're doing
- Ops lead declares P0
- Updates posted every 15 minutes
- Estimated recovery: < 1 hour
- Post-incident review mandatory
P1 - Major (Urgent Response)
Core functionality is broken.
- Uploads failing for many users
- Downloads broken
- Auth system down (can't log in)
- Payments system broken
- Email system down
Response:
- Senior engineers engage immediately
- Updates posted every 30 minutes
- Estimated recovery: < 4 hours
- Post-incident review required
P2 - Moderate (Prompt Response)
Important feature is broken but service works.
- OCR not processing
- Custom domains not working
- 2FA broken for some users
- Specific storage quotas calculate wrong
- Report system not functioning
Response:
- Relevant team notified
- Updates posted every 2 hours
- Estimated recovery: < 24 hours
- Brief incident summary after fix
P3 - Low (Normal Troubleshooting)
Small feature or edge case broken.
- One user's storage shows wrong
- Minor UI bug
- Particular browser incompatibility
- Documentation typo
Response:
- File ticket, normal priority
- Fix when team has capacity
- No status updates needed
- No incident review
P0 Incident Response
Phase 1: Declare & Investigate (0-15 min)
Immediately (first 5 minutes):
-
Confirm the incident is real
- Is Emberly.ca down?
- Can you reproduce it?
- Is it affecting all users or specific ones?
-
Sound the alarm
- Slack: Post in
#incidentschannel - Format:
[P0 INCIDENT] Service is down - [brief description] - @channel to notify everyone
- Start incident thread
- Slack: Post in
-
Determine scope
-
Assign incident commander
- Usually the senior eng available
- Their job: Coordinate response, post updates, make decisions
- Post in Slack:
[Name] is incident commander
By 15 minutes:
- Slack alert posted
- Scope determined
- Incident commander assigned
- Team notified
- Initial update posted: "We're investigating [issue]"
Phase 2: Status Updates (Ongoing)
Post updates every 15 minutes while investigating:
Phase 3: Fix & Restore (Varies)
Actions depend on the issue:
Database Down
Deployment Gone Wrong
Security Breach / Data Leak
Out of Capacity
Phase 4: Communicate (Every Phase)
To Users
Slack/Discord (public announcements):
Status page (status.emberly.ca):
- Update status to "Investigating" → "Identified" → "Monitoring" → "Resolved"
- Post updates as things change
- Include ETA when confident
Email to affected users (if major + prolonged):
To Team Email (if public incident)
Send internal email after 30 min of any P0:
Phase 5: Resolution
When the issue is fixed:
-
Verify fix is real
- Test from multiple clients
- Check monitoring for stability
- Ask a customer to test
- Run smoke tests
-
Update status
-
Stand down incident
Phase 6: Post-Incident Review (Next Day)
Schedule within 24 hours of resolution.
Attendees:
- Incident commander
- Engineers involved in fix
- On-call lead
- Someone from ops/management
Agenda:
- What happened? (timeline)
- Why did it happen? (root cause)
- Why didn't we catch it? (monitoring gaps)
- How do we prevent it next time? (action items)
Output:
- Write incident report
- Assign fixes to teams
- Update monitoring/alerts
- Update runbooks
Example report:
P1 Incident Response
Similar to P0 but less urgent:
- Post alert to Slack
- Assign incident owner (senior eng)
- Update status page
- Post updates every 30 minutes
- Post resolution
- Brief postmortem summary (less formal than P0)
Timeline: Try to resolve within 4 hours.
P2 Incident Response
- File ticket
- Assign to relevant team
- Update every 2 hours if public-facing
- Resolve within 24 hours
- No special review needed
Common Incident Scenarios
Database Connection Pool Exhaustion
Symptoms:
- Uploads failing with "database unavailable"
- Non-upload features working fine
- Intermittent (works sometimes, fails others)
Quick Fix:
Root Cause Check:
Prevention:
- Alert when pool > 80%
- Set connection timeout to 5 minutes
- Monitor slowest queries
Out of Disk Space
Symptoms:
- All uploads start failing
- Errors: "no space left on device"
- Database might go into read-only mode
Quick Fix:
Prevention:
- Alert when disk > 80%
- Set up log rotation
- Archive old data periodically
- Monitor largest directories
Deployment Broken Production
Symptoms:
- New deploy caused 500 errors
- Specific feature now broken
- Errors started after recent deployment
Quick Fix:
Root Cause:
- Was change tested?
- Did it pass code review?
- Did automated tests pass?
- Did it work in staging?
Prevention:
- Require staging test before production deploy
- Add automated test requirements
- Have peer review on all changes
- Practice deployments in staging first
Security Incident - API Key Leaked
Symptoms:
- API key appears in GitHub issues or logs
- Unauthorized API calls from unknown source
- Multiple failed login attempts
Immediately:
Investigation:
- How was it leaked? (GitHub, logs, email?)
- Fix the leak (remove from history, restrict logs)
- How long was it exposed?
- Did attacker use it?
Prevention:
- Use environment variables for secrets, not code
- Scan GitHub for leaked secrets
- Rotate keys periodically
- Monitor API key usage for anomalies
Customer Data Corruption
Symptoms:
- Customer files are gone or wrong
- Database has impossible state
- File sizes don't match storage quota
Immediate:
Analysis:
- Is data actually lost or just showing wrong?
- Can we restore from backup?
- How many records affected?
- Can we notify users proactively?
Prevention:
- Test database restore procedures monthly
- Have automated backups every hour
- Monitor for data anomalies
- Have data integrity checks
Runbooks
Database Reboot Runbook
App Server Restart Runbook
Escalation Chain
Who to notify for different incidents:
All P0s
- Notify all engineers (@channel)
- CTO / Engineering lead
- CEO (within 30 min if user-facing)
All P1s
- Notify relevant team
- Engineering lead
P0 Ongoing > 1 hour
- Notify CEO directly
- Notify customers if public-facing
- Prepare public status update
Security incident
- Security team immediately
- Legal team immediately
- CTO
- CEO if data breach
Billing-related
- Finance team
- CEO
- Customer success team
Communication Templates
Initial Alert (First 5 min)
30-Minute Update (No Progress Yet)
60-Minute Update (Found Fix)
Resolution
Post-Incident Summary
Tools & Resources
- Status Page: https://status.emberly.ca (update live)
- Incident Slack Channel: #incidents (ping @channel for P0)
- Database: Direct SSH access, Prisma Studio for viewing
- Logs: Check ELK stack or CloudWatch (depends on setup)
- Monitoring: [Depends on your monitoring tool]
- Playbooks: Store in internal wiki
- On-call Schedule: [Link to schedule]