Incident Response Plan
This document offers guidance and a common vocabulary for responding to incidents, whether related to security, or the behavior of the product.
TL;DR: If something bad happens, report it clearly on #tech_incidents, communicating your assessment of severity (Really Bad stuff is Critical Severity - Sev 1, or High Severity - Sev 2). Escalate based on severity. The response team should follow the protocol below.
Escalation
Message #tech_incidents to notify the team of issues. Be a good witness. Behave as if you were reporting a crime and include lots of specific details (links, time of observation, repro steps) about what you have discovered, as well as your assessment of the severity of the issue.
Severity
Assigning severity is not an exact science, and reasonable folks can disagree. Having a common vocabulary is a way to reduce confusion, gradually align on what constitutes an emergency, and accelerate responses. Judgment always trumps definitions.
Sev-1: Critical Severity
Critical issues relate to actively exploited security risks, or site issues with a massively negative impact on the business. This should be extremely rare, and resolving the issue would be the highest priority for the entire company.
Example: a malicious actor is actively leveraging a security vulnerability, confidential user information is compromised.
Reporting/Escalation: Critical severity issues should involve a message to “@channel” in #tech-incidents as well as direct messages to Bomee and Marc for awareness. Continue escalation (slack or phone) until you receive acknowledgement that the team is on it.
Response Expectation: stakeholders should give Sev-1s absolute priority over any other task, and justify e.g. phone calls at 2AM.
Sev-2: High Severity
High severity issues relate to production site problems that severely impact our users ability to use our systems, or security vulnerability where an adversary or active exploitation hasn’t been proven yet, and may not have happened, but is likely to happen.
Examples:
-
a critical functionality of the site is not accessible [during NY business hours], making it unusable.
-
security vulnerabilities are discovered (eg: backdoors, malware, malicious access of business data)
Reporting/Escalation: Critical severity issues should involve a message to “@channel” in #tech-incidents. Continue escalation until you receive acknowledgement that the dev team is on it.
Response Expectation: the response owner(s) should give Sev-2s absolute priority over any other of their tasks, and ask for help to expedite resolution.
Sev-3 and Sev-4: Medium and Low Severity
Issues meeting this severity are simply suspicions or odd behaviors. They are not verified and require further investigation. There is no clear indicator that systems have tangible risk. This includes suspicious emails, outages, strange activity on a laptop, site issues that users can easily work around.
Examples:
-
Sev-3: Links to advanced search are gone, but users can access them via direct urls.
-
Sev-4: The release documentation page isn't accessible on the site.
Reporting/escalation: these issues can be processed via standard bug triage, or messages to the relevant slack channels (not #incidents).
Response Expectation: these issues do not require emergency response.
Response Steps
For both sev-1 and sev-2 issues, the response team will follow an iterative response process designed to investigate, contain exploitation, remediate our vulnerability, and document a post-mortem with the lessons of the incident.
- The person who is initially responding to the incident will take on the role of Incident Lead. They may choose to pass this baton to another person as the incident proceeds (e.g. if somebody with more experience becomes available.)
- When a member of the dev team has assessed that the issue warrants sev-1 or sev-2 severity, the Incident Lead should create a
temporary slack channel war room, (e.g.
#2023-12-20-war-room-error-500) where stakeholders can huddle and track progress.- link to this temporary channel should be posted in
#tech-incidentsso that anybody available can join.2.
- link to this temporary channel should be posted in
- The team should first focus on collecting and sharing information. The Incident Lead may direct participating team members to look at different information sources.
- The Incident Lead should, throughout the incident, determine whether the right team is gathered to work on the problem. If specific individuals are needed, they can attempt to contact them to ask them to join. Do not hesitate to pick up the phone and call somebody if the incident is of sufficient urgency.
- As theories develop, the Incident Lead should try to prioritize, and divide the available team members to work on different investigatory threads. They may check in with each group periodically via verbal prompt on the huddle, or by messaging in the slack channel. They may also simply request that each group report back every 5 or 10 minutes with a quick status update.
Making Changes
- It’s important that changes are made in a controlled manner. Before any change is applied, it must follow our regular change control process. It’s very easy to make a bad situation worse through a series of panicked changes which aren’t documented. Breathe!
- Whenever possible, changes should be applied to and evaluated in the staging environment.
On Incident Resolution
The Incident Lead should:
- If the root cause has still to be identified, create a ticket for followup, assign it, and mark it with the appropriate priority (usually “high”).
- Announce in
tech-incidentsthat the incident is over, and provide a summary of the resolution. Also state whether the root cause has been identified. - Update the Incident Log with a short summary of what occurred and what the solution was, so that if a similar issue occurs in future that the solve can be quicker.
- Add an agenda item for the upcoming architecture meeting, in order to brief the team on what happened.
The team member “closest to the issue” should:
- Instantiate a post-mortem doc soon after the incident is resolved.
Response Team Members
| Name | Function | Phone |
| Chuck | Tech Lead | (646) 201-8770 |
| Francois | Head of Engineering | (831) 239-8570 |
| Bomee | CEO - Escalation | (917) 446-2049 |
| Marc | CEO - Escalation | (917) 575-6337 |
Incident Log
We maintain an Incident Log.