SOC Metrics that Matter: MTTR, MTTD, MTTI, False Negatives, and more

Grant Oviatt
Grant Oviatt
July 24, 2024

Security Operations Center (SOC) metrics are quantifiable measures of an organization’s security effectiveness. They are jumping off points for a SOC manager to dive into potential issues, improve operational workflows, and ultimately reduce risk. In the world of security full of blinking lights and charts, it can be hard to know what operational metrics you should invest energy into collecting and what information they can provide you. From our perspective, there are 3 main areas of awareness that metrics should drive for a SOC leadership team.

  1. Threat detection and response effectiveness
  2. Analyst team cognitive load
  3. Business growth preparedness

This blog provides a starting place for managers to build out metrics for these three areas to bolster decision-making.

Threat detection and response effectiveness


When evaluating this metric you’re really trying to understand if you have the appropriate visibility and response time to effectively contain threat actors before they are able to achieve their objectives. Secureworks in their latest State of the Threat report stated that the median time for a ransomware operator to achieve their objectives is just under 24 hours. That comes down to finding threats, investigating threats, and responding to threats.

What is Detection Coverage?

Detection coverage measures the percentage of detections your team has implemented and tested that align to a known framework, namely MITRE ATT&CK. The purpose of this measure is to illustrate the likelihood that your operations team would receive some form of signal in the event of an incident.

How do I measure detection coverage?

In ATT&CK’s case, there are 194 techniques or behaviors commonly associated with threat actors prior to data exfiltration and impact. You would simply divide the detections you have that uniquely cover a technique by the total number of techniques.

What does good detection coverage look like?

For most teams, striving for 100% coverage of something like MITRE ATT&CK is both unadvisable and unachievable. The amount of operational energy to build, implement, and review the resulting detections would consume considerable numbers of team cycles and require a long-tail of tuning and maintenance.

However, the probabilities suggest that teams don’t need perfect coverage to identify threats. Remember, a threat actor has a limited number of behaviors they perform and have to accomplish every step in the threat lifecycle to achieve their objectives. With the simplifying assumptions that:

  • Every technique or behavior is equally likely by a threat actor (not true)
  • Your current visibility would record some form of monitored telemetry for 7 of the 10 techniques that would occur prior to Impact (assuming a minimum of 1 per lifecycle stage)
  • Your team wants to detect at least 3 of those behaviors for a threat actor without further investigation

That would require 127 detections to be covered or ~65% detection coverage for those outcomes. Teams with less experienced investigators typically want to opt for higher coverage on the detection front to account for slower investigation times.

How should this metric direct action?

  1. Justify and validate that the team has appropriate telemetry visibility into the organization
  2. Direct what detections the team should focus on building. Remember, there isn’t an equal likelihood for every behavior. Make sure to invest in behaviors that are most common (as seen in threat intelligence, threat reports, or even social media).

Additional Notes: In the event your team is light on capacity and needs to suppress detections, the strong candidates will be ones with high false positive ratios that are already covered by another detection. Your effective coverage won’t decrease but your overall capacity will increase.

What is Mean Time to Detect (MTTD)?

Mean time to detect measures the average amount of time it takes for an organization to identify an incident from the point the activity started. This measure is evaluating both your detection coverage and alert ingestion pipeline.

How do I measure MTTD?

Mean Time to Detect is calculated by subtracting the “Alerted At” time from the “Activity Started At” time for the earliest event in an incident.

What is a good MTTD?

The closer to zero, the better. Generally high performing organizations will fall somewhere between 30 minutes and 4 hours. Latency tends to get added from events early in the attack lifecycle, like VPN authentications using compromised valid credentials, that may be harder to identify before more direct action on a managed device in the environment.

How should this metric direct action?

  1. Validate that where you’re investing in detection coverage is paying off in incident identification
  2. Necessitates post-incident detection reviews to validate earliest activity and glean any potential behaviors for detections

What is Mean Time To Respond (MTTR)?

Mean time to respond measures the average amount of time it takes for your security team to go from security event to containment or resolution. This measure encompasses your entire operational pipeline from ingesting security telemetry, acknowledging the alert, triage and investigation, and initial containment.

How do I measure MTTR?

Mean Time to Respond is calculating the total time it takes from activity occurrence to stopping the bleeding for each alert and averaging them. Typically this is measured across a 30 day period.

What is a good MTTR?

Two to four hours is a generally acceptable range across all alert severities. The lower the MTTR, the lower the risk of a significant security incident.

Organizations may also choose to split their MTTR by detection severity and assign different levels of priority. In that case, the following would be strong starting points:

Critical: 1 hour
High
: 2 hours
Medium
: 4 hours
Low
: 8 hours

How should this metric direct action?

  1. SOC managers should investigate if there are significant differences in MTTR between alert sources, alert types, and individuals to identify pain points or training opportunities
  2. MTTR should force a SOC manager further into their investigative process. Is the majority of the time spent with alert ingestion, alerts waiting to be actioned, triage and investigation, or performing initial response actions?

Additional Notes: I’m personally a bigger fan of Median Time to Respond over Mean Time to Respond. Medians will naturally have a greater resistance to outliers that may throw off your metrics.

What is False Negative Rate / Investigation Error Rate?

Error rate or false negative rate measures how often an analyst or automation incorrectly dispositions true positive activity as a false positive. This informs security teams of poor documentation, lack of training, or alert fatigue that may result in missed critical detections.

How do I measure false negative rate?

False Negative rate is notoriously challenging to measure – after all if you could calculate your errors you wouldn’t have them in the first place. Shameless plug: check out the quality control process laid out by Expel that uses sampling to simplify your false negative learnings.

What is a good false negative rate?

At or below 1% error.

How should this metric direct action?

  • SOC managers should investigate if there are process improvements that need to be made or training opportunities to limit the number of false negatives.

What is SOC Capacity and Expected Work?

SOC Capacity measures how much total available time your team has to disposition security alerts.

Expected Work is the total amount of alert management work you expect in a given month.

How do I measure SOC capacity and expected work?

For a simple approach, we’re going to remove some real world components like arrival rates (how quickly do alerts come into the queue at once) and periodicity (what time of day do these alerts typically occur) and solve for a completely uniform distribution of events.

We’re assuming your analysts are working 8 hour days and that 70% of their time (5.6 hours) is available to do work outside of lunch, breaks, meetings, etc. Capacity is always measured in hours.

Total SOC Capacity (20 days / 1 month) =

[# of Security Analysts] * 5.6 hours * [% time spent triaging] * [# time in days]

Example:

  • You have a team of 5 security analysts with 40% of their time spent triaging alerts

5 analysts * 5.6 hours * 0.40 * 20 = 224 hours

Expected Work is calculated as MTTR * Average Alert Volume. If it takes the team on average 1.1 hours to respond to an alert and you have 200 of them a month, ideally your team would have a SOC capacity around 250 hours (1.1 hours * 200 alerts * 1.15 percent surge buffer).

What does good look like?

Good is a relative term here, but generally you want to have capacity available that exceeds your average work volume by 15% in a given month so that your team can handle alert surges.

 How should this metric direct action?

  • SOC managers should ensure that they have available capacity to manage their expected work in a given month and be able to account for future growth
  • Expected Work that’s exceeding capacity is ripe for support with automation or appropriate detection tuning
  • When SOC capacity is significantly below Expected Work, you run significant risk for burnout which cannot be sustained
  • If your Alert Latency / Alert Dwell time is high as a component of MTTR, take a hard look at capacity as a second order measure to suss out of resourcing is your constraint

Additional notes: For a deeper understanding of capacity planning outside of the simplified model above, checkout Jon Hencinski’s post on the subject here

Analyst team cognitive load


Just like a well oiled machine, an efficient SOC isn’t “redlining” at all times. With the expanded responsibilities of SOC teams these days, pushing security teams too hard for too long can lead to attrition and burnout. SOC managers have to be mindful by talking to their directs during 1:1’s, but can also use the following SOC metrics as starting points to identify morale issues.

What is Mean Time to Investigate (MTTI)?

This is the average duration between acknowledging an alert, investigating the activity, and resolving the alert.

How do I measure MTTI?

Alert Resolved Time - Alert Acknowledged Time

What is a good MTTI?

Top organizations will be averaging between 10 minutes to an hour, depending on alert volume and automation. Keep in mind that for organizations with high amounts of false positive alerts, tuning those alerts may artificially raise your mean time to investigate. That’s not a bad thing.

How should this metric direct action?

  1. Are there alert types with significantly higher MTTI? Is there an opportunity for automation or enrichment to support faster conclusions?
  2. Are there any alert types with high MTTI that are also high volume? These can be mind numbing for analysts. Prioritize for tuning, training, or automation.

Additional note: Like MTTR, I’m a personal fan of using Median instead of Mean for this measure for some outlier resistance.

What is Alert Latency / Alert Wait Time / Alert Dwell Time?

This measure describes the amount of time between suspicious activity occurring that triggered an alert and when an analyst or engineer acknowledges that alert.

How do I measure alert latency?

|Alert Acknowledged Time - Activity Start Time|

What is good alert latency?

Similar to MTTR, Alert Latency is typically managed based on alert severity. The following would be in the range of the top 10% of security organizations:

Critical: 20 minutes
High
: 1 hour
Medium
: 2 hours
Low
: 6 hours 

How should this metric direct action?

  1. Are there specific alerts or classes of technology that the team is consistently avoiding? Does it line up with an extended Mean Time to Investigate for those alerts? This might indicate a training or tuning opportunity.
  2. Is your alert volume higher than average? Does the SOC have the capacity to take on the workload given. You’ll need to invest in people or automation.

What is False Positive Rate?

This is how accurate your detections are at finding threat activity.

How do I measure false positive rate?

Number of false positive alerts / total number alerts in a given period.

What is a good false positive rate?

Unfortunately, there’s a good bit of variation between organizations depending on risk tolerance to specific high false positive prone behaviors and overall alert volume.

False positive rates should be considered in the severity of alerts generated by your team, as high severity alerts demand urgent responses. If your team is constantly needing to be high availability for low priority false positive alerts, you’ll start to see alert fatigue. Any security organization adhering to the following is operating well above average.

Critical = <25% false positive rate
High
= <50% false positive rate
Medium
= <75% false positive rate
Low
= <90% false positive rate

How should this metric direct action?

Are there alerts that require tuning or downgrading for not meeting the severity threshold?

Business growth preparedness


Security operations is often a reactive function in an organization, but anticipating future business growth allows a SOC manager to effectively resource or prioritize spend to reach business objectives.

What are Alerts per Unit of Growth?

The intent of this measure is to have a rough proxy based on historical data to show how internal growth across a single axis (customers, employees, AWS workloads) impacts alert volume as a ratio. Tracking this measure over time can help account for work variability or underscore efficiency changes to detection engineering. 

How do I measure alerts per unit of growth?

Total Monthly Alerts / Internal Growth Unit

As an example, your company has an OKR to grow to 1000 employees by the end of the year. Your historical trend for internal growth has been 1.15 alerts per user. If you currently have 500 employees you can anticipate your total monthly alert volume to move from 575 alerts to 1,150 alerts. You can then calculate Expected Work with your current MTTR and start capacity planning.

Additional note: This won’t accurately account for atypical growth like new product launches or new onboarded security products

What does good look like?

There’s no right answer here. Your goal should be to directionally forecast what the security team needs to deliver based on the business growth goals. You’ll normally come across these growth goals during Company All-Hands or OKR reviews.

How should this metric direct action?

  1. How do internal growth projections impact our overall capacity modeling? When do we need to budget for additional automation or resources to accommodate?

Additional note: In a more self-serving vein, SOC managers have an incredible opportunity to showcase non-linear growth of their security organization if they can break the Alerts / Unit of Growth trend. This is typically through internal innovation or automation.

Wrapping up

Establishing and maintaining effective SOC metrics is crucial for any security operations team aiming to stay ahead of modern threats. By focusing on key areas such as detection and response capabilities, analyst cognitive load, and preparedness for business growth, SOC managers can gain valuable insights into their operational efficiency and areas for improvement. These metrics not only provide a tangible way to measure performance but also guide strategic decisions to enhance security posture, optimize workflows, and ensure the sustainability of the team. As the security landscape continues to evolve, regularly reviewing and adapting these metrics will be essential for maintaining a robust and resilient SOC.

Further reading

Top SOC Challenges
Key SOC Tools Every Security Operations Needs

Mastering the Cybersecurity Investigation Process
What is MFA fatigue attack?
Investigating geo-impossible travel alert

Top 3 scenarios for auto remediation

Automated incident response: streamlining your SecOps

Ready to see Prophet Security in action?
See how our SOC Copilot will transform the way your team works.