See for yourself how Prophet AI can supercharge your security operations, accelerating alert investigation and response
Key benefits:
Lowers MTTR with AI-driven automated alert triage & investigation
Lowers risk by prioritizing critical alerts for analyst review
Eliminates manual effort, freeing analysts to focus on high-impact security tasks
Security Operations Center (SOC) metrics are quantifiable measures of an organization’s security effectiveness. They are jumping off points for a SOC manager to dive into potential issues, improve operational workflows, and ultimately reduce risk. In the world of security full of blinking lights and charts, it can be hard to know what operational metrics you should invest energy into collecting and what information they can provide you. From our perspective, there are 3 main areas of awareness that metrics should drive for a SOC leadership team.
This blog provides a starting place for managers to build out metrics for these three areas to bolster decision-making.
When evaluating this metric you’re really trying to understand if you have the appropriate visibility and response time to effectively contain threat actors before they are able to achieve their objectives. Secureworks in their latest State of the Threat report stated that the median time for a ransomware operator to achieve their objectives is just under 24 hours. That comes down to finding threats, investigating threats, and responding to threats.
Detection coverage measures the percentage of detections your team has implemented and tested that align to a known framework, namely MITRE ATT&CK. The purpose of this measure is to illustrate the likelihood that your operations team would receive some form of signal in the event of an incident.
In ATT&CK’s case, there are 194 techniques or behaviors commonly associated with threat actors prior to data exfiltration and impact. You would simply divide the detections you have that uniquely cover a technique by the total number of techniques.
For most teams, striving for 100% coverage of something like MITRE ATT&CK is both unadvisable and unachievable. The amount of operational energy to build, implement, and review the resulting detections would consume considerable numbers of team cycles and require a long-tail of tuning and maintenance.
However, the probabilities suggest that teams don’t need perfect coverage to identify threats. Remember, a threat actor has a limited number of behaviors they perform and have to accomplish every step in the threat lifecycle to achieve their objectives. With the simplifying assumptions that:
That would require 127 detections to be covered or ~65% detection coverage for those outcomes. Teams with less experienced investigators typically want to opt for higher coverage on the detection front to account for slower investigation times.
How should this metric direct action?
Additional Notes: In the event your team is light on capacity and needs to suppress detections, the strong candidates will be ones with high false positive ratios that are already covered by another detection. Your effective coverage won’t decrease but your overall capacity will increase.
Mean time to detect measures the average amount of time it takes for an organization to identify an incident from the point the activity started. This measure is evaluating both your detection coverage and alert ingestion pipeline.
Mean Time to Detect is calculated by subtracting the “Alerted At” time from the “Activity Started At” time for the earliest event in an incident.
The closer to zero, the better. Generally high performing organizations will fall somewhere between 30 minutes and 4 hours. Latency tends to get added from events early in the attack lifecycle, like VPN authentications using compromised valid credentials, that may be harder to identify before more direct action on a managed device in the environment.
How should this metric direct action?
Mean time to respond measures the average amount of time it takes for your security team to go from security event to containment or resolution. This measure encompasses your entire operational pipeline from ingesting security telemetry, acknowledging the alert, triage and investigation, and initial containment.
Mean Time to Respond is calculating the total time it takes from activity occurrence to stopping the bleeding for each alert and averaging them. Typically this is measured across a 30 day period.
Two to four hours is a generally acceptable range across all alert severities. The lower the MTTR, the lower the risk of a significant security incident.
Organizations may also choose to split their MTTR by detection severity and assign different levels of priority. In that case, the following would be strong starting points:
Critical: 1 hour
High: 2 hours
Medium: 4 hours
Low: 8 hours
How should this metric direct action?
Additional Notes: I’m personally a bigger fan of Median Time to Respond over Mean Time to Respond. Medians will naturally have a greater resistance to outliers that may throw off your metrics.
Error rate or false negative rate measures how often an analyst or automation incorrectly dispositions true positive activity as a false positive. This informs security teams of poor documentation, lack of training, or alert fatigue that may result in missed critical detections.
False Negative rate is notoriously challenging to measure – after all if you could calculate your errors you wouldn’t have them in the first place. Shameless plug: check out the quality control process laid out by Expel that uses sampling to simplify your false negative learnings.
At or below 1% error.
How should this metric direct action?
SOC Capacity measures how much total available time your team has to disposition security alerts.
Expected Work is the total amount of alert management work you expect in a given month.
For a simple approach, we’re going to remove some real world components like arrival rates (how quickly do alerts come into the queue at once) and periodicity (what time of day do these alerts typically occur) and solve for a completely uniform distribution of events.
We’re assuming your analysts are working 8 hour days and that 70% of their time (5.6 hours) is available to do work outside of lunch, breaks, meetings, etc. Capacity is always measured in hours.
Total SOC Capacity (20 days / 1 month) =
[# of Security Analysts] * 5.6 hours * [% time spent triaging] * [# time in days]
Example:
5 analysts * 5.6 hours * 0.40 * 20 = 224 hours
Expected Work is calculated as MTTR * Average Alert Volume. If it takes the team on average 1.1 hours to respond to an alert and you have 200 of them a month, ideally your team would have a SOC capacity around 250 hours (1.1 hours * 200 alerts * 1.15 percent surge buffer).
Good is a relative term here, but generally you want to have capacity available that exceeds your average work volume by 15% in a given month so that your team can handle alert surges.
How should this metric direct action?
Additional notes: For a deeper understanding of capacity planning outside of the simplified model above, checkout Jon Hencinski’s post on the subject here
Just like a well oiled machine, an efficient SOC isn’t “redlining” at all times. With the expanded responsibilities of SOC teams these days, pushing security teams too hard for too long can lead to attrition and burnout. SOC managers have to be mindful by talking to their directs during 1:1’s, but can also use the following SOC metrics as starting points to identify morale issues.
This is the average duration between acknowledging an alert, investigating the activity, and resolving the alert.
Alert Resolved Time - Alert Acknowledged Time
Top organizations will be averaging between 10 minutes to an hour, depending on alert volume and automation. Keep in mind that for organizations with high amounts of false positive alerts, tuning those alerts may artificially raise your mean time to investigate. That’s not a bad thing.
How should this metric direct action?
Additional note: Like MTTR, I’m a personal fan of using Median instead of Mean for this measure for some outlier resistance.
This measure describes the amount of time between suspicious activity occurring that triggered an alert and when an analyst or engineer acknowledges that alert.
|Alert Acknowledged Time - Activity Start Time|
Similar to MTTR, Alert Latency is typically managed based on alert severity. The following would be in the range of the top 10% of security organizations:
Critical: 20 minutes
High: 1 hour
Medium: 2 hours
Low: 6 hours
How should this metric direct action?
This is how accurate your detections are at finding threat activity.
Number of false positive alerts / total number alerts in a given period.
Unfortunately, there’s a good bit of variation between organizations depending on risk tolerance to specific high false positive prone behaviors and overall alert volume.
False positive rates should be considered in the severity of alerts generated by your team, as high severity alerts demand urgent responses. If your team is constantly needing to be high availability for low priority false positive alerts, you’ll start to see alert fatigue. Any security organization adhering to the following is operating well above average.
Critical = <25% false positive rate
High = <50% false positive rate
Medium = <75% false positive rate
Low = <90% false positive rate
How should this metric direct action?
Are there alerts that require tuning or downgrading for not meeting the severity threshold?
Security operations is often a reactive function in an organization, but anticipating future business growth allows a SOC manager to effectively resource or prioritize spend to reach business objectives.
The intent of this measure is to have a rough proxy based on historical data to show how internal growth across a single axis (customers, employees, AWS workloads) impacts alert volume as a ratio. Tracking this measure over time can help account for work variability or underscore efficiency changes to detection engineering.
Total Monthly Alerts / Internal Growth Unit
As an example, your company has an OKR to grow to 1000 employees by the end of the year. Your historical trend for internal growth has been 1.15 alerts per user. If you currently have 500 employees you can anticipate your total monthly alert volume to move from 575 alerts to 1,150 alerts. You can then calculate Expected Work with your current MTTR and start capacity planning.
Additional note: This won’t accurately account for atypical growth like new product launches or new onboarded security products
There’s no right answer here. Your goal should be to directionally forecast what the security team needs to deliver based on the business growth goals. You’ll normally come across these growth goals during Company All-Hands or OKR reviews.
How should this metric direct action?
Additional note: In a more self-serving vein, SOC managers have an incredible opportunity to showcase non-linear growth of their security organization if they can break the Alerts / Unit of Growth trend. This is typically through internal innovation or automation.
Establishing and maintaining effective SOC metrics is crucial for any security operations team aiming to stay ahead of modern threats. By focusing on key areas such as detection and response capabilities, analyst cognitive load, and preparedness for business growth, SOC managers can gain valuable insights into their operational efficiency and areas for improvement. These metrics not only provide a tangible way to measure performance but also guide strategic decisions to enhance security posture, optimize workflows, and ensure the sustainability of the team. As the security landscape continues to evolve, regularly reviewing and adapting these metrics will be essential for maintaining a robust and resilient SOC.
At Prophet Security, we're building an AI SOC Analyst that applies human-level reasoning and analysis to triage and investigate every alert, without the need for playbooks or complex integrations. Request a demo of Prophet AI to learn how you can triage and investigate security alerts 10 times faster.