VYPR
researchPublished May 4, 2026· Updated May 17, 2026· 1 source

Study: Structured Workflows, Not Models, Key to Effective AI Security Triage

New research reveals that Large Language Models fail to detect malicious activity when used in isolation, but achieve 93 percent accuracy when integrated into structured, tool-constrained investigative workflows.

A recent study by researchers at the University of Oslo and the Norwegian Defence Research Establishment has demonstrated that Large Language Models (LLMs) are ineffective at security alert triage when used in isolation, but achieve high accuracy when integrated into structured, agentic workflows Help Net Security. The research highlights a critical gap between the marketing of AI-powered security assistants and their actual performance in real-world Security Operations Center (SOC) environments.

In the study, researchers tested four popular models—GPT-5-mini, Claude 3 Haiku, Qwen3:30B, and Gemma 3:27B—against a specific malicious scenario involving reconnaissance, brute-force login attempts, and initial access against a web server, sourced from the AIT Log Data Set V1.1 Help Net Security. When provided only with a high-level summary of network logs, all four models failed to identify the malicious activity, with a zero percent success rate in flagging true-positive cases. In some instances, models like Gemma 3:27B exhibited extreme bias, classifying every input as benign Help Net Security.

The performance shifted dramatically when the researchers implemented a multi-stage, agentic workflow. In this configuration, one model was tasked with planning an investigation by selecting from predefined SQL queries against Suricata logs and performing grep searches on unstructured text. A second model summarized the gathered evidence, while a third model rendered a final verdict, with the capability to request further investigation if necessary Help Net Security.

Under this structured approach, the average accuracy for detecting malicious cases rose to 93 percent. GPT-5-mini, the top performer, correctly identified every malicious case across 100 test runs Help Net Security. The researchers emphasized that the models themselves remained unchanged; the improvement was entirely due to the introduction of constrained tools, defined investigative steps, and guardrails that forced the models to interact with data in a systematic, analyst-like manner Help Net Security.

Despite these gains, the study noted significant challenges regarding false positives. Even the most accurate models, such as GPT-5-mini, frequently classified benign cases as "uncertain," which would still require human intervention in a production environment Help Net Security. The authors suggest that while this conservative approach is preferable to missing actual threats, it limits the total amount of time such systems can currently save for human analysts.

This research serves as a cautionary tale for the deployment of AI in security operations, suggesting that the value of an AI security product lies less in the model itself and more in the surrounding architecture. As organizations continue to integrate AI into their security workflows, the focus must shift toward building robust, tool-constrained environments that force models to reason through evidence rather than relying on probabilistic guessing Help Net Security.

Synthesized by Vypr AI