Files
Download Full Text (161 KB)
Description
LLM security agent performance depends on two factors: harness design and challenge type. A poorly designed harness prevents models from recovering from failures, while challenge type determines baseline difficulty. We tested these factors across two experiments. In Experiment 1, we evaluated 6 harness-model combinations against 5 live HackTheBox machines requiring scanning, enumeration, exploitation, and privilege escalation across SSH, SMB, FTP, HTTP, and DNS. In Experiment 2, we benchmarked 10 frontier models via Claude Code Router on 5 challenges from Cybench spanning pwn, forensics, web, reverse, and crypto categories using a Pass@3 metric. Our experiments show models achieving 100% with one harness but scoring 0% with another, and the same model solves ~90% of static challenges but only ~20% of dynamic ones. The key insight is that dynamic challenges are solvable when the harness enables both efficient routine operations and a failure-recovery loop.
Publication Date
2026
Recommended Citation
Hammond, Joe; French, Eddie; and O'Brien, Austin, "LLM Security Agents: Harness Design and Static vs Dynamic Challenges" (2026). Annual Research Symposium. 77.
https://scholar.dsu.edu/research-symposium/77