Files

Download

Download Full Text (161 KB)

Description

LLM security agent performance depends on two factors: harness design and challenge type. A poorly designed harness prevents models from recovering from failures, while challenge type determines baseline difficulty. We tested these factors across two experiments. In Experiment 1, we evaluated 6 harness-model combinations against 5 live HackTheBox machines requiring scanning, enumeration, exploitation, and privilege escalation across SSH, SMB, FTP, HTTP, and DNS. In Experiment 2, we benchmarked 10 frontier models via Claude Code Router on 5 challenges from Cybench spanning pwn, forensics, web, reverse, and crypto categories using a Pass@3 metric. Our experiments show models achieving 100% with one harness but scoring 0% with another, and the same model solves ~90% of static challenges but only ~20% of dynamic ones. The key insight is that dynamic challenges are solvable when the harness enables both efficient routine operations and a failure-recovery loop.

Publication Date

2026

LLM Security Agents: Harness Design and Static vs Dynamic Challenges

Share

COinS