Files
Download Full Text (497 KB)
Description
Large Language Models (LLMs) are widely used in AI assistants, chatbots, and decision-support systems. To prevent harmful responses, most LLMs rely on safety alignment mechanisms that generate refusal responses when users request unsafe content. However, most safety evaluations assume that alignment is only required at the start of generation. In this research, we investigate a mid-generation jailbreak attack called Pause-and-Edit, where a refusal response is interrupted, modified, and resumed. This manipulation can cause the model to override its original safety decision and generate harmful instructions. Our study evaluates how vulnerable modern open-source LLMs are to this type of attack.
Publication Date
2026
Recommended Citation
Singh, Aman; More, Komal; Aryal, Samyam; and Spanier, Mark, "Mid-Generation Jailbreaks in Open-Source LLMs Using a Pause-and-Edit Attack" (2026). Annual Research Symposium. 83.
https://scholar.dsu.edu/research-symposium/83