Outlet Title
IEEE Access
Document Type
Article
Publication Date
2025
Abstract
Ransomware and other malware inflict devastating financial and operational damage on organizations worldwide by exploiting deeply embedded, hard-to-detect vulnerabilities in their systems. Detecting these vulnerabilities in compiled code before malicious actors exploit them remains a critical challenge in cybersecurity. This research introduces TEDVIL (Transformer-based Embeddings for Discovering Vulnerabilities in Lifted Code), a novel framework which uses transformer-based embeddings to train neural networks to detect vulnerabilities in lifted code. The framework was implemented using bidirectional (BERT and RoBERTa) and unidirectional (GPT-1 and GPT-2) transformer-based models to generate embeddings for training Long Short-Term Memory (LSTM) neural networks to detect stack-based buffer overflows in Low-Level Virtual Machine (LLVM) intermediate representation code. For comparison, simpler word2vec models (Skip-Gram and Continuous Bag of Words) were also trained, and their embeddings were used to train LSTMs. The results show that the LSTMs using GPT-2 embeddings outperformed those using GPT-1, BERT, RoBERTa, and word2vec embeddings, achieving a top accuracy of 92.5% and an F1-score of 89.7%. Notably, these results are achieved when the embedding model is trained with a dataset of just 48,000 functions, demonstrating effectiveness in resource-constrained settings. The findings underscore the effectiveness of TEDVIL in identifying hard-to-detect vulnerabilities in compiled code, and lay the groundwork for future research in leveraging transformer-based models for vulnerability detection.
Recommended Citation
G. A. McCully, J. D. Hastings and S. Xu, "TEDVIL: Leveraging Transformer-Based Embeddings for Vulnerability Detection in Lifted Code," in IEEE Access, vol. 13, pp. 76894-76913, 2025, doi: 10.1109/ACCESS.2025.3565980.
Included in
Artificial Intelligence and Robotics Commons, Cybersecurity Commons, Information Security Commons