Date of Award

Spring 3-2024

Document Type

Dissertation

Degree Name

Doctor of Philosophy in Cyber Operations (PhDCO)

First Advisor

Austin O'Brien

Second Advisor

Varghese Vaidyan

Third Advisor

Kyle Murbach

Abstract

This work presents the research methodology and procedures employed with a dissertation project documenting the use of a novel method designed to extract features from binary data, transform them into images suitable for classification and prediction using machine learning and deep learning methods. The novel method presented, segments a binary data source into regions which are delineated by differentials in entropy calculated between them, the resulting regions are variable-length n-grams. The size of the n-grams varies based on the amount of entropy present in that particular section. The method reduces those areas with higher-entropy into shorter n-grams. This results in finer granularity where the specificity is most appropriate. The images preserve the details of the n-grams in such a manner that deep learning methods may take optimal action upon them so long as the model is accurately designed to focus upon them and trained to do so. The work augments the n-gram extraction method with additional features which are then integrated into images created from the n-gram features. Also presented are the application of deep learning models to perform classification and prediction on the image datasets. Multiple deep learning models are explored including a popular transfer learning model which are investigated to determine the most optimal method to perform classification and prediction upon the images produced. The results indicate the method presented here presents a compelling means of creating images that precipitate high-accuracy prediction and a means of investigating binary similarity. Two separate binary datasets are studied with the main dataset consisting of binary malware files which are well-suited to this method due to their exhibition of large entropy differentials presented from the presence of packed regions of data which may feature compressed or encrypted data. The dynamically sized n-grams extracted from the binary data are enhanced through the integration of an additional feature extraction method using Lempel-Ziv Jaccard Distance (LZJD) metric. Attention is applied toward the transformation of the resulting variable-length n-grams into images to ensure as much information is retained from the extracted n-grams as possible, this includes the segregation of of large images into multiple smaller images to ease consumption by deep learning models. Factors which need mitigation with regard to the dynamic detection of entropy boundaries, transformation of images, development of deep learning models and in particular, imbalance of the source datasets are documented in the work. Validation of the deep learning models is empirically evaluated through the use of the Keras system’s accuracy & loss metrics and the creation of ROC/AUC & confusion matrices. Researchers in the field of artificial intelligence, particularly those interested in machine learning and deep learning methods will find the information relevant to the classification and prediction of malware files and other binary sources. The method presented affords scientists with a complimentary method of feature extraction supporting the study of binary data which are typically difficult to analyze data sources. Related work is presented for consideration and to contrast the novelty of the method presented here. The work is detailed to an extent that interested researchers may perform their own analysis to extend or utilize the method presented as a basis for their own research. The sources of malware and architecture dataset are provided as are thorough discussion of the methods employed to curate the malware dataset. The methods employed to mitigate challenges encountered may also be of interest, in particular, employing the n-gram extraction method to detect near-binary similarity amongst files both in the same class and in other classes. Lastly the work discovered that the malware dataset created for the study suffered from various degrees of binary duplication both within the same class and between other classes. Researchers studying binary forms of malware and other binary data and using deep learning models should pay particular attention to this. The method presented here proved valuable when repurposed to identify these near duplicate binaries and the degree of duplication and as a result, may be of interest to other researchers.

Share

COinS