Feature Extraction and Analysis of Binaries for Classification

Micah Flack, Dakota State University

Abstract

score, or even k-fold validation. By applying these mathematical concepts to malware identification, we can express the relationships between samples as binary classifiers, such as benign The research project, Feature Extraction and, Analysis of Binaries for Classification, provides an in-depth examination of the features shared by unlabeled binary samples, for classification into the categories of benign or malicious software using several different methods. Because of the time it takes to manually analyze or reverse engineer binaries to determine their function, the ability to gather features and then instantly classify samples without explicitly programming the solution is incredibly valuable. It is possible to use an online service; however, this is not always viable depending on the sensitivity of the binary. With Python3 and the Pefile library, we can gather the necessary features to begin choosing different classifier models from the Scikit-learn library for machine learning. This all addresses the issue of local automated classification, and we present several different classifier models, datasets and methods that allow for the classification of unknown binaries with a high degree of accuracy for predicting malware and benignware.