In the following steps, we will demonstrate a complete workflow in which we begin with raw samples, featurize them, vectorize their results, put them together, and finally train and test a classifier:
- Begin by enumerating our samples and assigning their labels:
import osfrom os import listdirdirectories_with_labels = [("Benign PE Samples", 0), ("Malicious PE Samples", 1)]list_of_samples = []labels = []for dataset_path, label in directories_with_labels: samples = [f for f in listdir(dataset_path)] for sample in samples: file_path = os.path.join(dataset_path, sample) list_of_samples.append(file_path) labels.append(label)
- We perform a stratified train-test split:
from sklearn.model_selection import train_test_splitsamples_train, ...