In the following steps, you will construct new features for the CERT insider threat dataset:
- Import numpy and pandas, and point to where the downloaded data is located:
import numpy as npimport pandas as pdpath_to_dataset = "./r42short/"
- Specify the .csv files and which of their columns to read:
log_types = ["device", "email", "file", "logon", "http"]log_fields_list = [ ["date", "user", "activity"], ["date", "user", "to", "cc", "bcc"], ["date", "user", "filename"], ["date", "user", "activity"], ["date", "user", "url"],]
- We will hand-engineer a number of features and encode them, thereby creating a dictionary to track these.
features = 0feature_map = {}def add_feature(name): """Add a feature to a dictionary to be encoded.""" ...