The first step here is straightforward: we load in the sequences we're interested in and the classes they belong to. Because we're loading the ecoli_protein_classes.txt file into a dataframe, when we need a simple vector, we use the $ subset operator to extract the classes column from the dataframe. Doing so returns that single column in the vector object we need. After this, the workflow is straightforward:
- Decide how much of the data should be training and how much should be test: Here, in step 1, we choose 75% of the data as the training set when we create the training_proportion variable. This is used in conjunction with num_seqs in the sample() function to randomly choose indices of the sequences to put into the training ...