How it works...

The first step here is straightforward: we load in the sequences we're interested in and the classes they belong to. Because we're loading the ecoli_protein_classes.txt file into a dataframe, when we need a simple vector, we use the $ subset operator to extract the classes column from the dataframe. Doing so returns that single column in the vector object we need. After this, the workflow is straightforward:

  1. Decide how much of the data should be training and how much should be test: Here, in step 1, we choose 75% of the data as the training set when we create the training_proportion variable. This is used in conjunction with num_seqs in the sample() function to randomly choose indices of the sequences to put into the training ...

Get R Bioinformatics Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.