How it works...

The first step here is straightforward: we load in the sequences we're interested in and the classes they belong to. Because we're loading the ecoli_protein_classes.txt file into a dataframe, when we need a simple vector, we use the $ subset operator to extract the classes column from the dataframe. Doing so returns that single column in the vector object we need. After this, the workflow is straightforward:

  1. Decide how much of the data should be training and how much should be test: Here, in step 1, we choose 75% of the data as the training set when we create the training_proportion variable. This is used in conjunction with num_seqs in the sample() function to randomly choose indices of the sequences to put into the training ...

Get R Bioinformatics Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.