Performing quality control and filtering on high-throughput sequence reads can be done using the following steps:
- Load the library and connect to a file:
library(ShortRead)fastq_file <- readFastq(file.path(getwd(), "datasets", "ch8", "ERR019652.fastq.gz") )
- Filter reads with any nucleotide with quality lower than 20:
qualities <- rowSums(as(quality(fastq_file), "matrix") <= 20) fastq_file <- fastq_file[qualities == 0]
- Trim the right-hand side of the read:
cut_off_txt <- rawToChar(as.raw(40))trimmed <- trimTails(fastq_file, k =2, a= cut_off_txt)
- Set up a custom filter to remove N and homomeric runs:
custom_filter_1 <- nFilter(threshold=0)custom_filter_2 <- polynFilter(threshold = 10, nuc = c("A", "T", "C", "G"))custom_filter ...