
Working with Character Data 175
> mmT = matchPattern(TATA, chr22NoN, max.mismatch = 1)
> length(mmT)
[1] 102104
> mismatch(TATA, mmT[1:3])
[[1]]
[1] 2
[[2]]
[1] 5
[[3]]
[1] 7
5.6.2 Matching many query sequences
Matching a huge number of query sequences to a single target sequence is a
problem that is now relevant due to high throughput sequencing technologies.
These technologies typically yield a large number, sometimes in the tens of
millions, of short reads. One of the bioinformatic tasks is to match these to a
known genome. And the function matchPDict can be used for this. It is based
on the Aho-Corasick algorithm.
The following example is taken from the