3Information Entropy and Statistical Measures
In this chapter, we start with a description of information entropy and statistical measures (Section 3.1). Using these measures we then examine “raw” genomic data. No biology or biochemistry knowledge is needed in doing this analysis and yet we almost trivially rediscover a three‐element encoding scheme that is famous in biology, known as the codon. Analysis of information encoding in the four element {a, c, g, t} genomic sequence alphabet is about as simple as you can get (without working with binary data), so it provides some of the introductory examples that are implemented. A few (simple) statistical queries to get the details of the codon encoding scheme are then straightforward (Section 3.2). Once the encoding scheme is known to exist, further structure is revealed via the anomalous placement of “stop” codons, e.g. anomalously large open reading frames (ORFs) are discovered. A few more (simple) statistical queries from there, and the relation of ORFs to gene structure is revealed (Section 3.3). Once you have a clear structure in the sequential data that can be referenced positionally, it is then possible to gather statistical information for a Markov model. One example of this is to look at the positional base statistics at various positions “upstream” from the start codon. We thereby identify binding sites for critical molecular interaction in both transcription and translation. Since the Markov model is needed in analysis ...
Get Informatics and Machine Learning now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.