Tracing the intracellular contraptions of life is essential to understanding how to debug our life code. Alas, without a software development kit, API, or even a manual—let alone easy-to-understand source code—we have to reverse-engineer the machine code of living things. That's the only way to find the causes of genetic diseases and to not just fix them, but to do so without breaking other parts in the process. Current methods tend to involve testing on genetically altered animals and cultured cells in long and expensive studies to find out if a certain applied change has the expected effects on protein composition, function, and organization, and to infer from this the way the system works. This can seem to even the bravest of minds like trying to fathom a recursive Rube Goldberg or Heath Robinson contraption. In this article we look at the cellular equivalent of compiler problems and the ways my company is trying to help.
All biological cells have proteins to function. Proteins are strings of amino acids, folded up a particular way that serves a purpose, a bit like a pop-up tent. You can fit a pop-up tent in your car boot when it’s folded away; but when you reach the campsite, you need to have it quickly pop up in the rain so you can peg it down and get camping. When you get home, to allow it to dry properly, you might pop it up again but not peg it down. But if you store it without its proper bag (perhaps the bag was not included with this tent because of a mistake at the factory) and other objects get crammed in there with it, you don’t want it to get stuck in a twisted shape, out of alignment and tangled up. This could cause it not only to be damaged, but to damage and get tangled with other things around it, misaligned poles and rods finding their ways to their lowest energy configurations.
With proteins, getting stuck in the useless lowest energy configuration can be a major problem. Amyloids and prions (which are among the causes of neurodegenerative diseases) are deposits of misfolded proteins that destroy any other proteins that come into contact with their ever-growing tangle. Heat, pH, and other molecules can cause some proteins to get deformed (denatured). Indeed, if something vital to keeping the protein well shaped for when it might be needed is missing or the conditions don't allow the right kind of fold, its deformation is all the more likely, and the cell might not be able to do anything more with it (Chiti and Dobson 2006).
Calculating protein folding and errors thereof has been the preserve of Stanford University's Folding@Home (Larson et al. 2002), FoldIt (Cooper et al. 2010), and Samsung's PowerSleep (Anonymous 2014) for many years, using crowd computing to make an impossibly huge task possible. (More recently incentivized by CureCoin [Cygnus-Xi and Vorksholk 2014]. The recent advancements in blockchain usage for financial applications [the most famous of these being Bitcoin] have led to a variety of ‘proofs of work’ being selected for transaction verification. A great mutual benefit was seen in making the calculation of protein folding a proof of work in itself, such that cryptocurrency could be mined in the process of distributed computational scientific endeavor.)
Every protein is specialized to certain tasks, conditions, and cell types. A Mongolian ger and a mountaineering tent are very different in their construction but cope with similar weather. One is meant to be semi-permanent; the other is meant for one or two nights at a given spot. A big top is different again, but they can all be described as tents. Cells have to provide equal versatility, both within the same cell during different conditions and in hundreds of different cell types. Protein folding has to be determined not only by environmental factors or utility, but by genetic and epigenetic instructions. Otherwise there would be too few ways for cells to fold their own proteins, and amyloids would run amok.
Lost in Translation
Cells often need not just any of the several possible proteins and variants of them that a particular gene theoretically encodes, but the exact protein needed for a particular cell at that time of day, with that amount of glucose, that level of stress and whilst it is in that exact position among other cells, which are at various stages of their lives. That gene must be transcribed and translated according to the correct setting of a collection of variables in both the DNA and the transcribed mRNA, as well as in the proteins handling, storing, and maintaining them.
Proteins can be coded by the same gene in several different ways at different stages of the protein’s production. This helps explain the surprisingly small number of genes humans have, compared to other, less complex species (Lander et al. 2001, Venter 2001, Ezkurdia et al. 2013).
Transcription is the process in which the relevant DNA, which was wrapped around histone proteins that form the chromatin structure, is unspooled to reveal a gene. This is copied to shorter, exportable mRNA strands by the transcription complex. There are many ways in which transcription can selectively be varied within the same gene, which each have effects on the mRNA available for translation. This is an area of much research focus already, and tools are available to handle bioinformatic big data about most aspects of transcription.
Translation is the process by which the Messenger RNA (mRNA) copy is read by the ribosomes, proteins whose job is to assemble proteins based on the instructions in the mRNA. Ribosomes are like factories, which are assembled on demand for protein production. Their assembly and that of the proteins they make, like flat pack furniture, relies on all the lugs and holes aligning, all the folds and magnets and coded labels matching where they are meant to be. mRNA is often derived from a much reduced portion of the original DNA sequence. However, mRNA itself contains sequences that can affect how it folds in the cytoplasm of the cell, how quickly it degrades over time before it has to be remade, and where the ribosome is able to start reading from to make a protein. If part of a notepad is crumpled up, it becomes more physically difficult to straighten and read without mistakes. The ribosome will skip such crumped parts of the mRNA and start reading at the first sentence that makes sense. That sentence might not normally make the most sense to start from, but the sentence that normally starts things off is unavailable. This is seen in mRNAs as alternative initiation codons (AICs). Codons are sets of three nucleotides (nucleotides being A, G, C, T in DNA and A, G, C, U in RNA) that correspond to amino acids via the transfer RNAs (tRNAs) that ribosomes pick up to build their proteins with. These can be arranged such that the ribosome has multiple choices of starting places on the mRNA and will clamp on (or bind) wherever the most obvious AIC is. If the usually most obvious AIC with the best sequences around it to clamp onto is not available, the ribosome will go for the next best and so forth, with reduced success rates compared to the optimum, until all reasonable possibilities are exhausted. An mRNA made exclusively of ’stop’ codons, whose job it is to identify the end of a new amino acid chain and halt the ribosome, would be illogical. Most mRNAs will be used by cells to make something at least on occasion, unless cellular conditions render their shape unreadable. An mRNA made exclusively of 'stop' codons, whose job it is to identify the end of a new amino acid chain and halt the ribosome, would be illogical. Most mRNAs will be used to make something at least on occasion, unless cellular conditions change to the point where some mRNAs are so folded up they can't be read. Therefore, contrary to the assertions of Kozak (1989), on whose 'Consensus' sequences many modern translatomic studies rest, it has been deduced that any codon which is not a stop can be a start, and it's not solely the Kozak-identified contextual codons which define this (Cowan et al. 2014, Ingolia et al. 2011, Lee et al. 2012). This is backed by the experimental data from those last three papers, the scientific details of which are too lengthy for this article.
If a mutation is found in a genome and seems to be related to cancer, scientists need to know which and how many of the proteins are affected in the cell and why. They will need to know which of those effects are good or bad and what is necessary in terms of treatments, to stop the bad effects without stopping the good ones. If a genetic edit is to be made, will it be possible for it to stop the negative effects of a mutation without causing negative effects of its own? Can the editing method be adjusted? And where on a gene do the AICs reside? Which ones are used by the proteins? We need a way to simulate this accurately if we are to ever cut the time it takes to solve genetic diseases (particularly those caused or affected by multiple mutations), edit genes or trace the effects of mutations. Given the dawn of CRISPR/Cas9 (Qi et al. 2013, Ran et al. 2013), TALENS (Boch 2011) and related methods of genetic editing, addition, and removal, a means to simulate the effects of the edits they make will be especially valuable.
The effects of AICs can be phenotype defining, as seen in orchids’ matK genes, which have been previously called "pseudogenes" due to their different alternative initiation codon usages (Barthet MM et al. 2015a, 2015b). The Wilms’ Tumour Suppressor Gene, WT1, where novel versions of the protein are produced depending on AICs, affecting where the protein goes and what it ends up doing (Bruening & Pelletier, 1996, Wegrzyn et al, 2008). Some of these can come from a wide variety of AICs affecting the same gene, such as dihydrofolate reductase (Peabody 1989, Wegrzyn et al 2008). In each case, understanding bioinformatically en masse how the AICs change the proteins and why the ribosomes respond differently in different cellular conditions to the same genes’ mRNA transcripts will be invaluable (and a lot less time-consuming than present methods) to researchers wishing to avoid the cure being worse than the disease.
This is where we come in. Vulpine Designs is programming a software module called INITIATOR SET, which includes Initmine, based on the Intermine project from the University of Cambridge (Smith et al. 2012). This system maps AIC information to gene sequences, testing each AIC for viability of ribosome assembly, likelihood of mRNA folded structures, protein targeting to organelles etcetera. The more information is characterized, the more the system can be refined. As time goes on, we will add extra featuresx to it. The module will draw its data from existing databases and user input, and will make a collection of all the data on that gene and the effects of it and its transcription and translation.
It is an exciting time for scientists handling all sorts of genetic diseases as we become able to grasp the many variables between genes and resultant proteins and phenotypes. It’s a privilege to be part of it, as accessibility of tools and information continues to increase through all our shared efforts. There is a possibility that one day we will be able to handle and design all aspects of the body at any age, and know the effects of our edits to the life code before we apply them.