Exponential advances in gene sequencing technology have produced an embarrassment of riches: we’re now able to almost trivially sequence an organism’s DNA, yet sifting meaning from these genomes is still an incredibly labor-intensive and haphazard task. For instance, consider the following simple questions:
How similar are the genomes of dogs and humans? How does this compare to cats and humans? What about mice and cats? How close, genetically, are mice and corn?
We have all of these genomes sequenced, but we don’t have particularly good and intuitive ways to answer these sorts of questions.
Whenever we can ask simple questions about empirical phenomena that don’t seem to have elegant answers, it’s often a sign there’s a niche for a new conceptual tool. This is a stab at a tool that I believe could deal with these questions more cogently and intelligently than current approaches.
Logarithmic Evolution Distance: an intuitive approach to quantifying difference between genomes.
How do we currently compare two genomes and put a figure on how close they are? The current fashionable metrics seem to be:
– Raw % similarity in genetic code— e.g., “Humans and dogs share 85% of their genetic sequence.” However, there are many ways to calculate this, depending on e.g., how one evaluates CNVs, segment additions and subtractions, and functional parity in sequences. One may get wildly different ‘percent raw similarity’ figures depending on whom they ask.
– Gene homologue analysis— e.g., “The dog genome has gene homologues for ~99.8% of the human genome.” However, this metric involves similar but more subjectivity than the first, with many ways to count homologues. Subjective assessments include trying to quantify homologue function, assessing what constitutes a similar-enough homologue, how to deal with additions and subtractions, and in dealing with CNVs– and this ‘roll up your sleeves and compare the functional nuts and bolts of two genomes’ approach is also extremely labor-intensive.
– Time since evolutionary divergence— e.g., “The latest common ancestor of dogs and cats lived 60 MYA, vs that of dogs and humans, which lived 95 MYA.” Time is often a relatively good proxy for estimating how far apart two genomes are, but there are many examples of false positives and false negatives for this heuristic. Selection strength and rate of genetic change can vary widely in different circumstances, and it breaks down very quickly for organisms with significant horizontal or inter-species gene transfer.
None of these approaches really give wrong answers to the questions I posed, but neither do they always, or often, give helpful and intuitive answers. They fail the ok, but what does it mean? test.
I think it’s important to note, first, that all of life is connected. And as evolution creates gaps, it could also bridge them. I say we use these facts to build an intuitive, computational metric for quantifying how far apart two genomes are.
‘Evolution Distance‘ – a rough computational estimate (useful in a relative sense) of the average number of generations of artificial selection it would take to evolve organism X into organism Y under standardized conditions, given a set of allowed types of mutations. Another label for this could be ‘permutation distance’.
To back up a bit, a (rough) way to explain what this idea is about is, let’s imagine we have some cats. We can breed our cats, and every generation we can take the most genetically doglike cats, and breed them together. Eventually (although it’ll take a while!) you’ll get a dog. What this tool would do, essentially, is computationally estimate how many generations worth of mutations it would take to go from genome A (a cat) to genome B (a dog). The number of generations is this ‘evolution distance’ between the genomes. You could apply it to any two genomes under the sun and get an intuitive, fairly consistent answer.
Now, what makes a dog a dog? We can identify several different thresholds for success– an exact DNA match would be the gold standard, followed by a match of the DNA that codes for proteins, followed by estimated reproductive compatibility, followed by specific subsystem similarities, and so forth, depending on our desired level of similarity. The answer would be in terms of X to Y generations, 95% Confidence Interval, in log notation (like the Ricter Scale), as it could vary so widely between organisms… let’s call it LED for Logarithmic Evolution Distance.
A starfish and a particular strain of e. coli might be 10.2-10.4. (That’s a lot!)
1. potentially very useful as a relative, quantitative metric,
2. intuitive in a way current measures of genetic similarity aren’t,
3. fully computational with a relatively straightforward interpretation– you’d set up a model, put in two genomes, and get an answer.
Thus far I’ve used ‘selection’ and ‘mutation’ somewhat interchangeably. But I think the ideal way to set up the model is to stay away from random mutations and pruning. Instead, I would suggest setting up an algorithm to map out a shortest mutational path from genome X to genome Y, given a certain amount of allowed mutation per generation. This would be less indicative of the randomness of evolution, but perhaps a tighter, more tractable, and more realistic estimate of the number of generations’ worth of distance.
In general, I see this as an intuitive metric to compare any two genomes that could see wide use– after the general model is built, the beauty of this approach is that it’s automated and quantitative. Just input any two arbitrary genomes, input some mutational parameters, and you get an answer. Biology is coming into an embarrassment of riches in terms of sequencing genomes. This is a tool that can hopefully help people, both scientists and laymen, make better intuitive sense of all this data.
– This comparison as explained does not deal with the complexity of sexual recombination or of horizontal gene transfer (though to be fair, none of its competitors do either). Or, to dig a little deeper, evolution happens on gene pools, whereas this tool only treats evolution as mutation on single genomes. Does it still produce a usably unbiased result in most comparisons? (My intuition is if we’re going for an absolute estimate of an ‘evolution distance’, no; a relative comparison, yes.)
Caveats:
1. Genome assembly is still an art as opposed to a science, especially for the organisms we care about.
2. A successful first approximation to evolution on the genome scale implicitly assumes knowledge and understanding of many processes we do not know and know not of! The concept of selective pressure applied to the genome is an extremely cutting edge field of study.
3. Validation. The problem with any such tool would be validation both on the level of accuracy (how close to the "truth" is the output?) as well as performance (how is it better than existing phylogenetic inference methods?). For the former, the proposed method obviates the convention of validation based on simulated evolution, since such a demonstration would beg the question. As for the latter question, there is no framework within which to posit an answer given the predicament we have with the former.
Good points. My responses:
1. I think one can make this point, but I also think it's a disappearing problem. I can't state anything too firmly here, but I'm reminded of this interview- http://radar.oreilly.com/2009/07/sequencing-a-genome-a-week.html where the tech guy in charge of Washington University's Genome Center talks about the progress of systematizing genome sequencing and reconstruction. It sure sounds like a lot of progress has been made very recently in terms of systemizing, automating, standardizing, etc, and most of the messy problems have been 'pushed down the chain' into the hands of genomicists.
2. I would say that a first approximation can be very rough, and in this context it could still be useful if one limited discussions to relative comparisons of LED. The other relevant item here is, of course, that the process of trying to simulate genomic evolution could be a very generative endeavor.
3. Both points are very valid. I think validation is a problem with this model, though (as you perhaps allude to re: performance) it may be a problem with other models to some extent as well?
Continuing on (3), I guess my argument would be this: validating accuracy is a problem rather unique to this tool because it attempts to model something empirical, which other tools do not. E.g., existing phylogenetic inference methods assert things, but the meaning of what they assert is more or less definitional. You may view the need for validation/accuracy as a weakness of this tool; I think it could also be viewed as a strength, that it goes beyond other tools in that asserts something that /can/ be said to be empirically accurate or inaccurate.
Thanks for the intelligent comments.