Gene Expression as a comprehensive diagnostic platform

I’m pretty sure I’ve found the future of medical diagnosis– it’s elegant, accurate, immediate, mostly doctor-less, comprehensive, and very computationally intensive. I don’t know when it’ll arrive, but it’s racing toward us and when it hits, it’ll change everything.

In short– the future of medical diagnosis is to use a gene expression panel along with functional and correlative connections between gene expression and pathology to perform thousands of parallel tests for every single human illness we know of– no matter whether it’s acute, chronic, pathogenic, mental, or lifestyle. In short: one simple test that’ll uncover all health problems.

What do you mean? And how would it work?

Gene expression is a measure of which (and to what extent) of your 20,000+ genes are being made into proteins and RNA. A gene expression test is much like a traditional genetic test, but since it goes beyond merely listing which genes your body has, and shows how much your body is using each one, it’s a much better view of what’s actually going on inside your body. Genes may be a blueprint of physiological potential– but gene expression is a snapshot of actual physiological function.

The Vast majority of illnesses leave a significant imprint on a person’s gene expression. A failing kidney, an inflamed appendix, obesity, a manic episode– these will influence which genes are activated, and in very specific, fingerprint-like ways. In a real way, gene expression is what’s happening in your body. It’s possible, and I think fairly probable, that the imprints distinct physiological insults leave on gene expression will be largely unique, and so in theory we should be able to work backward from gene expression to things that we care about, like diseases, syndromes, and injuries. Your gene expression is a record of what’s happened to and happening in your body. The information’s almost certainly in there, encoded in the relative expression levels of 20,000+ genes, if only we can interpret it.

We can decipher some of this by tracing causation, but realistically the only way to figure out what any given gene expression means clinically (for now) is through correlation. But once we’ve gathered a Large collection of gene expression-known illness pairs (we could build this dataset by requiring e.g., hospitals to collect a gene expression sample when a diagnosis is made), we can start to train computers to identify what gene expression conditions are connected to each illness. Finding these sorts of connections is almost impossible for humans, but there exist computational approaches which in theory are fairly ideal.[1] We’ll know that people with X gene expression tend to have Y disease, and so forth. If we can dive deeply into gene expression and find the predictive elements for each condition, we’ll be able to say this with great confidence– rivaling or perhaps surpassing the current thresholds for diagnosis.

It’s not a trivial computation, but there’s really nothing standing in the way of gene expression tests which use broad-spectrum correlative analysis to screen for all known illnesses at once. Gene expression tests will likely be fairly non-invasive (perhaps involving drawing a bit of blood) and extremely cheap, making it practical to preventatively screen everyone for almost everything 1-2 times a year. Moreover, the nature of the test means it can test for many kinds of things we simply can’t test for now, and easily distinguish between conditions with very similar symptoms but different causes– it’s not just a replacement for nearly all of our current tests, it’s way, way better.

The large-scale data collection necessary for correlative analysis is, I think, the biggest hurdle, though finding solid correlations in a massive dataset against a background of variable application of diagnostic criteria is also non-trivial. But it’s coming. And being able to catch more problems, catch them earlier, and catch them more cheaply could save society billions of dollars in the long-run, not to mention avert an incredible amount of human suffering.



– Training computer programs (e.g., classification ANNs) requires a lot of good data, as does screening out false positives in a sample as wide as a full genome. Getting enough *good* samples where all the right diagnoses have been made will be challenging.
– Gene expression analysis is still having growing pains. E.g., “Protein sequencing gone awry: 1 sample, 27 labs, 20 results“.
– Crunching the numbers on which of 20,000+ different genes (and potentially some non-protein-coding, RNA-producing genes) are correlated with each illness is far from a trivial problem.

– Will gene expression from multiple locations be needed to diagnose some illnesses?
– It seems fairly safe to say we’ll be able to diagnose e.g., kidney failure or malaria from gene expression data. But what about internal bleeding? And what about some of the more tricky or subjective mental illnesses? This technology will have its theoretical limits: I think the limits are pretty wide, but what are they?

[1] This task falls outside the scope of this writeup, but just going with what I know, I’d take a set of gene expression-known illnesses pairs, divide the gene data up into smaller, more tractable pieces (perhaps along specific gene-network faultlines, perhaps randomly?), and train a classifier neural network on the pieces, which will attempt to predict a specific illness based on what it finds to be the most significant data in the subset. Layer these subset-based classifier neural networks under a ‘master’ classifier neural network which gives the final yes/no prediction. Test this model on progressively larger out-of-sample data sets. Repeat for each illness. There are undoubtedly solutions orders of magnitude better than this– but it’s a baseline start.

ETA 15-20 years.

Edit, 1-22-10:Based on the available information, I think gene expression is a strongly representative abstraction level from which to draw. However, I see strong arguments for also including the metabolome and metagenome if it’s feasible to do so. This doesn’t materially change the situation.

Edit, 6-22-10: My ETA may even be too conservative: a collection of researchers from various California universities recently published a method for using gene expression for diagnosis by associating arbitrary gene expression profiles with clustered sets of expression profiles with known diagnoses.

Edit, 6-12-11: This idea depends on the mid-term availability of incredibly cheap gene expression sequencing. I don’t think this is unrealistic, given these sorts of trends (courtesy of

Edit, 6-29-11: Gene expression includes an incredible amount of context and nuance, which provides it with a significant advantage over the current (very imperfect) practice of using simple biomarkers

Edit, 11-1-11: It occurs to me that this transition from biomarkers to gene expression is similar to those going on in other fields: ephemeralization. Turning physical things into general-purpose software. And when things turn into software, they’re much easier to iterate, extend, and sell lots of.

One thought on “Gene Expression as a comprehensive diagnostic platform

Comments are closed.