How “the gene” gained, then lost its meaning
Certain words or terms are central to specific scientific fields. For example, “atom” and “molecule” are essential to chemistry while “matter”, “energy”, “quantum” and “space-time” are crucial to physics. Similarly, “gene” has been a key term in genetics. Indeed, it was proposed only three years after the field itself was named. Its significance in 20th century thought is indicated by the title of a book by the noted scholar Evelyn Fox Keller, The Century of the Gene (2002). In it, she argued that however important the word had been, it had begun to lose its usefulness. In this piece, I extend her argument. This might seem initially a mere semantic issue but it is more than that. It bears on how people, especially geneticists, think about heredity and how organisms work.
First, a brief historical overview: in just over half a century, the “gene” went from being a completely unknown, mysterious entity to one with a completely clear meaning in the early 1960s. The next half-century, however, saw the concept evolve into such a state of complexity and diversity that it now eludes a simple definition. We may pine for the earlier clarity but it is not coming back.
The science of genetics was named in 1906 by the English biologist, William Bateson (1861-1926). It was based on the Greek word genesis, meaning “origin” and the field was dedicated to solving how living things reproduced themselves faithfully, generation after generation . Three years later, the word “gene” was proposed by a Danish scientist, William Johansson (1857-1927), to denote the fundamental “unit of heredity”. Some word was needed to embody this idea and “gene” fit perfectly. What a gene actually was, in physical or chemical terms, however, was completely unclear. It was believed that heredity was carried on the long strings of material in the nucleus known as “chromosomes”, and after Johansson’s coinage it was inferred that the genes were located on the chromosomes. Johansson was content to regard them as dimensionless points, without worrying about their material nature, but of course people wanted to know more. By the late 1930s/early 1940s, there was evidence that genes were very small linear segments of chromosomes. The gene had thus acquired its first physical dimension, length.
By the 1950s and ‘60s, the “gene” had become fully three dimensional: it was a segment of a long thin molecule named deoxyribose nucleic acid or DNA. In animals and plants, the chromosomes consist by weight of approximately equal amounts of DNA and protein but it is the DNA that carries the hereditary information, while the proteins can be regarded as the packaging. (Of course, that packaging role is complex and the proteins do more than just wrap the DNA but we can leave those aspects aside.) DNA is a double-stranded molecule in which the two strands wrap around each other as helices (like a spiral staircase), and one strand is copied, from one end of the gene to the other, into a related molecule, termed “messenger” ribonucleic acid, RNA or mRNA. Its sequence of units (nucleotides) is then “translated” or “decoded” into a sequence of linked amino acids, to create a proteinaceous molecule, a polypeptide chain. It is the proteins of the cells, each composed of one or more polypeptide chains, that do the work of the cell, such as hemoglobin in the red blood cells carrying oxygen, insulin regulating the uptake of sugars from our food, digestive enzymes in our stomach breaking down our food, the proteins in our muscles that let them work as muscles, and so on.
The details of how genes specify protein chains are complex but the general picture is simple: a segment of DNA specifies an mRNA molecule which then specifies proteins. This is shown in the figure below, with more of the details given in the legend.
Figure: (Left) One strand of the double-helical DNA is copied in the nucleus to give a mRNA, which leaves the cell’s nucleus and is “translated” into a protein chain in the colloidal cytoplasm that surrounds the nucleus. (Right) The mRNA is “read” three nucleotides at a time, each such triplet corresponding to one of the 20 amino acids, which are the basic units of proteins (Figure by Sarah Kennedy)
That was the understanding achieved by 1963 and the solution to the puzzle of the nature of the gene was (rightly) seen as a triumph. It was wonderfully appealing in its simplicity: the chemical information in the gene was mirrored closely in, indeed was “co-linear” with, the sequence of the chemical units, the amino acids, in the protein. Since it was known which amino acid was specified by any “triplet” of nucleotides, one could deduce the basic structure of the protein from the nucleotide sequence of the mRNA or the gene itself. The implication was: know the genes and you will know the proteins. (And if you know something about the proteins, you can figure out what they do in the cell.) This idea was to have tremendous impact in the selling of the genome projects in the 1980s and 1990s – the efforts to sequence all the DNA in the genome – to the funding agencies by the scientists who wanted to do this work.
There was just one catch, which only became apparent in retrospect. The picture of gene-polypeptide colinearity was derived from work on a bacterium, Escherichia coli, or E. coli, a normal, non-pathogenic bacterial resident of our gut, and a few of the DNA viruses that infect E. coli. It was assumed that genes in far more complex organisms would be the same. After all, with such an elegant, comparatively simple mechanism, why shouldn’t Nature use it in all organisms? In fact, this was summarized in the adage “What is true for E. coli is true for the elephant”, from one of the great 20th century molecular biologists, Jacques Monod. Of course, E. coli differs from elephants in many obvious ways but Monod was referring to basic genetic mechanisms.
For about 15 years, there were few reasons to question this assumption. Animals and plants have much bigger cells than bacteria and also a lot more DNA but it was assumed that this simply reflected far more genes in animals and plants; E. coli was believed to have about 4000 genes while humans were estimated to have maybe 100, 000 genes. Were humans to have E. coli-sized genes, this would still leave humans with much more DNA in each of their cells than they seemed to need, but this was viewed as a detail that would be sorted out.
However, in 1977, something was discovered that was immediately seen to be a major complication. First, certain genes in an animal virus, were found to be not co-linear with their proteins but had large segments of their DNA sequence that, after being copied into their RNA transcripts, were then cut out and the protein-specifying segments tied (or “spliced”) together. This was found to be generally true of the genes of animals and plants. In effect, genes were much larger than they needed to be and there was no simple colinearity between the nucleotides in a gene and its resulting encoded protein. The dispensible parts were called “introns” and the protein-coding segments “exons”. Evidently, in animals and plants and many of their viruses, genes exist “in-pieces”.
No one had predicted “genes-in-pieces” and no one could understand what purpose it might serve or its possible evolutionary origins. (We might return to these issues in a separate article in this newsletter.) Things soon got more complicated, however, annoyingly so for those who like simplicity (as most scientists do, a desire embodied in the idea of “Occam’s Razor”). One of these complications concerned the exons, the parts of the gene saved from the initial cutting out of the introns from the original RNA transcript: they were found not always to be the same. The cutting out process, termed splicing, could and would often produce final transcripts that omitted some exons! This “alternative splicing” meant that the same gene could produce variant proteins. Evidently, an idea from classical genetics of the 1940s, “one gene-one protein” (that each gene specifies one and only one protein) was false. One gene could be the source of multiple, though related, proteins. It was also found that different spliced transcripts could be found in different cell types and did different things. Worse, there was no way to predict this from the DNA sequence.
Over the next four decades, further complications piled up. I will not give a full catalog here but will mention two. First, it was found that sometimes splicing would happen between transcripts from different positions on the chromosomes. This was an error but a “programmed” event. This was termed “trans-splicing” (as opposed to the normal splicing, which happened in segments close together on the chromosome, thus “in cis”.) Again, the DNA sequences involved provided no clue.
The second phenomenon mentioned here was just as applecart-upsetting, and emerged from many discoveries starting in the 1990s. The gene had been confidently defined as a segment of DNA that specified and encoded a polypeptide chain. The only exceptions allowed from the 1960s were the genes that specified the two kinds of RNA components that are essential for the protein-synthesizing process, (translation), namely the genes for transfer RNAs and ribosomes. (See figure and for readers unfamiliar with these terms, the first two references in Supplementary Reading explain them more fully.) As the evidence accumulated, it became clear that there are many DNA sequences that are copied into RNAs that do something else, often that being to regulate the synthesis of mRNAs from conventional genes. These transcripts are now classified as “non-coding RNAs” (ncRNAs). The number of the genes that specify these non-traditional RNAs is in the thousands; by one estimate there are about 8000 in the human genome. The number of conventional, protein coding genes in our genome is now known to be on the order of only 21,000, hence the number specifying the ncRNAs is clearly substantial, undermining the concept of the gene as simply a protein-coding device.
Perhaps you are shrugging and thinking, “well, life is always more complicated than we imagine and these are just another set of complications”. I believe, however, that the intellectual challenge is more wide-ranging than that. For a large part of the 20th century, particularly for many biologists, it was assumed that “the genes” held the essential clues to what living things on Earth were. If one had the sequence of the gene one could, it was thought, predict the sequence of its encoded protein, and from there, one could – with time and sufficient effort – decipher what the protein did in the organism. Today we know that it is often impossible to predict the relevant DNA sequences of the gene products one is interested in; usually one can only predict only part of the sequence of the gene itself from the product. And for ncRNAs, the sequence initially gives you little clue to what it is doing.
In effect, many genes of interest effectively disappear into a cloud of overlapping sequences. The bold, confident claims of the 1980s and 1990s to justify the expense of the genome projects, namely that determining the sequences of genomes would open up “the Book of Life”, have been debunked. Of course, the genome projects have produced a huge amount of invaluable information and have had a transformative effect in the practice of biology. But DNA sequences are not the key to understanding living things and, as we know now, never could have been. As Evelyn Fox Keller argued, the “century of the gene” was the 20th century and biology is now in a new era.
Supplementary Reading
Watson, J. D. (2005). The Molecular Biology of the Gene. Benjamin Cummings, NY. The classic text on genes and how they work, first published in 1970.
Kratz, R.F., Spock, L.J. (2024). Genetics for Dummies, 4th edn, John Wiley & Sons, NY.
All the basics of genetics and gene action, clearly laid out for laypersons.The Concept of the Gene in Development and Evolution (2000). Eds. P. Beurton, R. Falk, H-J. Rheinberger, Cambridge University Press, Cambridge.
A collection of twelve thoughtful essays on the history, philosophy and current conceptual problems of the idea of the gene.
Keller, E.F. (2002). The Century of the Gene. Harvard University Press, Cambridge.
A provocative and clear account of the history of the idea of the gene and the conceptual problems thrown up by this history.
Portin, P.,Wilkins, A.S. (2017). The evolving definition of the term “gene”. GENETICS 205: 1353-1364.
A historical review of how the “gene” conceptually grew in dimensionality, from 0 (a point) to n (where the magnitude of n is still unknown but more than four), and how difficult it is to define this word today.