Why is the genetic code the way it is?
Science as a form of intellectual inquiry, as we know it today, began picking up steam in the 18th century though its origins were much earlier and rather diffuse for many centuries. By the 19th century, however, it was proceeding with great speed and confidence. A common belief among many involved in it was that, in principle, all questions about the nature of our world – physical, chemical, and biological – could be answered by science. This is tantamount to believing that the number of facts about our world is finite, though incredibly large of course, and that scientific progress consists of transferring facts and ideas from the realm of the unknown into the known. Ultimately, everything of importance would eventually be known.
If there was a high-point of such confidence, it would have been in the late 19th century. By the start of the 20th century, things were beginning to be seen in a new light, as new and deeply puzzling insights in biology and physics appeared. Scientific understanding could be seen more clearly as involving not so much a steady accumulation of known facts as a continual transformation of understanding and perspectives involving novel unknowns. As soon as one major question is answered, even approximately, the answer provokes a new major question that must be addressed. This appreciation was behind one of the famous slogans about science, coined in the 1940s by an American savant, Vannevar Bush, who called science “the endless frontier”. Its character, in this respect, was in contrast to geographical frontiers, which were clearly defined. For instance, the frontier known as the American West, opened by the Lewis and Clark expedition of 1803, was seen as closed by 1890.
The contrast presented by a scientific frontier is illustrated by the genetic code. This term has two different meanings and it is important to distinguish them. The term “genetic code” is often used in the popular literature as a synonym for the “genome”, the entire set of DNA molecules that define a species’ genetic inheritance. For instance, one might speak loosely of the “genetic code” of the lion, meaning its genome. This usage, however, is really a mistake. The genome of the lion indeed is the foundation for the development of this animal but in no way is the lion’s genome sufficient to specify how this animal comes into existence. Similarly, there is no genetic “blueprint” depicted in or “program” that is “encoded” in the genome.1
The alternative and correct version of the phrase “the genetic code” refers to the sequence of bases in DNA that specifies the different amino acids that comprise the proteins (strictly speaking, the protein polypeptide chains) that make up so much of our bodies and which carry out the activities that keep living creatures alive. Thus, the term denotes how a sequence of nitrogenous bases in the DNA comes to specify a sequence of amino acids in a protein chain. There are four bases – adenine (A), thymine (T), guanine (G), cytosine (C) – and 20 different amino acids found in proteins.
The immediate question that this raised was how 4 basic kinds of unit in one molecule (a nucleic acid) combined to generate unique specifications of 20 different small molecules (the amino acids) in another kind of molecule, a protein. (The names of the 20 amino acids are not important here but are terms like “serine”, “tryptophan”, “leucine”, “aspartic acid”, etc.) This became the problem of the genetic code that scientists had to solve.
The question arose from the Watson-Crick model of DNA structure, formulated in 1953, hence 72 years ago. In this model, genetic inheritance is embodied in the sequence of base pairs in the double-stranded DNA molecule.2 By the mid-1950s, a small number of brilliant people had begun to work actively on the problem of the genetic code. Within a few years, with a combination of logic, common sense, intuition and basic biochemistry, they had managed to set out some basic conclusions.
The most important was a numerical deduction. There could not be a one-to-one relationship between bases and amino acids because there were too many of the latter (20 amino acids vs 4 bases). Hence, the code had to involve some combination of bases to specify individual amino acids. Nor would two bases for every amino acid be enough, since two would mean that there were only 4 x 4 combinations of bases to specify a total of 16 amino acids, which is fewer than 20. On the other hand, combinations of three base pairs would be more than adequate, namely 4 x 4 x 4 or 64 possible three letter combinations; this would be more than was needed to specify 20 amino acids.
Of course, in principle, there need not be a fixed number of bases – perhaps some amino acids were specified by 2 bases, others by three, one or more by four, or the like. However, this would be messy and complicated. If Nature were a smart engineer, She would have picked the smallest possible number for specifying all 20 amino acids and stuck with it; that number was three. In 1960, some brilliant genetic work indicated that indeed the genetic code was based on the number three; it was a “triplet code”.3 The questions then became which of the 64 possible three base combinations, termed “codons”, designated specific amino acids and what were their identities?
The problem had thus already achieved some clarity by 1961, only eight years after the Watson-Crick model had been published. The problem of the generic code had seemed so challenging – this was 15 years before there was sophisticated DNA sequencing – that for many people in the then nascent field of molecular biology, solving it became the crucial problem in biology to be solved. A number of the leading people believed that the rest of biology would be trivial to solve, basically a mopping up operation, performed perhaps by mostly lesser talents (as seen by these savants). In this view, the genetic code was like the frontier of the American West, something with a defined end to the excitement, to be followed by unadventurous development.
Yet progress was much faster than initially expected. The first codon was deciphered by some creative biochemistry (for the amino acid phenylalanine) in 1961-62 and, to the surprise and delight of everyone, the identities of all the rest had been determined by 1965. Thus, deciphering the genetic code, thought to be an insuperable problem in the 1950s, had taken less than four years to achieve in the 1960s. Altogether, 61 of the 64 codons specified amino acids, ranging from one codon for each of two amino acids (tryptophan and methionine), up to six for another (leucine). The remaining three codons out of the 64 were so called “stop codons” because they terminated translation of a protein chain; stop codons were used by organisms for just that purpose, to end elongation of the encoded polypeptide chain.
Did solution of the genetic code leave biology as a mopping up operation of lesser problems, with all the first-rate minds who had worked on the code drifting out of biology? Hardly. Over the last 60 years, biology has positively bloomed and has far more researchers and projects underway than it did in 1965. Of course, some of the key figures who had worked on the problem of the code moved into new areas of biology, in particular neurobiology, but few if any left biology.
This brings us back to the matter of closed frontiers in science. In fact, the solution of the genetic code immediately spawned a new question about it. This question was: how did it originate in the first place? This breaks down into smaller questions, such as why are specific codons associated with the particular amino acids that they are? And what is the reason for having multiple codons for many, indeed most, of the amino acids? This latter property was termed “degeneracy” (though without any connotations of immorality normally associated with this word). Degeneracy seemed a matter of redundancy. If one codon could do the job, why have multiple ones? Unlike solving the nature of the code, which took less than half a decade, these evolutionary questions are still with us, sixty years later.
Two broad alternative possibilities had been sketched as early as 1966 by Francis Crick. He was not only a co-developer of the model of DNA but had been a major intellectual force in working out the nature of the genetic code and the mechanism by which it worked. He now turned to the evolutionary question. The first possibility for the origins of the genetic code was some kind of slow gradual evolution, perhaps based on some degree of biochemical affinity between particular triplets of bases and specific amino acids. Crick thought this was unlikely and attempts by biochemists to show such relationships had failed to produce convincing evidence. (There were some claims of weak preferential affinities but not the strong relationships one might have expected.) The other idea, which Crick favoured, was that the genetic code was a “frozen accident”, namely that the associations between particular codons and particular amino acids had arisen more or less by chance but then had become fixed in place. In other words, once a particular codon specified a particular amino acid, it could not change to specifying another because this would alter the proteins made wholesale, which would be a biological catastrophe.
No one argued against Crick’s reasoning but, on the other hand, no one could propose an experiment to test this. After all, the genetic code was soon found to be apparently universal amongst living things (animals, plants, bacteria, viruses) and so must have originated only once and near the beginning of life on Earth, hence about 3.8 to 4.0 billion years ago. Furthermore, let us recall that when these ideas were first discussed, it was still more than a decade before DNA sequencing was possible and long before the kind of manipulation of DNA sequence we take for granted today.
Hence, the problem of the origins of the genetic code essentially sat there for five to six decades. A few small exceptions to the universality of the genetic code, in the form of a few non-degenerate codons (in mitochondria and a protozoan), were found but not enough to challenge Crick’s thinking on the probable frozen accident nature of the code’s origins. Perhaps the problem was not exactly frozen itself but for decades it was at least in a very deep, dreamless state.
What may mark the end of this long sleep are the set of advances in DNA sequencing and engineered DNA that have developed over decades, in particular the last two. It is now possible to do wholesale alterations to certain small genomes and then see what the biological results are. Thus, one can now alter the coding properties of such altered genomes and then look at the consequences.
In a recent report, two groups of scientists (one in Cambridge, England, the other in Cambridge, MA) have been extensively modifying the genome of the bacterium Escherichia coli (a long time workhorse in molecular biology) and producing modest reductions in the number of different codons (but of course making sure that there is at least one, and usually more, for each amino acid). In the latest iteration of this work, the group in England has produced a version of E. coli that uses only 57 codons, hence a reduction of seven in the total number. Please note: this new form of E. coli, termed Syn57, makes all of its required proteins and they should have perfectly normal molecular structures.
Creating Syn57 was a real technical tour de force, with many obstacles to overcome. Altogether, more than 100,000 changes in the genome had to be engineered to eliminate the seven codons while preserving the coding sense of the 4000 genes in the E. coli genome. A project of this scale would have been inconceivable 20 years ago and just barely imaginable 5 years ago but proved feasible now thanks to rapid technical progress. This work was reported in Science a few weeks ago.4
The key point is that this highly modified bacterium is fully viable and is, in its metabolism and structure, just E. coli. In this view, the “degeneracy” of the code is true redundancy. The “frozen accident” theory of the evolution of the genetic code has been given a big boost. Indeed, it had been given earlier support by a slightly smaller project started in 2019 by the Cambridge, England group who had created a 61 codon version. (This had required “only” 18,000 changes in the genome.)
However, and this is a non-trivial point, this new form of E. coli, while metabolically and structurally fully E. coli-like is feeble in its growth. In the normal medium used in the lab, normal E. coli cells take an hour to grow and divide. Syn57 takes four hours to do the same. Presumably, the difference in growth rate reflects a difference in the rate of protein synthesis. If so, the missing codons are actually quite helpful in protein synthesis in the wild-type bacterium. This would be no small advantage in the natural world. Thus, the apparent redundancy in the 64 codon genome is not actually superfluity but serves a real purpose. Efforts are underway to understand the growth problems of the Syn57 and to see if they can be overcome.
Nevertheless, the conclusion that the seemingly unnecessary codons are actually helpful modifies the conclusion that the genetic code is described best as simply a frozen accident. There may well have been some elements of chance (“accident”) in the early steps of the evolution of the genetic code (in terms of which codons came to specify which amino acids) but there was almost certainly strong degree of selection in expanding the codon repertoire and in the number of genes underlying that repertoire, which can vary from dozens to hundreds depending upon the species.
This conclusion fits my general view of most sharp binary choices in biology, denotable as A vs B. Often they are false dichotomies. Sometimes the answer is “A”, at other times “B”, and sometimes it is both, depending upon the context. In the case of the genetic code, it is almost certainly the case that chance and selection both played major roles in creating (and maintaining) it. This is a case of both A and B and the challenge is to figure out the balance and relevance of each to the different stages in the evolution of the genetic code.
My guess is that, in trying to reconstruct something that took place about 4 billion years ago, we will never have a certain answer. However, that is all right. Science may not deliver definitive, once-and-for-all answers but it does provide the excitement of chasing and arriving at new degrees of understanding. In that respect, it is indeed an endless frontier and that in itself is exciting.
The early stages of trying to understand how the genetic material directs the construction of the organism, following the Watson-Crick model, in the 1950s and 1960s were a time of frustration, especially given the early successes of molecular biology. Understandably, it produced a yearning for a helpful metaphor. Good metaphors in science may not really explain things but they are comforting, producing the illusion of better understanding. The metaphor of DNA as a “blueprint” for the organism was popular for awhile but it really had no explanatory power and soon faded. Far more popular and long-lasting was the idea of the “genetic program” as presumably embodied in the DNA. This was much more potent as the power and potential of computers became increasingly apparent from the 1950s onward. The metaphor of the computer posits a clear distinction between the “software” and the “hardware” and this was appealing at first. However, by the 1970s it was clear that this was another false binary choice since biological development involves a far more dynamic and prolonged interplay between the genetic material and the developing organism than is given by the software/hardware dichotomy. What killed off the genetic program as an idea was the complete sequencing of genomes by the early 2000s. Genomes in themselves do not contain the temporal information needed to construct organisms and it soon became obvious that they were not “programs” in themselves.
For non-biologist readers who may not be all that familiar with DNA and how it works, there is no better source than the book, “The Molecular Biology of the Gene”, originally written by James Watson and first published in 1970. It is now in its 8th edition, with multiple authors, but for understanding of the basics, the 1970 edition might actually be the best.
This work is described in one of the great classic papers in molecular biology. See Crick, F.H. et al. (1961). General nature of the genetic code for proteins. Nature 192: 1227-1232. Doi: 10.1038/192/192.1227.
The research report is Robertson, W.E. et al. (2025). Escherichia coli with a 57-codon genetic code. Science doi: 10.1126/science.ady 4368. A popular short account is in Zimmer, C. (2025). Scientists learning to rewrite the code of life. The New York Times (Int’l edition), 6 Aug., 2025, p. 12.



Adam, in end-note 1, you write "Genomes in themselves do not contain the temporal information needed to construct organisms . . ." This suggests a future article on this temporal information. e.g., where and how it is stored in the organism? Or maybe, putting on your prophetic hat, you will reply "I'm sorry, but we will have to wait several decades for the subject to have gelled into a more coherent body of knowledge suitable for this kind of popularization."
Adam, As a complete novice in genetics, I find the "codon wheel" is an excellent visual aid in understanding the genetic code (and its redundancy) that you discuss in this article. A good version available for free is the following Wikipedia article:
https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables:
For those who prefer a print version, there is an equivalent codon wheel, albeit w/o colors, as Fig. 9-1 on p.144 of the paperback Genetics for Dummies, 4th edition, by R.F. Kratz and L.J. Spock.