DNA Proofreading

Last semester year I took a course called "Biological Physics", about, as you can guess, the physics of fundamental processes in biology. Among a myriad of awe-inducing topics, I found particular interest in the mechanism of DNA proofreading. This is the process by which every cell in your body maximizes fidelity while replicating your genome; that is, how a new strand of DNA is "proofread" while being written. Since what I learned was in a physics course, this piece will focus mostly on the physical theory of how proofreading occurs - and why it needs to. 

Because I want this to be understandable to people from all levels of prior knowledge, I'm going to start at the basics - feel free to skip around, or just read part of it, as I expect the article to increase in complexity as it goes on. Let's begin, well, at the beginning!

Friedrich Miescher's laboratory (10 years later), where he discovered nuclein.  [1]

A brief history of DNA

Nowadays, basically everyone knows about DNA, but this wasn't always the case. As with all things in science, someone had to discover it! In this case, we can thank a Swiss man named Friedrich Miescher who aimed to understand the molecular components of living cells, namely proteins. Miescher then discovered a previously unknown molecule that happened to be contained within the cells nucleus, which he termed "nuclein" [1]. Although not right away, this was the beginning of a heated debate amongst scientists on the mechanism of heredity - the passing of traits from an organism (like you) to its offspring (your children). Was it based on proteins? Or DNA? Many scientists, like Miescher, believed that DNA was not nearly complex enough to carry the information needed for life, and thus "nuclein" took a backseat to protein for the nearly the next century. This all changed in the 1940s and 50s, with a few experiments that we should look into.

Only DNA has the ability to transform cells

Imagine two bacteria strains: R and S. R is essentially harmless upon infection, whereas S is fatal upon infection. That is, if one were to inject mice with R bacteria, they'd live; if injected with S bacteria, they'd die. If the bacteria, either R or S, were to be killed and then injected, no infection happens - as expected, the bacteria are dead. But now imagine, that when you take your R-infected mouse and inject it with dead S bacteria, it suddenly develops an S-bacteria infection and dies. In this imaginary scenario, this means that there exists some mechanism that could transform one bacteria type to another. Well, don't imagine too hard because Frederick Griffith did exactly this in 1928, and it was big. Cells could somehow pass information to another cell - even if dead - and change the fate of that cell! [2]

Then in the early 1940s, scientists Avery, MacLeod, and McCarty aimed to uncover what exactly the mechanism of transfer was in these types of scenarios. They exposed the aforementioned R bacteria to both protein and DNA from S bacteria, in order to see which molecule could induce such changes. As you could have guessed, they found that only DNA had the ability to transform the non-fatal R bacteria into the deadly S strain [2].

Griffith's experiment. Rough (R) bacteria don't kill the mice, but if combined with killed smooth (S) bacteria, they uptake the S DNA and kill the mouse. [source]

Viruses inject DNA into their hosts

The next huge experiment in the history of understanding DNA was performed in the early 1950s by two researchers: A.D. Hershey and Martha Chase [3]. With a technique known as isotope labeling - where elements with additional neutrons are incorporated into biological molecules, which then emit detectable radiation - Hershey and Chase were able to track both the protein and DNA of a virus while infecting bacteria. Knowing that the virus protein contained sulfur and that DNA contains phosphorous, they created viruses that contained isotopes of these elements. When these viruses infected a group of bacteria, which then replicated the virus, the scientists could look at what radioactive isotope was contained within the progeny virus: sulfur or phosphorous. As you might have guessed, it was phosphorous. The sulfur-containing protein was nowhere to be found in the new generation of viruses or the infected bacteria. This elucidated the process of viral replication; viruses injected DNA into the host cell, and this DNA was then copied to make more viruses. While this was a critical step in demonstrating that DNA was the biomolecule responsible for heredity, the tale of DNA was not over yet...

Hershey and Chase's experiment. On top, DNA's phosphorous is labeled (blue), and the label is passed down to new viral offspring. The bottom panel shows protein's sulfur labeled (green), and no label is seen in the resulting viruses. This suggests DNA is what contains hereditary information! [source]

DNA has a double helix structure

Shortly after Hershey and Chase's landmark experiment, researchers Watson and Crick had to one-up them [4] by not only proposing the base-paring mechanism we will come to later but also correctly identifying the structure of DNA: the double-helix we all know and love today. Using X-ray diffraction techniques, in which X-rays are blasted at a substance, and its structure inferred from the resulting pattern (a topic for another post sometime), Watson and Crick derived a model for DNA that fit the data better, in their eyes, than previously proposed structures. At this point in history, the chemical basis of DNA was known - it was composed of a sugar (ribose is another word for sugar) without oxygens attached (deoxy-), connected to a nucleic acid (remember "nuclein" from earlier?) - as deoxyribose nucleic acid. They thought the DNA had two strands, each with a sugar-phosphate backbone, and nucleotides, dubbed "bases", on the inside connecting the two backbones. These two backbones spiraled around each other, forming a double helix as seen in the cartoon to the right.

As I mentioned earlier, this paper also proposed the correct base-pairing mechanism. In DNA, there are four bases: adenine, guanine, thymine, and cytosine, but we'll call them A, G, T, and C. In the double helix model, every base from one of the strands is paired with another base on the complementary strand. Watson and Crick explained that A only binds to T and visa versa; similarly with G and C. Thus a mechanism for copying was also possible - if you know the sequence of one strand (for example, ATGC), then the other strand can be inferred (TACG). 

From this point on, research into DNA exploded, and I could probably spend the next five years writing about it and I wouldn't be able to cover it all. Thus I will sort of stop with the history lesson here and try to summarize all you need to know about the basics of DNA in the next section.

Cartoon of DNA showing the helical structure and the selective base pairs. [source





The central dogma of molecular biology.

The central dogma of molecular biology. [source

Core concepts of DNA

DNA is replicated before each cell division

The cell is the basic unit of life; every organism on this planet is comprised of either one or multiple cells. The purpose of life - if we can say life has a purpose - is to create offspring similar to itself. Most cells do this by what biologists call "dividing", such that one cell splits down the middle and becomes two cells, each essentially identical to the original cell. In order to become two cells many things must happen, but the important one here is that the DNA must be replicated, since both daughter cells need a copy of the DNA to function! When the DNA is replicated, the cell attempts to do so in the most faithful way possible, since a mutation could lead to deleterious effects...

The central dogma of modern biology

Wait a second - we discussed how DNA is the molecule responsible for heredity, but how does this work? And why is a single mutation in the DNA so darn important? The central dogma of biology, developed by Francis Crick in 1957 [5], explains why. 

The central dogma describes the flow of information across biological molecules. It goes like this:

The details above were not initially in Crick's hypothesis, but I figured they fit in for your understanding of the process. However, as I just learned while writing this, this is merely a bastardization of the full central dogma that Crick described. There are a couple additional ideas:

This is basically to reiterate that the DNA is the hereditary factor, and proteins cannot give rise to corresponding DNA or RNA sequences. Thus, we can imagine now that a mutation in the DNA during replication could directly result in the mutation of an amino acid in a protein, which could effect the function of that protein. I'll write a separate article about protein structure and function, but for now just know that proteins do "the work" in a cell, and structural changes to proteins can be disastrous. 

It should now be clear that accurately maintaining the DNA sequence of As, Gs, Cs, and Ts during replication is a serious matter. But how do cells do this?

DNA Proofreading

Whew! If you already knew all about DNA, I apologize for the long-winded intro. If not, hopefully you learned something new. Whichever group you fall into, you are now ready to discuss one of the most important processes in life - DNA proofreading. As we talked about above, DNA contains every bit of information necessary for an organism's survival. Every protein and every decision a cell makes is at some level because of its genetic code. Therefore it is imperative that the information within the DNA is preserved with utmost fidelity upon replication, because if it isn't, the effects can be catastrophic. This kind of mistake is called a mutation, and in case you couldn't guess is typically not a good thing! A mutation can lead to an incorrectly made protein, which might effect other cell functions, and could eventually lead to cell death or death of the entire organism. Diseases caused by a single point mutation - a mutation at a single base pair - include sickle cell anemia

Then how does the cell ensure a faithful replication? You might say "well, Alex, you told us earlier that A only binds to T, and G only binds to C! How could there not be a faithful replication?!" However nothing in nature is 100% - everything is merely a matter of good enough. The reality is that A binds to T a little bit better than A binds to A, G, or C. In fact, a ridiculously small amount better [6].  Yet somehow, errors in genome replication occur on the order of one in one billion base pairs replicated [6].

As the DNA is replicated, a complex of proteins tracks along the template strand (the DNA strand being copied), incorporating new bases to the daughter strand (the new strand being synthesized) from the existing pool within the nucleus. If it's copying an A, it might bring in any one of the other bases, which could then either bind to the A or float back out. It's important to look at the probability that each base might bind to the A. For the sake of example, let's say for the matching base, T, its 50%, but for the non-matching bases it's 10%. This 10% is still pretty likely on the scale of billions of bases to replicate across the genome - our cells would be mutating like crazy. What the replication complex does to combat this is to briefly break the bond between the A and its new partner - now the partner needs to bind again, with the same probability as before. This means that the correct base now has a chance of binding of 25% - 50% times 50%. But the incorrect bases have a chance of 10% * 10% - just 1%! Now the correct base is 25 times as likely to be incorporated, instead of just 5 times! This "squaring" of the probabilities results in smaller probabilities getting even smaller in comparison. Yet this isn't quite enough, so the replication process has one more trick up its sleeve. The breaking step mentioned above has to happen before the next base is incorporated; if the next base is incorporated then the previous base is sealed in its position mistake or not. The cell takes advantage of this by having the replication complex move on a tiny bit faster if the incorporated base was correct! Meaning there is simply more time for the breaking step to occur in the case of a mistake, increasing the probability of correct incorporation even more [6]. Intuitively, proofreading like this becomes more necessary as an organism's genome grows - that's why you don't really see it in small-genome organisms like viruses. Another factor at play here is that this proofreading requires energy: ATP, the energy source used in cellular functions, is consumed in the bond-breaking step. So, things like viruses, which want to get away with using as little energy as possible, mostly don't proofread, and studies have shown that proofreading viruses are often less fit than their error-prone counterparts [7]. 

The first step of DNA proofreading. The "3'->5' exonuclease" is what cuts out the wrong base, and polymerase is the protein responsible for replicating the DNA. [source]

Of course, this is all a bit of a simplification, and the numbers presented are for ease of explanation. In reality these "probabilities" and the differences between them are quite small, and I made probably egregious simplifications of the complex enzymatic processes involved. If you want to read more you can check out Ref. 6, which I totally encourage, but my goal here wasn't to leave you an expert on the subject and instead give a solid intro to a fundamental, complex, and ultimately cool biological process.

 I'd love to hear feedback or comments, so please write yours in below :).

Sources:

DNA Comments (Responses)