Scientists Have Finished Sequencing the Human Genome. Again.

Scientists Have Finished Sequencing the Human Genome. Again.

Sequencing an entirely complete human genome has been the work of decades. Twenty years ago, the Human Genome Project (HGP) declared their work finished, with an asterisk. Even a decade later, fully eight percent of the genome — so-called “junk DNA” — was beyond our understanding. But the idea of junk DNA was stuck in science’s collective craw. Mother Nature is a cheapskate, and genes are expensive. Living things lose genes for resistance to threats they don’t experience. Why would DNA violate its own principle of parsimony? A lot of tweedy arguments ensued.

Even as we unraveled that last eight percent, a few obnoxious holdouts remained. But now, in a flurry of more than a dozen peer-reviewed papers, a coalition of researchers report that they have sequenced an entire human reference genome, from start to finish — telomere to telomere. Thanks to their efforts, not only do we know what the “junk” DNA does, we know how it does it.

“It’s a big deal,” said coauthor Erich D. Jarvis. “Every single base pair of a human genome is now complete.”

“You would think that, with 92 percent of the genome completed long ago, another eight percent wouldn’t contribute much,” added Jarvis. “But from that missing eight percent, we’re now gaining an entirely new understanding of how cells divide.”

‘Like a Broken Record’

A single “representative” copy of the human genome is about three billion base pairs in length. That’s gigantic. During sequencing, researchers break the DNA molecules into pieces of manageable length. With euchromatin — the 92% of our genome sequenced by the HGP — it’s easy to stitch the sequence back together. The problem arises when it comes time to sequence and reassemble heterochromatin: the DNA of that last eight percent.

Scientists Have Finished Sequencing the Human Genome. Again.

Far from being junk, heterochromatin codes for important cogs in the cellular machinery that handles our DNA. Instead of coding for “normal” proteins, heterochromatin makes DNA accessory molecules, including a type called centromeres. Centromeres are the bit that holds two strands of a chromosome together, and they’re an indispensable part of cell division. But until now, centromeres have been a major obstacle in the effort to nail down a reference genome.

Some stretches of heterochromatin loop on the same series of a few nucleobases, repeating them over and over like a broken record. Others are just long stretches of the same nucleobase — think “AAAAAAAAAAAAAAAA,” but thousands of bases long. Centromeres have both. Historically, it’s been tough to tell exactly how long these repetitive stretches are, let alone align them right. However, an international group of geneticists decided to pool their efforts, calling themselves the Telomere-to-Telomere (T2T) Consortium. Jarvis’ lab used a number of tools to help T2T clean up “messy” DNA sequences and generate error-free results.

Merfin’ DNA

One such tool is Merfin. Merfin is a high-powered DNA sequencing tool, which T2T used to clean up some of the most error-prone lengths in the human genome — including centromeres.

“Genomes that we generate in the lab can have many errors in them,” said Giulio Formenti, a postdoc in Jarvis’ lab, who developed Merfin. “If even just one or a few base pairs are wrong, that can have big consequences for the overall accuracy of the genomic sequence.” Centromeres are long and repetitive, so they’re highly susceptible to this kind of errors. But they’re important enough that we need to get them right.

“Stretches of identical base pairs, such as AAA, are hard for existing technology to assess,” added Formenti. “There are often errors in those sequences, even now. Merfin corrects them.”

The T2T team focused their attention on a single genome, derived from a kind of non-viable cell created when a sperm fertilizes an egg that has no nucleus. Because of this glitch in their development, these cells have two copies of the father’s DNA — and no information from the mother. They’re diploid cells, but they have a single gene line. That made them prime targets for use as a single end-to-end genome. It also made them prime targets for Merfin.

In addition to Merfin, the researchers used Pacific Biosciences’ HiFi DNA sequencing machine, along with the Oxford Nanopore sequencing method. Nanopore is capable of reading up to a million base pairs at a time, while HiFi excels at accuracy. All of a sudden, centromeres became much easier to sequence and align. “It was the last piece of the puzzle — like putting on a new pair of glasses,” said coauthor and T2T co-chair Adam Phillippy, a researcher at the NIH.

Looking Forward

While the new reference genome is complete, it came from just one gene line. Therefore, sequencing the human genome doesn’t automatically represent the full diversity of human haplotypes. “To address this bias,” the researchers write in one report, “the Human Pangenome Reference Consortium has joined with the T2T Consortium to build a collection of high-quality reference haplotypes from a diverse set of samples.” In this way, the researchers intend to pursue a reference genome for the entire human race.

In the meantime, scientists intend to use this reference genome to better understand genetic diseases, aging, and the process of human evolution.

“Ever since we had the first draft human genome sequence, determining the exact sequence of complex genomic regions has been challenging,” T2T Consortium co-chair Evan Eichler said in a statement. “I am thrilled that we got the job done. The complete blueprint is going to revolutionize the way we think about human genomic variation, disease and evolution.”

Yes, “Merfin’ DNA” is a lame Beach Boys joke. I still think it’s funny, and I will die on this hill.