Repetitive
“Without proteins, DNA is nothing,” said Altemose, who earned a Ph.D. in bioengineering jointly from UC Berkeley and UC San Francisco in 2021 after having received a D.Phil. in statistics from Oxford University. “DNA is a set of instructions with no one to read it if it doesn’t have proteins around to organize it, regulate it, repair it when it’s damaged and replicate it. Protein-DNA interactions are really where all the action is happening for genome regulation, and being able to map where certain proteins bind to the genome is really important for understanding their function.”
After the T2T consortium sequenced the missing DNA, Altemose and his team used new techniques to find the place within the centromere where a big protein complex called the kinetochore solidly grips the chromosome so that other machines inside the nucleus can pull chromosome pairs apart.
“When this goes wrong, you end up with missegregated chromosomes, and that leads to all kinds of problems,” he said. “If that happens in meiosis, that means you can have chromosomal anomalies leading to spontaneous miscarriage or congenital diseases. If it happens in somatic cells, you can end up with cancer — basically, cells that have massive misregulation.”
What they found in and around the centromeres were layers of new sequences overlaying layers of older sequences, as if through evolution new centromere regions have been laid down repeatedly to bind to the kinetochore. The older regions are characterized by more random mutations and deletions, indicating they’re no longer used by the cell. The newer sequences where the kinetochore binds are much less variable, and also less methylated. The addition of a methyl group is an epigenetic tag that tends to silence genes.
All of the layers in and around the centromere are composed of repetitive lengths of DNA, based on a unit about 171 base pairs long, which is roughly the length of DNA that wraps around a group of proteins to form a nucleosome, keeping the DNA packaged and compact. These 171 base pair units form even larger repeat structures that are duplicated many times in tandem, building up a large region of repetitive sequences around the centromere.
The T2T team focused on only one human genome, obtained from a non-cancerous tumor called a hydatidiform mole, which is essentially a human embryo that rejected the maternal DNA and duplicated its paternal DNA instead. Such embryos die and transform into tumors. But the fact that this mole had two identical copies of the paternal DNA — both with the father’s X chromosome, instead of different DNA from both mother and father — made it easier to sequence.
The researchers also released this week the complete sequence of a Y chromosome from a different source, which took nearly as long to assemble as the rest of the genome combined, Altemose said. The analysis of this new Y chromosome sequence will appear in a future publication.
Altemose and his team, which included UC Berkeley project scientist Sasha Langley, also used the new reference genome as a scaffold to compare the centromeric DNA of 1,600 individuals from around the world, revealing major differences in both the sequence and copy number of repetitive DNA around the centromere. Previous studies have shown that when groups of ancient humans migrated out of Africa to the rest of the world, they took only a small sample of genetic variants with them. Altemose and his team confirmed that this pattern extends into centromeres.
“What we found is that in individuals with recent ancestry outside the African continent, their centromeres, at least on chromosome X, tend to fall into two big clusters, while most of the interesting variation is in individuals who have recent African ancestry,” Altemose said. “This isn’t entirely a surprise, given what we know about the rest of the genome. But what it suggests is that if we want to look at the interesting variation in these centromeric regions, we really need to have a focused effort to sequence more African genomes and do complete telomere-to-telomere sequence assembly.”
DNA sequences around the centromere could also be used to trace human lineages back to our common ape ancestors, he noted.
“As you move away from the site of the active centromere, you get more and more degraded sequence, to the point where if you go out to the furthest shores of this sea of repetitive sequences, you start to see the ancient centromere that, perhaps, our distant primate ancestors used to bind to the kinetochore,” Altemose said. “It’s almost like layers of fossils.”
Long-read sequencing a game changer
The T2T’s success is due to improved techniques for sequencing long stretches of DNA at once, which helps when determining the order of highly repetitive stretches of DNA. Among these are PacBio’s HiFi sequencing, which can read lengths of more than 20,000 base pairs with high