Tag Archives: genomes

Balanophora genomes display massively convergent evolution with other extreme holoparasites and provide novel insights into parasite–host interactions – Nature.com

  1. Balanophora genomes display massively convergent evolution with other extreme holoparasites and provide novel insights into parasite–host interactions Nature.com
  2. Parasitic plant convinces hosts to grow into its own flesh—it’s also an extreme example of genome shrinkage Phys.org
  3. These parasitic plants force their victims to make them dinner Popular Science
  4. Extreme parasitism: Balanophora convinces a host to grow into its tissue Earth.com
  5. Strikingly convergent genome alterations in two independently evolved holoparasites Nature.com
  6. View Full Coverage on Google News

Read original article here

We Now Have The Largest Ever Human ‘Family Tree’, With 231 Million Ancestral Lineages

In June 2000, two rival groups of researchers shook hands in the shared success of a milestone in biology – the delivery of a rough draft of the human genome.

What started with an incomplete map of our chromosomes has since bloomed into a vast trove of individualized sequences from all corners of the globe, and in many cases stretching far back in time.

 

Somewhere in that ocean of decoded DNA is a story of our shared humanity.

Unfortunately, reading it is easier said than done. Not only is the sheer mass of data a problem, subtle differences in samples, diverse formats, and analysis techniques prioritizing different kinds of errors all present obstacles to a unified interpretation.

Now researchers from the Big Data Institute (BDI) at the University of Oxford in the UK have made a significant start, by merging a forest of more than 3,600 individual sequences from 215 populations into a single, enormous tree.

The tree’s branches comprise of a mind-blowing 231 million ancestral lineages. At its base is a spread of roots represented by eight ancient, highly detailed human genome sequences, with thousands of smaller snippets used to confirm their place deep in our past.

Among them are three Neanderthal genomes, one genome from a Denisovan, and a small family who lived in Siberia more than four thousand years ago.

“Essentially, we are reconstructing the genomes of our ancestors and using them to form a series of linked evolutionary trees that we call a ‘tree sequence’,” says geneticist Anthony Wilder Wohns, who led the study while completing his doctorate at the BDI.

 

“We can then estimate when and where these ancestors lived.”

Their tree sequence method makes use of what’s known as a succinct data structure – a computing concept that aims to represent data in an optimal amount of space that also limits the amount of time needed to probe it all with questions.

We might apply similar thinking when saving files on our own computer, finding a compromise between compressing documents and squeezing them into long lists of folders, or simply saving everything on the desktop.

In this specific case, a tree sequence finds correlations between different branches of a tree to help make the large pools of information easier to study. 

By turning the data into graphs with nodes representing various lineages and mapping mutations along the edges, massive genetic databases can not only be squeezed into a relatively small space, but can be accessed more easily by algorithms designed to search for interesting statistics.

“The power of our approach is that it makes very few assumptions about the underlying data and can also include both modern and ancient DNA samples,” says Wohns, who further explains their work in the video below.

Incorporating labels on the geographical locations of sequences allowed the team to estimate where certain common ancestors might have once lived and how they moved about.

Not only does this reveal events we already suspect, such as how human populations migrated from Africa, it hints at changes in population densities within ancestral groups we’re still learning about, such as the Denisovans.

 

Thanks to the efficiency of this process, the already impressive tree has plenty of room to grow as more genetic data become available in the future.

Adding millions more genomes will only make any further results more accurate, pinpointing exactly where a novel sequence fits in a genealogy that stretches around the world.

“This genealogy allows us to see how every person’s genetic sequence relates to every other, along all the points of the genome,” says BDI evolutionary geneticist, Yan Wong.

Thinking even bigger, there’s no reason the same approach couldn’t be applied to other species, possibly one day contributing to a global tapestry of life on Earth.

“While humans are the focus of this study, the method is valid for most living things; from orangutans to bacteria,” says Wohns.

“It could be particularly beneficial in medical genetics, in separating out true associations between genetic regions and diseases from spurious connections arising from our shared ancestral history.”

This research was published in Science.

 

Read original article here

Huge Project Is Now Underway to Sequence The Genome of Every Complex Species on Earth

The Earth Biogenome Project, a global consortium that aims to sequence the genomes of all complex life on earth (some 1.8 million described species) in ten years, is ramping up.

 

The project’s origins, aims and progress are detailed in two multi-authored papers published today. Once complete, it will forever change the way biological research is done.

Specifically, researchers will no longer be limited to a few “model species” and will be able to mine the DNA sequence database of any organism that shows interesting characteristics. This new information will help us understand how complex life evolved, how it functions, and how biodiversity can be protected.

The project was first proposed in 2016, and I was privileged to speak at its launch in London in 2018. It is currently in the process of moving from its startup phase to full-scale production.

The aim of phase one is to sequence one genome from every taxonomic family on earth, some 9,400 of them. By the end of 2022, one-third of these species should be done. Phase two will see the sequencing of a representative from all 180,000 genera, and phase three will mark the completion of all the species.

The importance of weird species

The grand aim of the Earth Biogenome Project is to sequence the genomes of all 1.8 million described species of complex life on Earth. This includes all plants, animals, fungi, and single-celled organisms with true nuclei (that is, all “eukaryotes”).

While model organisms like mice, rock cress, fruit flies and nematodes have been tremendously important in our understanding of gene functions, it’s a huge advantage to be able to study other species that may work a bit differently.

 

Many important biological principles came from studying obscure organisms. For instance, genes were famously discovered by Gregor Mendel in peas, and the rules that govern them were discovered in red bread mold.

DNA was discovered first in salmon sperm, and our knowledge of some systems that keep it secure came from research on tardigrades. Chromosomes were first seen in mealworms and sex chromosomes in a beetle (sex chromosome action and evolution has also been explored in fish and platypus). And telomeres, which cap the ends of chromosomes, were discovered in pond scum.

Answering biological questions and protecting biodiversity

Comparing closely and distantly related species provides tremendous power to discover what genes do and how they are regulated. For instance, in another PNAS paper, coincidentally also published today, my University of Canberra colleagues and I discovered Australian dragon lizards regulate sex by the chromosome neighborhood of a sex gene, rather than the DNA sequence itself.

Scientists also use species comparisons to trace genes and regulatory systems back to their evolutionary origins, which can reveal astonishing conservation of gene function across nearly a billion years. For instance, the same genes are involved in retinal development in humans and in fruit fly photoreceptors. And the BRCA1 gene that is mutated in breast cancer is responsible for repairing DNA breaks in plants and animals.

 

The genome of animals is also far more conserved than has been supposed. For instance, several colleagues and I recently demonstrated that animal chromosomes are 684 million years old.

It will be exciting, too, to explore the “dark matter” of the genome, and reveal how DNA sequences that don’t encode proteins can still play a role in genome function and evolution.

Another important aim of the Earth Biogenome Project is conservation genomics. This field uses DNA sequencing to identify threatened species, which includes about 28 percent of the world’s complex organisms – helping us monitor their genetic health and advise on management.

No longer an impossible task

Until recently, sequencing large genomes took years and many millions of dollars. But there have been tremendous technical advances that now make it possible to sequence and assemble large genomes for a few thousand dollars. The entire Earth Biogenome Project will cost less in today’s dollars than the human genome project, which was worth about US$3 billion in total.

In the past, researchers would have to identify the order of the four bases chemically on millions of tiny DNA fragments, then paste the entire sequence together again. Today they can register different bases based on their physical properties, or by binding each of the four bases to a different dye. New sequencing methods can scan long molecules of DNA that are tethered in tiny tubes, or squeezed through tiny holes in a membrane.

 

Why sequence everything?

But why not save time and money by sequencing just key representative species?

Well, the whole point of the Earth Biogenome Project is to exploit the variation between species to make comparisons, and also to capture remarkable innovations in outliers.

There is also the fear of missing out. For instance, if we sequence only 69,999 of the 70,000 species of nematode, we might miss the one that could divulge the secrets of how nematodes can cause diseases in animals and plants.

There are currently 44 affiliated institutions in 22 countries working on the Earth Biogenome Project. There are also 49 affiliated projects, including enormous projects such as the California Conservation Genomics Project, the Bird 10,000 Genomes Project and UK’s Darwin Tree of Life Project, as well as many projects on particular groups such as bats and butterflies.

Jenny Graves, Distinguished Professor of Genetics and Vice Chancellor’s Fellow, La Trobe University.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

 

Read original article here

A Weird Paper Tests The Limits of Science by Claiming Octopuses Came From Space

A summary of decades of research on a rather ‘out-there’ idea involving viruses from space raises questions on just how scientific we can be when it comes to speculating on the history of life on Earth.

 

It’s easy to throw around words like crackpot, rogue, and maverick in describing the scientific fringe, but then papers like this one, from 2018, come along and leave us blinking owlishly, unsure of where to even begin.

A total of 33 names were listed as authors on this review, which was published by Progress in Biophysics and Molecular Biology back in August 2018. The journal is peer reviewed and fairly well cited. So it’s not exactly small, or a niche pay-for-publish source.

Science writer Stephen Fleischfresser goes into depth on the background of two of the better known scientists involved: Edward Steele and Chandra Wickramasinghe. It’s well worth a read.

For a tl;dr version, Steele is an immunologist who has a fringe reputation for his views on evolution that relies on acquiring gene changes determined by the influence of the environment rather than random mutations, in what he calls meta-Lamarckism.

Wickramasinghe, on the other hand, has had a somewhat less controversial career, recognized for empirically confirming Sir Fred Hoyle’s hypothesis describing the production of complex carbon molecules on interstellar dust.

 

Wickramasinghe and Hoyle also happened to be responsible for another space biology thesis. Only this one is based on more than just the origins of organic chemistry.

The Hoyle Wickramasinghe (H-W) thesis of Cometary (Cosmic) Biology makes the rather simple claim that the direction of evolution has been significantly affected by biochemistry that didn’t start on our planet.

In Wickramasinghe’s own words, “Comets are the carriers and distributors of life in the cosmos, and life on Earth arose and developed as a result of cometary inputs.”

Those inputs, Wickramasinghe argued, aren’t limited to a generous sprinkling of space-baked amino acids, either.

Rather, they include viruses that insert themselves into organisms, pushing their evolution into whole new directions.

The report, titled “Cause of Cambrian Explosion – Terrestrial or Cosmic?”, pulls on existing research to conclude that a rain of extra-terrestrial retroviruses played a key role in the diversification of life in our oceans roughly half a billion years ago.

“Thus retroviruses and other viruses hypothesized to be liberated in cometary debris trails both can potentially add new DNA sequences to terrestrial genomes and drive further mutagenic change within somatic and germline genomes,” the authors wrote.

 

Let that sink in for a moment. And take a deep breath before continuing, because that was the tame part.

It was during this period that a group of mollusks known as cephalopods first stretched out their tentacles from beneath their shells, branching into a stunning array of sizes and shapes in what seemed like a remarkably short time frame.

The genetics of these organisms, which today include octopuses, squid, and cuttlefish, are as weird as the animals themselves, due in part to their ability to edit their DNA on the fly.

The authors of the paper make the rather audacious claim that these genetic oddities might be a sign of life from space.

Not of space viruses this time, but the arrival of whole genomes frozen in stasis before thawing out in our tepid waters.

“Thus the possibility that cryopreserved squid and/or octopus eggs, arrived in icy bolides several hundred million years ago should not be discounted,” they wrote.

In his review of the paper, medical researcher Keith Baverstock from the University of Eastern Finland conceded that there’s a lot of evidence that plausibly aligns with the H-W thesis, such as the curious timeline of the appearance of viruses. 

 

But that’s just not how science advances.

“I believe this paper justifies skepticism of the scientific value of stand alone theories of the origin of life,” Baverstock argued at the time.

“The weight of plausible, but non-definitive, evidence, great though that might be, is not the point.”

While the idea is as novel and exciting as it is provocative, nothing in the summary helps us better understand the history of life on Earth any better than existing conjectures, adding little of value to our model of evolution.

Still, with solid caveats in place, maybe science can cope with a generous dose of crazy every now and then.

Journal editor Denis Noble concedes that ‘further research is needed’, which is a bit of an understatement.

But given the developments regarding space-based organic chemistry in recent years, there’s room for discussion.

“As space chemistry and biology grows in importance it is appropriate for a journal devoted to the interface between physics and biology to encourage the debates,” said Noble.

“In the future, the ideas will surely become testable.”

Just in case those tests confirm speculations, we recommend being well prepared for the return of our cephalopod overlords. Who knows when they’ll want those eggs back?

This research was published in Progress in Biophysics and Molecular Biology

A version of this article was first published in August 2018.

 

Read original article here

‘Useless Specks of Dust’ Turn Out to Be Building Blocks of All Vertebrate Genomes

Originally, they were thought to be just specks of dust on a microscope slide.

Now, a new study suggests that microchromosomes – a type of tiny chromosome found in birds and reptiles – have a longer history, and a bigger role to play in mammals than we ever suspected.

 

By lining up the DNA sequence of microchromosomes across many different species, researchers have been able to show the consistency of these DNA molecules across bird and reptile families, a consistency that stretches back hundreds of millions of years.

What’s more, the team found that these bits of genetic code have been scrambled and placed on larger chromosomes in marsupial and placental mammals, including humans. In other words, the human genome isn’t quite as ‘normal’ as previously supposed.

“We lined up these sequences from birds, turtles, snakes and lizards, platypus and humans and compared them,” says geneticist Jenny Graves, from La Trobe University in Australia. “Astonishingly, the microchromosomes were the same across all bird and reptile species.

“Even more astonishingly, they were the same as the tiny chromosomes of Amphioxus – a little fish-like animal with no backbone that last shared a common ancestor with vertebrates 684 million years ago.”

By tracing these microchromosomes back to the ancient Amphioxus, the scientists were able to establish genetic links to all of its descendants. These tiny ‘specks of dust’ are actually important building blocks for vertebrates, not just abnormal extras.

It seems that most mammals have absorbed and jumbled up their microchromosomes as they’ve evolved, making them seem like normal pieces of DNA. The exception is the platypus, which has several chromosome sections line up with microchromosomes, suggesting that this method may well have acted as a ‘stepping stone’ for other mammals in this regard, according to the researchers.

Microchromosomes are consistent in birds and reptiles, but mixed up in larger chromosomes in mammals. (Paul Waters)

The study also revealed that as well as being similar across numerous species, the microchromosomes were also located in the same place inside cells.

“Not only are they the same in each species, but they crowd together in the center of the nucleus where they physically interact with each other, suggesting functional coherence,” says biologist Paul Waters, from the University of New South Wales (UNSW) in Australia.

 

“This strange behavior is not true of the large chromosomes in our genomes.”

The researchers credit recent advancements in DNA sequencing technology for the ability to sequence microchromosomes end-to-end, and to better establish where these DNA fragments came from and what their purpose might be.

It’s not clear whether there’s an evolutionary benefit to coding DNA in larger chromosomes or in microchromosomes, and the findings outlined in this paper might help scientists put that particular debate to rest – although a lot of questions remain.

The study suggests that the large chromosome approach that has evolved in mammals isn’t actually the normal state, and might be a disadvantage: genes are packed together much more tightly in microchromosomes, for example.

“Rather than being ‘normal’, chromosomes of humans and other mammals were puffed up with lots of ‘junk DNA’ and scrambled in many different ways,” says Graves.

“The new knowledge helps explain why there is such a large range of mammals with vastly different genomes inhabiting every corner of our planet.”

The research has been published in PNAS.

 

Read original article here

A Shockingly Small Percentage of Our DNA Is Uniquely ‘Human’, Study Finds

By now you might have heard the factoid that modern humans share a pretty large chunk of our genomes with bananas. But delving down much deeper, how much of our genome is uniquely Homo sapiens.

 

A new study has suggested that number could be as small as 1.5 percent, with the rest being shared with our ancient relatives such as Neanderthals and Denisovans.

“We generate a map within human genomes of archaic ancestry and of genomic regions not shared with archaic hominins,” the team wrote in their new paper.

“We find that only 1.5 to 7 percent of the modern human genome is uniquely human.”

Untangling what is ours and what came from our ancient kin is a difficult task. How do you tell which genetic variants are due to interbreeding (also called admixing) of Neanderthals and Homo sapiens for example, rather than variants that were passed onto both species from a common ancestor?

The team wanted to create a system that could identify both admixture events as well as this shared inheritance – called incomplete lineage sorting – that would help tell us which regions of our genome are unique to us.

They created an algorithm called SARGE – Speedy Ancestral Recombination Graph Estimator – so they could map how our genes have weaved through time and species, separating and joining back together at different points using something called ancestral recombination graphs.

 

They ran SARGE on 279 modern human genomes from Africa and elsewhere, two high-quality Neanderthal genomes, and one high-quality Denisovan genome.

“Using the resulting ancestral recombination graph, we map Neanderthal and Denisovan ancestry, incomplete lineage sorting, and the absence of both across modern human genomes,” the team wrote.

“We find evidence of at least one wave of Neanderthal admixture into the ancestors of all non-Africans.”

Along with the 1.5 to 7 percent of the genome that’s unique to modern humans, they also found “evidence of multiple bursts of adaptive changes specific to modern humans within the past 600,000 years involving genes related to brain development and function”.

The researchers explain that most of those genes that were uniquely ours were not genes with unknown functions, instead they were well known genes which coded for proteins used in the brain.

Obviously, this is not even close to the end of the story. For starters, between 1.5 and 7 percent is a pretty large range and the team think they can make it more specific with more genomes and more research.

There have also been plenty of other analyses looking at the percentage of DNA we take from our ancient cousins, so it’s unlikely this will be the last word on the matter.

 

Plus, SARGE isn’t able to tell the researchers why those bursts of adaptive changes happened when they did.

However, the team already has some ideas.

“It’s extremely tempting to speculate that one or more of these bursts had something to do the incredibly social behavior humans have – mediated in large part by our expert control of speech and language,” University of California, Santa Cruz paleogeneticist and one of the researchers, Richard Green, told Business Insider.

The research has been published in Science Advances.

 

Read original article here

An ancestral recombination graph of human, Neanderthal, and Denisovan genomes

INTRODUCTION

Much of the current genetic variation within humans predates the split, estimated at 520 to 630 thousand years (ka) ago (1), between the populations that would become modern humans and Neanderthals. The shared genetic variation present in our common ancestral population is still largely present among humans today and was present in Neanderthals up until the time of their extinction. This phenomenon, which is known as incomplete lineage sorting (ILS), means that any particular human will share many alleles with a Neanderthal that are not shared with some other humans. Therefore, humans often share genetic variation with Neanderthals not by admixture but rather by shared inheritance from a population ancestral to us both. Because of this, any effort to map ancestry from archaic hominins in human genomes must disentangle admixture from ILS. Furthermore, a technique able to identify both admixture and ILS could produce a catalog of uniquely human genomic regions that is free of both and thereby shed light on the evolutionary processes that have been important in our origin as a unique species.

Ancestral recombination graph (ARG) inference (2) is a powerful starting point for such an analysis. An ARG can be conceptualized as a series of trees, mapped to individual sites, over phased haplotypes (chromosomes) in a panel of genomes. Ancestral recombination events, or sites at which chromosome segments with different histories were joined together by historical recombination, form boundaries between trees. Each ancestral recombination event manifests as a clade of haplotypes, all of which descend from the first ancestral haplotype to have it, moving from one position in the tree upstream of the event to a new position in the downstream tree (3). ARGs are complete descriptions of phylogenomic datasets and present for recombining genomes what single trees present for nonrecombining genomes, i.e., a complete description of their genetic relationships. As prior techniques for ancestry mapping can be thought of as summaries of the ARG, higher resolution ancestry maps could be produced if the ARG were known. In addition, the ARG can be used to estimate the time to most recent common ancestor (TMRCA) between admixed and admixer haplotypes, providing additional information about historical admixture between humans and archaic hominins.

Given the utility of an ARG, it is expected that several methods have been devised for estimating ARGs from genetic data. These published approaches all have different strengths and limitations. BEAGLE (3), ArgWeaver (4), and Rent+ (5) were designed for small datasets and require substantial time and/or memory to be used with large sequencing panels. Margarita (6) randomly samples histories at ancestral recombination event boundaries and does not seek to produce parsimonious recombination histories (6). ArgWeaver (4), which is widely considered the gold standard in ARG inference, requires prior knowledge of demographic model parameters. Relate (7) is a relatively new method that scales well to large datasets and produces trees without polytomies and with branch lengths but, in doing so, necessarily samples some relationships that are not directly inferred from the data, as do several other methods (4, 5). The most computationally efficient approach, tsinfer (8), also scales to large datasets but assumes that frequency of an allele is correlated with its age. Since this assumption is violated at loci undergoing either admixture or selection, tsinfer is not well suited for ARG inference using genetic data from Neanderthals, Denisovans, and modern humans.

Here, we present a heuristic, parsimony-guided ARG inference algorithm called SARGE (Speedy Ancestral Recombination Graph Estimator) and use it to build a genome-wide ARG of both modern human and archaic hominin genomes. SARGE can run on thousands of phased genomes, makes no prior assumptions other than parsimony, heuristically estimates branch lengths, and avoids making inferences about unobserved relationships by leaving polytomies in output trees. We validate SARGE using simulated data and demonstrate that it has high specificity compared to existing methods in reconstructing the topology of trees, making it suitable for identifying archaic admixture segments. To achieve this high specificity, SARGE avoids describing some relationships in output trees, resulting in lower sensitivity than existing methods.

We run SARGE on a panel of 279 modern human genomes, two high-coverage Neanderthal genomes, and one high-coverage Denisovan genome. Using the resulting ARG, we map Neanderthal and Denisovan ancestry, ILS, and the absence of both across modern human genomes. We find evidence of at least one wave of Neanderthal admixture into the ancestors of all non-Africans. We also identify several long and deeply divergent Neanderthal haplotype blocks that are specific to some human populations. We find support for the hypothesis that Denisovan-like ancestry is the result of multiple introgression events from different source populations (9, 10). We also detect an excess of Neanderthal and Denisovan haplotype blocks unique to South Asian genomes. Last, we pinpoint human-specific changes likely to have been affected by selection since the split with archaic hominins, many of which are involved in brain development.

RESULTS

ARG algorithm

To build an ARG containing both modern human and archaic hominin genomes without the use of a demographic model or the need to infer ancestral haplotypes, we developed a parsimony-based ARG inference technique, SARGE. SARGE uses both shared derived alleles and inferred, shared ancestral recombination events to articulate trees (Fig. 1A and Supplementary Methods). SARGE uses the four-gamete test (11) to determine regions of recombination and the affected haplotypes. The crux of SARGE is a fast algorithm for choosing the branch movement(s) capable of explaining the highest number of discordant clades across a genomic segment that fails the four-gamete test. Once the branch movements, i.e., inferred ancestral recombinations, are determined, further definition of clades is possible. Thus, the trees are articulated by both shared alleles and shared ancestral recombination events (figs. S1 and S2 and Supplementary Methods). SARGE infers branch lengths via a heuristic method, compensating for mutation rate variation across the genome by comparing the number of mutations on each branch to the divergence to an outgroup genome in a fixed width region around each site (fig. S7 and Supplementary Methods).

Fig. 1 Data structure and performance of SARGE on simulated data.

(A) Schematic of data structure. Top: Rectangles are “tree nodes” representing clades in trees. Each clade has member haplotypes (shown with letters A to G) and a start and end coordinate (blue numbers in brackets) determined by coordinates of single-nucleotide polymorphism (SNP) sites tagging the clade (yellow numbers in braces), along with a propagation distance parameter (100 in this example). Parent/child edges (vertical arrows) also have start and end coordinates determined by the nodes. Ovals are candidates for clades sharing an ancestral recombination event that can explain four-gamete test failures; colored edges indicate potential paths between tree nodes through candidate nodes that could explain four-gamete test failures (colors indicate types of paths). The candidate node with the most edges (here, AB) is eventually chosen as the most parsimonious branch movement, allowing for the inference of new nodes. The two trees at the bottom show the “solved” ancestral recombination event with the branch movement marked in red and all clades inferred without SNP data marked with yellow stars (haplotypes A and B share an ancestral recombination event; their ancestry is shared with haplotypes C, D, and G upstream of the recombination event and haplotype E downstream of it). The coordinates of the recombination event (blue numbers in brackets) are taken to be midway between the highest-coordinate upstream site (left side) and the lowest-coordinate downstream site (right side) involved in recombination. For a more detailed overview of the data structure, see figs. S3 to S5. (B) Accuracy of SARGE on simulated data (defined as percent of all clades correct according to the true ARG in the simulation), with increasing numbers of human-like haplotypes from an unstructured population. Error bars are one SD across five replicates. (C) Number of nodes per tree with increasing number of haplotypes in simulated data.

” data-icon-position=”” data-hide-link-title=”0″>

Fig. 1 Data structure and performance of SARGE on simulated data.

(A) Schematic of data structure. Top: Rectangles are “tree nodes” representing clades in trees. Each clade has member haplotypes (shown with letters A to G) and a start and end coordinate (blue numbers in brackets) determined by coordinates of single-nucleotide polymorphism (SNP) sites tagging the clade (yellow numbers in braces), along with a propagation distance parameter (100 in this example). Parent/child edges (vertical arrows) also have start and end coordinates determined by the nodes. Ovals are candidates for clades sharing an ancestral recombination event that can explain four-gamete test failures; colored edges indicate potential paths between tree nodes through candidate nodes that could explain four-gamete test failures (colors indicate types of paths). The candidate node with the most edges (here, AB) is eventually chosen as the most parsimonious branch movement, allowing for the inference of new nodes. The two trees at the bottom show the “solved” ancestral recombination event with the branch movement marked in red and all clades inferred without SNP data marked with yellow stars (haplotypes A and B share an ancestral recombination event; their ancestry is shared with haplotypes C, D, and G upstream of the recombination event and haplotype E downstream of it). The coordinates of the recombination event (blue numbers in brackets) are taken to be midway between the highest-coordinate upstream site (left side) and the lowest-coordinate downstream site (right side) involved in recombination. For a more detailed overview of the data structure, see figs. S3 to S5. (B) Accuracy of SARGE on simulated data (defined as percent of all clades correct according to the true ARG in the simulation), with increasing numbers of human-like haplotypes from an unstructured population. Error bars are one SD across five replicates. (C) Number of nodes per tree with increasing number of haplotypes in simulated data.

In the interest of parsimony, our method attempts to infer a set of ancestral recombination events that each explains as many four-gamete test failures as possible. Because the four-gamete test is known to underestimate the true number of ancestral recombination events (12, 13), SARGE will systematically underestimate the true number of ancestral recombination events in a dataset by design. Because of this, SARGE is not well suited to certain tasks, such as the creation of fine-grained recombination maps. We have attempted to mitigate cases where a clade in the ARG should be broken by an unobserved ancestral recombination event, however, by introducing a propagation distance parameter that limits the genomic distance over which each observed clade is allowed to persist (Fig. 1A and Supplementary Methods).

SARGE is scalable to large datasets and achieves higher specificity than many other methods at the cost of lower sensitivity, by leaving uncertainty (polytomies) in the output data. Using simulated data, we find that SARGE runs quickly (figs. S8D and S10), requires little memory, and has 78.93% specificity [95% confidence interval (CI), 78.09 to 78.95%] on average across a range of simulated datasets that include between 50 and 500 haplotypes (see Supplementary Methods). SARGE is at least as specific as alternative techniques (fig. S9, A and C). Conversely, SARGE’s sensitivity (25.36%; 95% CI, 25.32 to 25.40%) is lower than that of other methods (fig. S9, B and D), as SARGE leaves an increasingly large number of polytomies in output trees as the number of input haplotypes increases (Fig. 1, B and C). As expected, SARGE (as well as similar techniques) performs best when the mutation to recombination rate ratio is high, as this makes clades easier to detect (figs. S11 and S12) and suffers slightly in accuracy with increasing amounts of population structure (fig. S13).

We also find that the sensitivity of SARGE can be increased by increasing the propagation distance parameter (see Supplementary Text and fig. S15), that missing clades are likely to be small clades that are likely to be close to the leaves of trees (see Supplementary Text and fig. S16), and that incorrectly inferred clades tend to be within a few kilobases of sites at which those clades exist in truth (see Supplementary Text and fig. S17). We also find, using simulated data, that the SARGE’s branch lengths do not appear to be systematically biased upward or downward (see Supplementary Text and fig. S18).

We ran SARGE on 279 phased human genomes from the Simons Genome Diversity Project (SGDP) (14), together with two high-coverage Neanderthal genomes (1, 15) and one high-coverage Denisovan genome (16). In our analyses, we relied on modern human population labels defined by the SGDP for many analyses, but we split sub-Saharan Africans into one population containing only the most deeply diverged lineages (Biaka, Mbuti, and Khomani-San), which we call “Africa-MBK,” and the remaining genomes (“Africa”). Using these data, we find that the completeness of trees in the ARG (the extent to which all possible clades are present rather than in polytomies) is positively correlated with the local mutation rate to recombination rate ratio (fig. S20A; Spearman’s rho = 0.40; P < 2.2 × 10−16) and that the number of inferred ancestral recombination events per genomic window agrees with a previously published population recombination map (17) (fig. S20B; Spearman’s rho = 0.46; P < 2.2 × 10−16), as expected. Estimates of the mean TMRCA of pairs of haplotypes, taken across all trees, were also concordant with prior knowledge (Fig. 2A).

Fig. 2 Performance of SARGE on SGDP and archaic hominin dataset.

(A) Pairwise coalescence times for randomly sampled sets of up to 10 pairs of phased genome haplotypes per population (every possible pair was considered for archaic hominins, since fewer genomes were available). Values are calibrated using a 13-Ma human-chimp divergence time (see Supplementary Methods) and averaged across every variable site in the dataset, error bars show one SD, and branch shortening values for archaic samples were incorporated into calculations using mean values reported in (1). The lower value for humans comes from removing archaic-admixed clades from trees. (B) UPGMA (unweighted pair group method with arithmetic mean) trees computed using nucleotide diversity from SNP data (top and left) against similarity matrix from shared recombination events inferred by SARGE. Light yellow boxes (similar groups) are Native Americans and Papuans. (C) Average similarity between Orcadian haplotypes in the SGDP panel and other European haplotypes calculated on the basis of the number of shared ancestral recombination events. The best matches are in England, Iceland, and Norway, as expected.

” data-icon-position=”” data-hide-link-title=”0″>

Fig. 2 Performance of SARGE on SGDP and archaic hominin dataset.

(A) Pairwise coalescence times for randomly sampled sets of up to 10 pairs of phased genome haplotypes per population (every possible pair was considered for archaic hominins, since fewer genomes were available). Values are calibrated using a 13-Ma human-chimp divergence time (see Supplementary Methods) and averaged across every variable site in the dataset, error bars show one SD, and branch shortening values for archaic samples were incorporated into calculations using mean values reported in (1). The lower value for humans comes from removing archaic-admixed clades from trees. (B) UPGMA (unweighted pair group method with arithmetic mean) trees computed using nucleotide diversity from SNP data (top and left) against similarity matrix from shared recombination events inferred by SARGE. Light yellow boxes (similar groups) are Native Americans and Papuans. (C) Average similarity between Orcadian haplotypes in the SGDP panel and other European haplotypes calculated on the basis of the number of shared ancestral recombination events. The best matches are in England, Iceland, and Norway, as expected.

Using these data, we found SARGE’s inferences of ancestral recombination events to be accurate. Because SARGE articulates tree clades using either shared allelic variation or shared inferred ancestral recombination, it is possible to test the concordance of trees made from each source. On average, 13.2% of clades in the ARG are known only from inference of shared ancestral recombination events and not by the presence of a shared, derived allele. We created a similarity score between every pair of phased human genome haplotypes in our dataset based on how often the haplotypes share ancestral recombination events. This score recapitulates relationships among humans known from single-nucleotide polymorphism (SNP) data alone (Fig. 2, B and C; Pearson’s r2 with scores from SNP data = 0.989; P < 2.2 × 10−16). We note that genomes with the poorest correlation between SNP-based and recombination-based similarity scores to other genomes are those most likely to contain phasing errors (table S1).

Archaic hominin admixture

We used our ARG to find regions of each phased human genome that derived from admixture with archaic hominins (see Supplementary Methods and fig. S24). If humans and the archaic hominins in our panel were in populations that had sorted their lineages, then this exercise would be simple with a complete and correct ARG. However, since human genome regions are often within a clade that includes hominin haplotypes due to ILS, finding admixed segments requires analysis beyond simply finding clades that unite some human and archaic hominin haplotypes.

We started by selecting clades from ARG trees that united some modern humans with archaic hominins to the exclusion of some other modern humans. We then assigned each human genome haplotype in each such clade as putative Neanderthal, Denisovan, or ambiguous ancestry, depending on whether the clade contained Neanderthal, Denisovan, or both types of haplotypes. We then performed several filtering steps to remove these clades likely to result from ILS. First, we removed any clades that included more than 10% of the Africa-MBK haplotypes from the most basal human lineages, which are unlikely to be admixed. We then discarded clades that persisted for a short distance along the chromosome (which likely represent older haplotypes broken down over time by recombination) or in which the TMRCA between modern humans and archaic hominins was high (see Supplementary Methods and fig. S24). This ascertainment strategy was designed to identify haplotype blocks that we could confidently identify as archaic-introgressed and therefore likely underestimated the true extent of admixture across the genome. Because our method relies on both the haplotype block length and the TMRCA between admixed and introgressor haplotypes to identify admixed segments, we were able to identify some haplotypes that resemble archaic admixture in modern humans but that have relatively high sequence divergence to published archaic genomes (manifesting as high TMRCAs between archaic and modern genomes within these segments).

Using the resulting maps, we calculated genome-wide percent admixture estimates across populations and compared them to

estimates based on the population-wide D statistic (18, 19) using basal Africa-MBK lineages as an outgroup. Since our alleles were polarized relative to the chimpanzee genome and only sites with derived alleles present in hominins were considered, our calculations were of the form D(Africa-MBK, Test, Introgressor, Chimpanzee) / D(Africa-MBK, Introgressor1, Introgressor2, Chimpanzee), where Introgressor1 and Introgressor2 were randomly chosen subsets of half of the introgressor (Neanderthal or Denisovan) haplotypes and the derived allele frequency in chimpanzee was set to 0 at all sites in our dataset. ARG-based estimates are similar to, but lower than, D statistic–based estimates in all non-African genomes, which we expected because of our aggressive filtering strategy for eliminating ILS (see Supplementary Methods). We detected slightly more admixture in sub-Saharan Africans (excluding Africa-MBK) than using the D statistic (Fig. 3A), even when considering the lower end of 95% confidence weighted block jackknife intervals (table S2). We note that a recent study that used an outgroup-free method to detect Neanderthal ancestry blocks in human genomes also found a higher average amount of Neanderthal ancestry in African genomes than has been previously reported (20). As another quality check, we compared our maps of Neanderthal ancestry to those published in prior studies (2023). We found that maps produced by SARGE are about as concordant with the published maps as the published maps are with each other (fig. S25).

Fig. 3 Neanderthal ancestry inferred in modern humans.

(A) Genome-wide percent Neanderthal, Denisovan, and ambiguous (either Neanderthal or Denisovan) across SGDP populations, using the ARG and an estimator based on the D statistic. D statistic calculations considered only one archaic population at a time as introgressor and thus do not detect ambiguous ancestry and also might count some Denisovan ancestry as Neanderthal and vice versa. (B) For individual phased human genome haplotypes (points), mean TMRCA with Neanderthal in Neanderthal haplotype blocks (y axis) and mean Neanderthal haplotype block length (x axis) for all Neanderthal-introgressed haplotypes. TMRCA calculations assume a total of 6.5-Ma human-chimpanzee divergence and branch shortening values from (1), with a mutation rate of 1 × 10−9 per site per year. bp, base pair. (C) Overall number Neanderthal haplotype blocks versus geographically restricted (unique to a 3000-km radius) Neanderthal haplotype blocks. (D) Same as (B), but limited to geographically restricted (unique to a 3000-km radius) Neanderthal haplotype blocks. Only haplotypes with more than 10 geographically restricted segments are shown.

” data-icon-position=”” data-hide-link-title=”0″>

Fig. 3 Neanderthal ancestry inferred in modern humans.

(A) Genome-wide percent Neanderthal, Denisovan, and ambiguous (either Neanderthal or Denisovan) across SGDP populations, using the ARG and an estimator based on the D statistic. D statistic calculations considered only one archaic population at a time as introgressor and thus do not detect ambiguous ancestry and also might count some Denisovan ancestry as Neanderthal and vice versa. (B) For individual phased human genome haplotypes (points), mean TMRCA with Neanderthal in Neanderthal haplotype blocks (y axis) and mean Neanderthal haplotype block length (x axis) for all Neanderthal-introgressed haplotypes. TMRCA calculations assume a total of 6.5-Ma human-chimpanzee divergence and branch shortening values from (1), with a mutation rate of 1 × 10−9 per site per year. bp, base pair. (C) Overall number Neanderthal haplotype blocks versus geographically restricted (unique to a 3000-km radius) Neanderthal haplotype blocks. (D) Same as (B), but limited to geographically restricted (unique to a 3000-km radius) Neanderthal haplotype blocks. Only haplotypes with more than 10 geographically restricted segments are shown.

We next looked for population-specific differences in archaic hominin ancestry in modern humans. Lengths of archaic haplotype segments and the TMRCA to admixer across those segments are both affected by the time of admixture and the divergence between the true admixers and available archaic hominin genomes. We therefore computed the mean of these two values for each ancestry type per phased genome haplotype and compared them across individuals from different populations to look for evidence of distinctive, population-specific admixture events. This analysis revealed distinctive population-specific patterns for Neanderthal and Denisovan ancestry, and many pairwise comparisons of these values between populations are significant (table S3). Segments of ambiguous ancestry produce a pattern resembling a mixture of Neanderthal and Denisovan ancestry, as expected (Fig. 3, B to D, and figs. S26, S28, S37, and S38). We caution, however, that our approach can artificially shorten haplotype block lengths (see Supplementary Methods and figs. S21 and S22), especially for populations such as Papuans and Australians that were absent from the 1000 Genomes Project panel (24) that was used for phasing (14). Nonetheless, Neanderthal haplotype block lengths in Oceania are not significantly shorter than in other populations (Fig. 3B), and incorrect phasing in archaic genomes does not appear to negatively affect results of admixture scans using simulated data (see Supplementary Text).

As expected, the ARG classifies a smaller fraction (0.096 to 0.46%) of sub-Saharan African genomes (excluding Mozabite and Saharawi individuals) as resulting from Neanderthal admixture compared to non-African genomes (0.73 to 1.3%). The haplotype segments of African genomes that are grouped together in clades with Neanderthal haplotypes are distinctive from the haplotype segments found in the genomes of people with non-African ancestry (Fig. 3B and fig. S26A). Namely, the African haplotypes are more dissimilar to the Neanderthal haplotypes with which they are grouped and tend to be shorter. These observations are qualitatively consistent with the model wherein genetic drift may group Neanderthal and African haplotypes, independent of a specific admixture event. It is also possible that these haplotypes are the result of true introgression events from unknown archaic hominins distantly related to the Neanderthal/Denisovan lineage (25). Another recent study using an inferred ARG also found mysterious, divergent haplotypes within sub-Saharan Africans that resembled unknown archaic introgression (7).

Unexpectedly, however, two of the SGDP African populations, Masai and Somali, are intermediate between non-African and African genomes when measuring lengths of archaic haplotype segments and TMRCA to admixers within them (Fig. 3B). These Neanderthal haplotype blocks may have originated in ancient European migrants to eastern Africa (26) and spread beyond eastern Africa through gene flow, which is known to have affected even the basal Africa-MBK lineages (27).

To test this hypothesis, we recomputed the mean length and TMRCA of admixer genomes within archaic-introgressed haplotype segments across all individuals, using only geographically restricted segments. We defined these as any archaic haplotype segments found only in genomes that were sampled within a 3000-km radius of each other (using geodesic distance between sampling coordinates). This analysis showed Masai and Somali genomes to have a comparable number of geographically restricted Neanderthal haplotypes to most other African genomes (Fig. 3C), with similar haplotype block lengths and TMRCAs to admixers within these geographically restricted haplotypes (Fig. 3D). This observation is concordant with the idea that the unusual Neanderthal haplotypes in these populations originated in Eurasian migrants.

Outside of Africa, our Neanderthal introgression maps largely agree with prior studies. We detect a mean TMRCA to Neanderthal of about 74 ka ago across all Neanderthal haplotype blocks in non-African populations, using published corrections for branch shortening in the archaic genomes (1). This value is consistent with published estimates of Neanderthal admixture times and the phylogenetic distance between the Vindija33.19 Neanderthal and the introgressing Neanderthal calculated by a recent study (28). The mean TMRCA between genomic segments detected as Neanderthal admixture segments and the Neanderthal itself is consistent within about 5000 years for all populations outside of Africa (Fig. 3B). We see slightly more Neanderthal ancestry in Central Asia, East Asia, and the Americas than in Europe, South Asia, and Southwest Asia (Fig. 3A). We also find more geographically restricted Neanderthal haplotype blocks in South Asia than elsewhere in mainland Eurasia, and the fewest geographically restricted Neanderthal haplotype blocks in the Americas (Fig. 3C and fig. S35).

Humans in Central and East Asia are known to have elevated Neanderthal ancestry compared to other populations (22). However, there is debate over whether this elevated Neanderthal ancestry is due to smaller past population size relative to other groups and the resulting stronger effect of genetic drift (22) or to additional pulses of Neanderthal admixture specific to these populations (9, 29). Although we detect more Neanderthal ancestry in Central and East Asians than in West Eurasians, we detect a similar number of geographically restricted haplotype blocks (unique to a 3000-km radius) in both groups (Fig. 3C). Further, Neanderthal haplotype blocks are shorter on average and therefore older in Central and East Asians than in West Eurasians (Fig. 3B). This implies that the excess Neanderthal ancestry in Central and East Asians mostly comprises broadly shared haplotype blocks from introgression common to all non-Africans, consistent with the drift scenario. Another recent study (20) suggested that excess Neanderthal ancestry in Central and East Asians could be mostly due to underestimating Neanderthal ancestry in West Eurasians as a result of using sub-Saharan Africans, who share some Neanderthal ancestry with West Eurasians, as a model unadmixed outgroup. We reject this explanation because removing Neanderthal haplotypes from West Eurasians would likely increase the number of geographically restricted Neanderthal haplotypes in Asians, contrary to our observation. Furthermore, South Asians appear to have a comparable amount of Neanderthal ancestry to West Eurasians (Fig. 3, B and C), despite sharing few Neanderthal haplotype blocks with Africans relative to West Eurasians (fig. S37A). Last, because our strategy for ascertaining Neanderthal haplotype used only Mbuti, Biaka, and Khomani-San genomes as an outgroup and allowed up to 10% of these genomes to have Neanderthal ancestry, we do not believe our results were significantly biased by our choice of outgroup. This is further demonstrated by the fact that we detect a non-negligible amount of Neanderthal ancestry in all African groups (table S2).

Aside from these broadly shared haplotype blocks, we also observe geographically restricted Neanderthal haplotype blocks in each non-African population in our panel. These population-specific haplotype blocks tend to be longer than the shared haplotype blocks and to have an older TMRCA to the Neanderthal genome than the broadly shared haplotype blocks (Fig. 3D, compared to all blocks shown in Fig. 3B). These observations suggest that the population-specific haplotype blocks may be the result of more recent population-specific Neanderthal admixture, as has recently been suggested (2628, 30).

We next investigated population-specific patterns within Denisovan ancestry segments and found that these segments probably originate from admixture with multiple, divergent individuals that were distantly related to the Denisovan genome. This implies that the Denisovan genome is not a good model for the actual population that admixed with humans with “Denisovan” ancestry. Prior studies have suggested that Denisovan-like haplotype blocks in humans have two or three distinct sources with different levels of divergence to the Denisovan genome, with the best-matching haplotype blocks in East Asia (9, 10). We uncover the same signal: Geographically restricted Denisovan haplotype blocks have the lowest TMRCA to the Denisovan genome in East Asian genomes (mean TMRCA to Denisovan of 125 ka ago) (figs. S26 and S27).

Unexpectedly, we detected many Neanderthal and Denisovan-like haplotype blocks that are unique to South Asia (Fig. 3C and figs. S26C, S45, S35, and S36) and many Neanderthal haplotype blocks that are unique to Oceania (Fig. 3C and figs. S44 and S35). These geographically restricted Neanderthal haplotype blocks are no more divergent to the Neanderthal genome than those specific to other populations (Fig. 3D), complicating any interpretation of these regions.