Ancient DNA and deep population structure in sub-Saharan African foragers

Skeletal samples

The skeletal remains that were sampled in this study are curated at the National Museum of Kenya (Kisese II), the National Museum of Tanzania (Mlambalasi), the Malawi Department of Museums and Monuments (Hora 1 and Fingira) and the Livingstone Museum (Kalemba), and sampling permissions and protocols are described in Supplementary Note 3. Individuals were chosen based on their associated LSA archaeological contexts, and skeletal samples were selected to maximize the likelihood of yielding authentic aDNA and to minimize damage. The Fingira phalanx was an isolated find from a mixed excavation context, and too small to provide both aDNA and a direct date. A list of both successful and failing samples is provided in Supplementary Table 1. Direct radiocarbon dating was attempted on five of the six successful individuals at the Pennsylvania State University Radiocarbon Laboratory using established methods and quality control measures for collagen purification43,44 before accelerator mass spectrometry analysis (Supplementary Note 4). A list of direct date and stable isotopic results for the two successfully dated individuals, and indirect dates where available for the other individuals, is provided in Supplementary Tables 3 and 4. All dates were calibrated using OxCal (v.4.4)45, with a uniform prior (U(0,100)) to model a mixture of two curves: IntCal20 (ref. 46) and SHCal20 (ref. 47).

aDNA laboratory work

We successfully generated genome-wide aDNA data from a total of six human skeletal elements: five petrous bones and one phalanx. We processed an additional six petrous bones, eight teeth and 11 other bones in the same manner but did not obtain usable DNA (Supplementary Table 1). In clean room facilities at Harvard Medical School, we cleaned the outer surfaces of the samples and then sandblasted (petrous bones)48 or drilled (other bones and teeth) to obtain powder (additional information for the 15 previously published samples reported here with increased coverage can be found in refs. 11,13,15,16). We extracted DNA49,50,51 and prepared barcoded sequencing libraries (between one and six libraries for the six newly reported individuals, and between one and eight additional libraries for the previously reported individuals: from Mota Cave in Ethiopia15 (I5950); White Rock Point in Kenya13 (I8930); Gishimangeda Cave in Tanzania13 (I13763, I13982 and I13983); Chencherere II (I4421 and I4422), Fingira (I4426, I4427 and I4468) and Hora 1 (I2967) in Malawi11; and Shum Laka in Cameroon16 (I10871, I10872, I10873 and I10874), treating in almost all cases with uracil-DNA-glycosylase (UDG) to reduce aDNA damage artefacts52,53,54. We used two rounds of targeted in-solution hybridization to enrich the libraries for molecules from the mitochondrial genome and overlapping a set of around 1.2 million nuclear SNPs55,56,57,58 and sequenced in pools on the Illumina NextSeq 500 and HiSeqX10 machines with 76 bp or 101 bp paired-end reads. Further details on each library are provided in Supplementary Table 2. For the Mota individual (I5950), we also generated whole-genome shotgun sequencing data, using the same (pre-enrichment) library, with seven lanes with 101 bp paired-end reads (on Illumina HiSeq X Ten machines) yielding approximately 26× coverage (1,176,635 sites covered from the capture SNP set).

Bioinformatics procedures

From the raw sequencing data, we used barcode information to assign reads to the proper libraries (allowing at most one mismatch per read pair). We merged overlapping reads (at least 15 bases), trimmed barcode and adapter sequences from the ends, and mapped to the mtDNA reference genome RSRS59 and the human reference genome hg19 using BWA (v.0.6.1)60. After alignment, we removed duplicate reads and reads with mapping quality less than 10 (30 for shotgun data) or with length less than 30 bases. To prepare data for analysis, we disregarded terminal bases of the reads (2 for UDG-treated libraries and 5 for untreated, to eliminate most damage-induced errors), merged the .bam files for all libraries from each individual, and called pseudohaploid genotypes (one allele chosen at random from the reads aligning at each SNP). The high coverage for the Mota whole-genome shotgun data enabled us to call diploid genotypes; we used the procedure from ref. 26, including storing the genotypes in a fasta-style format that is easily accessible through the cascertain and cTools software. Code for bioinformatics tools and data workflows is provided at GitHub (https://github.com/DReichLab/ADNA-Tools and https://github.com/DReichLab/adna-workflow).

Uniparental markers and authentication

We determined the genetic sex of each individual according to the ratio of DNA fragments mapping to the X and Y chromosomes61. We called mtDNA haplogroups using HaploGrep2 (ref. 62), comparing informative positions to PhyloTree Build 17 (ref. 63) (Supplementary Table 6). For four individuals (I2967, I4422, I4426 and I19528) with evidence of haplogroups that split partially but not fully along more specific lineages, we use the notation [HaploGrep2 call]/[sub-clade direction] (for example, L0f/L0f3 for a split on the lineage leading to L0f3 but not within L0f3). For males, we called Y-chromosome haplogroups by comparing their derived mutations with the Y-chromosome phylogeny provided by YFull (https://yfull.com).

We evaluated the authenticity of the data first by measuring the rate of characteristic aDNA damage-induced errors at the ends of sequenced molecules. We next searched directly for possible contamination by examining (1) the X/Y ratio mentioned above (in case of contamination by sequences from the opposite sex), (2) the consistency of mtDNA-mapped sequences with the haplogroup call for each individual64 and (3) the heterozygosity rate at variable sites on the X chromosome (for males only)65. Two individuals (I2966 from Hora 1 and I13763 from Gishimangeda Cave) had non-negligible evidence of contamination from these metrics and also displayed excess allele sharing with non-Africans in the admixture graph analysis; we were able to fit them in the final model after allowing ‘artificial’ admixture from a European-related source (6% and 9%, respectively). We also restricted ourselves to damaged reads in making the mtDNA haplogroup call for I2966. Further details are provided in Supplementary Table 2 and Supplementary Note 5.

Familial relatives

We searched for close family relatives by computing, for each pair of individuals, the proportion of matching alleles (from all targeted SNPs) when sampling one read at random per site from each. We then compared these proportions to the rates when sampling two alleles from the same individual—mismatches are expected to be twice as common for unrelated individuals as for within-individual comparisons, with family relatives intermediate. We found one possible instance between the two individuals from White Rock Point (approximately second-degree relatives, but uncertain due to low coverage) (Extended Data Fig. 1b)

Dataset for genome-wide analyses

We merged our newly generated data with published data from ancient and present-day individuals11,12,13,14,16,25,26,66,67. We performed our genome-wide analyses using the set of autosomal SNPs from our target enrichment (about 1.1 million).

PCA

We performed a supervised PCA using the smartpca software68, using three populations (Juǀ’hoansi, Mbuti and Dinka; four individuals each, from ref. 26, were chosen to create a broad separation in the PCA between highly divergent ancestral lineages from southern, central and eastern Africa) to define a two-dimensional plane of variation, and projected all other present-day and ancient individuals (using the lsqproject and shrinkmode options). This procedure captures the genetic structure of the projected individuals in relation to the groups used to create the axes, reducing the effects of population-specific genetic drift in determining the positions of the individuals shown in the plot, as well as bias due to missing data for the ancient individuals.

f-statistics

We computed f-statistics in ADMIXTOOLS69, with standard errors estimated by block jackknife. To facilitate the use of low-coverage data, we used a new program, qpfstats (included as part of the ADMIXTOOLS package), together with the option ‘allsnps: YES,’ for both stand-alone f4-statistics and statistics for use in qpWave and qpGraph (see below). In brief, qpfstats solves a system of equations based on f-statistic identities to enable the estimation of a consistent set of statistics while maximizing the available coverage and reducing noise in the presence of missing data; full details are provided in Supplementary Note 7. We computed statistics of the form f4(Ind1, Ind2; Ref1, Ref2), where Ind1 and Ind2 are ancient individuals from Kenya, Tanzania or Malawi/Zambia, and Ref1 and Ref2 are either ancient southern African foragers (AncSA, listed in Extended Data Table 1), the Mota individual or present-day Mbuti. These groups were chosen in light of our PCA results and the previous evidence for ancestry related to some or all of them among ancient eastern and south-central African foragers5,11,14.

qpWave analysis

The qpWave software70 estimates how many distinct sources of ancestry (from 1 to the size of the test set) are necessary to explain the allele-sharing relationships between the specified test populations and the outgroups (where ‘distinct’ means different phylogenetic split points relative to the outgroups). Each test returns results for different ranks of the allele-sharing matrix, where rank k implies k + 1 ancestry sources. For absolute fit quality, we give the ‘tail’ P value, where a higher value indicates a better fit. We also give ‘taildiff’ P values as relative measures comparing consecutive rank levels, where a higher value indicates less improvement in the fit when adding another ancestry source. As our base test set, we used the 12 ancient eastern and south-central African forager individuals (3 from Kenya, 3 from Tanzania, 5 from Malawi and 1 from Zambia) from our admixture graph Model 3 who did not have evidence of either admixture from food producers or contamination. We also compared results when adding the Mota individual to the test set. As outgroups, we used Altai Neanderthal, Mota and the following eight present-day groups: Juǀ’hoansi, ǂKhomani, Mbuti, Aka, Yoruba, French, Agaw and Aari, with the last two (as well as Mota) omitted when we moved Mota to the test set.

Dates of admixture

We inferred dates of admixture using the DATES software21. We used a minimum genetic distance of 0.6 cM, a maximum of 1 M and a bin size of 0.1 cM. As reference populations, we used ancient southern African foragers together with one of Mota, Dinka, Luhya, Yoruba or European-American individuals (the latter three from 1000 Genomes: LWK, YRI and CEU). The results assume an average generation interval of 28 years, and standard errors were estimated by block jackknife.

Admixture graph fitting

We built admixture graphs using the qpGraph software in ADMIXTOOLS69. We chose to analyse each eastern and south-central forager individual separately rather than form subgroups (for example, by site or time period) to study both broad- and fine-scale structure (through relationships between individuals with both low and high degrees of ancestral similarity). Although such an approach was facilitated by our relatively manageable sample sizes, it also relied on the ability to compute f-statistics with our qpfstats methodology (further details are provided in Supplementary Note 7 and the ‘f-statistics’ section above) to make use of all available SNPs for individuals with low-coverage data. For all of the models, we used the options ‘outpop: NULL’, ‘lambdascale: 1’ and ‘diag: 0.0001.’ We also specified larger values of the ‘initmix’ parameter to explore the space of graph parameters more thoroughly: 100,000, 150,000 and 200,000 for models 1–3 (and additional models built from them), respectively.

We began with a version of the admixture graph from ref. 16, to which we added three high-coverage ancient forager individuals (from Jawuoyo, Kisese II and Fingira) to create model 1. We then extended our model to more individuals. We used a procedure in which we (1) added each other ancient individual one by one to model 1 and evaluated the fit; (2) built an intermediate-size model 2 including a total of 11 geographically diverse eastern and south-central African foragers; (3) added the remaining individuals one by one to model 2; and (4) built our final Model 3 with all 18 individuals above a coverage threshold of 0.05× (Supplementary Note 6). In steps (1) and (3), as a starting point, we assumed a simple form of admixture (as in model 1) whereby all eastern and south-central African individuals derived their ancestry from exactly the same three sources (in varying proportions). If we found that an individual did not fit well when added in this manner, we noted the specific violation(s) to determine whether the likely cause(s) were excess relatedness to certain other individuals, distinct source(s) for the three-way admixture, admixture from other populations, or contamination or other artefacts. For the two individuals (one from Hora 1 and one from Gishimangeda) with evidence of appreciable contamination, we included dummy admixture events contributing non-African-related ancestry. Full details on our fitting procedures are provided in Supplementary Note 6.

Excess relatedness analysis

To study excess relatedness between individuals after correcting for different proportions of Mota-related, central-African-related and southern-African-related ancestry, we built an admixture graph similar to our main model 3, but in which each forager individual is descended from an independent mixture of the three ancestry components, without accounting for excess shared genetic drift. We also included four additional individuals with lower coverage (three from Kenya and one from Chencherere II in Malawi), but excluded the two early individuals from Hora 1 due to their much greater time depth compared with other individuals in the model. Finally, for individuals modelled with admixture beyond the primary three sources (that is, pastoralist-related ancestry for four individuals, western-African-related ancestry for the Panga ya Saidi individual and the excess central-African-related ancestry for the Kakapel individual, plus dummy admixture for contamination), we locked the relevant branch lengths and mixture proportions at their values from model 3 to prevent compensation for the inaccuracies in the model by these parameters. We next used the residuals (fitted minus observed values) of each outgroup f3-statistic f3(Neanderthal; X, Y) to quantify the excess relatedness between individuals X and Y that is unaccounted for by the model. In other words, we fit each individual as we did during the add-one phase of the main admixture graph inference procedure (except here all simultaneously) but now, instead of using the model violations to inform the building of a well-fitting model, we used them directly as the output of the analysis.

We plotted the excess relatedness residuals for each pair of individuals as a function of great-circle distance between sites, as computed using the haversine formula (also adding a dummy value of 0.001 km to each distance). We fit curves to the data with the functional form 1/mx, additionally allowing for translation (full equation: y = 1/(mx + a) + b, where y is excess relatedness, x is distance, and m, a and b are fitted constants) through inverse-variance-weighted least squares. We also omitted the point corresponding to the pair of individuals from White Rock Point (Kenya) because of their evidence for close familial relatedness (see above). Finally, we computed a decay scale for the curves given by the formula (e – 1)× a/m (where e is Euler’s number). We note that a residual (that is, y axis) value of zero has no special meaning in the plots.

For Mesolithic Europe, we performed two analogous analyses, one for the western part of the continent and one for eastern and northern. In the first analysis, we selected individuals with predominantly western hunter-gatherer (WHG)-related ancestry, while in the second analysis, we selected individuals who could be modelled as admixed with WHG as well as eastern hunter-gatherer (EHG)-related ancestry (Supplementary Table 12). In both cases, we built simple admixture graph models to estimate the residuals. For western Europe, we used the Upper Palaeolithic Ust’-Ishim individual from Russia71 as an outgroup and fit all of the test individuals as descending from a single ancestral lineage. For eastern and northern Europe, we used Ust’-Ishim as an outgroup, Mal’ta 1 from Siberia72 for a representative of ancient northern Eurasian ancestry, Villabruna from Italy73 for WHG, Karelia from Russia56,58,73 for EHG (admixed with ancestry related to Mal’ta and to Villabruna) and finally the test individuals each with independent mixtures of WHG and EHG-related ancestry in varying proportions.

Effective population size inference

We called ROH starting with counts of reads for each allele at the set of target SNPs (rather than our pseudohaploid genotype data), which we converted to normalized Phred-scaled likelihoods. We performed the calling using BCFtools/RoH74, which is able to accommodate unphased, relatively low-coverage data (at least for calling long ROH) and does not rely on a reference haplotype panel. The method is also robust to modest rates of genotype error, such as that which could occur here as a result of aDNA damage or contamination, although we recommend some caution in interpreting the results for I2966 (Hora 1) and I0589 (Kuumbi Cave; for this analysis only, we used the version of the published data with UDG-minus libraries included, for a total of around 2× average coverage). We also note that the nature of any possible effect on the final inferences is uncertain; errors could deflate the population size estimates by breaking up ROH, but they could also break very long ROH into shorter but still long blocks, which have the strongest influence on the population size estimates. In the absence of population-level data from related groups, we specified a single default allele frequency (‘–AF-dflt 0.4’) and no genetic map (although we subsequently converted physical positions to genetic distances using ref. 75, which we expect to be reasonably accurate at the length scales that we are interested in). For our analyses, we retained ROH blocks with length >4 cM. In three instances, we merged blocks with a gap of <0.5 cM and at most two apparent heterozygous sites between them.

From the ROH results, we applied the maximum likelihood approach from ref. 23 to estimate recent ancestral effective population sizes (Ne). We used all ROH blocks of longer than 4 cM, except for three individuals (KPL001 from Kakapel in Kenya, I9028 from St Helena, South Africa, and I9133 from Faraoskop, South Africa) with high proportions of very long ROH (a sign of familial relatedness between parents—approximately at the first-cousin level in these cases—rather than of longer-term low population size), for whom we used only blocks from 4–8 cM.

We note that, even within a randomly mating population, the number and extent of ROH can vary substantially between individuals, which is reflected in the large standard errors of the Ne estimates for small sample sizes. We also note that recent admixture can influence ROH (and therefore Ne estimates) by making coalescence between an individual’s two chromosomes less likely, but on the basis of the other results of our study, we do not expect a substantial effect for these individuals.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Read original article here

Leave a Comment