What is YFull's age estimation methodology?

Q: What is YFull's age estimation methodology?

A: YFull uses a methodology based on the research and analysis discussed in Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data by Adamov, Guryanov, Korzhavin, Tagankin, Urasin (2015).

The methodology is reflected in the Age Estimation table for each analyzed raw data file or VCF file and in the subclade age "info" pop-up tables linked to the YTree. If there is more one raw data file for the same person, YFull will use the file with the best coverage for the age and will not use the other file or files.

The first step is to select and count reliable derived Known and Novel SNPs for a raw data file. The number of counted SNPs appears in both tables.

The following five criteria are used to select reliable SNPs:

1. The coordinates of the SNPs must fall within the combBED regions designed to select X-degenerate segments. The combBED area borders were formed by mutual overlapping BED files taken from the work of Poznik et al. (2013) (total length of 10.45 Mbp) and by the generalized BigY BED file (11.38 Mbp long), published in the BigY White Paper (2014). The result was 857 continuous segments of the Y-chromosome with a total length of 8,473,821 base pairs.

2. Insertions and deletions (called "Indels") are excluded, as are multiple nucleotide polymorphisms (SNPs with more than one base position).

3. Variants detected in more than five different "localizations" are excluded. "Localization” means a group of samples from the YFull database belonging to the same subclade and having derived allele nomination. In some cases, the same derived variants may be found in different subclades or different haplogroups because of mapping errors or because the standard reference sequence is based mainly on haplogroup R1b data and to a lesser extent on haplogroup G data. This causes some variants in some haplogroups to be ancestral instead of derived. Although YFull established the "five different localizations" criterion empirically, the criterion is soft but believed to be effective.

4. SNPs with only one or two "reads" are excluded.

5. SNPs are excluded if the "read quality" is less than 90%. Quality is determined pursuant to YFull's proprietary SNP rating system. See the FAQ How does YFull determine the quality ratings for my Known SNPs and for my Novel SNPs?

6. The Age Estimation table for each sample provides a high level of detail about the application of the selection criteria. Reliable Known and Novel SNPs are listed in the "+Known SNPS" and "+Novels" columns of the table, and SNPs not selected are listed in the "x Known SNPs" and "x Novels" columns, with details related to the five criteria.

The second step of the sample age determination methodology is explained in the YTree "info" pop-up tables for the YTree subclades. For each sample in a table, two formulas are applied to the number of SNPs for the sample. The first formula corrects the SNP count to an assumed (or corrected) count from the combBed bp coverage area, and the second formula establishes the age of a sample based on the corrected count. The second formula uses an assumed mutation rate of 144.41 years (0.8178*10-9, which is the average of the mutation rates of the ancient Anzick-1 sample and of a group of known genealogies, and an assumed age of 60 years for living providers of YFull samples.

Last updated on February 20, 2021.