ArticlePDF Available

Genetic genealogy for cold case and active investigations


Abstract and Figures

Investigative genetic genealogy has rapidly emerged as a highly effective tool for using DNA to determine the identity of unknown individuals (unidentified remains or perpetrators), generating identifications in dozens of law enforcement cases, both cold and active. The amount of press coverage of these cases may have given the impression that the analysis is straightforward and the outcome guaranteed once a sample is uploaded to a database. However, the database query results serve only as clues from which in-depth genealogy and descendancy research must proceed to determine the possible identities of an unknown individual. While there certainly will be more announcements of cases solved using this new technique, there are many more cases where identification has not yet been possible due to the wide variety of complications present in these investigations. This paper lays out the fundamentals of genetic genealogy, along with the challenges that are encountered in many of these investigations, and concludes with a set of case studies that demonstrate the variety of cases encountered thus far.
Content may be subject to copyright.
Author Manuscript, published in final form as:
Greytak EM, Moore C, & Armentrout SL (2019). Genetic genealogy for cold case and active
investigations. Forensic Science International, 299, 103113. doi: 10.1016/j.forsciint.2019.03.039
Genetic genealogy for cold case and active investigations
Ellen M. Greytak, CeCe Moore, Steven L. Armentrout
Parabon NanoLabs, Inc., 11260 Roger Bacon Dr. Suite 406, Reston, VA, 20190, USA
Genetic genealogy is helping to close both cold and active investigations
Forensic DNA is uploaded to public genetic genealogy databases to find relatives
Extensive genealogy and descendancy research generate a list of possible identities
Many complicating factors can impede the research
Identity is narrowed down using a wide range of information & confirmed using STRs
Investigative genetic genealogy has rapidly emerged as a highly effective tool for using DNA to
determine the identity of unknown individuals (unidentified remains or perpetrators), generating
identifications in dozens of law enforcement cases, both cold and active. The amount of press
coverage of these cases may have given the impression that the analysis is straightforward and
the outcome guaranteed once a sample is uploaded to a database. However, the database
query results serve only as clues from which in-depth genealogy and descendancy research
must proceed to determine the possible identities of an unknown individual. While there
certainly will be more announcements of cases solved using this new technique, there are many
more cases where identification has not yet been possible due to the wide variety of
complications present in these investigations. This paper lays out the fundamentals of genetic
genealogy, along with the challenges that are encountered in many of these investigations, and
concludes with a set of case studies that demonstrate the variety of cases encountered thus far.
Keywords: Genetic genealogy; Forensic genetics; DNA; SNPs; Cold cases; Human
Traditional genealogy has been practiced for centuries, using documentary records and oral
histories to trace families backwards in time. Until recently, these were the only ways to
connect extended family members, but with the advent of direct-to-consumer (DTC) genetic
testing, it is now possible to find relatives through shared DNA. This has enabled thousands of
individuals who have lost their biological identity through adoption, abandonment, anonymous
gamete donation, misattributed parentage, etc., to regain their genetic heritage. More recently,
these same tools have been used to identify DNA from suspected perpetrators in more than
thirty law enforcement cases, only some of which have been publicly announced (Table 1).
Table 1: Cases for which law enforcement agencies have announced identification of DNA from a
suspected perpetrator with the aid of genetic genealogy (through 1/31/19). * Deceased; ** Pled guilty
Identified As
Multiple Homicides and Sexual
Assaults - “Golden State Killer”
1974 -
Joseph James
April 24,
County, WA
Double Homicide of Jay Cook (20)
and Tanya Van Cuylenborg (18)
William Earl
Talbott II
May 21, 2018
Tacoma, WA
Homicide of Michella Welch (12)
Gary Charles
June 20,
Lancaster, PA
Homicide of Christy Mirack (25)
Raymond Charles
June 25,
Brazos County,
Homicide of Virginia Freeman (40)
James Otto
June 25,
Fort Wayne, IN
Homicide of April Tinsley (8)
John Dale Miller**
July 15, 2018
Woonsocket, RI
Homicide of Constance Gauthier
Matthew Norman
July 18, 2018
St. George, UT
Sexual Assault of Carla Brooks
Spencer Glen
July 28, 2018
Fayetteville, NC
Multiple Sexual Assaults -
“Ramsey Street Rapist”
2006 -
Darold Wayne
August 22,
County, IL
Homicide of Holly Cassano (22)
Michael F. A.
August 29,
County, MD
Multiple Sexual Assaults
2007 -
Marlon Michael
14, 2018
Sarasota, FL
Homicide of Deborah Dalzell (47)
Luke Edward
19, 2018
Parabon and
Multiple Sexual Assaults - “NorCal
1991 -
Roy Waller
21, 2018
Greenville, SC;
Memphis, TN;
Portageville, MO
Multiple Homicides and Sexual
1990 -
Robert Eugene
October 5,
Starkville, MS
Double Homicide of Betty Jones
(65) and Kathryn Crigler (81)
Michael W.
October 8,
Greenbrier, AR
Homicide of Pam Felkins (32)
Edward Keith
October 29,
Fulton County,
Homicide of Lorrie Ann Smith (28)
Jerry Lee
November 1,
Anne Arundel
County, MD
Homicide of Michael Temple (29)
Fred Lee
Frampton, Jr.
November 2,
Orlando, FL
Homicide of Christine Franke (25)
Benjamin L.
November 5,
Parabon &
Florida Dept.
of Law
Carlsbad, CA
Homicide of Jodine Serrin (39)
David Mabrito*
13, 2018
Parabon and
Santa Clara, CA
Homicide of Leslie Marie Perlov
John Arthur
21, 2018
College Station,
Multiple Sexual Assaults
Christopher Quinn
12, 2018
Cedar Rapids, IA
Homicide of Michelle Martinko
Jerry Lynn Burns
19, 2018
County, FL
Sexual Assault of Unnamed Victim
William L. Nichols*
January 10,
Orange County,
Sexual Assaults of Two Unnamed
1995 &
Kevin Konther
January 11,
Victims (9 and 31)
La Mesa, CA
Homicide of Scott Martinez (47)
Zachary Aaron
January 24,
Fremont, CA
Homicide of Jack Upton (30)
Russell Guerrero
January 24,
Portland, OR
Homicide of Anna Marie Hlavka
Jerry Walter
January 31,
Generating Data
Unlike traditional forensic DNA analysis, which uses autosomal short tandem repeats (STRs) to
generate an identity profile from ~20 loci, genetic genealogy uses hundreds of thousands of
single nucleotide polymorphisms (SNPs) spread across the autosome. Participants in genetic
genealogy have had their DNA tested by a direct-to-consumer (DTC) genetic testing company,
such as 23andMe or AncestryDNA, which use microarrays to genotype up to ~1 million SNPs.
DTC companies obtain DNA from spit kits or cheek swabs and thus always have a large amount
of high-quality single-source DNA to work with. Forensic DNA samples, on the other hand,
often only have a small amount of degraded DNA, which may be mixed with DNA from one or
more other individuals. Microarray genotyping has previously been shown to be effective and
accurate with forensic samples (Keating et al., 2013), and Parabon has used it for casework
since 2015, generating high genotyping call rates from forensic samples down to 1 ng of DNA
(Table 2). Parabon has also found it is possible to accurately deconvolute microarray data from
two-person mixtures, as long as the person-of-interest is at least 40% of the mixture and a
single-source reference sample from the second contributor is available.
Table 2: Summary of Parabon’s >250 forensic DNA samples used in genetic genealogy casework and
the resulting microarray genotyping call rates.
Call Rate
Single Source
2.5 ng
> 95%
Low Mixture
2.5-5 ng
High Mixture
5-10 ng
10-20 ng
20-40 ng
40-80 ng
>80 ng
Parabon’s casework currently uses the Illumina CytoSNP-850K array, an off-the-shelf chip that
contains >98% of the SNPs on the OmniExpress chip used by, FamilyTreeDNA,
and MyHeritage. 23andMe previously also based their chip on the OmniExpress but has since
moved to smaller custom chips that overlap less with the other DTC companies. For law
enforcement cases, extracted DNA samples are processed at a CLIA-certified lab, and the data
is uploaded securely to Parabon.
Determining Relatedness from DNA
Given enough SNPs, it is possible to determine the degree of relatedness between two people,
which is defined by the expected amount of shared DNA, not the number of meioses (Figure 1).
Figure 1: Pedigree showing the degrees of relatedness, as defined by the expected amount of shared
DNA. Each relationship is defined with respect to the red “self / twin” box.
While several relationship inference methods had previously been proposed (Huff et al., 2011;
Manichaikul et al., 2010), 23andMe was the first DTC company to introduce an accurate,
scalable approach to inferring approximately how closely related two DNA samples are from
autosomal SNPs (Henn et al., 2012). Each person has two copies of each of the 22 autosomal
chromosomes (“autosomes”), one inherited from their mother and one inherited from their
father. Autosomes are not inherited intact from each parent; rather, each parent’s own pair of
chromosomes is randomly recombined into a new chromosome that is passed onto the child.
While recombination occurs randomly, nucleotides that are closer to one another on a
chromosome are more likely to be inherited together, while nucleotides that are far apart are
more likely to be separated by recombination. The probability of recombination between two
nucleotides is quantified as their genetic distance, which is measured in centimorgans (cM),
such that 1 cM equates to a 1% probability of recombination.
Rather than simply looking at the total number of shared SNPs, genetic genealogy takes
advantage of the fact that recombination will break up long stretches of shared DNA over the
generations, such that more closely related people will share longer stretches of DNA
(“segments”) that are identical-by-descent (IBD) (Figure 2). The more recombination events
that have occurred, the shorter the shared IBD segments will be, so the number and length of
IBD segments in cM can be used to approximate the degree of relatedness.
Figure 2: Inheritance of DNA segments on a single chromosome. The lengths of the shared segments
(shaded boxes) are summed across all 22 autosomes to give the total amount of shared DNA.
To detect IBD segments, genetic genealogy algorithms search for regions of the genome where
two individuals share at least one allele at every SNP. To be counted, these segments must
contain a minimum number of SNPs (typically ~500) and be over a certain length (typically 5-7
cM), which screens out most segments that are shared by chance rather than due to common
descent. When summed across all autosomes, the amount of DNA shared IBD strongly
correlates with the degree of relatedness between two individuals, such that more distant
relatives tend to share less DNA (Table 3). However, due to the random nature of
recombination, the amount of shared DNA can vary greatly for relatives of the same degree,
and this variation increases with more recombination events, such that ~10% of third cousins
and ~50% of fourth cousins share no detectable IBD segments.
Table 3: The range of DNA shared by pairs of people with each relationship. While most pairs from a
given relationship fall within a narrower range, these values represent the full ranges that have been
observed (Ball et al., 2016).
cM Range
Full Sibling
Half-Sibling, Avuncular, Double First Cousin, Grandparent / Grandchild
First Cousin (1C), Half-Avuncular, Great-Grandparent / Great-Grandchild, Great-
First Cousin Once-Removed (1C1R), Half-First Cousin (½ 1C), Half-Great-
Aunt/Uncle / Half-Great-Niece/Nephew
Second Cousin (2C), First Cousin Twice-Removed (1C2R),
Half-First Cousin Once-Removed (½ 1C1R)
Second Cousin Once-Removed (2C1R), Half-Second Cousin (½ 2C), First
Cousin Thrice-Removed (1C3R), Half-First Cousin Twice-Removed (½ 1C2R)
Third Cousin (3C), Second Cousin Twice-Removed (2C2R)
Third Cousin Once-Removed (3C1R), Distant Cousins
Genetic Genealogy Databases and Genetic Privacy
DTC genetic testing companies’ private databases have exploded in size, with AncestryDNA
currently containing nearly 15 million individuals, 23andMe containing nearly 10 million, and
MyHeritage and FamilyTreeDNA (FTDNA) together containing roughly 3.5 million (Regalado,
2019). AncestryDNA and 23andMe maintain their databases separately and are not accessible
to law enforcement, as the only way to submit a sample is via a cheek swab or spit kit.
MyHeritage and FTDNA both allow uploads of data generated from other sources, but law
enforcement usage of either requires written permission from the company, as well as a court
order for MyHeritage or “the required legal documentation” for FTDNA.
GEDmatch, on the other hand, is not a DTC company. It was created by Curtis Rogers and
John Olson in 2010 as a public database where individuals from different testing companies
could compare their DNA by downloading their raw data from a DTC company’s site and
uploading it to a common database. After the Golden State Killer suspect was identified through
surreptitious use of GEDmatch, the site’s administrators decided to explicitly allow law
enforcement usage. They posted a notice on the front page of the site (Figure 3) and also
updated their Terms of Service to state that law enforcement can and is using GEDmatch to
identify remains and perpetrators of violent crimes, defined as homicides or sexual assaults
( Both new and existing users were required to view these new Terms and
decide whether to accept them before using the site. Critics of genetic genealogy argue that
many people who joined the site prior to this update may not have considered the possibility that
their desire to locate relatives could lead to the discovery that they are related to someone
whose DNA is associated with a crime and to the apprehension of that relative. Indeed, it is
possible some of them still may be unaware of the new warning, and individuals who had their
data uploaded by another individual or have been inactive on the site may not have reviewed
the new Terms to decide whether to consent. However, even prior to implementing these new
Terms, GEDmatch’s Terms clearly stated that any data set to “public” would be searchable by
anyone. The law has generally allowed information made available to the public to be used in
criminal investigations. Users can easily have their data set to “private,” hiding it from all search
queries, or removed entirely. Thus, the DNA data files in a public database like GEDmatch
come from individuals who have proactively downloaded their data from a private DNA testing
company’s website, uploaded the information to a public website, reviewed the Terms of
Service that permits law enforcement usage, and opted in to public comparisons against their
Figure 3: Notice posted on GEDmatch’s homepage after the site’s use in the Golden State Killer
investigation was made public.
Additionally, no sensitive genetic information is disclosed to law enforcement during a genetic
genealogy search, as the raw genetic data from GEDmatch users is not accessible. Raw
genetic data can contain sensitive health-related information, and this type of private genetic
information should be protected. In keeping with this precept, no raw genotypes are displayed
or made available for download by GEDmatch. GEDmatch simply performs comparisons
among samples, returning the lengths and chromosomal locations of shared DNA segments,
which are used to determine the approximate relationship between individuals. Similarly, data
obtained from abandoned DNA at a crime scene and used for genetic genealogy are not
exposed to other users and can be prevented from appearing in search results (an option
available to all users). At Parabon, genetic data is kept on an encrypted server only accessible
to authorized employees, and the company’s GEDmatch accounts can only be accessed by the
bioinformatics team and the lead genetic genealogist, CeCe Moore. These facts mitigate many
of the privacy concerns surrounding genetic genealogy, as individuals have control over
whether their data is used as part of law enforcement investigations, and sensitive raw data is
not accessed (Greytak et al., 2018).
Unlike with familial searching of law enforcement databases, no one is legally required to
contribute to a genetic genealogy database, and the samples are not in the possession of
government agencies. The persons contributing to GEDmatch are warned explicitly that
criminal investigators as well as fellow genealogy enthusiasts are able to perform comparisons
against their data. If they choose to participate anyway, there is no reason why law
enforcement should not be able to use this information. These significant differences from
familial searching argue against automatically applying familial search policies, such as
restricting analysis to the end of an investigation, to genetic genealogy. The two techniques are
entirely independent; familial searching has previously been used in some genetic genealogy
cases and not in other; The public is strongly in favor of the use of genetic genealogy to
investigate violent crimes: GEDmatch saw a significant increase in the number of participants
after the Golden State Killer arrest (Milian, 2018), and a recent survey showed overwhelming
public support (Guerrini et al., 2018).
Database Searching
A GEDmatch one-to-many query compares the DNA of interest to all public data in the
database, returning a list of individuals who share the most autosomal DNA. Each “match”
includes the individual’s name or alias, the email address associated with their GEDmatch
account, and any haplogroup or family tree information they have chosen to share (Figure 4).
Figure 4: Top five results from a GEDmatch one-to-many comparison, with potentially identifying
information (kit numbers, names, and email addresses) removed.
A one-to-one comparison can then be run on each match using a more precise algorithm to see
the lengths and chromosomal locations of the shared segments. Comparing the amount of
shared DNA to reference data (e.g., (Bettinger & Perl, 2018)) gives the probability that the
relationship between the unknown individual and the match falls into each degree of
relatedness. For example, a match sharing 100 cM could be anywhere from 5th degree to >8th
degree, with 6th degree being most likely.
However, there are additional complications. First, in addition to multiple possible degrees of
relatedness, each degree contains many relationship types that must be considered (e.g., 5th
degree relatives around the same age could be second cousins, first cousins twice-removed, or
half-first cousins once-removed). Second, the amount of DNA shared by each relationship
varies among populations. Populations founded by a small number of individuals can have low
genetic diversity and high background relatedness, or endogamy. In such populations,
individuals with a given relationship will share significantly more DNA than in other populations,
such that even very distant cousins can share significant amounts of DNA. Endogamy
manifests as a large number of matches, each sharing many small segments, indicating that the
segments were actually inherited from distant ancestors (ISOGG, 2019). Another challenge is
pedigree collapse, in which the same families intermarry multiple times throughout history,
which can inflate the amount of shared DNA between their descendants.
Casework Match Results
More than 80% of samples from Parabon’s >250 law enforcement cases have resulted in a
match at the third cousin level or closer (>60 cM), with subjects of European descent having a
higher probability of success due to their overrepresentation in genetic genealogy databases
(Greytak & Moore, 2018) (Figure 5A). European descent was assessed by Snapshot DNA
Phenotyping, which infers an individual’s genetic admixture from seven continental populations
(African, Middle Eastern, European, Central/South Asian, East Asian, Oceanian, and Native
American). In this analysis, samples were considered “European” if they had at least 80%
European ancestry. Note that the law enforcement cases submitted to Parabon are primarily
from North American agencies, and samples from other regions will likely have lower match
probabilities due to lower participation in DTC genetic testing and use of GEDmatch.
The closeness of the top match is not the sole variable in determining viability for genetic
genealogy. A comprehensive assessment must include consideration not only of the closest
match, but of the quality of the supporting matches and the amount of information available
about each match. For example, progress may be difficult if the top match has unknown
parentage and/or is from a country where records are not available. Parabon assesses each
sample on a subjective scale: 1) very high probability of identification (e.g., parent-child match),
2) high probability of identification, 3) medium probability of identification, 4) low probability of
identification but likely to generate actionable information, and 5) unlikely to generate actionable
information. An assessment does not guarantee a particular outcome but is intended to help
agencies to decide how to proceed. Thus far, ~80% of European samples and ~60% of non-
European samples have been assessed as workable (assessments 1-4) (Figure 5B).
Figure 5: For Parabon’s >250 law enforcement samples, the frequency of A) the top GEDmatch one-to-
many match being in each degree of relatedness and B) samples receiving each assessment level.
Results are reported for European, non-European, and all samples, as well as for those cases that have
been solved (i.e., resulted in an identification) thus far. Degree of relatedness is based solely on the
amount of shared DNA, not the true relationship determined through genealogy: Parent-Child (>3300 cM),
Full Siblings (2200-3300), 2nd Degree (1300-2200), 3rd Degree (650-1300), 4th Degree (340-650), 5th
Degree (200-340), 6th Degree (90-200), 7th Degree (60-90), 8th Degree (30-60), >8th Degree (<30).
Importantly, just because a sample does not have sufficient promising match data today does
not mean it never will. Hundreds of new individuals upload their data to GEDmatch every day
(Milian, 2018), and as the database grows, the proportion of samples with close matches will
increase. Thus, Parabon monitors all unsolved cases for new matches on a weekly basis.
Genealogy Research
While most of the discussion surrounding genetic genealogy focuses on the database matches,
the vast majority of genetic genealogy work happens after the match list is generated. Many US
records are available to the public and have been compiled into searchable databases
accessible via subscription. For example, provides a mechanism for accessing a
large collection of records, such as the census through 1940, vital records (birth, marriage,
death) from many states, the Social Security Death Index, and Some users also create and share public family trees, although these can contain
errors, so they must be examined critically. People search databases and public social media
can also be used to help determine family structures. In some cases, law enforcement may be
asked to assist with this research using their greater access to records.
A previous analysis of the MyHeritage DTC database showed that ~60% of individuals of
Northern European descent will have a match at 100 cM or closer (Erlich, Shor, Pe, & Carmi,
2018). Using simulation, the authors showed that it is often possible to identify an unknown
individual from a single third cousin level match given knowledge of his or her sex, location
within 100 miles, and age within 5 years. However, in addition to the fact that such detailed
demographic information is often not available in law enforcement cases, this assumes that,
given a third cousin match, it is straightforward to obtain a complete list of the match’s relatives
at that distance (the authors determined this number to be ~850, not including half relatives). In
reality, a massive amount of work is required to expand a match into a list of relatives (Greytak
et al., 2018).
The first task is to definitively identify each match, which itself can be quite difficult. Although
GEDmatch displays the name and email address associated with each matching kit, users can
choose to use an alias or an anonymous email address, and kits are sometimes managed by
someone other than the match themselves. Moreover, even if a user associates their actual
name, it may be common (e.g., John Smith), which can complicate identification. Consequently,
the initial identification of matches is both critical and challenging, and often requires
considerable genetic genealogical skill and creative problem solving, e.g., deciphering initials,
inferring identities from other identifiable matches, and figuring out who DNA is from when the
kit is managed by someone else. Even though contacting matches via the given email address
might enable identification and even produce family tree information, Parabon seldom contacts
matches directly so as to minimize the number of people involved in an investigation and reduce
the risk of tipping off a suspect. Matches closer than third cousins are only contacted with the
permission of the investigating agency, and the agency can choose to make the contact instead.
Any contact includes the fact that the questions are in regard to a law enforcement investigation
(no specifics of the case are given), and the individual is informed they are free to participate or
not. If the individual asks not to be involved, they are not contacted again.
Once the matches are identified, their family trees must be constructed back to the set of
possible common ancestors with the unknown individual. The number of generations back in
time to the common ancestors of interest is determined by the distance of the matches’
relationships, although since the estimates are not usually specific to a single relationship, often
the family trees must be built even further back than these levels would imply. Building family
trees back in time requires traditional genealogy research: combing through public records to
determine the identities of each generation’s parents.
However, records are not always available - not all US states maintain an accurate and public
birth index, many families trace back to immigrants from other countries where records are not
readily available, etc. In addition, biological family trees often do not match documented family
trees due to misattributed paternity, unrecorded adoption, unknown parentage, etc., and
individuals in these situations are overrepresented in genetic genealogy databases. Surnames
and spellings also often change through the generations, further complicating the analysis.
Descendancy Research
Once possible common ancestors have been identified, the family trees must then be built
forward in time (“descendancy research” or “reverse genealogy”) to elucidate the possible
identities of the unknown individual (Figure 6).
Figure 6: A hypothetical family tree resulting from genetic genealogy research. Given a match in
GEDmatch (orange star), the family tree is built backward in time to the possible common ancestors
(orange) and then forward in time (blue) to determine the possible identities of the unknown individual (in
this case, from among the “second cousins”).
The possible ancestors from which the unknown individual descends can sometimes be
narrowed using genomic ancestry (e.g., if the family tree is Northern European, but the unknown
individual has 25% ancestry from another population, the genetic genealogist can search
among the possible grandparents for one who married someone from that ancestral group).
Shared DNA on the X-chromosome can also narrow down the possible paths between matches,
as males only inherit X-DNA from their mothers. Thus, if an unknown male shares X-DNA with
a match, they must be related through his mother, and the path between them cannot pass
through two males in a row. When available, Y-chromosome and mitochondrial (mtDNA)
haplogroups can also narrow down the possibilities, as these are passed directly from father to
son and from mother to child, respectively. Thus, individuals share a mtDNA haplogroup with
their maternal lineage, and males share a Y haplogroup with their paternal lineage.
DNA sharing among matches can also be used to narrow down where the unknown individual
falls in the tree. If matches do not share any DNA with one another, they are likely related to the
individual on different branches of his or her family tree, and the genetic genealogist can then
search for an intersection (“triangulation”) between the two matches’ families in the form of a
marriage that produced children or an out-of-wedlock birth (Figure 7). While there could be
hundreds or thousands of individuals who are second or third cousins to a single match, there
are typically only a few individuals who are cousins at the right distance to multiple matches.
Figure 7: Triangulation between two hypothetical family trees. Given two matches in GEDmatch who are
unrelated to one another (orange stars), family trees are built for each and then searched for an
intersection (green) in the form of a marriage or out-of-wedlock birth. Children of this intersection are
related to both matches, while all other individuals in the tree are only related to one match.
Narrowing Down the Possible Identities
Once candidate individuals have been identified, the genetic genealogist can use a variety of
factors to include or exclude them, in addition to traditional investigative information, such as a
connection to the crime scene or the victim. Sex is known from the DNA, and some age
information may be available for unidentified remains, age can be estimated; for perpetrators,
at minimum, they had to be alive and physically capable of committing the crime. The individual
also had to be in a given location at a given time, which may mean he or she lived nearby.
While the GEDmatch matches may be spread across the US or even the world, it is sometimes
possible to focus on a particular branch of the family that moved close to the location of interest.
Parabon’s genetic genealogists also use Snapshot DNA Phenotyping (Greytak & Armentrout,
2015) to prioritize among individuals and confirm or exclude hypotheses. An individual’s eye
color, hair color, and skin color can often be determined from mugshots, yearbook photos, or
social media and compared to the predictions. Full siblings cannot be distinguished using
genetic genealogy, as they share all the same genealogical relationships with the matches.
However, if they differ in phenotype, this can be used to prioritize among them. Similarly, if
genealogy research leads to an individual whose phenotypes are at odds with the predictions,
this can spur continued research, while a close similarity can help corroborate an identification.
The degree to which the identity of the unknown individual can be narrowed down varies from
case to case. In the best-case scenario, a single individual or a set of siblings can confidently
be identified through matches to multiple branches of their family tree. More often, there are
multiple cousins (descendants of a particular set of common ancestors) who are consistent with
the available information. These leads can then be followed up through additional research,
traditional investigation, and/or targeted kinship testing of family members to more precisely
place the unknown individual in the family tree. Parabon’s Snapshot Kinship Inference tool uses
genome-wide SNP data to predict the precise degree of relatedness between individuals, out to
6th-degree relatives (Greytak et al., 2017). Using a machine learning model built on thousands
of reference subjects with known relationships, Snapshot predicts the probability that a pair
belongs to each degree of relatedness. Confidence is calculated using the probability of the
most likely degree and the precision calculated for that degree in cross-validation.
Law Enforcement Leads
During decades-long cold case investigations, hundreds or thousands of individuals may be
investigated before the perpetrator is found. Genetic genealogy offers an efficient means of
narrowing an investigation, often to only a few individuals. The number of possible relatives
included in a genetic genealogy analysis varies depending on the number and distance of the
matches. Even when the only matches are distant and large family trees must be constructed
because common ancestors are many generations in the past, experienced genetic
genealogists can triangulate among the matches to determine the most promising branches of
the family tree and limit the amount of unnecessary tree building. Given sufficient triangulation
and time, the number of leads can be reduced to the offspring of a single couple.
No matter how confident the identification, however, genetic genealogy alone cannot prove
identity with 100% certainty. There is always a remote possibility that the unknown individual
could have been adopted or abandoned, and his or her existence could be unknown to family
and not revealed through official records. Therefore, genetic genealogy leads must be verified
through a direct DNA comparison between the person-of-interest’s STR profile and that of the
crime scene sample. It is this traditional forensic DNA match that is used for prosecution.
Case Studies
The following case studies demonstrate how genetic genealogy has been used to assist
investigators with identifying a suspect in cold case investigations. Only information approved
for public release by the investigating agencies is included, so some case details (e.g., DNA
sample source, exact GEDmatch match information) have been obfuscated.
Case Study #1: Snohomish County, WA; 31-year-old cold case (double homicide)
This case study demonstrates the ideal genetic genealogy case, where there are close matches
and clear familial connections that point to only a single conclusion. However, even seemingly
straightforward cases require a large amount of research and the expertise to recognize and
cope with confounding factors such as unknown and misattributed parentage.
The Crime: In 1987, a young Canadian couple, Jay Cook (20) and Tanya Van Cuylenborg (18),
traveled from British Columbia to Washington State in a van. After purchasing a ferry ticket to
Seattle, they were never heard from again. Days later, Tanya’s body was found in a ditch in the
woods, and a few days after that, Jay’s body and the van were found in two separate locations.
DNA evidence was obtained for an unknown suspect (“Subject”).
GEDmatch: There were two matches at approximately the 5th degree relative level, plus
additional more distant matches. The top two matches had no shared DNA between them,
meaning they were most likely related to the Subject on different branches of his family tree.
Family Trees: Family trees were constructed for both key matches back to their great-
grandparents and beyond using census records, vital records, newspaper archives, public
“people search” databases, public social media data, and public family trees. Next,
descendancy research was performed to trace the descendants of each set of ancestors to
determine if an intersection between them could be found.
A triangulating marriage was found between a granddaughter of Match #2’s great-grandparents
and a son of Match #1’s great-grandmother. Extensive research revealed that this son had
taken his stepfather’s surname, initially obscuring his true relationship to Match #1. Thus, the
children of this marriage were half first cousins once-removed to Match #1, as well as second
cousins to Match #2. While both of these relationships are 5th degree, it is critical to consider
all possible relationship types, as half relationships are quite common. No other marriages were
found between the descendants of these ancestors. There was only one son from this
marriage, William Earl Talbott II, and he was therefore the only known male who could be
carrying this mix of DNA from both matches’ families (Figure 8).
Mr. Talbott had never been arrested for a crime that would require submitting DNA to a
database. He had no known connection to the victims and no reason to have been on the
investigators’ radar. His phenotypes matched those predicted by Snapshot, but without other
information to tie him to the crime, this had not been enough to identify him as a suspect.
Figure 8: Anonymized family tree released by the Snohomish County Sheriff’s Department as part of
their announcement of the arrest of William Earl Talbott II. The tree shows the position of Mr. Talbott
(Suspect) and two GEDmatch matches (Cousins) used to determine his identity.
Resolution: Based on the lead provided by genetic genealogy, the detectives were able to
collect DNA from a cup discarded by Mr. Talbott, which, using traditional STR analysis, was
shown to match the DNA from the crime scene. He was arrested and is currently awaiting trial.
Case Study #2: Tacoma, WA; 32-year-old cold case (homicide)
Triangulation between matches using documentary sources is sometimes not possible. In
addition to being able to tenaciously research records and meticulously build family trees, this
case study shows how genetic genealogists must be able to think creatively about possible
hypotheses to explain the available data.
The Crime: 12-year old Michella Welch went missing on 26 March 1986. She had taken her
two younger sisters to Puget Park in Tacoma, Washington and then ridden her bicycle home to
make lunch while her sisters played nearby. When the sisters returned to the park, they found a
brown paper bag with their lunches but no Michella. By 3:10 p.m., officers arrived at the park
and started searching for the missing girl. A tracking dog found her body around 11:30 p.m.
She had been beaten and sexually assaulted and died from a cut to the neck.
The DNA: Another young Tacoma girl, Jennifer Bastian, was also killed around the same time,
and investigators had long believed one person committed both crimes. More than 10,000
investigative hours went into the cases in 1986 alone. Recent DNA testing showed that the
crimes were committed by different men, but neither DNA profile resulted in a CODIS match.
Genetic Ancestry: The Subject was predicted to be predominantly Northern European with a
small but notable amount of Northern Native American admixture (~10%).
GEDmatch: The two top matches did not share DNA, suggesting they were most likely related
to the Subject on different branches of his family tree.
Family Trees: Trees were built for the two top matches back to their great-great-grandparents
and beyond, and extensive descendancy research was performed, but no documented
intersection was found between the two families. The analyst identified a pair of brothers who
were cousins of Match #1, lived within a few miles of the crime scene in 1986, and had two
Native American great-great-grandparents on different branches of their family trees, which was
consistent with the predicted ancestry of the Subject. However, the Subject only shared about
half as much DNA with Match #1 as would be expected for a cousin, and there should have
been an intersection between the families that would connect these cousins to both matches.
When families are connected through DNA but do not intersect on paper (e.g., through a
marriage license or a birth certificate), the explanation may be misattributed paternity: a pair of
individuals from each family had a child together, but the true biological father was not recorded.
Through census record research, it was discovered that relatives of the two matches had lived
in the same small town when one of the cousins’ ancestors was conceived. This was the only
discovered geographical intersection between these families. Based on the amount of shared
DNA, it was postulated that Match #2’s relative was the unrecorded biological father of the
cousins’ ancestor (Figure 9). Under this hypothesis, the cousins would actually be half cousins
to Match #1, which matched the amount of shared DNA. They would also be related to Match
#2 at the appropriate genetic distance.
Figure 9: Pedigree for two cousins of Match #1 who were identified as persons-of-interest in the Tacoma
case, showing the apparent misattributed paternity between Match #1’s relative and Match #2’s relative.
Resolution: The genetic genealogy analysis identified a pair of brothers who could be the
Subject, neither of whom had ever been arrested for a crime that would have required
submission of DNA to a database. Officers were eventually able to follow one of the brothers,
Gary Charles Hartman, into a restaurant, where they obtained a napkin he had used and
discarded. Traditional STR analysis showed that the DNA on the napkin matched the DNA
found at the crime scene. More than thirty years after Michella Welch was found murdered in a
Washington park, investigators announced that they had arrested a suspect in her murder.
Hartman is currently awaiting trial.
Case Study #3: Nearly 40-year-old cold case (homicide)
When there are not enough strong matches in GEDmatch to fully narrow down the possible
branches of a large family tree, cases cannot always be resolved efficiently through genetic
genealogy alone. If an intersection between the matches’ families cannot be found, the number
of possible identities for the Subject can be very large. However, as this case study shows, if
family members of the matches are willing to cooperate, targeted kinship testing can quickly
include or exclude various branches of the family tree and thus arrive at a small number of
included individuals. Due to the close relatives of the suspect who were eventually found in this
investigation, the details of this case are not included to protect their privacy.
GEDmatch: The Subject’s top two matches were both in the 6th-8th degree relative range and
had no shared DNA between them, meaning they were most likely related to the Subject on
different branches of his family tree. There were also additional, more distant matches.
Family Trees: Trees were built for the two top matches back to their great-great-grandparents,
but no intersection was found between the two families. The Subject was most likely a great-
grandson or great-great-grandson of one of Match #1’s great-great-grandparent couples, but
without triangulation, it was not possible to narrow his identity down further. Parabon
recommended more research to identify branches of the family that might have moved to the
area of the crime, as well as targeted kinship testing of members of the top match’s family.
Kinship Testing: The investigating agency obtained a voluntary buccal swab from a cousin on
Match #1’s paternal side, from which DNA was extracted, genotyped, and compared to the
Subject. Snapshot Kinship Inference predicted this individual was unrelated to the Subject, and
Match #1’s paternal family could therefore likely be excluded (assuming the familial
relationships on paper were correct). The agency then obtained a voluntary buccal swab from a
cousin on Match #1’s maternal side, who was predicted with 94.2% confidence to be a 3rd
degree relative (first cousin or genetic equivalent) to the Subject.
Targeted Family Trees: The analyst built family trees for the spouses of each of the kinship
tester’s maternal aunts and uncles back to their great-great-great-grandparents. One uncle’s
wife was determined to be a distant cousin to many of the Subject’s more distant matches. This
triangulation meant that one of the male children of this couple was most likely the Subject, as
he would be related to the GEDmatch matches on both sides of his family tree second cousins
once-removed (6th degree relatives) to Match #1 and distant cousins (ranging from third
cousins once-removed to fifth cousins once-removed) to Distant Matches #1-7 (Figure 10).
Importantly, barring additional independent intersections between these family trees, the
identified Persons of Interest were the only individuals who were related to both of these
families. These children were also the right age at the time of the crime, lived nearby, and all
appeared to have phenotypes consistent with the Snapshot predictions.
Figure 10: Pedigree built for Match #1’s family after the possible branches leading to the Subject were
narrowed down through targeted kinship testing and subsequent triangulation with distant matches.
Resolution: The genetic genealogy analysis identified a set of brothers who could be the
Subject, none of whom had ever been arrested for a crime that would have required submission
of DNA to a database. Officers were eventually able to narrow the investigation down to a
single brother and match his DNA to the crime scene DNA using traditional STR analysis. He
has been arrested and is awaiting trial.
Genetic genealogy has been called “2018’s biggest contribution to crime science” (Augenstein,
2018) and is rapidly changing the face of cold case investigations. Even for perpetrators who
are completely under the radar or long dead, given DNA from a crime scene, it may be possible
to identify them with genetic genealogy. Importantly, genetic genealogy has just as much power
to generate leads in active cases as in cold cases. In fact, it was recently used to identify a
perpetrator in a sexual assault case that had occurred only three months earlier (Havens, 2019),
and he has since pled guilty. Rather than wait until years have passed and all other leads have
been exhausted, investigators now have access to innovative forensic DNA technologies that
can generate significant new leads and prevent cases from going cold. Looking to the future,
genetic genealogy has the potential to significantly reduce the number of unsolved cold cases in
North America while also reducing the rate at which cases go cold.
Augenstein, S. (2018). Working Backward From Genealogy: Tracking a Dead Killer’s Trail.
Forensic Magazine.
Ball, C., Barber, M., Byrnes, J., Carbonetto, P., Chahine, K., Curtis, R., . . . Willmore, L. (2016).
Ancestry DNA Matching White Paper. Retrieved from
Bettinger, B. T., & Perl, J. (2018). The Shared cM Project 3.0 tool v4. Retrieved from
Erlich, Y., Shor, T., Pe, I., & Carmi, S. (2018). Identity inference of genomic data using long-
range familial searches. Science, 362(6415), 690-694. doi:10.1126/science.aau4832 Terms of Service and Privacy Policy.
Greytak, E., & Moore, C. (2018). Closing Cases with a Single SNP Array: Integrated Genetic
Genealogy, DNA Phenotyping, and Kinship Analyses. Proceedings of the 29th
International Symposium on Human Identification.
Greytak, E. M., & Armentrout, S. (2015). DNA Phenotyping: Predicting Ancestry and Physical
Appearance from Forensic DNA. Proceedings of the 26th International Symposium on
Human Identification.
Greytak, E. M., Gorden, E. M., Marshall, C. K., Sturk-Andreaggi, K., McMahon, T. P., &
Armentrout, S. L. (2017). SNP Recovery from Degraded Samples for Kinship
Greytak, E. M., Kaye, D. H., Budowle, B., Moore, C., & Armentrout, S. L. (2018). Privacy and
genetic genealogy data. Science, 361(6405), 857. doi:10.1126/science.aav0330
Greytak, E. M., Moore, C., & Armentrout, S. L. (2018). RE: Identity inference of genomic data
using long-range familial searches, Erlich et al. Science, 362(6415) (2018), 690-694
(eLetter, 10-29-18).
Guerrini, C. J., Robinson, J. O., Petersen, D., & McGuire, A. L. (2018). Should police have
access to genetic genealogy databases? Capturing the Golden State Killer and other
criminals using a controversial new forensic technique. PLOS Biology, 16(10),
e2006906-e2006906. doi:10.1371/journal.pbio.2006906
Havens, E. (2019). Elderly woman in home invasion rape case: I forgive my attacker. St.
George Spectrum & Daily News. Retrieved from
Henn, B. M., Hon, L., Macpherson, J. M., Eriksson, N., Saxonov, S., Pe'er, I., & Mountain, J. L.
(2012). Cryptic distant relatives are common in both isolated and cosmopolitan genetic
samples. PLoS One, 7. doi:10.1371/journal.pone.0034267
Huff, C. D., Witherspoon, D. J., Simonson, T. S., Xing, J., Watkins, W. S., Zhang, Y., . . . Jorde,
L. B. (2011). Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome
Research, 21, 768-774. doi:10.1101/gr.115972.110
International Society of Genetic Genealogy (2019). Endogamy. Accessed January 30, 2019.
Retrieved from
Keating, B., Bansal, A. T., Walsh, S., Millman, J., Newman, J., Kidd, K., . . . Kayser, M. (2013).
First all-in-one diagnostic tool for DNA intelligence: genome-wide inference of
biogeographic ancestry, appearance, relatedness, and sex with the Identitas v1 Forensic
Chip. International Journal of Legal Medicine, 127, 559-572. doi:10.1007/s00414-012-
Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W.-M. (2010).
Robust relationship inference in genome-wide association studies. Bioinformatics, 26,
2867-2873. doi:10.1093/bioinformatics/btq559
Milian, J. (2018). Cold-case murders, rapes cracked by Lake Worth genealogy website. The
Palm Beach Post. Retrieved from
Regalado, A. (2019). More than 26 million people have taken an at-home ancestry test. MIT
Technology Review. Retrieved from
... Si se trata de un caso antiguo a veces no es posible. Las técnicas moleculares actuales permiten, en muchas ocasiones, extraer suficiente DNA a partir de dichas muestras, aunque estén degradadas o contaminadas y poder obtener un perfil genético del sospechoso (Greytak et al. 2019). ...
... Fijémonos que construimos la genealogía yendo hacia atrás en el tiempo pues buscamos los antepasados comunes. También existen dificultades adicionales (Greytak et al. 2019), como por ejemplo que para cada grado de parentesco existen distintas relaciones posibles. Se ha visto también que la cantidad de DNA compartido en una misma relación de parentesco puede variar según la población humana. ...
... Una vez se obtienen los individuos actuales que están emparentados familiarmente con el sospechoso gracias a un ancestro común, hay que deducir cuál de ellos puede serlo. Como es fácil imaginar la cantidad de personas de la generación actual (o generaciones actuales) que podrían ser de entrada nuestro sospechoso es muy grande (Greytak et al. 2018(Greytak et al. , 2019Scudder et al. 2019). Por ejemplo, cuando se detecta una coincidencia entre primos terceros se estima, a partir de simulaciones, que el número de posibles candidatos a ser el sospechoso estaría alrededor de 850 (Erlich et al. 2018). ...
RESUMEN: Recientemente ha aparecido una nueva forma de identificar sospechosos mediante el DNA, útil cuando los procedimientos estándar no dan resultado. Se trata de la búsqueda familiar de largo alcance, que ha permitido resolver casos abiertos. El objetivo de este artículo es mostrar, de manera concisa y clara, en que consiste este nuevo procedimiento. En líneas generales, se trata de extraer nuevamente DNA del sospechoso a partir muestras biológicas que se conserven. Hay que obtener su perfil genético de manera que sea compatible con los existentes en los bancos de datos usados para construir genealogías. Se debe introducir el perfil del sospechoso en dichos bancos y buscar si parte del mismo coincide con el de alguna persona. Si es así, seguramente se trata de un familiar lejano. Se busca el antepasado común y a partir de él se reconstruye el árbol genealógico familiar hasta los descendientes actuales. Entre ellos estará el sospechoso, que podrá ser identificado aplicando diferentes técnicas de filtraje. PALABRAS CLAVE: DNA, marcadores genéticos, perfil genético, bancos de datos, casos abiertos, búsqueda familiar, genealogía genética. TITLE: LONG-RANGE FAMILIAL SEARCH: ANOTHER APPROACH TO DETECT SUSPECTS BASED ON DNA ABSTRACT: A new procedure to identify suspects by means of DNA has recently emerged, useful when standard procedures fail. It is called the long-range familial search, which has allowed to solve cold-cases. The objective of this article is to present, in a concise and clear way, this new procedure. In general terms, it consists of extracting again the suspect's DNA from biological samples that are preserved. It is necessary to obtain the genetic profile, but compatible with those existing in the databanks used to build genealogies. The suspect's profile must be introduced in these data banks and to find out
... Investigative genetic genealogy (IGG), also known as forensic genetic genealogy (FGG), involves generation of a single nucleotide polymorphism (SNP) profile from an unsolved forensic case sample and entering it into a consumer genealogy database. Extended family trees are built using potential relatives surfaced from that search, seeking to locate a potential relative for direct comparison to the perpetrator's profile [12,62]. One of the first such cases was that of the Golden State Killer, who was identified as Joseph DeAngelo. ...
... As currently performed for IGG, SNP profiles can be entered into consumer genealogy databases and indirect matches found with biologically related individuals [12,62]. These indirect matches to known individuals provide the basis for genealogical research, which utilizes ascending family trees back in time to include the most recent common ancestor (MRCA) between the individuals and the forensic profile. ...
Full-text available
DNA databases effectively develop investigative leads, with database size being directly proportional to increased chances of solving crimes as demonstrated by a business case including a universal STR database example. DNA database size can be expanded physically by increasing the number and type of qualifying offenses, adding arrestees, or moving towards a universal database. The theoretical size of a DNA database can also be increased scientifically by using the inherent nature of DNA sharing by biologically related individuals by using an indirect matching strategy including Partial Matching, Familial Searching, and Investigative Genetic Genealogy (IGG). A new strategy is introduced using areas of shared DNA as a search key to locate potential relatives for further kinship evaluation. New search key strategies include Y-STR, mtDNA, and X Chromosome searching to locate potential relatives, coupled with kinship and genetic genealogical research, as well as expanded use of unidentified human remains (UHRs).
... Forensic setting. Law enforcement looms large in public opinion about genetic data since it may seek to access genetic information, an issue that has gained intense interest in the wake of high-profile cold cases that were ultimately solved using such information 183 . Over the years, there has also been an effort to expand government-run forensic databases at the federal, state and local levels 184 . ...
... Furthermore, law enforcement may also seek to exploit public databases or utilize the services of a DTC-GT company for forensic genealogy purposes in FGG/IGG. To date, law enforcement in the United States has largely focused its efforts on publicly accessible databases (for example, GEDmatch) 183 and private databases held by companies that voluntarily cooperate (for example, FamilyTreeDNA) 190 . For example, law enforcement generated leads in dozens of cold cases by uploading genetic profiles derived from crime scenes to GEDmatch, a public database where individuals can upload their DTC-GT data to learn about where their forebears came from and to locate potential genetic relatives. ...
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information. In this Review, the authors describe technical and legal protection mechanisms for mitigating vulnerabilities in genomic data privacy. They also discuss how these protections are dependent on the context of data use such as in research, health care, direct-to-consumer testing or forensic investigations.
... From a DNA point of view, if the DNA profile from a UHR case is not matched in a local, state or national missing persons DNA database, an identification cannot be achieved and must wait for the appropriate Ante-Mortem (AM) data to be made available. In these instances, forensic genetic genealogy (FGG) -an emerging field for forensic investigation -has the potential to provide an alternate avenue for identification [2][3][4][5][6]. ...
... Comparisons to the Victorian Missing Persons DNA Database (VMPDD) -which at present holds more than 350 reference samples for missing persons -failed to produce a familial match in these cases. Where possible, biogeographical ancestry (BGA) predictions (previously derived using the Precision ID Ancestry Panel, Thermo-Fisher Scientific) were used to assist with case selection (Table 1), as subjects of European ancestry have a higher probability of success due to their over-representation in genetic genealogy databases [4]. Sample types were mostly bone samples, with one sample being a bloodstain sample (Table 1). ...
The successful application of forensic genetic genealogy (FGG) to identify Jane and John Doe cases in the United States, has raised the prospect of using the technique in Australia to assist in the reconciliation of unidentified human remains (UHRs) with long term missing persons. A study was conducted to explore the feasibility of FGG using whole genome array (WGA) data from both pristine control samples as well as compromised casework samples, with the view to explore how DNA quantity and quality impacted on the ability to generate search results when compared to a genetic genealogy database, such as GEDmatch. From this study, several insights were gained as to the impact DNA quantity and degradation had on the percentage of SNPs genotyped and heterozygote/homozygote ratio – which are critical for successful matching outcomes. It was noted in this study (using a control sample) that successful matching occurred when genotyping errors were 5% or less. Two UHR cases were matched to kits on GEDmatch PRO, which provided investigative leads for identification purposes. The effectiveness of the FGG approach to match casework samples (as well as volunteer samples used in the study) is indicative of the usage of ‘direct-to-consumer’ (DTC) genetic testing by Australians. Given the (often) limited availability of casework samples, findings from this study will assist Australian agencies considering the use of FGG, to determine if WGA is a suitable method for their application.
... Extant direct to consumer (DTC) SNP databases provided an exponentially growing, vast resource that could produce a genetic trail to anyone whose relatives participated in the consumer DNA testing market and opted into genealogical DNA databases such as GEDmatch [3,4]. The IGG method involves determining an extended DNA profile from a crime scene (or other forensic) sample using high-density SNP genotyping microarrays [5] or whole-genome sequencing (WGS) [6]. Then the SNP genotype dataset is searched in a public DNA database to identify matches with significant DNA sharing, indicating a familial relationship. ...
... (4) Local realignment of unaligned ends was performed to improve indel alignment in the mapping. (5) The alleles observed at each of the 5422 FORCE SNPs was determined using the Identify Known Mutations from Mappings tool, ignoring broken pairs in the bone set sequence data. Various analysis metrics such as the total number of reads, percent mapped, and percent duplicate reads were reported throughout the workflow using different CLC Genomics Workbench quality control (e.g., QC for Targeted Sequencing) tools. ...
Full-text available
The FORensic Capture Enrichment (FORCE) panel is an all-in-one SNP panel for forensic applications. This panel of 5422 markers encompasses common, forensically relevant SNPs (identity, ancestry, phenotype, X- and Y-chromosomal SNPs), a novel set of 3931 autosomal SNPs for extended kinship analysis, and no clinically relevant/disease markers. The FORCE panel was developed as a custom hybridization capture assay utilizing ~20,000 baits to target the selected SNPs. Five non-probative, previously identified World War II (WWII) cases were used to assess the kinship panel. Each case included one bone sample and associated family reference DNA samples. Additionally, seven reference quality samples, two 200-year-old bone samples, and four control DNAs were processed for kit performance and concordance assessments. SNP recovery after capture resulted in a mean of ~99% SNPs exceeding 10X coverage for reference and control samples, and 44.4% SNPs for bone samples. The WWII case results showed that the FORCE panel could predict first to fifth degree relationships with strong statistical support (likelihood ratios over 10,000 and posterior probabilities over 99.99%). To conclude, SNPs will be important for further advances in forensic DNA analysis. The FORCE panel shows promising results and demonstrates the utility of a 5000 SNP panel for forensic applications.
... The technology has also been approved for use within the U.S. criminal justice system for both STR profiling and mitochondrial (mt) DNA sequencing [2]. Furthermore, whole genome (shotgun) sequencing and genome-wide SNP arrays are often used for investigative genetic genealogy to provide leads in active and cold cases [3]. While more challenging cases involving historical skeletal remains, including historical figures, unmarked graves, and unidentified war victims, have seen some advances in DNA profiling with MPS, success rates are still low [4][5][6]. ...
Full-text available
The integration of massively parallel sequencing (MPS) technology into forensic casework has been of particular benefit to the identification of unknown military service members. However, highly degraded or chemically treated skeletal remains often fail to provide usable DNA profiles, even with sensitive mitochondrial (mt) DNA capture and MPS methods. In parallel, the ancient DNA field has developed workflows specifically for degraded DNA, resulting in the successful recovery of nuclear DNA and mtDNA from skeletal remains as well as sediment over 100,000 years old. In this study we use a set of disinterred skeletal remains from the Korean War and World War II to test if ancient DNA extraction and library preparation methods improve forensic DNA profiling. We identified an ancient DNA extraction protocol that resulted in the recovery of significantly more human mtDNA fragments than protocols previously used in casework. In addition, utilizing single-stranded rather than double-stranded library preparation resulted in increased attainment of reportable mtDNA profiles. This study emphasizes that the combination of ancient DNA extraction and library preparation methods evaluated here increases the success rate of DNA profiling, and likelihood of identifying historical remains.
Anthropologists are often the custodians of long-term unidentified human remains though their positions as curators of university or museum skeletal collections. Various factors decrease the solvability of these legacy cases including the passage of time, the loss of provenience for specific cases, and lack of documentation or case records. While anthropologists can contribute important information toward identification, it is often necessary to explore novel and cross-disciplinary strategies to resolve difficult cold cases.In long cold cases, the postmortem interval, in particular, may be difficult to estimate leading to further challenges in achieving identification. Modern advances in radiocarbon bomb pulse dating, isotope analysis, and actualistic studies have contributed to positive identification of unidentified human remains in some legacy cases, but may not be available to all forensic practitioners and law enforcement from resource-poor agencies. Pooling resources, as well as collaborating with professionals outside of forensic anthropology, is a useful strategy to pursue when anthropological methods are exhausted.The case study presented here demonstrates a collaborative approach between forensic anthropologists, forensic genetic genealogists, and law enforcement in a century-old homicide. The dismembered and mummified parts of a male body were recovered in a remote cave in 1979 and again in 1991. Despite forensic anthropologists creating and updating the biological profile over the decades from recovery to present, no identification was made until the application of forensic genetic genealogy (FGG) to the case in 2019. New interpretations of bone microstructure and trauma analysis are presented for the case, alongside the historical documentation and “proof of life” evidence used by the genealogy team. A review of the FGG methods underscores the challenges in this case (e.g. significant endogamy, multiple aliases used by the victim) and the steps taken toward resolution. Ultimately, a combined anthropology and genealogy approach resulted in a confirmed identity for a man who was murdered in 1916.Key pointsForensic scientists should leverage a collaborative, interdisciplinary approach toward human identification.When combined with forensic anthropology methods, forensic genetic genealogy is a valuable tool linking biological and cultural-historical aspects of identity.Forensic anthropologists should review challenging cases in their labs as new methods are introduced and new resources become available.
This research undertakes a systematic review of cases cleared using forensic genetic genealogy (FGG) and has produced a dataset and annotated bibliography that can be used for further research. Data was collected to better understand the impact of FGG on a number of metrics relating to substantive, procedural, and distributive justice. FGG has been used primarily to clear cases involving serial and sexual violence against female and vulnerable victims, and in cases involving stranger victimization – cases that have traditionally been more difficult to clear. About 80% of victims were targeted for sexual violence, and about 28% belonged to social groups that are particularly vulnerable to criminal and sexual exploitation. About 79% of suspects and 48% of victims were of European ancestry, although the ancestry of many victims was unknown as they are victims of sexual assault. In the U.S., FGG investigations were primarily conducted in lower-income U.S. counties and were overrepresented in rural areas while being underused in the 50 largest metropolitan areas. The ten largest police departments in the U.S. cleared only 2% of cases and identified 5 suspects. The average time for FGG to clear a case was found to be 12.1 months. The rights of defendants to a fair trial are best secured when cases are cleared quickly and brought to trial in a timely fashion. Many of these cases had been open for many years: this allowed offenders to victimize others, while also leading to higher incidences of wrongful convictions, prosecutions, and innocent suspects being investigated by police. Rates of prior wrongful conviction in the dataset ranged from 2.8% of all crimes, to 3.7% of all homicides. Rates of wrongful prosecutions ranged from 10.1% of all crimes, to 16.5% of all rape-murders. Two persons have been exonerated as a result of an FGG investigation. FGG is advancing the aims of justice in several important ways and should be made more widely available, especially for victims who have been less well-served by the criminal justice system. Further steps to regulate FGG will be welcome in order to further these benefits while reassuring the public of the safety and security of their information.
Full-text available
On April 24, 2018, a suspect in California’s notorious Golden State Killer cases was arrested after decades of eluding the police. Using a novel forensic approach, investigators identified the suspect by first identifying his relatives using a free, online genetic database populated by individuals researching their family trees. In the wake of the case, media outlets reported privacy concerns with police access to personal genetic data generated by or shared with genealogy services. Recent data from 1,587 survey respondents, however, provide preliminary reason to question whether such concerns have been overstated. Still, limitations on police access to genetic genealogy databases in particular may be desirable for reasons other than current public demand for them.
Full-text available
When a forensic DNA sample cannot be associated directly with a previously genotyped reference sample by standard short tandem repeat profiling, the investigation required for identifying perpetrators, victims, or missing persons can be both costly and time consuming. Here, we describe the outcome of a collaborative study using the Identitas Version 1 (v1) Forensic Chip, the first commercially available all-in-one tool dedicated to the concept of developing intelligence leads based on DNA. The chip allows parallel interrogation of 201,173 genome-wide autosomal, X-chromosomal, Y-chromosomal, and mitochondrial single nucleotide polymorphisms for inference of biogeographic ancestry, appearance, relatedness, and sex. The first assessment of the chip's performance was carried out on 3,196 blinded DNA samples of varying quantities and qualities, covering a wide range of biogeographic origin and eye/hair coloration as well as variation in relatedness and sex. Overall, 95 % of the samples (N = 3,034) passed quality checks with an overall genotype call rate >90 % on variable numbers of available recorded trait information. Predictions of sex, direct match, and first to third degree relatedness were highly accurate. Chip-based predictions of biparental continental ancestry were on average ~94 % correct (further support provided by separately inferred patrilineal and matrilineal ancestry). Predictions of eye color were 85 % correct for brown and 70 % correct for blue eyes, and predictions of hair color were 72 % for brown, 63 % for blond, 58 % for black, and 48 % for red hair. From the 5 % of samples (N = 162) with <90 % call rate, 56 % yielded correct continental ancestry predictions while 7 % yielded sufficient genotypes to allow hair and eye color prediction. Our results demonstrate that the Identitas v1 Forensic Chip holds great promise for a wide range of applications including criminal investigations, missing person investigations, and for national security purposes.
Full-text available
Although a few hundred single nucleotide polymorphisms (SNPs) suffice to infer close familial relationships, high density genome-wide SNP data make possible the inference of more distant relationships such as 2(nd) to 9(th) cousinships. In order to characterize the relationship between genetic similarity and degree of kinship given a timeframe of 100-300 years, we analyzed the sharing of DNA inferred to be identical by descent (IBD) in a subset of individuals from the 23andMe customer database (n = 22,757) and from the Human Genome Diversity Panel (HGDP-CEPH, n = 952). With data from 121 populations, we show that the average amount of DNA shared IBD in most ethnolinguistically-defined populations, for example Native American groups, Finns and Ashkenazi Jews, differs from continentally-defined populations by several orders of magnitude. Via extensive pedigree-based simulations, we determined bounds for predicted degrees of relationship given the amount of genomic IBD sharing in both endogamous and 'unrelated' population samples. Using these bounds as a guide, we detected tens of thousands of 2(nd) to 9(th) degree cousin pairs within a heterogenous set of 5,000 Europeans. The ubiquity of distant relatives, detected via IBD segments, in both ethnolinguistic populations and in large 'unrelated' populations samples has important implications for genetic genealogy, forensics and genotype/phenotype mapping studies.
Full-text available
Accurate estimation of recent shared ancestry is important for genetics, evolution, medicine, conservation biology, and forensics. Established methods estimate kinship accurately for first-degree through third-degree relatives. We demonstrate that chromosomal segments shared by two individuals due to identity by descent (IBD) provide much additional information about shared ancestry. We developed a maximum-likelihood method for the estimation of recent shared ancestry (ERSA) from the number and lengths of IBD segments derived from high-density SNP or whole-genome sequence data. We used ERSA to estimate relationships from SNP genotypes in 169 individuals from three large, well-defined human pedigrees. ERSA is accurate to within one degree of relationship for 97% of first-degree through fifth-degree relatives and 80% of sixth-degree and seventh-degree relatives. We demonstrate that ERSA's statistical power approaches the maximum theoretical limit imposed by the fact that distant relatives frequently share no DNA through a common ancestor. ERSA greatly expands the range of relationships that can be estimated from genetic data and is implemented in a freely available software package.
Full-text available
Genome-wide association studies (GWASs) have been widely used to map loci contributing to variation in complex traits and risk of diseases in humans. Accurate specification of familial relationships is crucial for family-based GWAS, as well as in population-based GWAS with unknown (or unrecognized) family structure. The family structure in a GWAS should be routinely investigated using the SNP data prior to the analysis of population structure or phenotype. Existing algorithms for relationship inference have a major weakness of estimating allele frequencies at each SNP from the entire sample, under a strong assumption of homogeneous population structure. This assumption is often untenable. Here, we present a rapid algorithm for relationship inference using high-throughput genotype data typical of GWAS that allows the presence of unknown population substructure. The relationship of any pair of individuals can be precisely inferred by robust estimation of their kinship coefficient, independent of sample composition or population structure (sample invariance). We present simulation experiments to demonstrate that the algorithm has sufficient power to provide reliable inference on millions of unrelated pairs and thousands of relative pairs (up to 3rd-degree relationships). Application of our robust algorithm to HapMap and GWAS datasets demonstrates that it performs properly even under extreme population stratification, while algorithms assuming a homogeneous population give systematically biased results. Our extremely efficient implementation performs relationship inference on millions of pairs of individuals in a matter of minutes, dozens of times faster than the most efficient existing algorithm known to us. Our robust relationship inference algorithm is implemented in a freely available software package, KING, available for download at∼wc9c/KING.
Consumer genomics databases have reached the scale of millions of individuals. Recently, law enforcement authorities have exploited some of these databases to identify suspects via distant familial relatives. Using genomic data of 1.28 million individuals tested with consumer genomics, we investigated the power of this technique. We project that about 60% of the searches for individuals of European-descent will result in a third cousin or closer match, which can allow their identification using demographic identifiers. Moreover, the technique could implicate nearly any US-individual of European-descent in the near future. We demonstrate that the technique can also identify research participants of a public sequencing project. Based on these results, we propose a potential mitigation strategy and policy implications to human subject research.
DNA Phenotyping: Predicting Ancestry and Physical Appearance from Forensic DNA
  • E M Greytak
  • S Armentrout
Greytak, E. M., & Armentrout, S. (2015). DNA Phenotyping: Predicting Ancestry and Physical Appearance from Forensic DNA. Proceedings of the 26th International Symposium on Human Identification.
RE: Identity inference of genomic data using long-range familial searches
  • E M Greytak
  • C Moore
  • S L Armentrout
Greytak, E. M., Moore, C., & Armentrout, S. L. (2018). RE: Identity inference of genomic data using long-range familial searches, Erlich et al. Science, 362(6415) (2018), 690-694 (eLetter, 10-29-18).
Elderly woman in home invasion rape case: I forgive my attacker
  • E Havens
Havens, E. (2019). Elderly woman in home invasion rape case: I forgive my attacker. St. George Spectrum & Daily News. Retrieved from