Author Manuscript, published in final form as:
Greytak EM, Moore C, & Armentrout SL (2019). Genetic genealogy for cold case and active
investigations. Forensic Science International, 299, 103–113. doi: 10.1016/j.forsciint.2019.03.039
Genetic genealogy for cold case and active investigations
Ellen M. Greytak, CeCe Moore, Steven L. Armentrout
Parabon NanoLabs, Inc., 11260 Roger Bacon Dr. Suite 406, Reston, VA, 20190, USA
• Genetic genealogy is helping to close both cold and active investigations
• Forensic DNA is uploaded to public genetic genealogy databases to find relatives
• Extensive genealogy and descendancy research generate a list of possible identities
• Many complicating factors can impede the research
• Identity is narrowed down using a wide range of information & confirmed using STRs
Investigative genetic genealogy has rapidly emerged as a highly effective tool for using DNA to
determine the identity of unknown individuals (unidentified remains or perpetrators), generating
identifications in dozens of law enforcement cases, both cold and active. The amount of press
coverage of these cases may have given the impression that the analysis is straightforward and
the outcome guaranteed once a sample is uploaded to a database. However, the database
query results serve only as clues from which in-depth genealogy and descendancy research
must proceed to determine the possible identities of an unknown individual. While there
certainly will be more announcements of cases solved using this new technique, there are many
more cases where identification has not yet been possible due to the wide variety of
complications present in these investigations. This paper lays out the fundamentals of genetic
genealogy, along with the challenges that are encountered in many of these investigations, and
concludes with a set of case studies that demonstrate the variety of cases encountered thus far.
Keywords: Genetic genealogy; Forensic genetics; DNA; SNPs; Cold cases; Human
Traditional genealogy has been practiced for centuries, using documentary records and oral
histories to trace families backwards in time. Until recently, these were the only ways to
connect extended family members, but with the advent of direct-to-consumer (DTC) genetic
testing, it is now possible to find relatives through shared DNA. This has enabled thousands of
individuals who have lost their biological identity through adoption, abandonment, anonymous
gamete donation, misattributed parentage, etc., to regain their genetic heritage. More recently,
these same tools have been used to identify DNA from suspected perpetrators in more than
thirty law enforcement cases, only some of which have been publicly announced (Table 1).
Table 1: Cases for which law enforcement agencies have announced identification of DNA from a
suspected perpetrator with the aid of genetic genealogy (through 1/31/19). * Deceased; ** Pled guilty
Multiple Homicides and Sexual
Assaults - “Golden State Killer”
Double Homicide of Jay Cook (20)
and Tanya Van Cuylenborg (18)
May 21, 2018
Homicide of Michella Welch (12)
Homicide of Christy Mirack (25)
Homicide of Virginia Freeman (40)
Fort Wayne, IN
Homicide of April Tinsley (8)
John Dale Miller**
July 15, 2018
Homicide of Constance Gauthier
July 18, 2018
St. George, UT
Sexual Assault of Carla Brooks
July 28, 2018
Multiple Sexual Assaults -
“Ramsey Street Rapist”
Homicide of Holly Cassano (22)
Michael F. A.
Multiple Sexual Assaults
Homicide of Deborah Dalzell (47)
Multiple Sexual Assaults - “NorCal
Multiple Homicides and Sexual
Double Homicide of Betty Jones
(65) and Kathryn Crigler (81)
Homicide of Pam Felkins (32)
Homicide of Lorrie Ann Smith (28)
Homicide of Michael Temple (29)
Homicide of Christine Franke (25)
Homicide of Jodine Serrin (39)
Santa Clara, CA
Homicide of Leslie Marie Perlov
Multiple Sexual Assaults
Cedar Rapids, IA
Homicide of Michelle Martinko
Jerry Lynn Burns
Sexual Assault of Unnamed Victim
William L. Nichols*
Sexual Assaults of Two Unnamed
Victims (9 and 31)
La Mesa, CA
Homicide of Scott Martinez (47)
Homicide of Jack Upton (30)
Homicide of Anna Marie Hlavka
Unlike traditional forensic DNA analysis, which uses autosomal short tandem repeats (STRs) to
generate an identity profile from ~20 loci, genetic genealogy uses hundreds of thousands of
single nucleotide polymorphisms (SNPs) spread across the autosome. Participants in genetic
genealogy have had their DNA tested by a direct-to-consumer (DTC) genetic testing company,
such as 23andMe or AncestryDNA, which use microarrays to genotype up to ~1 million SNPs.
DTC companies obtain DNA from spit kits or cheek swabs and thus always have a large amount
of high-quality single-source DNA to work with. Forensic DNA samples, on the other hand,
often only have a small amount of degraded DNA, which may be mixed with DNA from one or
more other individuals. Microarray genotyping has previously been shown to be effective and
accurate with forensic samples (Keating et al., 2013), and Parabon has used it for casework
since 2015, generating high genotyping call rates from forensic samples down to 1 ng of DNA
(Table 2). Parabon has also found it is possible to accurately deconvolute microarray data from
two-person mixtures, as long as the person-of-interest is at least 40% of the mixture and a
single-source reference sample from the second contributor is available.
Table 2: Summary of Parabon’s >250 forensic DNA samples used in genetic genealogy casework and
the resulting microarray genotyping call rates.
≤ 2.5 ng
Parabon’s casework currently uses the Illumina CytoSNP-850K array, an off-the-shelf chip that
contains >98% of the SNPs on the OmniExpress chip used by Ancestry.com, FamilyTreeDNA,
and MyHeritage. 23andMe previously also based their chip on the OmniExpress but has since
moved to smaller custom chips that overlap less with the other DTC companies. For law
enforcement cases, extracted DNA samples are processed at a CLIA-certified lab, and the data
is uploaded securely to Parabon.
Determining Relatedness from DNA
Given enough SNPs, it is possible to determine the degree of relatedness between two people,
which is defined by the expected amount of shared DNA, not the number of meioses (Figure 1).
Figure 1: Pedigree showing the degrees of relatedness, as defined by the expected amount of shared
DNA. Each relationship is defined with respect to the red “self / twin” box.
While several relationship inference methods had previously been proposed (Huff et al., 2011;
Manichaikul et al., 2010), 23andMe was the first DTC company to introduce an accurate,
scalable approach to inferring approximately how closely related two DNA samples are from
autosomal SNPs (Henn et al., 2012). Each person has two copies of each of the 22 autosomal
chromosomes (“autosomes”), one inherited from their mother and one inherited from their
father. Autosomes are not inherited intact from each parent; rather, each parent’s own pair of
chromosomes is randomly recombined into a new chromosome that is passed onto the child.
While recombination occurs randomly, nucleotides that are closer to one another on a
chromosome are more likely to be inherited together, while nucleotides that are far apart are
more likely to be separated by recombination. The probability of recombination between two
nucleotides is quantified as their genetic distance, which is measured in centimorgans (cM),
such that 1 cM equates to a 1% probability of recombination.
Rather than simply looking at the total number of shared SNPs, genetic genealogy takes
advantage of the fact that recombination will break up long stretches of shared DNA over the
generations, such that more closely related people will share longer stretches of DNA
(“segments”) that are identical-by-descent (IBD) (Figure 2). The more recombination events
that have occurred, the shorter the shared IBD segments will be, so the number and length of
IBD segments in cM can be used to approximate the degree of relatedness.
Figure 2: Inheritance of DNA segments on a single chromosome. The lengths of the shared segments
(shaded boxes) are summed across all 22 autosomes to give the total amount of shared DNA.
To detect IBD segments, genetic genealogy algorithms search for regions of the genome where
two individuals share at least one allele at every SNP. To be counted, these segments must
contain a minimum number of SNPs (typically ~500) and be over a certain length (typically 5-7
cM), which screens out most segments that are shared by chance rather than due to common
descent. When summed across all autosomes, the amount of DNA shared IBD strongly
correlates with the degree of relatedness between two individuals, such that more distant
relatives tend to share less DNA (Table 3). However, due to the random nature of
recombination, the amount of shared DNA can vary greatly for relatives of the same degree,
and this variation increases with more recombination events, such that ~10% of third cousins
and ~50% of fourth cousins share no detectable IBD segments.
Table 3: The range of DNA shared by pairs of people with each relationship. While most pairs from a
given relationship fall within a narrower range, these values represent the full ranges that have been
observed (Ball et al., 2016).
Half-Sibling, Avuncular, Double First Cousin, Grandparent / Grandchild
First Cousin (1C), Half-Avuncular, Great-Grandparent / Great-Grandchild, Great-
First Cousin Once-Removed (1C1R), Half-First Cousin (½ 1C), Half-Great-
Aunt/Uncle / Half-Great-Niece/Nephew
Second Cousin (2C), First Cousin Twice-Removed (1C2R),
Half-First Cousin Once-Removed (½ 1C1R)
Second Cousin Once-Removed (2C1R), Half-Second Cousin (½ 2C), First
Cousin Thrice-Removed (1C3R), Half-First Cousin Twice-Removed (½ 1C2R)
Third Cousin (3C), Second Cousin Twice-Removed (2C2R)
Third Cousin Once-Removed (3C1R), Distant Cousins
Genetic Genealogy Databases and Genetic Privacy
DTC genetic testing companies’ private databases have exploded in size, with AncestryDNA
currently containing nearly 15 million individuals, 23andMe containing nearly 10 million, and
MyHeritage and FamilyTreeDNA (FTDNA) together containing roughly 3.5 million (Regalado,
2019). AncestryDNA and 23andMe maintain their databases separately and are not accessible
to law enforcement, as the only way to submit a sample is via a cheek swab or spit kit.
MyHeritage and FTDNA both allow uploads of data generated from other sources, but law
enforcement usage of either requires written permission from the company, as well as a court
order for MyHeritage or “the required legal documentation” for FTDNA.
GEDmatch, on the other hand, is not a DTC company. It was created by Curtis Rogers and
John Olson in 2010 as a public database where individuals from different testing companies
could compare their DNA by downloading their raw data from a DTC company’s site and
uploading it to a common database. After the Golden State Killer suspect was identified through
surreptitious use of GEDmatch, the site’s administrators decided to explicitly allow law
enforcement usage. They posted a notice on the front page of the site (Figure 3) and also
updated their Terms of Service to state that law enforcement can and is using GEDmatch to
identify remains and perpetrators of violent crimes, defined as homicides or sexual assaults
(GEDmatch.com). Both new and existing users were required to view these new Terms and
decide whether to accept them before using the site. Critics of genetic genealogy argue that
many people who joined the site prior to this update may not have considered the possibility that
their desire to locate relatives could lead to the discovery that they are related to someone
whose DNA is associated with a crime and to the apprehension of that relative. Indeed, it is
possible some of them still may be unaware of the new warning, and individuals who had their
data uploaded by another individual or have been inactive on the site may not have reviewed
the new Terms to decide whether to consent. However, even prior to implementing these new
Terms, GEDmatch’s Terms clearly stated that any data set to “public” would be searchable by
anyone. The law has generally allowed information made available to the public to be used in
criminal investigations. Users can easily have their data set to “private,” hiding it from all search
queries, or removed entirely. Thus, the DNA data files in a public database like GEDmatch
come from individuals who have proactively downloaded their data from a private DNA testing
company’s website, uploaded the information to a public website, reviewed the Terms of
Service that permits law enforcement usage, and opted in to public comparisons against their
Figure 3: Notice posted on GEDmatch’s homepage after the site’s use in the Golden State Killer
investigation was made public.
Additionally, no sensitive genetic information is disclosed to law enforcement during a genetic
genealogy search, as the raw genetic data from GEDmatch users is not accessible. Raw
genetic data can contain sensitive health-related information, and this type of private genetic
information should be protected. In keeping with this precept, no raw genotypes are displayed
or made available for download by GEDmatch. GEDmatch simply performs comparisons
among samples, returning the lengths and chromosomal locations of shared DNA segments,
which are used to determine the approximate relationship between individuals. Similarly, data
obtained from abandoned DNA at a crime scene and used for genetic genealogy are not
exposed to other users and can be prevented from appearing in search results (an option
available to all users). At Parabon, genetic data is kept on an encrypted server only accessible
to authorized employees, and the company’s GEDmatch accounts can only be accessed by the
bioinformatics team and the lead genetic genealogist, CeCe Moore. These facts mitigate many
of the privacy concerns surrounding genetic genealogy, as individuals have control over
whether their data is used as part of law enforcement investigations, and sensitive raw data is
not accessed (Greytak et al., 2018).
Unlike with familial searching of law enforcement databases, no one is legally required to
contribute to a genetic genealogy database, and the samples are not in the possession of
government agencies. The persons contributing to GEDmatch are warned explicitly that
criminal investigators as well as fellow genealogy enthusiasts are able to perform comparisons
against their data. If they choose to participate anyway, there is no reason why law
enforcement should not be able to use this information. These significant differences from
familial searching argue against automatically applying familial search policies, such as
restricting analysis to the end of an investigation, to genetic genealogy. The two techniques are
entirely independent; familial searching has previously been used in some genetic genealogy
cases and not in other; The public is strongly in favor of the use of genetic genealogy to
investigate violent crimes: GEDmatch saw a significant increase in the number of participants
after the Golden State Killer arrest (Milian, 2018), and a recent survey showed overwhelming
public support (Guerrini et al., 2018).
A GEDmatch one-to-many query compares the DNA of interest to all public data in the
database, returning a list of individuals who share the most autosomal DNA. Each “match”
includes the individual’s name or alias, the email address associated with their GEDmatch
account, and any haplogroup or family tree information they have chosen to share (Figure 4).
Figure 4: Top five results from a GEDmatch one-to-many comparison, with potentially identifying
information (kit numbers, names, and email addresses) removed.
A one-to-one comparison can then be run on each match using a more precise algorithm to see
the lengths and chromosomal locations of the shared segments. Comparing the amount of
shared DNA to reference data (e.g., (Bettinger & Perl, 2018)) gives the probability that the
relationship between the unknown individual and the match falls into each degree of
relatedness. For example, a match sharing 100 cM could be anywhere from 5th degree to >8th
degree, with 6th degree being most likely.
However, there are additional complications. First, in addition to multiple possible degrees of
relatedness, each degree contains many relationship types that must be considered (e.g., 5th
degree relatives around the same age could be second cousins, first cousins twice-removed, or
half-first cousins once-removed). Second, the amount of DNA shared by each relationship
varies among populations. Populations founded by a small number of individuals can have low
genetic diversity and high background relatedness, or endogamy. In such populations,
individuals with a given relationship will share significantly more DNA than in other populations,
such that even very distant cousins can share significant amounts of DNA. Endogamy
manifests as a large number of matches, each sharing many small segments, indicating that the
segments were actually inherited from distant ancestors (ISOGG, 2019). Another challenge is
pedigree collapse, in which the same families intermarry multiple times throughout history,
which can inflate the amount of shared DNA between their descendants.
Casework Match Results
More than 80% of samples from Parabon’s >250 law enforcement cases have resulted in a
match at the third cousin level or closer (>60 cM), with subjects of European descent having a
higher probability of success due to their overrepresentation in genetic genealogy databases
(Greytak & Moore, 2018) (Figure 5A). European descent was assessed by Snapshot DNA
Phenotyping, which infers an individual’s genetic admixture from seven continental populations
(African, Middle Eastern, European, Central/South Asian, East Asian, Oceanian, and Native
American). In this analysis, samples were considered “European” if they had at least 80%
European ancestry. Note that the law enforcement cases submitted to Parabon are primarily
from North American agencies, and samples from other regions will likely have lower match
probabilities due to lower participation in DTC genetic testing and use of GEDmatch.
The closeness of the top match is not the sole variable in determining viability for genetic
genealogy. A comprehensive assessment must include consideration not only of the closest
match, but of the quality of the supporting matches and the amount of information available
about each match. For example, progress may be difficult if the top match has unknown
parentage and/or is from a country where records are not available. Parabon assesses each
sample on a subjective scale: 1) very high probability of identification (e.g., parent-child match),
2) high probability of identification, 3) medium probability of identification, 4) low probability of
identification but likely to generate actionable information, and 5) unlikely to generate actionable
information. An assessment does not guarantee a particular outcome but is intended to help
agencies to decide how to proceed. Thus far, ~80% of European samples and ~60% of non-
European samples have been assessed as workable (assessments 1-4) (Figure 5B).
Figure 5: For Parabon’s >250 law enforcement samples, the frequency of A) the top GEDmatch one-to-
many match being in each degree of relatedness and B) samples receiving each assessment level.
Results are reported for European, non-European, and all samples, as well as for those cases that have
been solved (i.e., resulted in an identification) thus far. Degree of relatedness is based solely on the
amount of shared DNA, not the true relationship determined through genealogy: Parent-Child (>3300 cM),
Full Siblings (2200-3300), 2nd Degree (1300-2200), 3rd Degree (650-1300), 4th Degree (340-650), 5th
Degree (200-340), 6th Degree (90-200), 7th Degree (60-90), 8th Degree (30-60), >8th Degree (<30).
Importantly, just because a sample does not have sufficient promising match data today does
not mean it never will. Hundreds of new individuals upload their data to GEDmatch every day
(Milian, 2018), and as the database grows, the proportion of samples with close matches will
increase. Thus, Parabon monitors all unsolved cases for new matches on a weekly basis.
While most of the discussion surrounding genetic genealogy focuses on the database matches,
the vast majority of genetic genealogy work happens after the match list is generated. Many US
records are available to the public and have been compiled into searchable databases
accessible via subscription. For example, Ancestry.com provides a mechanism for accessing a
large collection of records, such as the census through 1940, vital records (birth, marriage,
death) from many states, the Social Security Death Index, and Newspapers.com. Some
Ancestry.com users also create and share public family trees, although these can contain
errors, so they must be examined critically. People search databases and public social media
can also be used to help determine family structures. In some cases, law enforcement may be
asked to assist with this research using their greater access to records.
A previous analysis of the MyHeritage DTC database showed that ~60% of individuals of
Northern European descent will have a match at 100 cM or closer (Erlich, Shor, Pe, & Carmi,
2018). Using simulation, the authors showed that it is often possible to identify an unknown
individual from a single third cousin level match given knowledge of his or her sex, location
within 100 miles, and age within 5 years. However, in addition to the fact that such detailed
demographic information is often not available in law enforcement cases, this assumes that,
given a third cousin match, it is straightforward to obtain a complete list of the match’s relatives
at that distance (the authors determined this number to be ~850, not including half relatives). In
reality, a massive amount of work is required to expand a match into a list of relatives (Greytak
et al., 2018).
The first task is to definitively identify each match, which itself can be quite difficult. Although
GEDmatch displays the name and email address associated with each matching kit, users can
choose to use an alias or an anonymous email address, and kits are sometimes managed by
someone other than the match themselves. Moreover, even if a user associates their actual
name, it may be common (e.g., John Smith), which can complicate identification. Consequently,
the initial identification of matches is both critical and challenging, and often requires
considerable genetic genealogical skill and creative problem solving, e.g., deciphering initials,
inferring identities from other identifiable matches, and figuring out who DNA is from when the
kit is managed by someone else. Even though contacting matches via the given email address
might enable identification and even produce family tree information, Parabon seldom contacts
matches directly so as to minimize the number of people involved in an investigation and reduce
the risk of tipping off a suspect. Matches closer than third cousins are only contacted with the
permission of the investigating agency, and the agency can choose to make the contact instead.
Any contact includes the fact that the questions are in regard to a law enforcement investigation
(no specifics of the case are given), and the individual is informed they are free to participate or
not. If the individual asks not to be involved, they are not contacted again.
Once the matches are identified, their family trees must be constructed back to the set of
possible common ancestors with the unknown individual. The number of generations back in
time to the common ancestors of interest is determined by the distance of the matches’
relationships, although since the estimates are not usually specific to a single relationship, often
the family trees must be built even further back than these levels would imply. Building family
trees back in time requires traditional genealogy research: combing through public records to
determine the identities of each generation’s parents.
However, records are not always available - not all US states maintain an accurate and public
birth index, many families trace back to immigrants from other countries where records are not
readily available, etc. In addition, biological family trees often do not match documented family
trees due to misattributed paternity, unrecorded adoption, unknown parentage, etc., and
individuals in these situations are overrepresented in genetic genealogy databases. Surnames
and spellings also often change through the generations, further complicating the analysis.
Once possible common ancestors have been identified, the family trees must then be built
forward in time (“descendancy research” or “reverse genealogy”) to elucidate the possible
identities of the unknown individual (Figure 6).
Figure 6: A hypothetical family tree resulting from genetic genealogy research. Given a match in
GEDmatch (orange star), the family tree is built backward in time to the possible common ancestors
(orange) and then forward in time (blue) to determine the possible identities of the unknown individual (in
this case, from among the “second cousins”).
The possible ancestors from which the unknown individual descends can sometimes be
narrowed using genomic ancestry (e.g., if the family tree is Northern European, but the unknown
individual has 25% ancestry from another population, the genetic genealogist can search
among the possible grandparents for one who married someone from that ancestral group).
Shared DNA on the X-chromosome can also narrow down the possible paths between matches,
as males only inherit X-DNA from their mothers. Thus, if an unknown male shares X-DNA with
a match, they must be related through his mother, and the path between them cannot pass
through two males in a row. When available, Y-chromosome and mitochondrial (mtDNA)
haplogroups can also narrow down the possibilities, as these are passed directly from father to
son and from mother to child, respectively. Thus, individuals share a mtDNA haplogroup with
their maternal lineage, and males share a Y haplogroup with their paternal lineage.
DNA sharing among matches can also be used to narrow down where the unknown individual
falls in the tree. If matches do not share any DNA with one another, they are likely related to the
individual on different branches of his or her family tree, and the genetic genealogist can then
search for an intersection (“triangulation”) between the two matches’ families in the form of a
marriage that produced children or an out-of-wedlock birth (Figure 7). While there could be
hundreds or thousands of individuals who are second or third cousins to a single match, there
are typically only a few individuals who are cousins at the right distance to multiple matches.
Figure 7: Triangulation between two hypothetical family trees. Given two matches in GEDmatch who are
unrelated to one another (orange stars), family trees are built for each and then searched for an
intersection (green) in the form of a marriage or out-of-wedlock birth. Children of this intersection are
related to both matches, while all other individuals in the tree are only related to one match.
Narrowing Down the Possible Identities
Once candidate individuals have been identified, the genetic genealogist can use a variety of
factors to include or exclude them, in addition to traditional investigative information, such as a
connection to the crime scene or the victim. Sex is known from the DNA, and some age
information may be available – for unidentified remains, age can be estimated; for perpetrators,
at minimum, they had to be alive and physically capable of committing the crime. The individual
also had to be in a given location at a given time, which may mean he or she lived nearby.
While the GEDmatch matches may be spread across the US or even the world, it is sometimes
possible to focus on a particular branch of the family that moved close to the location of interest.
Parabon’s genetic genealogists also use Snapshot DNA Phenotyping (Greytak & Armentrout,
2015) to prioritize among individuals and confirm or exclude hypotheses. An individual’s eye
color, hair color, and skin color can often be determined from mugshots, yearbook photos, or
social media and compared to the predictions. Full siblings cannot be distinguished using
genetic genealogy, as they share all the same genealogical relationships with the matches.
However, if they differ in phenotype, this can be used to prioritize among them. Similarly, if
genealogy research leads to an individual whose phenotypes are at odds with the predictions,
this can spur continued research, while a close similarity can help corroborate an identification.
The degree to which the identity of the unknown individual can be narrowed down varies from
case to case. In the best-case scenario, a single individual or a set of siblings can confidently
be identified through matches to multiple branches of their family tree. More often, there are
multiple cousins (descendants of a particular set of common ancestors) who are consistent with
the available information. These leads can then be followed up through additional research,
traditional investigation, and/or targeted kinship testing of family members to more precisely
place the unknown individual in the family tree. Parabon’s Snapshot Kinship Inference tool uses
genome-wide SNP data to predict the precise degree of relatedness between individuals, out to
6th-degree relatives (Greytak et al., 2017). Using a machine learning model built on thousands
of reference subjects with known relationships, Snapshot predicts the probability that a pair
belongs to each degree of relatedness. Confidence is calculated using the probability of the
most likely degree and the precision calculated for that degree in cross-validation.
Law Enforcement Leads
During decades-long cold case investigations, hundreds or thousands of individuals may be
investigated before the perpetrator is found. Genetic genealogy offers an efficient means of
narrowing an investigation, often to only a few individuals. The number of possible relatives
included in a genetic genealogy analysis varies depending on the number and distance of the
matches. Even when the only matches are distant and large family trees must be constructed
because common ancestors are many generations in the past, experienced genetic
genealogists can triangulate among the matches to determine the most promising branches of
the family tree and limit the amount of unnecessary tree building. Given sufficient triangulation
and time, the number of leads can be reduced to the offspring of a single couple.
No matter how confident the identification, however, genetic genealogy alone cannot prove
identity with 100% certainty. There is always a remote possibility that the unknown individual
could have been adopted or abandoned, and his or her existence could be unknown to family
and not revealed through official records. Therefore, genetic genealogy leads must be verified
through a direct DNA comparison between the person-of-interest’s STR profile and that of the
crime scene sample. It is this traditional forensic DNA match that is used for prosecution.
The following case studies demonstrate how genetic genealogy has been used to assist
investigators with identifying a suspect in cold case investigations. Only information approved
for public release by the investigating agencies is included, so some case details (e.g., DNA
sample source, exact GEDmatch match information) have been obfuscated.
Case Study #1: Snohomish County, WA; 31-year-old cold case (double homicide)
This case study demonstrates the ideal genetic genealogy case, where there are close matches
and clear familial connections that point to only a single conclusion. However, even seemingly
straightforward cases require a large amount of research and the expertise to recognize and
cope with confounding factors such as unknown and misattributed parentage.
The Crime: In 1987, a young Canadian couple, Jay Cook (20) and Tanya Van Cuylenborg (18),
traveled from British Columbia to Washington State in a van. After purchasing a ferry ticket to
Seattle, they were never heard from again. Days later, Tanya’s body was found in a ditch in the
woods, and a few days after that, Jay’s body and the van were found in two separate locations.
DNA evidence was obtained for an unknown suspect (“Subject”).
GEDmatch: There were two matches at approximately the 5th degree relative level, plus
additional more distant matches. The top two matches had no shared DNA between them,
meaning they were most likely related to the Subject on different branches of his family tree.
Family Trees: Family trees were constructed for both key matches back to their great-
grandparents and beyond using census records, vital records, newspaper archives, public
“people search” databases, public social media data, and public family trees. Next,
descendancy research was performed to trace the descendants of each set of ancestors to
determine if an intersection between them could be found.
A triangulating marriage was found between a granddaughter of Match #2’s great-grandparents
and a son of Match #1’s great-grandmother. Extensive research revealed that this son had
taken his stepfather’s surname, initially obscuring his true relationship to Match #1. Thus, the
children of this marriage were half first cousins once-removed to Match #1, as well as second
cousins to Match #2. While both of these relationships are 5th degree, it is critical to consider
all possible relationship types, as half relationships are quite common. No other marriages were
found between the descendants of these ancestors. There was only one son from this
marriage, William Earl Talbott II, and he was therefore the only known male who could be
carrying this mix of DNA from both matches’ families (Figure 8).
Mr. Talbott had never been arrested for a crime that would require submitting DNA to a
database. He had no known connection to the victims and no reason to have been on the
investigators’ radar. His phenotypes matched those predicted by Snapshot, but without other
information to tie him to the crime, this had not been enough to identify him as a suspect.
Figure 8: Anonymized family tree released by the Snohomish County Sheriff’s Department as part of
their announcement of the arrest of William Earl Talbott II. The tree shows the position of Mr. Talbott
(Suspect) and two GEDmatch matches (Cousins) used to determine his identity.
Resolution: Based on the lead provided by genetic genealogy, the detectives were able to
collect DNA from a cup discarded by Mr. Talbott, which, using traditional STR analysis, was
shown to match the DNA from the crime scene. He was arrested and is currently awaiting trial.
Case Study #2: Tacoma, WA; 32-year-old cold case (homicide)
Triangulation between matches using documentary sources is sometimes not possible. In
addition to being able to tenaciously research records and meticulously build family trees, this
case study shows how genetic genealogists must be able to think creatively about possible
hypotheses to explain the available data.
The Crime: 12-year old Michella Welch went missing on 26 March 1986. She had taken her
two younger sisters to Puget Park in Tacoma, Washington and then ridden her bicycle home to
make lunch while her sisters played nearby. When the sisters returned to the park, they found a
brown paper bag with their lunches but no Michella. By 3:10 p.m., officers arrived at the park
and started searching for the missing girl. A tracking dog found her body around 11:30 p.m.
She had been beaten and sexually assaulted and died from a cut to the neck.
The DNA: Another young Tacoma girl, Jennifer Bastian, was also killed around the same time,
and investigators had long believed one person committed both crimes. More than 10,000
investigative hours went into the cases in 1986 alone. Recent DNA testing showed that the
crimes were committed by different men, but neither DNA profile resulted in a CODIS match.
Genetic Ancestry: The Subject was predicted to be predominantly Northern European with a
small but notable amount of Northern Native American admixture (~10%).
GEDmatch: The two top matches did not share DNA, suggesting they were most likely related
to the Subject on different branches of his family tree.
Family Trees: Trees were built for the two top matches back to their great-great-grandparents
and beyond, and extensive descendancy research was performed, but no documented
intersection was found between the two families. The analyst identified a pair of brothers who
were cousins of Match #1, lived within a few miles of the crime scene in 1986, and had two
Native American great-great-grandparents on different branches of their family trees, which was
consistent with the predicted ancestry of the Subject. However, the Subject only shared about
half as much DNA with Match #1 as would be expected for a cousin, and there should have
been an intersection between the families that would connect these cousins to both matches.
When families are connected through DNA but do not intersect on paper (e.g., through a
marriage license or a birth certificate), the explanation may be misattributed paternity: a pair of
individuals from each family had a child together, but the true biological father was not recorded.
Through census record research, it was discovered that relatives of the two matches had lived
in the same small town when one of the cousins’ ancestors was conceived. This was the only
discovered geographical intersection between these families. Based on the amount of shared
DNA, it was postulated that Match #2’s relative was the unrecorded biological father of the
cousins’ ancestor (Figure 9). Under this hypothesis, the cousins would actually be half cousins
to Match #1, which matched the amount of shared DNA. They would also be related to Match
#2 at the appropriate genetic distance.
Figure 9: Pedigree for two cousins of Match #1 who were identified as persons-of-interest in the Tacoma
case, showing the apparent misattributed paternity between Match #1’s relative and Match #2’s relative.
Resolution: The genetic genealogy analysis identified a pair of brothers who could be the
Subject, neither of whom had ever been arrested for a crime that would have required
submission of DNA to a database. Officers were eventually able to follow one of the brothers,
Gary Charles Hartman, into a restaurant, where they obtained a napkin he had used and
discarded. Traditional STR analysis showed that the DNA on the napkin matched the DNA
found at the crime scene. More than thirty years after Michella Welch was found murdered in a
Washington park, investigators announced that they had arrested a suspect in her murder.
Hartman is currently awaiting trial.
Case Study #3: Nearly 40-year-old cold case (homicide)
When there are not enough strong matches in GEDmatch to fully narrow down the possible
branches of a large family tree, cases cannot always be resolved efficiently through genetic
genealogy alone. If an intersection between the matches’ families cannot be found, the number
of possible identities for the Subject can be very large. However, as this case study shows, if
family members of the matches are willing to cooperate, targeted kinship testing can quickly
include or exclude various branches of the family tree and thus arrive at a small number of
included individuals. Due to the close relatives of the suspect who were eventually found in this
investigation, the details of this case are not included to protect their privacy.
GEDmatch: The Subject’s top two matches were both in the 6th-8th degree relative range and
had no shared DNA between them, meaning they were most likely related to the Subject on
different branches of his family tree. There were also additional, more distant matches.
Family Trees: Trees were built for the two top matches back to their great-great-grandparents,
but no intersection was found between the two families. The Subject was most likely a great-
grandson or great-great-grandson of one of Match #1’s great-great-grandparent couples, but
without triangulation, it was not possible to narrow his identity down further. Parabon
recommended more research to identify branches of the family that might have moved to the
area of the crime, as well as targeted kinship testing of members of the top match’s family.
Kinship Testing: The investigating agency obtained a voluntary buccal swab from a cousin on
Match #1’s paternal side, from which DNA was extracted, genotyped, and compared to the
Subject. Snapshot Kinship Inference predicted this individual was unrelated to the Subject, and
Match #1’s paternal family could therefore likely be excluded (assuming the familial
relationships on paper were correct). The agency then obtained a voluntary buccal swab from a
cousin on Match #1’s maternal side, who was predicted with 94.2% confidence to be a 3rd
degree relative (first cousin or genetic equivalent) to the Subject.
Targeted Family Trees: The analyst built family trees for the spouses of each of the kinship
tester’s maternal aunts and uncles back to their great-great-great-grandparents. One uncle’s
wife was determined to be a distant cousin to many of the Subject’s more distant matches. This
triangulation meant that one of the male children of this couple was most likely the Subject, as
he would be related to the GEDmatch matches on both sides of his family tree – second cousins
once-removed (6th degree relatives) to Match #1 and distant cousins (ranging from third
cousins once-removed to fifth cousins once-removed) to Distant Matches #1-7 (Figure 10).
Importantly, barring additional independent intersections between these family trees, the
identified Persons of Interest were the only individuals who were related to both of these
families. These children were also the right age at the time of the crime, lived nearby, and all
appeared to have phenotypes consistent with the Snapshot predictions.
Figure 10: Pedigree built for Match #1’s family after the possible branches leading to the Subject were
narrowed down through targeted kinship testing and subsequent triangulation with distant matches.
Resolution: The genetic genealogy analysis identified a set of brothers who could be the
Subject, none of whom had ever been arrested for a crime that would have required submission
of DNA to a database. Officers were eventually able to narrow the investigation down to a
single brother and match his DNA to the crime scene DNA using traditional STR analysis. He
has been arrested and is awaiting trial.
Genetic genealogy has been called “2018’s biggest contribution to crime science” (Augenstein,
2018) and is rapidly changing the face of cold case investigations. Even for perpetrators who
are completely under the radar or long dead, given DNA from a crime scene, it may be possible
to identify them with genetic genealogy. Importantly, genetic genealogy has just as much power
to generate leads in active cases as in cold cases. In fact, it was recently used to identify a
perpetrator in a sexual assault case that had occurred only three months earlier (Havens, 2019),
and he has since pled guilty. Rather than wait until years have passed and all other leads have
been exhausted, investigators now have access to innovative forensic DNA technologies that
can generate significant new leads and prevent cases from going cold. Looking to the future,
genetic genealogy has the potential to significantly reduce the number of unsolved cold cases in
North America while also reducing the rate at which cases go cold.
Augenstein, S. (2018). Working Backward From Genealogy: Tracking a Dead Killer’s Trail.
Ball, C., Barber, M., Byrnes, J., Carbonetto, P., Chahine, K., Curtis, R., . . . Willmore, L. (2016).
Ancestry DNA Matching White Paper. Retrieved from
Bettinger, B. T., & Perl, J. (2018). The Shared cM Project 3.0 tool v4. Retrieved from
Erlich, Y., Shor, T., Pe, I., & Carmi, S. (2018). Identity inference of genomic data using long-
range familial searches. Science, 362(6415), 690-694. doi:10.1126/science.aau4832
Greytak, E., & Moore, C. (2018). Closing Cases with a Single SNP Array: Integrated Genetic
Genealogy, DNA Phenotyping, and Kinship Analyses. Proceedings of the 29th
International Symposium on Human Identification.
Greytak, E. M., & Armentrout, S. (2015). DNA Phenotyping: Predicting Ancestry and Physical
Appearance from Forensic DNA. Proceedings of the 26th International Symposium on
Greytak, E. M., Gorden, E. M., Marshall, C. K., Sturk-Andreaggi, K., McMahon, T. P., &
Armentrout, S. L. (2017). SNP Recovery from Degraded Samples for Kinship
Greytak, E. M., Kaye, D. H., Budowle, B., Moore, C., & Armentrout, S. L. (2018). Privacy and
genetic genealogy data. Science, 361(6405), 857. doi:10.1126/science.aav0330
Greytak, E. M., Moore, C., & Armentrout, S. L. (2018). RE: Identity inference of genomic data
using long-range familial searches, Erlich et al. Science, 362(6415) (2018), 690-694
Guerrini, C. J., Robinson, J. O., Petersen, D., & McGuire, A. L. (2018). Should police have
access to genetic genealogy databases? Capturing the Golden State Killer and other
criminals using a controversial new forensic technique. PLOS Biology, 16(10),
Havens, E. (2019). Elderly woman in home invasion rape case: I forgive my attacker. St.
George Spectrum & Daily News. Retrieved from
Henn, B. M., Hon, L., Macpherson, J. M., Eriksson, N., Saxonov, S., Pe'er, I., & Mountain, J. L.
(2012). Cryptic distant relatives are common in both isolated and cosmopolitan genetic
samples. PLoS One, 7. doi:10.1371/journal.pone.0034267
Huff, C. D., Witherspoon, D. J., Simonson, T. S., Xing, J., Watkins, W. S., Zhang, Y., . . . Jorde,
L. B. (2011). Maximum-likelihood estimation of recent shared ancestry (ERSA). Genome
Research, 21, 768-774. doi:10.1101/gr.115972.110
International Society of Genetic Genealogy (2019). Endogamy. Accessed January 30, 2019.
Retrieved from https://isogg.org/wiki/Endogamy
Keating, B., Bansal, A. T., Walsh, S., Millman, J., Newman, J., Kidd, K., . . . Kayser, M. (2013).
First all-in-one diagnostic tool for DNA intelligence: genome-wide inference of
biogeographic ancestry, appearance, relatedness, and sex with the Identitas v1 Forensic
Chip. International Journal of Legal Medicine, 127, 559-572. doi:10.1007/s00414-012-
Manichaikul, A., Mychaleckyj, J. C., Rich, S. S., Daly, K., Sale, M., & Chen, W.-M. (2010).
Robust relationship inference in genome-wide association studies. Bioinformatics, 26,
Milian, J. (2018). Cold-case murders, rapes cracked by Lake Worth genealogy website. The
Palm Beach Post. Retrieved from https://www.palmbeachpost.com/news/20181129/cold-
Regalado, A. (2019). More than 26 million people have taken an at-home ancestry test. MIT
Technology Review. Retrieved from https://www.technologyreview.com/s/612880/more-