ArticlePDF Available

Hypothesis Testing of the Critical Underlying Premise of Discernible Uniqueness in Firearms-Toolmarks Forensic Practice

Authors:
  • Forensic Engineering International, LLC

Abstract

The forensic practice of firearms-toolmarks identification rests on a currently indefensible conceptual foundation of discernible uniqueness of firearms and tools. Similar to the now-defunct forensic practice of comparative bullet-lead analysis that had been admitted for almost four decades in judicial proceedings, there has never been any scientifically acceptable hypothesis testing of the underlying premises required for practice validation. Existing studies in the domain literature typically presented in court testimonies as support for specific source attributions are pervasively and fatally flawed such that they have no external validity for extrapolation to universal assumption. They are, thus, of no value for validation of the critical premise of discernible uniqueness in real-world forensic scenarios and are largely irrelevant to any particular criminal judicial proceeding. A practical solution is offered that would allow a scientifically defensible opinion to be proffered to courts until comprehensive and meaningful hypothesis testing can be conducted by the mainstream scientific community.
WINTER 2013 121
HYPOTHESIS TESTING OF
THE CRITICAL UNDERLYING PREMISE OF DISCERNIBLE
UNIQUENESS IN FIREARMS-TOOLMARKS FORENSIC
PRACTICE
William A. Tobin* and Peter J. Blau**
ABSTRACT: The forensic practice of firearms-toolmarks (F/TM) identification
rests on a currently indefensible conceptual foundation of discernible uniqueness,.
Similar to the now-defunct forensic practice of comparative bullet-lead analysis
that had been admitted for almost four decades in judicial proceedings, there has
never been any scientifically acceptable hypothesis testing of the underlying
premises required for practice validation. Existing studies in the domain literature
typically presented in court testimonies as support for specific source attributions
are pervasively and fatally flawed such that they have no external validity for
extrapolation to universal assumption. They are, thus, of no value for validation of
the critical premise of discernible uniqueness in real-world forensic scenarios and
are largely irrelevant to any particular criminal judicial proceeding. A practical
solution is offered that would allow a scientifically defensible opinion to be
proffered to courts until comprehensive and meaningful hypothesis testing can be
conducted by the mainstream scientific community.
CITATION: William A. Tobin and Peter J. Blau, Hypothesis Testing of the
Critical Underlying Premise of Discernible Uniqueness in Firearms-Toolmarks
Forensic Practice, 53 Jurimetrics J. 121142 (2013).
*Corresponding author: 2708 Little Gunstock Rd, Bumpass, VA, (804) 448-3955;
wtobin@feintl.com.
**Materials Science & Technology Division, Oak Ridge National Laboratory, Oak Ridge,
Tennessee.
ARTICLES
Tobin & Blau
122 53 JURIMETRICS
The greatest dangers to liberty lurk in insidious encroachment by men
of zeal, well-meaning but without understanding.
Justice Louis D. Brandeis
It ain’t what you don’t know that gets you into trouble. It’s what you
know for sure that just ain’t so.
attributed to Mark Twain and others
It has repeatedly been observed in the scholarly literature that forensic
identification, of which firearms-toolmarks (F/TM) identification is a subset, rests
on an indefensible conceptual foundation of discernible uniqueness (in the present
discussion, of individual firearms and tools).
1
Remarkably, and despite admonitions
from respected voices in the scientific and legal communities that forensic individu-
alization
2
is a fallacy and that there is no scientific foundation for such claims in
the forensic field of F/TM (also known as, “ballistics”),
3
scientifically untenable
source attributions continue to be proffered to, and accepted by, criminal courts.
Reaction by the very insular community of F/TM examiners to critical recent
National Research Council of the National Academies (NAS) reports that confirm
the lack of scientific foundation has been to circle the wagons and reject opinions
from anyone other than trained F/TM examiners. This response in essence excludes
the scholarly opinions of scientists, engineers, academicians, and legal scholars,
regardless of their credentials. The cited scholarly essays
4
suggest that forensic
individualization based on a claim of uniqueness has a scientifically indefensible
conceptual foundation and is a fallacy promulgated by the forensic community. The
1
. See, e.g., Simon A. Cole, Forensics Without Uniqueness, Conclusions Without
Individualization: The New Epistemology of Forensic Identification, 8 LAW, PROBABILITY & RISK
233, 234 (2009). Forensic “identification” is synonymous with individualization, or specific source
attribution, whereby a forensic examiner opines that particular evidence is associated with, or
originated from, a specific source. Id.
2
. For this essay, the terms identification and individualization will be used interchangeably as
synonyms for specific source attribution. In the first author’s experience, that is invariably the use
in judicial proceedings.
3
. See, e.g., Michael J. Saks & Jonathan J. Koehler, The Individualization Fallacy in Forensic
Science Evidence, 61 VAND. L. REV. 199, 21214 (2008) [hereinafter Saks & Koehler,
Individualization Fallacy]. See also Jonathan J. Koehler & Michael J. Saks, Individualization
Claims in Forensic Science: Still Unwarranted, 75 BROOK. L. REV. 1187, 118788 (2010) [here-
inafter Koehler & Saks, Individualization Claims].
4
. See Cole, supra note 1; Koehler & Saks, Individualization Claims, supra note 3; Saks &
Koehler, Individualization Fallacy, supra note 3.
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 123
authors, and relevant mainstream scientists and colleagues with specialized forensic
expertise with whom the authors have collaborated, agree.
Forensic firearms individualization is presented as ostensibly “scientific” by
innumerable purported “validation studies” and practitioner claims based on
“training and experience,” and sold to courts without true scientific validation or
scrutiny. The perception of “uniqueness” has proliferated and become so embedded
in the public consciousness by CSI media that it has become “pathological
science,
5
accepted as fact and entrenched in legal precedent by the U.S. judiciary,
rendering it virtually irrevocable by any but the most courageous and scientifically
savvy judges.
Two assumptions are invoked as essential and implied when bullets (or cartridge
cases) are claimed to be associated with specific firearms, a process known as
specific source attribution or individualization: uniqueness and repeatability-
reproducibility.
6
Neither premise has ever been scientifically established. There has
been no comprehensive or meaningful testing of the validity of either assumption
(hypothesis) despite a proliferation of purported validation studies in the domain
literature. Such “studies” are typically presented to courts by advocates as proof of
hypothesis validity and low rates of practice error.
7
Instead, and principally because
5
. Pathological science is a term believed to have been first used by Nobel Laureate Irving
Langmuir in his presentation at a colloquium at The Knolls Research Laboratory, December 18,
1953. It characterizes situations, with no implications of dishonesty, where people are influenced
into mistaken beliefs or false interpretations of results resulting from a lack of understanding about
how humans can deceive themselves and be led astray by subjective influences, wishful thinking,
cognitive biases, or unforeseen interactions between or among input variables known as “threshold
interactions.” Some of the beliefs have attracted a great deal of attention, such as the claims for cold
fusion and polywater, with hundreds of papers published on the topics, and have lasted decades only
to eventually fade from public memory as the beliefs and interpretations are revealed to be invalid.
6
. Repeatability and reproducibility have different and specific meanings in the true scientific
method and subsequent peer review process to denote experimental outcome than they are used in
forensic firearms-toolmarks (F/TM) practice to denote persistence in transfer of observed
characteristics. In firearms identification practice, the terms are used somewhat interchangeably to
denote the persistence in transfer of characteristics from tool to work piece or other receptor material
rather than the scientific process itself. However, the assumption of discernible uniqueness, not the
assumption of repeatability-reproducibility of tribological interaction characteristics, is the primary
focus of this paper.
7
. There are numerous papers in the Association of Firearms and Toolmarks Examiners (AFTE)
literature frequently presented by prosecutors to courts in arguments as support for the validity of
the forensic practice with “validation study” in the title. The major flaws in such “validation studies”
are exemplified and discussed in Clifford Spiegelman & William A. Tobin, Analysis of Experiments
in Forensic Firearms/Toolmarks Practice Offered as Support for Low Rates of Practice Error and
Claims of Inferential Certainty, LAW, PROBABILITY & RISK (Oct. 1, 2012),
http://lpr.oxfordjournals.org/content/early/2012/10/01/lpr.mgs028.full.pdf+html. Generally, the
Tobin & Blau
124 53 JURIMETRICS
practitioners are not scientists, the requisite premises for the practice and ultimate
opinions of individualization are, to this day, intuited.
The purpose of this paper is to critically discuss the bases for key aspects of
firearms and toolmarks identification. These include the concepts of uniqueness
and discernible uniqueness, current practices, underlying assumptions and
presumptions of the principal professional trade organization that has established
the decision-threshold guidelines used by F/TM examiners, repeatability and
reproducibility issues (primarily in the true scientific sense but also briefly in the
usage in firearms identification practice relating to the tribological transfer of
characteristics used for forensic comparison), purported bases for validation, and
the strength of certainty and uniqueness claims in light of the well-established
scientific method.
I. RELEVANCE OF UNIQUENESS PREMISE
Firearms identification, a subset of toolmarks identification, comprises the
significant majority of F/TM examinations. It is essentially a pattern-matching
practice whereby forensic examiners compare surface features (striations or linear
“scratches,” and impressions), directly or indirectly, of bullets with barrels,
cartridge cases with surface features of firing pins, breech faces, extractors, and
ejectors, produced during the cycling or firing of cartridges through firearms. When
firearms examiners observe “sufficient agreement” in the alignment and spatial
relationships of the compared striations, they will declare a “match” and typically
testify that the two items under comparison were cycled or fired through or in a
particular firearm.
Mainstream scientists are likely in agreement that, at some level above the
subatomic, every object in the universe is probably unique. Nevertheless, it is
argued in the scholarly literature, and correctly in the authors’ opinions, that
uniqueness is “largely irrelevant” to the forensic identification practices, which
would include firearms identification.
8
The more seminal issue for forensic
practice, however, is that of discernible uniqueness.
9
Similar to the familiar riddle
of a tree falling in a forest with no one around, if uniqueness does, in fact, exist,
what is its probative value if firearms examiners cannot discern it? But to discern
studies are rife with fallacies of presumption (premises that presume what they purport to prove),
[including suppressed evidence (missing critical factors)] and false dichotomy (also known as
“either-or fallacy,” “false dilemma,” “false choice,” inter alia, where options are presented as
nonjointly exhaustive alternatives when, in fact, alternative options are available and reasonable).
Id. at 4.
8
. Cole, supra note 1, at 241.
9
. See Saks & Koehler, Individualization Fallacy, supra note 3, at 20607.
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 125
uniqueness at some level, it is axiomatic that two conditions must exist: (1) some
criteria, indicia, or parameters of detection for uniqueness, and (2) rules of
application for those indicia to discern “same” from “different.
10
An exhaustive
review of the domain literature reveals no such criteria. Thus, there is no apparent
official or scientifically acceptable protocol for distinguishing “same” from
“different. The only existing formal guidance for firearms identification practice
is the AFTE Theory of Identification, a guideline produced by the Association of
Firearms and Toolmarks Examiners (AFTE), the principal trade association for
F/TM examiners. Inasmuch as both members and nonmembers use the AFTE
Theory of Identification as the guideline for their examination process, it is
important to understand its assertions, technical basis, and applicability.
II. AFTE THEORY OF IDENTIFICATION
The AFTE Theory of Identification is not a protocol, standardized procedure, or a
proper scientific theory
11
but, rather, is a subjective guideline promulgated by a
trade organization and not a scientific body.
12
For numerous reasons, “firearms
identification,” as it is called in the trade, is not a true science, despite numerous
claims in the AFTE literature.
13
Regardless of whether a firearms examiner is an
official member of AFTE, the AFTE Theory of Identification is virtually the sole
conceptual basis and guidance for the practice of firearms identification. According
to the U.S. Department of Justice website,
14
In 1985, the Criteria for Identification Committee formalized the
AFTE Theory of Identification as it relates to toolmarks. The theory
10
. Cole, supra note 1, at 24445.
11
. There are numerous reasons the “Theory of Identification” cannot be considered a proper
scientific theory. For one, all proper scientific theories must be falsifiable (refutable), the quality or
characteristic of being testable by empirical experiment. “All intergalactic aliens are purple” is an
interesting proposition, but it is not falsifiable and, thus, could not be considered proper scientific
theory. The AFTE Theory is also missing scientific indicia of repeatability and reproducibility
because there is no protocol, as discussed in this paragraph.
12
. See Dougherty v. Haag, No. 05CC06993, at 1 (Cal. Super. Ct. Dec. 6, 2006), where AFTE
confirmed its status as a trade association.
13
. See, e.g., Richard Grzybowski et al., Firearm/Toolmark Identification: Passing the Reliability
Test Under Federal and State Evidentiary Standards, 35 AFTE J. 209, 21113 (2003). In another,
somewhat humorous, fallacy of logic, in claiming that F/TM practice is a science because it
purportedly uses the scientific method, the authors analogize F/TM practice to changing a light bulb,
claiming that both use the scientific method, implying that, therefore, F/TM practice is a science.
Id. at 23538.
14
. AFTE Theory of Identification, NATL INST. OF JUSTICE, http://www.ojp.usdoj.gov/nij/
training/firearms-training/module13/fir_m13_t05_07.htm (last visited Jan. 12, 2013).
Tobin & Blau
126 53 JURIMETRICS
articulates three principles that provide the conceptual basis for
comparing toolmarks for the purpose of identifying them as having a
common source.
The three principles of the AFTE Theory of Identification as it
Relates to Toolmarks [sic]:
1. The theory of identification as it pertains to toolmarks
enables opinions of common origin to be made when the
unique surface contours of two toolmarks are in sufficient
agreement.
2. This sufficient agreement is related to the significant
duplication of random toolmarks as evidenced by the
correspondence of a pattern or combination of patterns of
surface contours. Significance is determined by the
comparative examination of two or more sets of surface
contour patterns comprised of individual peaks, ridges,
and furrows. Specifically, the relative height or depth,
width, curvature, and spatial relationship of the individual
peaks. Ridges and furrows within one set of surface
contours are defined and compared to the corresponding
features in the second set of contours. Agreement is
significant when it exceeds the best agreement
demonstrated between two toolmarks known to have been
produced by different tools and is consistent with
agreement demonstrated by toolmarks known to have been
produced by the same tool. The statement that sufficient
agreement exists between two toolmarks means that the
likelihood another tool could have made the mark can be
considered a practical impossibility.
3. The current interpretation of
individualization/identification is subjective in nature,
founded on scientific principles and based on the
examiner’s training and experience.
III. DEFICIENCIES IN BOTH THEORY AND PRACTICE
Parsing the “theoryof identification reveals some glaring deficiencies. It is
inherently vague and tautological, with circular reasoning. An examiner is
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 127
permitted to opine an identification when there is sufficient agreement, and
sufficient agreement is defined as enough agreement for an identification. From a
scientific logic perspective, in allowing opinions of individualization, the AFTE
Theory also facilitates classic fallacies of presumption (premises presuming what
they purport to prove or have yet to be established): petitio principii (that is,
begging the question by assuming the initial claim) and suppressed evidence (that
is, presuming no important piece of evidence (input parameter) has been
overlooked by the premise or study). From both legal and scientific perspectives, it
comprises ipse dixit (unproven assertion). In evaluating the AFTE Theory of
Identification that purportedly enables opinions of common origin,” it should be
borne in mind that there exist no indicia for access to ground truth or other objective
benchmark by which to judge the accuracy of opinions of common origin (claimed
matches). The petitio principii occurs with two obvious questions: (1) how the
examiner knows that the 2-dimensional surface contours are unique,
15
and (2) how
the examiner knows that other toolmarks that the examiner has never examined (or
even ones that he has) are not in even better agreement, a circumstance strikingly
demonstrated in Trotter v. Missouri.
16
Importantly, and contrary to most other forensic practices, the pattern-
matching practice of firearms identification does not use the single dissimilarity
rule whereby a questioned item is immediately eliminated during forensic
comparisons when a single substantive difference in any of the measured or
observed characteristics is detected between the questioned and known items of
evidence. In essence, the task of a firearms examiner is to find a limited number of
surface contours that are in agreement and to rationalize away those that are not. In
the first author’s experience, most purported identifications are based on a
relatively limited subset of the object’s characteristics, with the majority of
characteristics observed not in sufficient agreement.
17
The Theory offers no
guidance about what indicia are to be used in judging which characteristics are
useful for comparisons and, as previously alluded, nor does it offer any rules for
15
. Examiners frequently claim that the patterns they observe in the field of view of a comparison
microscope are 3-dimensional, but a microscopic image is realistically 2-dimensional. Newer
microscope technology that enables quantitative 3D imaging of surfaces to be generated by
computer-aided reconstruction makes this claim even more suspect.
16
. Trotter v. Missouri, 736 S.W.2d 536, 53738 (Mo. App. 1987).
17
. See Jerry Miller & Michael Neel, Criteria for Identification of Toolmarks Part III: Supporting
the Conclusion, 36 AFTE J. 7, 9 (2004), where in this particular study, 52% of matching striations
were observed in known non-match samples (KNM), and a maximum of 86% matching striations
were observed in known match samples (KM). Even higher percentages of agreement have been
encountered by one author in known non-matches. Id.
Tobin & Blau
128 53 JURIMETRICS
discerning “same” from “different.” Such vague guidance requires completely
subjective interpretation based on training and experience, as well as seemingly
superhuman recollection of nondescript patterns of spatial relationships, physical
characteristics exhibited (for example, striation width), and quality, of the most
elementary and nondescript geometric form (lines) comprising the examiner’s
“training and experience.” Inevitably, this leads to differences of opinion based on
variability of examiners’ experiences, skills, and even geographic locations, among
other confounding variables. Epistemic variability attributable to lack of an
articulated protocol
18
is so problematic that even the same examiner may not agree
with himself between and among temporally remote examinations for provenance.
For example, one examiner changed his opinion of the very same evidence at some
period of time after he originally opined a specific source attribution (declared “a
match”).
19
Another example clearly shows the effects of having no objective
criteria: even when different examiners correctly conclude paired test samples to
be of common origin in proficiency tests and purported validation studies, there is
limited or no accord among respondents about exactly which characteristics
comprised their “matches” (Type III error).
20
The AFTE Theory of Identification
implicitly requires examiners to accept the asserted but unfounded premise of
18
. Unless otherwise noted, use of the term protocol in this essay refers to a scientifically
acceptable protocol. Virtually all federal, state, and local crime laboratories have documents self-
described as protocols, but none reviewed by the first author constitute scientifically acceptable
protocols, and even those not reviewed by the authors could not constitute scientifically acceptable
protocols inasmuch as they lack requirements of the scientific method.
19
. Trotter, 736 S.W.2d at 538. Initially the police thought that [the police officer victim] was
killed with his own revolver, since [it] was missing and [the same caliber] slug was removed from
his body. This theory later was abandoned when [a suspect] was apprehended after [another
shooting]. Ballistic tests were performed [on his recovered weapon by a firearms examiner who
testified at defendant’s trial] that it was his opinion that the [suspect’s] revolver . . . fired the
[particular] slug removed from [the deceased officer’s] body. . . .
Sometime after the trial . . . the [deceased officer’s] gun was recovered and examined by [the
same forensic firearms examiner who originally claimed a “match” between the recovered slug
from the deceased police officer and the original suspect’s weapon. The examiner subsequently]
testified at [the trial of a second (different) defendant] that he is now of the opinion that it was
[the deceased officer’s] gun, and not the [original suspect’s] gun [which he originally declared
a “match” at the trial of the first defendant], which caused the death of [the deceased officer].
Id. (emphasis added). This is considered an example of blocking a confounding variable because it
“blocks” (as rationalization and argument) differences of opinion between and among examiners
since the very same examiner rendered two mutually exclusive opinions.
20
. Miller & Neel, supra note 17, at 9. Respondents report the same conclusion (identification)
but differ widely about identifying which striations were suitable for “matching” (Type III error).
Id.
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 129
discernible uniqueness.
21
Although most AFTE members are not certified by AFTE
inasmuch as certification is a voluntary program completed by a minority of the
membership, acceptance of the Theory or similar guideline with underlying
premises of discernible uniqueness and repeatability is assuredly a sine qua non
condition for crime laboratory certification as a firearm-toolmark examiner.
IV. MISSING CORNERSTONES OF THE SCIENTIFIC METHOD
Because the “theory” is comprised of such vague and subjective terms, with no
underlying protocol, it does not incorporate, or even allow for, two critical
cornerstones of true scientific endeavor: repeatability and reproducibility.
“Repeatability” is the characteristic of experimentation that permits an
experimenter to replicate results of his or her prior experiments; “reproducibility
is a property that allows an experimenter to replicate the results of another’s
experiment.
For purposes of both repeatability and reproducibility, the outcome of
empirical experiments to validate an inductive hypothesis cannot be critically
dependent upon an experimenter’s training and experience, consistently invoked as
the only basis for F/TM practice and expertise. Although an acceptable basis for
the low threshold of admissibility as an “expert” in a court of law (“knowledge
beyond the ken of the average juror”), training and experience are patently
unacceptable in true scientific endeavor as bases for “proof” for inductive
hypothesis testing and validation. Examiners’ claims of being able to recognize
uniqueness when they see it, implicit in opining “identification” subjectively, are
nonsensical on that basis. The crux of rejection by firearms examiners of criticism
from the mainstream scientific community is that none other than “qualified”
practitioners are trained to recognize uniqueness in firearm “signatures.”
22
Paradoxically, and perhaps without realizing it, practitioners reject opinions of
members of the true scientific community because mainstream scientists are not
trained to recognize the very phenomenon that has never been established to exist,
as confirmed by two separate committees of the NRC, generally considered by the
21
. The subjective guideline known as the “AFTE Theory of Identification” will be discussed in
more detail in upcoming paragraphs. In short, the theory claims to allow an examiner to declare
common origin (individualization or specific source attribution) if the two compared items of
evidence exhibit “sufficient agreement” (for that particular examiner) that “exceeds the best known
non-match” (in that particular examiner’s experience). This creates an assumption as a logical
necessity that all firearms are discernibly unique in observable characteristics.
22
. The purported “signatures” of a firearm are comprised of the striations or impressions
imparted to a cartridge case and bullet by tribological processes during cycling of a cartridge through
a firearm. Tribology is the science and engineering of interacting surfaces in relative motion.
Tobin & Blau
130 53 JURIMETRICS
true scientific community, jurists, and legislators, to be the most prestigious voice
of the relevant scientific community!
23
In judicial proceedings, proponents of the status quo in firearms identification
do not present any body of data, proof, or evidence in scientific argument to support
the premise of discernible uniqueness, but rather typically resort to ad hominem
attack of scientific expert witnesses who challenge the foundations of the practice,
including distinguished members of the NRC-National Academy of Sciences
(NAS). Instead of providing substantiating data and studies validating the operating
premise of discernible uniqueness allegedly underlying the practice, proponents in
essence attempt to shoot the messengers of acceptable scientific methodology. The
attacks on scholarly critics, and defensive responses to the NRC findings, include
claims that they were not qualified firearms examiners or that (the scientific
community as represented by the NRC-NAS and other members of the true
scientific community) got it wrong because they did not have a qualified F/TM
examiner on the Committee, respectively. Virtually identical claims were raised in
responses to scientific challenges to the now-defunct practice of comparative bullet
lead analysis (CBLA): the (witness, critic, inter alia) is not a qualified (CBLA-
firearms) examiner. Arguments against admissibility of a metallurgist-materials
scientist as an expert witness in the tribological interactions of bullets and cartridge
cases with firearm components during functioning discount the opinions of those
trained to understand the materials physics that underlies the formation of the very
characteristics used by forensic examiners for their pattern-matching practice.
Regrettably, practitioners routinely attempt to convince judges that metallurgy-
materials science and statistics have nothing to do with firearms identification
practice. That position is irrational in that it ignores the specific scientific discipline
that focuses on the forming, shaping, finishing, surface engineering, functioning,
and tribological interactions of-with the firearm components. Such processes are
directly responsible for the material responses to stresses and strains that eventually
create the striations and impressions on bullets and cartridge cases.
Claims that the field of statistics is irrelevant are also irrational because the
ultimate inferences proffered by expert witnesses, that a particular bullet or
cartridge case(s), or both, was fired from (or in) a specific firearm “to a practical
certainty,are inherently probabilistic. When confronted with findings by the NRC
23
. COMM. ON IDENTIFYING THE NEEDS OF THE FORENSIC SCI. CMTY., NATL RESEARCH COUNCIL
OF THE NATL ACADS., STRENGTHENING FORENSIC SCIENCE IN THE UNITED STATES: A PATH
FORWARD 154 (2009) (citing COMM. TO ASSESS THE FEASIBILITY, ACCURACY, & TECHNICAL
CAPABILITY OF A NATL BALLISTICS DATABASE, NATL RESEARCH COUNCIL OF THE NATL ACADS.,
BALLISTIC IMAGING 3 (2008) [hereinafter NRC, BALLISTICS IMAGING REPORT]).
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 131
that [c]onclusions drawn in firearms identification should not be made to imply
the presence of a firm statistical basis when none has been demonstrated,”
24
and in
consequent arguments by practitioners against statistical relevance and, thus,
admissibility of statisticians as expert witnesses, responses have included that, in
their process deriving opinions of specific source attribution, firearms examiners
“don’t use statistical numbers” (notwithstanding that an individualization or
specific source attribution implies a probability of 1).
25
The more general attack,
however, ignores stare decisis, specifically the ruling of U.S. v. Porter, inter alia,
and implies that mainstream scientists are not qualified to recognize either
discernible uniqueness in firearms or scientifically, logically unacceptable practice
and inference.
26
Apparently as a result of the NRC committee reports, there have been some
refinements to opinion statements in several forensic practices, but only a miniscule
and diaphanous refinement in the domain of firearms identification: the expression
of “absolute certaintyhas most recently been replaced by “to a practical certainty”
or “to a reasonable degree of ballistic certainty.” Not surprisingly given the
nonscientific cultural ethos, this attempt to accommodate the NRC position is a red
herring and woefully misses the point that the inherently probabilistic inferences of
individualization are without scientific foundation and essentially intuited. From a
scientific perspective, this response is a cosmetic change that does nothing to
remedy the underlying scientific shortcomings of F/TM practice. In a recent
evidentiary hearing for a criminal trial, a federal judge demonstrated complete
understanding of the probabilistic implications of such opinions when he asked the
prosecutor, with the firearms examiner sitting at her table for consultation, if she
had any problems with his (implied inclination to rule) limiting the F/TM
examiner’s opinion to “consistent with.” After a brief consultation with the firearms
examiner, who demonstrated a lack of understanding that individualizations are
24
. See NRC, BALLISTIC IMAGING REPORT, supra note 23, at 82 (emphasis omitted).
25
. Post Conviction Testimony of Lyndon Ray Watkins, Sr., at 9798, In re Rattler, No. 83403C
(Md. Cir. Ct. Oct. 22, 1998) (where witness “confirms” prosecutor’s leading question that an
examiner’s opinion of individualization does not “use statistics.”).
26
. United States v. Porter, 618 A.2d 629, 634 (D.C. 1992) (defining the relevant scientific
community as individuals “whose scientific background and training are sufficient to allow them to
comprehend and understand the process and form a judgment about it”). This ruling expressly
expanded the “relevant community” consideration of Frye v. United States, 293 F. 1013 (D.C. Cir.
1923) from the narrow inclusion of only the insular community of F/TM examiners to include most
mainstream scientists and engineers. Id. at 633. Heretofore, F/TM examiners have defended
challenges by claiming that the opinions of scientists, engineers, inter alia, should be excluded
because they were not F/TM examiners and, thus, did not have the “training and experience” to
know it when they see it. Grzybowski et al., supra note 13, at 214.
Tobin & Blau
132 53 JURIMETRICS
inherently probabilistic and that consequent expressions of certainty are without
scientific foundation, the prosecutor proposed to the court that the Judge’s ruling
be amended to allow “consistent with being fired from the same weapon and
inconsistent with being fired from any other weapon.” The Judge immediately
responded that the proposed modification “undercuts the consistencystatement,
circumventing the scientifically acceptable limitation on the expression of
certainty, thus defeating the proposed intention underlying his imminent ruling. The
Judge’s Order subsequently limited the firearms examiner’s opinion at trial to
“consistent with.”
27
State court judges most frequently dismiss the scientific and legal implications
from respected voices of both the scientific and legal communities by either
declining to grant evidentiary admissibility hearings because “the [NRC-]NAS
reports don’t signal enough of a change in acceptance within the scientific
community”
28
or otherwise resorting to laissez-faire judging in cases where
hearings were granted. Some legal advocates have argued, much to the chagrin of
the Committee Chairman (Honorable Harry Edwards) of the ad hoc 2009 NRC
Strengthening Forensic Science Committee, and others, that the NRC reports were
not intended to affect admissibility.
29
However, some forward-thinking courts
27
. Transcript of Proceedings at 281, United States. v. Jackson, 1:11-CR-411-WSD, (N.D. Ga.
July 25, 2012 (disallowing expression of source attribution “to a practical certainty” and limiting to
“consistent with.”). Judge Duffey is the former U.S. Attorney for the Northern District of Georgia,
and displayed unusual acumen in understanding probabilistic implications of individualizations and
ipse dixit expressions of certainty.
28
. People v. Ambriz, No. BA376570, (L.A. Super. Ct. 2012) (ruling that the firearms examiner’s
opinion would be “limited”: she would not be allowed to call F/TM practice a “science” but must
append the conclusion statement with “in my opinion” such that the final opinion permitted by the
Court would be “in my opinion, the [evidence] was fired in the [specific weapon]”).
29
. See Hon. Harry T. Edwards, The National Academy of Sciences Report on Forensic Sciences:
What it Means for the Bench and Bar, Presentation at Superior Court of the District of Columbia
(May 6, 2010) (emphasis added) (citations omitted) (quoting Government’s Opposition to
Defendant’s Motion to Exclude Expert Testimony Concerning Latent Fingerprints Evidence at 3,
United States v. Faison, No. 2008-CF2-16636 (D.C. Super. Ct. 2010), available at http://www.
cadc.uscourts.gov/internet/home.nsf/AttachmentsByTitle/NAS+Report+on+Forensic+Science/$FIL
E/Edwards%2C+The+NAS+Report+on+Forensic+Science.pdf.
I recently had an opportunity to read several briefs filed by various U.S. Attorneys’ offices
in which my name has been invoked in support of the Government’s assertion that the
Committee’s findings should not be taken into account in judicial assessments of the
admissibility of certain forensic evidence. One brief, for example, asserts:
In fact, the Honorable Harry T. Edwards, Co-Chair for the NRC Forensic Science
Report, has stated on the public record that the report is not intended to affect the
admissibility of any forensic evidence.
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 133
(almost all Federal) have begun to recognize the absence of indicia of practice
reliability and have excluded or limited opinions.
30
To this day, opinions of scientists, academicians, and scholars continue to be
dismissed by both firearms examiners and many judges because they have
purportedly never conducted F/TM comparisons and have not been trained to
recognize uniqueness. In the true scientific community, training and experience are
unacceptable as proof of the validity of an inductive hypothesis or even that “train-
ing and experience” can allow an observer to discern uniqueness with no objective
parameters of detection, rules of application, body of data, and properly established
probabilistic foundation for expressions of inferential certainty incorporating
significance levels and confidence intervals. From the scientific perspective,
qualifications based on impressive numbers of cases, samples, or years of
experience are largely irrelevant to the tasks at hand: establishing the validity of the
hypothesis that F/TM examiners can accurately opine provenance of evidentiary
samples or that discernible uniqueness exists for firearms identification.
V. INFERENTIAL LOGIC PROCESSES
Logical arguments are typically categorized as either deductive or inductive.
When the universe of samples in a possible sample pool is available for testing, or
when the premise(s) of an argument characterizing the entire population is accepted
or established as fact(s), deductive inference is appropriate. Say, for example that
it is known that all Chevy Novas ever made by General Motors are blue, and a
Chevy Nova emblem is recovered from the scene of a bombing or other catastrophic
event that destroyed the car. It is logically acceptable to deduce that the vehicle
from which the emblem was affixed was blue. However, deduction is generally not
a plausible process in forensic firearms identification practice because the possible
sample universe is almost never available for examination and comparison,
This is a blatant misstatement of the truth. I have never said that the Committee’s Report
is “not intended to affect the admissibility of forensic evidence,”. . . . To the degree that I
have commented on the effect of the Report on admissibility determinations, I have said
something quite close to the opposite of what these briefs assert. Id. at 6.
30
. See United States v. Jackson, 1:11-CR-411-WSD, (N.D.Ga. July 27, 2012); United States v.
St. Gerard, U.S. Army Trial Judiciary, Germany (5th Cir. June 7, 2010) (where Judge excluded
statement of “practical impossibility for the cartridge case to have been fired by any [other]
weapon”); United States v. Alls, No. CR2-08-223(1), (S.D. Ohio Dec. 7, 2009) (examiner may
testify as opinion witness, but may not testify to her conclusion that firearm identification can be
made to exclusion of all other firearms); United States v. Monteiro, 407 F. Supp. 2d 351, 372 (D.
Mass. 2006) (finding that “there is no reliable . . . scientific methodology which will currently permit
the expert to testify that . . . [a casing and a particular firearm are] a ‘match’ to an absolute certainty,
or to an arbitrary degree of statistical certainty.”).
Tobin & Blau
134 53 JURIMETRICS
requiring inference by inductive process. But that is a 900-pound gorilla in the
courtroom for inferences of specific source attribution.
The scientific community has long accepted the tenet that it is impossible to
inductively “prove” a hypothesis by accumulating positive instances from a
population with any combination of samples. In short, an experimental hypothesis
for inductive process cannot be proven by simple (or sample) enumeration.
Regardless of sample selection, there always exists the possibility of inductive
hypothesis falsification. (If, however, all sample combinations in the possible
sample pool are tested, the appropriate process of inference then becomes
deductive.)
Anecdotal training and experience has also been invoked by F/TM practi-
tioners and advocates to argue that so many thousands of firearms have been
examined by a particular examiner or by the community of firearms examiners and
that there has “never been found two that were indistinguishable” (ignore other
glaring flaws in that assertion and reasoning, including continuity of
communication pathways and subjective nature of verification) and that the
confluence of similar assertions by other F/TM examiners combine to prove
uniqueness as a valid hypothesis-premise based on putative infrequency. Not-
withstanding that training and experience is an unacceptable basis for scientific
proof, it is undoubtedly the most frequently invoked basis for judicial acceptance
of opinions of individualization.
In their paper discussing the flaws in inferring uniqueness from infrequency,
Saks and Koehler provide an enlightening example of the fallacy of such argument
as counterintuitive as the well-known (in the statistical community) birthday
problem: assume a population of 100,000 guns in a small city and that there are 100
pairs of indistinguishable firearms in that population of guns. Assume also that the
police department has 100 F/TM examiners, and that each conducts 10 pairwise
examinations every day for 10 years. After the 3,650,000 comparisons are
conducted over the 10-year period, there remains a 93% chance that none of the
100 indistinguishable firearm pairs were ever compared.
31
This is a striking and
counterintuitive example of the folly of individual, or even collective, training and
experience as basis for validation of the premise of discernible uniqueness, even if
cognitive retention and subsequent recollection of the spatial relationships
(patterns) of the tens of millions of nondescript lines on the many thousands or
millions of specimens over a lengthy period of time was humanly possible. It is
simply not possible to prove an inductive hypothesis by simple (or sample)
31
. Saks & Koehler, Individualization Fallacy, supra note 3, at 213.
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 135
enumeration from a population.
32
This fact alone demonstrates the critical need for
a proper statistical approach to properly characterize uncertainty related to the
inductive testing and, more directly relevant to judicial proceedings, to expressions
of certainty proffered in connection with expert testimony.
VI. PROFUSION OF “VALIDATION STUDIES”
Numerous self-proclaimed validation studies comprising sample enumeration,
and near 100-year acceptance and admissibility by U.S. courts of firearms
identification practice, continue to be argued as ample proof of efficacious practice,
of the underlying premise of discernible uniqueness and, thus, of the ability of
forensic F/TM examiners to render opinions of individualization.
33
Similar arguments were invoked during the challenge to the now-defunct
forensic practice of comparative bullet lead analysis (CBLA) up to the time of its
ultimate demise in 2005. The practice of comparing bullet compositions for specific
source attributions had been accepted by courts for almost four decades. The CBLA
literature, similar to that of the literature of firearms identification, had also
encompassed an abundance of purported validation studies claiming support for the
validity of specific source attributions and consequent probative value. One
common denominator to CBLA and F/TM is that not a single study tested what are
the most seminal underlying premises for putative probative value: ostensible low
relative frequency and the implication that a perceived low relative frequency of
occurrence implies uniqueness, of either product source, per se, or availability of
product source in local retail markets.
34
Unfortunately, forensic firearms
identification is exceedingly more subjective than was CBLA. Except for the
32
. A similar example of such counterintuitive circumstance relating to perceived infrequency is
the statistically famous birthday problem (which has apparently been the subject of numerous bar
bets): it takes only 23 people at a party for there to exist a 50% chance of someone else present
having the same birthday as you; the chance rises to almost 100% with 60 people present.
33
. With the realization that the courtroom is not a laboratory, one scholar discusses the practice
of invoking lengthy admissibility and extensive casework as “proof” of practice validity as “implicit
testing.” Simon A. Cole, “Implicit Testing”: Can Casework Validate Forensic Techniques?, 46
JURIMETRICS J. 117, 12628 (2006).
34
. Saks & Koehler, Individualization Fallacy, supra note 3, at 213, demonstrate one of the
fallacies of inferring uniqueness from relative frequency in a population by their example of 100
pairs of guns, each pair hypothetically discernibly indistinguishable from each other, in a population
of 100,000 guns. If a police department had 100 firearms examiners, and each examiner examined
10 guns per day every day for the next 10 years, there remains a 93% chance that none of the 50
pairs were ever examined. Id. This is one reason why such purported “validity testing even
comprising an accumulation of positive instances from a population (called sample enumeration) is
inadequate as scientific “proof” of an hypothesis as a universal assumption or law.
Tobin & Blau
136 53 JURIMETRICS
ultimate inference of “same source,” CBLA was almost entirely an analytically
sophisticated quantitative objective forensic practice; F/TM is a virtually entirely
subjective qualitative practice.
Because of its almost entirely subjective nature, scholarly critics have cynically
characterized the basis for firearms identification (discernible uniqueness) as “I
know it when I see it.” Despite its fatal flaws, and as previously alluded, CBLA did
have a single dissimilarity rule where, if an examiner encountered dissimilarity of
characteristics during comparison examinations, the questioned sample was
immediately excluded; not so in F/TM practice. It is the practice of firearms
examiners to find a few similarities on which to base an association and to dismiss,
or rationalize away, the frequent and sometimes overwhelming dissimilarities
present. Among those publications that hint at the nature and scope of the problem,
one found up to 52% matching lines in a known non-match
35
and another only 21
24% (steel-jacketed bullets) and 3638% (non-jacketed bullets) concordance on
bullets fired from the same gun.
36
It has been observed that there are typically 2 and
3 times more matching striations in known non-matches (fired in different guns)
than in those fired in the same gun.
37
VII. PROFICIENCY TESTS,
VALIDATION STUDIES, AND RATES OF ERROR
In addition to purported validation studies, proficiency tests are frequently
offered to courts as proof of low rates of misidentifications and practice error. For
decades, examiners typically claimed infallibility in practice: zero rate of error
based on scoring “100% on all proficiency tests.” Eventually realizing the
indefensible nature of such a claim because of challenges from the mainstream
scientific community given that there are finite rates of error in all human
endeavors, most now accede to domain rates of error “between 0 and 1% or “less
than one percent.” However, to this day, there has been no showing of any
correlation between purported validation studies or proficiency test results and rates
of Type I error (false positives or misattributions) in actual casework.
Proficiency tests as they are structured in the forensic field also suffer from
numerous flaws, the most salient of which is that they neither mimic real-world
scenarios nor do they test, in the words of Justice Blackmun, the ultimate task at
35
. Miller & Neel, supra note 17, at 9.
36
. Alfred A. Biasotti, A Statistical Study of the Individual Characteristics of Fired Bullets, 4 J.
FORENSIC SCI. 34, 37 (1959).
37
. Miller & Neel, supra note 17, at 9.
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 137
hand,
38
the inference of individualization, in large part because of the difference
between the logic processes of deduction and empirical induction.
In court testimonies, examiners invariably claim zero failures (100% accuracy)
in proficiency tests. Even ignoring other significant defects in the proficiency
testing process,
39
a proficiency test typically presents the respondent (examiner
tested) with one or several questioned bullets fired from a “known” (control)
firearm, and then a few reference bullets or firearms representing possible sources.
In proficiency testing, a respondent can deduce to individualize or infer the source,
because the entire possible sample pool is almost always provided for comparisons.
In real-world scenarios, the entire pool of possible samples is not reasonably
available for examination and comparison; thus, the examiner must use an inductive
logic process to infer source, which requires proper statistical foundation for
probabilistic characterization and expression of certainty for a given level of
confidence. Without the latter, any expression of certainty as typically proffered in
court testimony, such as “to a reasonable degree of ballistic certainty,” is nothing
more than speculation and is without scientific foundation. As noted by a renowned
professor of physics, “Any measurement that you make without knowledge of its
uncertainty is completely meaningless.”
40
VIII. WHAT DO CURRENT FORENSIC PROFICIENCY TESTS
AND PURPORTED VALIDATION STUDIES ACTUALLY MEASURE?
The extent of the myriad flaws and deficiencies in proficiency tests and
purported validation studies in the field of F/TM practice are beyond the scope of
this paper to enumerate. In general, however, such comparative observational
studies are known in the true scientific community to have the least predictive
38
. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 590 (1993). See, e.g., DAVID L. FAIGMAN
ET AL., SCIENCE IN THE LAW: STANDARDS, STATISTICS AND RESEARCH ISSUES 1065 (2002); D.
Michael Risinger, Defining the “Task at Hand”: Non-Science Forensic Science After Kumho Tire
Co. v. Carmichael, 57 WASH. & LEE L. REV. 767, 77576 (2000).
39
. To include that they are neither double-blind nor blind, that respondents have commented on
public record that they insultingly and laughably easy, that “inconclusives” are not counted as
incorrect responses, that the rate of “inconclusive” rises dramatically (quadrupled in at least one
study) during proficiency testing as compared to casework, that there is no control over collaborative
efforts, that samples are provided in “pristine condition” (not damaged as in “real-world” scenarios),
that test samples are preexamined to insure clarity of striations, inter alia.
40
. WALTER LEWIN WITH WARREN GOLDSTEIN, FOR THE LOVE OF PHYSICS: FROM THE END OF
THE RAINBOW TO THE EDGE OF TIMEA JOURNEY THROUGH THE WONDERS OF PHYSICS ixx
(2011). Walter Lewin is an internationally acclaimed and extremely popular professor of physics at
the Massachusetts Institute of Technology since 1966. It is claimed that over a million people yearly
watch his lectures via various media. Id. at 301.
Tobin & Blau
138 53 JURIMETRICS
power to a population; many have no external validity (also known as
“generalizability”) at all. Quite often, the only predictive power is to the particular
subset from which the observations were made. Such is the case with virtually all
of the current studies in the professional literature of F/TM examiners, rendering
them generally irrelevant to any particular criminal trial.
41
Depending on how an experimenter frames the null and alternative hypotheses,
if at all, validation studies in the domain of F/TM should be designed to address the
“ultimate task at hand” for triers of both law and fact: that is, whether forensic
practitioners can accurately and reliably opine specific source attributions
inductively from a very large possible sample pool of unknown size when the entire
pool cannot be presented to respondents (examiners) contemporaneously. The task
at hand for a particular judge is not whether a firearms examiner can accurately and
reliably deduce source attribution when presented with all samples in an irrelevant
possible sample pool, as typically represented by proficiency tests and current
purported validation studies in the domain literature. In actual casework, forensic
examiners are asked to compare crime scene bullets or cartridge cases, or both, with
a single, or several, weapon(s) recovered from a suspect(s), presenting a strongly
biasing context, in essence implying that we want you to confirm what we already
know.”
42
Based on currently existing domain epistemology, the ability of firearms
examiners to make correct inferences is likely more correlated to the effectiveness
of the investigating officer-detective than to forensic technique. In other words, it
is quite possible that correct results are more correlated to the fact that investigators
“got it right” than from efficacious forensic practice, a Type III forensic error
(obtaining the right answer but for the wrong reason).
The task at hand for validation studies should be to attempt to falsify two
hypotheses: (1) that uniqueness exists, and (2) that a properly trained human
observer can discern uniqueness. Few observers, if any, would refute the ability of
properly trained forensic examiners to opine possible associations of bullets or
cartridge cases, or both, with the specific weapon from which they were fired when
presented with all possible samples for direct comparison. A colorful example
might be where one octuplet is cloned and a respondent is asked which of the
octuplets was cloned. It is quite likely that there would be near-zero rates of
41
. Several practitioners have recognized this fact but most still do not. See Alfred A. Biasotti &
John Murdock, “Criteria for Identification” or “State of the Art” of Firearm and Toolmark
Identification, 16 AFTE J. 16, 19 (1984), where the authors note that consecutively manufactured
barrel studies are “subjective evaluations” that are “therefore only of value to the examiner who
conducted the study” or other people in the same laboratory.
42
. Lisa J. Steele, All We Want You to Do Is Confirm What We Already Know: A Daubert
Challenge to Firearms Identifications, 38 CRIM. L. BULL. 465, 47678 (2002).
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 139
misidentification if all the octuplets were presented contemporaneously for
comparison with the clone. Assuredly, the rate of error would likely increase if, at
random, octuplet #3 was presented for comparison the first month, then #7 is
presented a month or so later, and so on.
As it turns out, careful analysis for both internal and external validity
43
of the
various putative validation studies that currently exist reveals them to be nothing
more than very limited proficiency tests of the participating examiners
(respondents), in large part because of fallacies of suppressed evidence, false
dichotomies, petitio principii (begging the question), and other fallacies of
presumption, in addition to the fact that they do not circumstantially mirror
casework.
44
The studies lack external validity and, accordingly, scientific value for
validating the underlying premise of uniqueness, correlation for characterizing rates
of misattributions in actual casework, or for establishing probative value of claimed
matches.
IX. SCIENTIFICALLY ACCEPTABLE EXPERIMENTAL
METHODOLOGY
It is estimated that about 300 million firearms are owned in the United States.
45
Realistically, all firearms in the possible sample pool are not available to forensic
examiners for direct comparisons. Despite its probabilistic nature and lack of
absolute certainty as a logical necessity (as exists in deductive logic process),
inductive inference, with properly founded uncertainty statement, is the only
plausible and scientifically defensible approach.
For decades, F/TM examiners, and others including fingerprint examiners,
claimed infallibility in both methodology and practitioner execution. All ex-
periments have some degree of error, and the two principal categories of error are
43
. Internal validity is the trait of an experiment without which the data would be uninterpretable.
External validity is the property of an experiment that allows extrapolation beyond the sampling
constraints of the experimental conditions.
44
. Fallacies of presumption are those that contain premises that presume what they purport to
prove, and include the fallacies of suppressed evidence (omitting critical evidence) and false
dichotomies (explicitly or implicitly presenting an argument with only two alternatives).
45
. Estimates from sources best positioned for such estimates: the National Rifle Association
(NRA), Alcohol Tobacco & Firearms (ATF), National Academy of Sciences (NAS), and Federal
Bureau of Investigation (FBI). 2013 NRA-ILA Firearm Fact Card, NATL RIFLE ASSN OF AM., INST.
LEGISLATIVE ACTION (Jan. 8, 2013), http://www.nraila.org/news-issues/fact-sheets/2013/2013
firearmsfactcard.aspx. FBI firearms examiner John B. H. Webb made similar estimate in recent
testimony in United States v. Jackson, 1:11-cr-411-WSD (N.D. Ga. July 25, 2012). See also COMM.
TO IMPROVE RES. & DATA ON FIREARMS, NATL RESEARCH COUNCIL OF THE NATL ACADS.,
FIREARMS AND VIOLENCE: A CRITICAL REVIEW 5657 (Charles F. Wellford et al. eds., 2004).
Tobin & Blau
140 53 JURIMETRICS
generally considered to be process (systemic) and measurement. Variability in
virtually all processes and measurement is a fact of life, and one of the most critical
tasks in true scientific empirical activity is taming uncertainty. Statistical methods
are designed to address variability, and they allow us to understand implications of
experimental activity in the presence of variability. Such methods should be
incorporated in all experimental inductive inference derived from observational
studies performed by forensic examiners intended for judicial application.
Unfortunately, they do not currently exist. Opinions are merely intuited and
expressed “to a reasonable degree of practical certainty,” “to a reasonable degree
of ballistic certainty,” or other (subjective) expressions of certainty with no
established basis in reality.
A classic example of the need to establish proper statistical methods to sample
an unknown population for inductive inference is typically presented to
metallurgists, materials scientists and statisticians early in their professional
education when confronted with quality control testing for single-use items such as
camera flash bulbs. For example, Eastman Kodak could not subject entire
production lots (sample pool) to destructive testing if they expected to have any
marketable flash bulbs. Sampling regimens based on proper design of experiments
were required to produce scientifically acceptable uncertainty characterizations for
inferences relating to measured variables such as rate of defects, rate of failures,
inter alia, to be associated with a population that could not be examined or tested
in its entirety. This is a well-established, scientifically defensible, and indeed a
standard, approach to characterizing uncertainty in inductive inference relating to
experimental outcomes in circumstances where the entire sample pool cannot be
tested.
So how should F/TM examiners begin to frame experiments for scientific
acceptability? Currently, the most formidable barrier is that, because discernible
uniqueness has not been scientifically or forensically established, there exists no
articulated protocol providing the foundation for assessing repeatability and
reproducibility of experimental results in either purported validation studies or in
actual casework. There exist neither parameters of detection nor any rules of
application to discern “same” from “different.” Without access to ground truth
(knowledge of whether a particular firearm is, in fact, the firing platform for
particular bullets or cartridge cases), disagreements are nothing more than
differences of opinion. However, if an articulated protocol someday becomes
reality, certain basic questions should be incorporated into the design of any
proposed firearms identification experiments. Judges should consider the same
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 141
questions when evaluating the validation studies offered as support for courtroom
opinions. These are:
(1) What is the hypothesis to be tested in each study?
(2) What is the population of interest (target population)?
(3) What is the sample population?
(4) Does the sample population coincide with the target population?
(5) Are the study inferences within acceptable limits of the experimental
(sample) frame?
(6) Are the opinions offered in judicial proceedings supported by the claimed
“validation studies”? In other words, does the purpose for which the
“validation studies” are offered to the court, express or implied, exceed the
limitations of the study(ies)?
First, experimenters should define and state their objective(s) with properly
articulated null and alternative, or minimally working, hypotheses. Many, if not
most, of the F/TM studies do not present an articulated hypothesis or purpose,
resulting in a vague framework in which a judge might attempt to assess exactly
what was measured by the experiment, whether the experiment(s) is “substantially
similar” in all relevant aspects to the case at bar,
46
and whether the purpose for
which the studies are presented to the court is supported by the limitations of the
study(ies). Secondly, the target and sample populations must be well defined. Once
the target population is defined, a sample frame can be established. In every study
evaluated by the first author and discussed, in part, in an upcoming paper, the
sampled population does not coincide with the target population as implied to
courts.
47
Accordingly, even if this aspect were the only flaw in experimental design
of “validation study” methodology, the “validation study” sample populations are
irrelevant to any particular judicial proceeding for numerous metallurgical and
statistical reasons. Thus, most inferences proffered to courts wherein purported
validation studies are presented as establishing a putative foundation for opinion
exceed scientifically acceptable limitations of the experimental framework for the
studies offered as support, largely because experimenters, expert witnesses, or both,
did not insure that the sample population coincided with the target population.
Notwithstanding that there is virtually no external validity (generalizability) in
the individual, or even collective, claimed validation studies in the field, another
46
. See Burriss v. Texaco, Inc., 361 F.2d 169, 175 (4th Cir. 1966). Cf., Beasley v. Ford Motor
Co., 117 S.E.2d 863, 865 (S.C. 1963) (stating that in order “for an experiment to be admissible the
conditions of it must be similar . . . to the facts under investigation”).
47
. See Spiegelman & Tobin, supra note 7, at 67.
Tobin & Blau
142 53 JURIMETRICS
vast, but subtle, disconnect also exists. Even if an experiment(s) exhibits internal
validity, it may be irrelevant as support for the ability of a firearms examiner to
opine an individualization and for any (unfounded) expression of certainty in the
particular case at bar. Even when the conclusions of a study are properly within the
limitations of its underlying experimental methodology, per se, they are most often
extrapolated and presented to courts as proof, explicitly or implicitly, of universal
efficacious practice for the particular firearm (and other independent variables) in
the case at bar. This insidious extrapolation to other situations is patently
objectionable as pathological science, yet goes virtually undetected by both courts
and practitioners alike.
X. SHORTCOMINGS IN “10-GUN VALIDATION STUDIES
It is generally accepted within both the firearm identification and metallurgical-
tribological communities that the highest risk of subclass carryover
48
of
characteristics (primarily striations or striae) used by firearms examiners for
comparisons is between and among products from the same manufacturing batch,
with the highest risk being of those consecutively manufactured. For this reason,
some of the purported validation studies involve what are known colloquially as
“10-gun studies.” The thinking is that if F/TM examiners can correctly match, with
low rates of Types I and II errors (false positives and false negatives, respectively),
bullets or cartridge cases, or both, fired from various known test guns that were
consecutively manufactured, these outcomes somehow prove the hypothesis of
discernible uniqueness and consequently “prove” that firearms examiners have low
rates of misattributions. However, even prominent practitioners have recognized
the limited value of such 10-gun (purported validation) studies in at least one paper
by F/TM authors who are respected in their community.
49
Notwithstanding, the
48
. The three categories of characteristics used by F/TM examiners for comparison are class,
subclass, and purportedly “individual.” Class characteristics are those that belong to the largest
sample pool or population of possible samples (firearms, in this case), such as caliber, barrel rifling
(number of lands and grooves, direction of twist), inter alia. Subclass characteristics are those of a
subset of the class and are generally derived from fabrication; for example, all barrels produced by
the same broach, breech faces produced by the same forging die or shaped by the same end mill.
“Subclass carryover” is the situation where characteristics imparted by forced contact with a
manufacturing tool, for example, striations that derive from the fabrication process and belong to a
large group (batch) of weapons, and are not purportedly “individual” characteristics, are present
during examination by the forensic examiner.
49
. See Biasotti & Murdock, supra note 41, at 19, where they state, with respect to tests of
consecutively manufactured firearms, that “[s]uch studies are subjective evaluations based on
criteria of identification which cannot readily be articulated or communicated to other examiners
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 143
various 10-gun studies continue to be presented to courts as proof of infallibility or
otherwise low rates of practice error, not realizing that the studies are largely
irrelevant to judicial proceedings. Additionally, calculations reveal that the 95%
confidence interval for a 10-gun sample pool with 10 correct responses is 72% to
100%. Thus, it is reasonable to expect the possibility of 28% attribution error.
Recognizing the call from the legal community for more reliable foundations
for their practice, numerous validation studies have proliferated from the firearms
identification field, some including large numbers of practitioners in the
international forensic community. However large and diverse the population of
study participants were, the expanded studies do not rectify scientific flaws in
hypothesis testing in firearms identification practice and do nothing to establish the
validity of the underlying premise of discernible uniqueness or of the ultimate
inference of individualization. They also do not establish credible rates of practice
error in real-world casework. Arguably, the studies may be considered effective
support for F/TM examiners’ individual abilities (proficiencies) to discern
differences between and among firearms when the entire pool of possible samples
are presented simultaneously as they are in proficiency tests but, as discussed,
proficiency tests do not mimic real-world scenarios, are easily “maxed” or defeated,
or both, and do not measure what practitioners typically believe and represent that
they measure.
In the now-defunct CBLA practice, proficiency tests presented a few
questioned bullets along with some known reference samples, and examiners were
asked which questioned bullets came from which known bullet sources. Proficiency
tests in the field of comparative bullet lead analysis basically tested the Federal
Bureau of Investigation (FBI) examiner’s ability to quantify compositional analytes
(chemical elements), and to compare questioned samples with an extremely limited
pool of possible samples by deductive inference. These proficiency tests never
tested the two most fundamentally critical issues for forensic application: (a) the
validity and reliability of specific source attributions in the general population
(universe) of bullets by inductive inference, and (b) the probative value of such
claimed matches, the ultimate tasks at hand for forensic practice. What they did test
was examiners’ efficiency in accurately measuring the analyte compositions (the
parameters of detection) and in comparing them with those from
contemporaneously presented samples. During challenges to the practice, the FBI
Laboratory’s measurement capabilities were never called into question, yet that is
what practitioners vehemently (and misguidedly) presented as practice error for the
except through photography. The information gained from such studies is therefore only of value to
the examiner who conducted the study; or to the examiners trained or supervised by that examiner.”
Tobin & Blau
144 53 JURIMETRICS
task at hand (the ultimate inference). In fact, forensic practitioners were generally
quite proficient at measuring (although not necessarily interpreting) the elemental
constituents in bullet lead and were probably among the best in the nation, if not
the world, at doing so. The problem was that such exquisite measurements with
quite sophisticated analytical instrumentation (including a nuclear reactor) were
meaningless when it came to the ultimate task at hand (representing probative value
for a claimed match). For almost 35 years, proponents of CBLA practice and courts
alike had been seduced by the sophistication of the analytical instrumentation in its
impressive ability to measure compositional constituents to the parts per million.
Unfortunately, the resulting data were meaningless in the courtroom because no
one had ever established the probative value of such sophisticated measurements
until researchers studied random match possibilities in retail outlets and published
their findings in 2005. Those results dealt a devastating blow to the forensic utility
of CBLA.
50
Practice advocates had never studied product-marketing distribution.
They had merely assumed compositional uniqueness of sources (molten batches),
and then assumed uniform geographical product distribution, ignoring the possi-
bility of what researchers eventually found to be inhomogeneous retail distribution
and also that sources were not discernibly unique. The researchers found surprising
concentrations of indistinguishable product in local and regional areas such that
many consumers could not have purchased different product compositions even had
they wanted to do so. Analytically indistinguishable product concentrated in a local
area can be damaging to probative utility. It was incumbent on the proponent of the
evidence to demonstrate forensic value, but without any comprehensive or
meaningful study of discernible uniqueness of bullet compositions or of the
availability of same-composition bullets in retail markets (distribution), probative
value of CBLA had merely been intuited. Current firearms identification practice
suffers from the same deficiencies as CBLA, but it is the contention of this paper
that these deficiencies have been either unrealized, ignored or inadequately
addressed.
Despite decades of law enforcement use and widespread acceptance of AFTE
guidelines-based findings in legal proceedings, the forensic practice of comparing
tribological markings on the surfaces of bullets, shell casings, and firearms, and
50
. Simon A. Cole et al., A Retail Sampling Approach to Assess Impact of Geographic
Concentrations on Probative Value of Comparative Bullet Lead Analysis, 4 LAW, PROBABILITY &
RISK 199, 202 (2005).
Hypothesis Testing in Firearms-Toolmarks Forensic Practice
WINTER 2013 145
consequently opining source attributions based on those comparisons, fails to meet
key criteria consistent with mainstream scientific methods by which comprehensive
and meaningful research, with rigorous hypothesis testing, is conducted. The
subjective judgments are all too often afforded high credibility in judicial
proceedings, largely attributable to nearly 100 years of favorable admissibility
rulings, despite inadequate qualification of their physical bases and inadequate use
of sound methods that include qualification of applicability.
As implied in both NRC reports, it will take years of rigorous scientific study
and experimentation to properly and adequately address the critical underlying
issues of discernible uniqueness, design and conduct of experiments (including
sampling issues), statistical significance of results, and quantitative versus
qualitative expressions of findings, among other considerations. The efforts will
assuredly require quite significant funding. From the authors’ and colleagues’
perspective, few, if any, of the current “validation studies” would be useful or
contribute to establishing a proper foundation for effective, and scientifically
acceptable, forensic practice. Much like the belief in CBLA that “same composition
= same source,” the longstanding belief by F/TM practitioners that “sufficient
agreement = same gun” in firearms identification is also similarly intuited because
there have never been any comprehensive and meaningful studies establishing the
underlying belief (premise) of discernible uniqueness. The ultimate task at hand for
CBLA was whether a purported “match” of compositions between questioned
(crime scene) and known (suspect-defendant’s) bullets implied “same source”
(batch) of molten lead with significant probative value. The ultimate task at hand
for courts evaluating firearms identification testimony is whether an examiner
testifying that there exists “sufficient agreement” of some limited, unspecified and
unquantified, number of similarities means that the bullets or cartridge cases were
cycled or fired through the same gun. No correlation between a finding of similarity
and individualization has ever been scientifically established in either field. None
of the existing examiner proficiency tests or purported validation studies has ever
tested (1) the validity of the underlying premise of discernible uniqueness, (2)
prevalence of putative sources or even possible samples with conflating
characteristics, for probative value, or (3) rates of practice error, in either field of
forensic endeavor. In the interim, while awaiting comprehensive and meaningful
studies to establish a proper foundation for expert opinions, the only scientifically
acceptable solution to curb the excesses of current F/TM testimonies is to limit
expert opinions such that triers of fact and law understand existing limitations of
forensic practice. Currently, the strongest opinion that is scientifically defensible is
that, in the opinion of the examiner (the first caveat), the characteristics exhibited
Tobin & Blau
146 53 JURIMETRICS
by the evidence are consistent with the evidence having been fired from a particular
firearm or that the firearm could not be excluded as the firing platform (source).
... The forensic practitioner visually compares fired cartridge cases, viewing them side by side through a comparison microscope. 4 Properties that practitioners report taking into consideration include the position and shape of the firing-pin impression, and the heights, widths, and distances between parallel peaks and troughs on the firing-pin impression and on the breech-face region (Tobin & Blau [4]; Tai & Eddy [5]). Current 1 For simplicity, we assume centre-fire cartridges. ...
... Some firearms use rim-fire cartridges, and a firing-pin impression appears on the edge of the base of the fired cartridge case rather than on a central primer cup. 2 Breech designs vary, but a common design is for there to be a breech block, i.e., a block of metal, that halts the backward motion of the cartridge case. 3 The design of many firearms allow firing pins and breech faces to be replaced, but for simplicity the present paper does not address scenarios involving such changes. 4 A comparison microscope allows images of two different objects to be juxtaposed and rotated and aligned relative to one another. ...
... • what we consider to be the best feature set ( §6.2) • the benefit of rotating the data matrices to a common target ( §6.3) • the benefit of using more known-source cartridge cases ( §6. 4) • what we consider to be the best segmentation of the cartridge case base ( §6.5) ...
Article
Full-text available
We describe and validate a feature-based system for calculation of likelihood ratios from 3D digital images of fired cartridge cases. The system includes a database of 3D digital images of the bases of 10 cartridges fired per firearm from approximately 300 firearms of the same class (semi-automatic pistols that fire 9 mm diameter centre-fire Luger-type ammunition, and that have hemispherical firing pins and parallel breech-face marks). The images were captured using Evofinder®, an imaging system that is commonly used by operational forensic laboratories. A key component of the research reported is the comparison of different feature-extraction methods. Feature sets compared include those previously proposed in the literature, plus Zernike-moment based features. Comparisons are also made of using feature sets extracted from the firing-pin impression, from the breech-face region, and from the whole region of interest (firing-pin impression + breech-face region + flowback if present). Likelihood ratios are calculated using a statistical modelling pipeline that is standard in forensic voice comparison. Validation is conducted and results are assessed using validation procedures and validation metrics and graphics that are standard in forensic voice comparison.
... Another concern is that the theory lacks specific protocols on how to conduct comparisons and evaluate findings. Without such protocols, examiners cannot reliably assess whether a tool may or may not be the origin of a toolmark [25]. ...
... Expert opinion based on the examination of bullets and cartridges to make a determination of whether they were fired from (or cycled through the action of) a particular gun has been accepted by the courts for decades [19]. Notwithstanding long-standing practice and judicial precedent for acceptance of F/T testimony, the foundations of the discipline have sometimes been criticized in recent years, particularly due to lack of reliable error rates, but F/T testimony is consistently found admissible [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35]. A 2008 NRC Report asserted that, "the validity of the fundamental assumptions of uniqueness and reproducibility of firearm-related toolmarks has not yet been fully demonstrated" [1, p, 3]. ...
Article
Full-text available
This paper describes design and logistical aspects of a decision analysis study to assess the performance of qualified firearms examiners working in accredited laboratories in the United States in terms of accuracy (error rate), repeatability, and reproducibility of decisions involving comparisons of fired bullets and cartridge cases. The purpose of the study was to validate current practice of the forensic discipline of firearms/toolmarks (F/T) examination. It elicited error rate data by counting the number of false positive and false negative conclusions. Preceded by the experimental design, decisions, and logistics described herein, testing was ultimately administered 173 qualified, practicing F/T examiners in public and private crime laboratories. The first round of testing evaluated accuracy, while two subsequent rounds evaluated repeatability and reproducibility of examiner conclusions. This project expands on previous studies by involving many F/T examiners in challenging comparisons and by executing the study in the recommended double-blind format.
... These have included studies of RMP, the statistical test mentioned by Carriquiry. Sadly, only anecdotal opinions and superficial reviews of select studies are offered up as informed criticism (5,6). Although these criticisms have been authored ostensibly by statisticians, no careful definitions of RMP as related to firearm and toolmark evidence have been proposed and no actual toolmark comparison research has been performed to support these criticisms. ...
Article
The field of firearms and toolmark analysis has encountered deep scrutiny of late, stemming from a handful of voices, primarily in the law and statistical communities. While strong scrutiny is a healthy and necessary part of any scientific endeavor, much of the current criticism leveled at firearm and toolmark analysis is, at best, misinformed and, at worst, punditry. One of the most persistent criticisms stems from the view that as the field lacks quantified random match probability data (or at least a firm statistical model) with which to calculate the probability of a false match, all expert testimony concerning firearm and toolmark identification or source attribution is unreliable and should be ruled inadmissible. However, this critique does not stem from the hard work of actually obtaining data and performing the scientific research required to support or reject current findings in the literature. Although there are sound reasons (described herein) why there is currently no unifying probabilistic model for the comparison of striated and impressed toolmarks as there is in the field of forensic DNA profiling, much statistical research has been, and continues to be, done to aid the criminal justice system. This research has thus far shown that error rate estimates for the field are very low, especially when compared to other forms of judicial error. The first purpose of this paper is to point out the logical fallacies in the arguments of a small group of pundits, who advocate a particular viewpoint but cloak it as fact and research. The second purpose is to give a balanced review of the literature regarding random match probability models and statistical applications that have been carried out in forensic firearm and toolmark analysis.
... There exist numerous flaws in firearm identification practice, primarily fallacies of presumption and grossly deficient (or, more prevalently, completely lacking) designs of experiments. Critics have characterized the practice as "pathological science" and some insightful judges are in agreement, resulting in embryonic stages of a judicial paradigm shift (Tobin and Blau 2013). ...
Article
Full-text available
Comparative Bullet Lead Analysis (CBLA) was discredited as a forensic discipline largely due to the absence of cross-discipline input, primarily metallurgical and statistical, during development and forensic/judicial application of the practice. Of particular significance to the eventual demise of CBLA practice was ignorance of the role of statistics in assessing probative value of claimed bullet ‘matches’ at both the production and retail distribution levels, leading to overstated testimonial claims by expert witnesses. Bitemark comparisons have come under substantial criticism in the last few years, both due to exonerations based on DNA evidence and to research efforts questioning the claimed uniqueness of bitemarks. The fields of fire and arson investigation and of firearm and toolmark comparison are similar to CBLA and bitemarks in the absence of effective statistical support for these practices. The features of the first two disciplines are examined in systemic detail to enhance understanding as to why they became discredited forensic practices, and to identify aspects of the second two disciplines that pose significant concern to critics.
... Another argumentative article, by Tobin and Blau (158), is rising similar claims, by comparing firearm identification and toolmarks examination to comparative bullet-lead analysis (CBLA). It is stated that existing studies in the domain literature, typically presented as support for specific source attributions, have no external validity for extrapolation to universal assumption. ...
Article
Full-text available
Forensic firearm examination is a branch of forensic science that focuses on analysing and comparing firearms, ammunition, and related evidence to determine their connection with a specific crime. It plays a vital role in criminal investigation, especially in linking firearms with crime scenes, identifying firearms used in criminal activities, and providing expert evidence in court. The investigation process in forensic firearm analysis involves various principles and techniques. One of the fundamental principles is the uniqueness of firearms, which means that no two firearms—even of the same make and model—will produce identical marks on fired bullets and shells/cartridge cases. This principle forms the basis for comparing firearm results. Firearm examiners use specialised tools and microscopes to examine and compare the bullets and shells' striations, firearm marks, and other microscopic characteristics. Critics of forensic firearm examinations question the reliability of the findings of firearm examinations and state that these examinations are not based on valid scientific research and that the absoluteness of the findings is over-represented/exaggerated. They claim that the subjective nature of the analysis and the lack of standardisation in the field can result in mistakes and prejudice/bias in examiners' conclusions. In addition, the critics argue that there are no compelling empirical studies to support the validity and reliability of the discipline. In response to this criticism, the Association of Firearm Tool Mark Examiners (AFTE) and other experts have made a case for the reliability and validity of forensic firearm examinations. The AFTE claims that the discipline is based on sound scientific principles and is critically studied according to the prescripts of the scientific method. The AFTE stresses that a wealth of literature supports the validity of firearm identification and that firearm examinations can link a bullet or shell to a specific firearm. The AFTE developed the identification theory, which serves as the published statement of the relevant scientific community. According to this theory, when there is a more substantial similarity between individual characteristics of two biometrical marks, such as marks on bullets or shells, than between marks known to be produced by different firearms, and if there is a consistent similarity between marks known to be produced by the same firearm, the conclusion can be drawn that these marks came from the same firearm. According to the AFTE and firearm experts, the reliability of forensic firearm examinations is corroborated by the extensive training and experience required of firearm examiners. These individuals undergo stringent training and often have years of practical experience conducting examinations and comparisons. The experts claim that the skill involved in examinations and the systematic and scientific approach to analysis guarantees the reliability and accuracy of their conclusions/findings. It is recommended that firearm examiners avoid presenting their results in absolute terms and, where possible, quantify their findings with supporting statistics. They should communicate their findings and conclusions cautiously, acknowledging the limitations and uncertainties inherent in a forensic firearm examination. By avoiding absolute statements, the examiners can accurately convey the significance and reliability of their findings to the court. It is also essential that court officials present firearm results together with other supporting evidence. This approach ensures that firearm evidence becomes part of the broader context of the case, which promotes a just and balanced evaluation of the evidence by the court. Purposeful/purposive interviews were conducted with 30 detectives from 10 police stations in Gauteng, four international forensic experts from overseas institutions and five local forensic examiners using semi-structured interview schedules. Research data was obtained by meticulously analysing existing literature and collecting participant data. The results of this study provide insight into the perception of detectives and local and international forensic examiners regarding forensic examination clues, with specific reference to firearm-related results.
Chapter
The forensic identification of firearms, by examining discharged bullets and cartridges, forms a subset of the more generic discipline of tool-mark identification. This chapter focuses primarily on those marks and impressions associated with the use of firearms, partly because of the volume of such evidence, particularly in the US courts, and partly because of the attention devoted to it in the literature, though much of the discussion applies also to the wider field of tool-mark evidence. It uses the generic name for this subdiscipline of forensic science. In the United Kingdom, the National Ballistics Intelligence Service (NABIS) database has a more generic function, acting as a repository for both technical and other information related to gun crime thereby providing intelligence to assist investigation. Comparison algorithms then reduce the data to similarity scores that feed into the calculation of the required probability densities.
Article
Individualization, the claim to be able to reduce the potential donor pool of a forensic trace to a single source, has long been criticized. This criticism was echoed by a 2009 U.S. National Research Council report, which called such claims unsupportable for any discipline save nuclear DNA profiling. This statement demanded a response from those disciplines, such as fingerprint analysis, that have historically designated 'individualization' one of their approved testimonial conclusions. This article analyses three serial responses to this challenge by the U.S fingerprint profession. These responses posited new terms for testimonial reports ormodified the definition of individualization. The article argues that these reforms have yet to 'fix' individualization and that all three reforms suffered semantic and conceptual difficulties. The article concludes by suggesting that these difficultiesmay be traced to the insistence on retaining, and somehow justifying, the term and concept 'individualization', instead of developing new terms and concepts from a defensible reasoning process.
Cf., Beasley v. Ford Motor Co
  • See Burriss V
  • Texaco
  • Inc
See Burriss v. Texaco, Inc., 361 F.2d 169, 175 (4th Cir. 1966). Cf., Beasley v. Ford Motor Co., 117 S.E.2d 863, 865 (S.C. 1963) (stating that in order "for an experiment to be admissible the conditions of it must be similar... to the facts under investigation").