|
Stephen W. Porges, PhD,
Chair of Review Committee
Date: March 28-29,1996
Location: JHU/APL Facility, Laurel, MD
Summary:
A team of distinguished scientists met to evaluate the research
conducted by JHU/APL on automated scoring of polygraph examinations.
The committee chaired by Dr. Stephen W. Porges included Dr. Raymond
Johnson, Dr. John Kircher, Dr. John Stern, and Dr. John Thornton.
Dr. Porges is a former President of the Society for Psychological
Research, a Professor of Human Development and Psychology at the
University of Maryland, and an active researcher for the past 30
years in the area of autonomic psychophysiology, physiology, and
signal processing. Dr. Porges has chaired and participated on the
review committee for several government agencies including the National
Institutes of Health. Dr. Johnson is an Associate Professor of Psychology
at Queens University and has been an active researcher for over
20 years in the area of forensic psychophysiology. Dr. John Kircher
is an Associate Professor of Educational Psychology at the University
of Utah and has been an active researcher for about 20 years in
the area of forensic psychophysiology with special expertise in
statistical models and methods for deception from autonomic responses.
Dr. John Stern is the former chairman of the Department of Psychology
at Washington University, a former President of the Society for
Psychophysiological Research, a pioneer in psychophysiological methodology
and application and currently involved in evaluating the use of
eye movements in the detection of deception. Dr. John Thornton is
a biostatistician and a former department chairman of biostatistics
at a major medical school. The team had collective research experience
covering all areas of statistics, signal processing, and psychophysiology
relevant to the JHU/APL project. Based on scientific peer-review
criteria, the evaluation team viewed the JHU/APL effort inadequate
in the application of scientific technology. The committee was
perplexed by the theoretical and descriptive approach conducted
by the research team, as well as their lack of understanding of
physiology, psychophysiology, signal processing strategies, physiological
monitoring hardware, and statistical theory. Specific issues will
be discussed below.
SOW
History: JHU/APL was given the task to develop an algorithm
that discriminates between deceptive and non-deceptive in the data
set provided. The data set consisted of polygraph tests on which
two expert polygraphers had agreement. The JHU/APL team developed
an algorithm that works well with the data set provided. The algorithm
was able to replicate with a high degree of accuracy the decisions
of the expert polygraphers. Thus, based upon the initial description
of task, JHU/APL has completed the task. However, from a scientific
point of view, especially in terms of test development and application
of the algorithm to the field, the approach was faulty from the
start. Questions such as whether the algorithm works, under what
circumstances does the algorithm work, and whether the algorithm
works better than other algorithms have not been addressed. From
a scientific approach, these questions are not appropriate, because
the strategy used to develop the algorithms represent an atheoretical
and naive approach to psychophysiological signal processing and
test development.
Basic Problems
with the project
1.
Generalizability of the JHU/APL algorithm. JHU/APL approached
the question as an empirical task of developing and algorithm that
distinguished between the polygraph tracings designated as deceptive
from those designated as not deceptive. This empirical direction
is frequently articulated by the JHU/APL team as the power
or the strength of the computers and their ability to
extract the most meaningful patterns. Unfortunately, no approach
is truly empirical. All approaches to statistical analysis are limited
to the quality of the data input and subsequently dependent upon
sampling theory. The true goal is not to separate the given
data set into two groups, but to develop an algorithm that may be
generalized and used in field applications. Thus, we
must shift from descriptive atheoretical empirical descriptions,
to inferential statistics and their associated distributions of
estimators and of course, the probabilities of these estimators.
In the polygraphers world, this would be stated in terms of
the probability that individual is innocent or guilty.
Methodologically, the JHU/APL algorithm can be used with the data
set from which it was developed, but generalization of the probabilities
based on the development sample field to field tests is not statistically
appropriate. Basically, given the methods used to develop the
algorithm, there is no scientific basis for the algorithm to be
generalized to field application. Thus, the marketing of the Polyscore
algorithm as providing a probability of deception in the field is
unwarranted and professionally irresponsible.
2.
Hardware used to collect the data for the development of the JHU/APL
algorithms.
The committee was critical of the Axciton computerized polygraph
system that was used to collect the data. To the committee the technology
of this system appears to represent 40-year old hardware. Although
the outputs of the amplifiers are computerized, the amplifiers and
the interface with the subject reflect an old technology. JHU/APL
was unaware of the specifications of the amplifiers of Axciton or
the potential confounding effects of the time constants and filters
designed into the amplifiers on raw data. JHU/APL had no understanding
or knowledge of the transfer functions of the amplifiers. This appeared
to the review committee as inappropriate, since JHU/APL had received
a Federal contract to generate a transformation function to statistical
adjust data from another polygraph manufacturer to be similar to
the Axciton data. No appropriate answer was obtained regarding the
decision to select the system. Nor was there an adequate reason
to recommend this system to polygraphers in government agencies.
Apparently, because the JHU/APL algorithms were developed on data
generated from the system, the current marketing of the algorithm
forces users to purchase the Axciton.
It is important
to acknowledge that all AC amplifiers (i.e., self-centering) filter
data, especially lower frequency input. If the slow trend in GSR
is important or if the latency to peak or recovery of GSR is important,
this information might be lost or adulterated by the frequency response
characteristics of the amplifiers. From an engineering point of
view, this reflects poor design specifications. However, since JHU/APL
does not have an experienced research physiologist or psychophysiologist
or biomedical engineer on their team, the question of attenuation
of input signal or corruption of signal by amplifiers and filters
was not in their realm of expertise and, thus, not considered in
conducting their task.
3.
Software. The software incorporates several stages
of smoothing, filtering, and signal extraction. Each of the
stages represent a decision to extract specific features from the
data. Although the JHU/APL team argues that the approach is empirical
and the computer selects the features, this argument
is not totally accurate. Decisions were made by the team and each
decision functioned to filter the entire empirical data
set. The transfer functions of each of these steps needs to be empirically
presented. There is no doubt the several iterations of the algorithm
were accomplished via hard work, however, the decision sequences
and the analyses supporting these decisions were not available for
review. The development of an algorithm is not an empirical exercise
as suggested by JHU/APL, but rather it must be based upon statistical
theory and requires clear statements of these assumptions and consequences
for each transformation and each filter.
4.
Physiological systems. Since the 1880s researchers
have been quantifying electrodermal activity in response to a variety
of stimuli including words and thoughts
and visualizations. Even Carl Jung looked at GSR to
words. There is a large literature in psychophysiology, physiology
and biomedical engineering regarding the quantification and feature
detection of physiological data. JHU/APL appeared to be totally
ignorant of this literature on the special problems of signal processing
and filtering of physiological signals.
5.
Summary of scientific effort. This is not a scientific
project. The project represents a contractual relationship between
Federal agencies and JHU/APL to distinguish two groups in existing
data base with a computer algorithm. If we accept that as the task,
the contract has been fulfilled. However, if we were to evaluate
this project in terms of scientific quality, and hold this project
to the criteria of peer review by psychophysiologists, signal processors,
statistical modelers, or psychometricians, our evaluation would
be negative. For example, the methodology does not meet the standards
for publication in a quality peer review journal.
Factors contributing
to the inadequacy of the project
1.
Functioned in scientific vacuum.
The project has functioned as if it were in a scientific
vacuum and has not incorporated knowledge of several related disciplines.
For example, specific features of GSR have been known for over 30
years to be sensitive to psychological phenomenon. For over a decade
research in the forensic psychophysiology has documented that several
of these features are related to deception. These features are easily
identified and quantified and should, at least be entered into the
model to determine whether they discriminate better than the empirically
derived variables. Empirically derived variables, may at times,
miss theoretically important or relevant characteristics. Moreover,
as stated above, there are no truly empirically derived
variables, because decisions rules have been made regarding the
collection of the raw data (e.g., digitizing rates, transfer functions
of the amplifiers, etc.) and the processing of the data (e.g., numbers
of samples to look at, features to detect such as slope or peak
or level, etc.).
The evaluation
committee was surprised there was no input into this project by
individuals who were expert in the problems of signal processing
of physiological signals. Physiological signals present special
problems, because of their unique statistical characteristics such
nonstationarity, dynamic changes, physiological state changes, and
latency characteristics. Once this is understood, it is clear that
transfer functions of the amplifiers and software filters would
impact on true physiological signals, and this impact may have a
predictable pattern. The JHU/APL team was ignorant of these issues
and argued adamantly that these problems were irrelevant. The evaluation
team viewed the JHU/APL team as inappropriate and unprepared to
deal with the research problem. The credentials of the JHU/APL team
do not fit the skills necessary for a successful treatment of the
task.
The JHU/APL
team argued irrelevant points such as relating sending a missile
in the space (task which JHU/APL contributed to) with developing
an acceptable algorithm to deal would psychophysiological of data.
In addition, several government employees have argued that the how
and why of the development of the algorithm does not matter.
Rather, it is whether the algorithm works or does not work. This
is not an appropriate question, because the ability of the system
to work is dependent upon the physical and sampling characteristics
of the data. The effects of the amplifiers, filters, and sampling
procedures have clearly adulterated the data. Thus, even with
the best statistical techniques, the findings would be limited.
However, the statistical techniques employed suffer from a lack
of understanding of statistical theory and methodologies that would
allow one to generalize findings from a large sample to the field
application of a single client.
Finally,
there is no documented scientific product that can be scientifically
evaluated. In science this occurs in terms of peer reviewed publications.
How would this project fair in the Journal of Applied Physiology,
or an IEEE journal or a signal processing journal or a psychophysiology
journal? How would peers evaluate the product? These answers to
this question is clear, the methodology would be viewed as inadequate.
An interesting
example of this criticism occurred at the 1994 meeting of the Society
for Psychophysiological Research. This meeting occurred while I
was President of the Society. As President of the Society for Psychophysiological
Research, I was involved in the development of the program for annual
meeting this past October. My charge to the program chairman was
to stimulate submissions from several affinity groups including
forensic psychophysiology. One symposium focusing on forensic psychophysiology
was accepted. I attended this symposium and was appalled by the
attitudes expressed and the scientific bases of the JHU/APL presentation.
My view was shared by many of the Society members who witnessed
the talk. Following the meeting I received a copy of an email from
a member who is conducting polygraph research in Europe. His message
included the following statements:
To all
concerned scholars and scientists:
During
the last meeting of the Society for Psychophysiology Research,
held in Atlanta, Ga., October 5-9, a symposium was scheduled on
the topic of Decision-Making Algorithms in Forensic Psychophysiology.
The reason to address you is my concern about the scientific
contribution by Dale E. Olsen, John C. Harris, and Wendy W. Chiu
(Johns Hopkins University), entitled The Development of
a Physiological Detection of Deception Scoring Algorithm
and, to a lesser extent, the contribution by Patrick F. Castelaz,
(Loma Linda University), and John E. Angus (Claremont Graduate
School), entitled Deception Classification via Nonconventional
Processing Techniques.
I am sending
a copy of this letter to the Board Of Directors of the Society
for Psychophysiological Research, and I will ask them never to
invite non-scientific researchers like these again for
a presentation any conference organized by a Society that fosters
research relating psychology and physiology.
There several
issues associated with the above statements. First, the Government
contracts are being conducted by researchers that are not respected
by scientists conducting psychophysiological research. Second,
the research was viewed as severely flawed and of no scientific
value. Third, it presents not only the contract researcher poorly,
but reflects negatively on the sponsor. Basically, legitimate
and established researchers view the current Government sponsored
polygraphy research as wasteful and poorly conceived. Additionally,
credible researchers who want to conduct scientific investigation
of polygraphy do not have access to the minimal funding they required
to complete well-designed research that would pass the criteria
of peer review.
2.
The fallacy of efficacy research. The JHU/APL contract
is an example of efficacy research. Efficacy research is short-sighted
and often useless when new methodologies are developed. Efficacy
research is focused on proving that existing technologies work and
not at understanding why they work. The issue is critical in the
area of polygraphy. Most researchers as well as the public believe
that polygraphy works with a hit rate greater than chance.
The polygraphy community believes that hit rate approaches 100%,
while the scientific community believes that this is a vast over
estimation. There are two approaches to dealing with this inconsistency
in objective effectiveness. One is to evaluate the effectiveness
or improve effectiveness by identifying strategies of experts.
The other is to attempt to understand the mechanisms and processes
mediating the features of physiological activity related to deception.
The first assumes that polygraphy as currently configured works
and that the primary research goal is to streamline
the scoring procedure to be more reliable. It is assumed that if
reliability is enhanced accuracy will be enhanced as well. To deal
with the reliability issues, funds have been directed at computerized
scoring and thus, reduce scorer bias. The JHU/APL contract reflects
this approach. The second assumes that better signal processing
can be used to extract difference physiological response variables,
better paradigms can be developed to isolate physiological responses,
and better decision making algorithms can be developed. The latter
is a scientific approach with a goal of improved detection instruments
via understanding the underlying mechanisms, the former is the approach
in the industry with an unmodifiable and potentially obsolete product
attempting to justify its utility.
Both approaches
cannot co-exist. Their philosophical bases are in direct contradiction.
Moreover, it is clear on which side of this philosophical argument
Government agencies are. Although Government funding has stimulated
novel evoked potential and eye movement research, it has directly
impeded research with the more traditional autonomic
nervous system variables. The Government has created two bottlenecks
in the polygraph research program: 1) the dependence on the Axciton
system; and 2) the development of algorithms by the Applied Physics
Laboratory at Johns Hopkins. These two research decisions
result in current efficacy research funding being directed at demonstrating
their utility. For example, although Axciton appears to be a
defective device in the transformation of physiological signals,
the Government has allocated funds to translate these
signals into a data base similar to other traditional polygraphers.
Unfortunately, the traditional polygraphs also adulterate the physiological
signals. Thus, the efficacy approach will result and wasted funds
because the amplifiers distort the underlying electrophysiological
and mechano-physiological signals. The obvious solution would be
to use amplifiers and computer algorithms that accurately report
and represent the physiological signals. A simple exercise that
could be evaluated by competent scientist.
Supporting evaluations
by Drs. Johnson, Kircher, and Stern
Dr.
Raymond Johnson
Based on
what I learned at the site visit, I had a number of concerns about
the both results and the manner in which the contract was executed.
To begin, the contractors made no attempt to take advantage of nearly
100 years of research either on the nature of the physiological
measures that form the basis of polygraphy or on the methods that
have been developed to quantify these signals. While the contractors
were able to generate a measure for separating the deception indicated
(DI) and no deception indicated (NDI) populations, it is very likely
that they could have done a better job (see below) and spent considerably
less time and money had they used the existing information.
While one
might argue the above criticisms are irrelevant since they did derive
a discriminator, the main point is that they never tested their
method to determine if it works, and if it does, they did not determine
the extent to which their algorithm works in different situations.
That is, at present, there is not one single piece of evidence
to confirm that this algorithm will provide an accurate indication
of DI or NDI in a person outside the original sample used to make
the model. This is another instance of the contractors ignoring
a long history existing knowledge, in this case, research design
methods. Had they used any of a series of standard practices, they
would have built their model on a random selection of one-half of
the data, leaving the other half for testing the efficacy of the
finished model. Perhaps the most appalling aspect of their presentation
was the fact that they felt absolutely no need to test their algorithm
before putting their product on the market. In addition, they
sell this as a tested product and I saw no indication from any of
the polygraphy community that was present that they understood that
this was an entirely untested product.
Along with
a complete failure to test/validate the algorithm, the contractors
also failed entirely to characterize their sample and thereby delineate
properly for whom and what crimes the system might work. Upon questioning,
they had little or no idea about what mix of subjects (e.g., sex,
race, age) or crimes was used to develop the algorithm. I feel that,
without proper studies, it is impossible to know if the algorithm
(even assuming it works and is valid) can be applied at all, or
in the same way, across the entire range of possible subjects and
crimes. I believe that the contractors, in ignoring this point,
had made a serious error that can only be exacerbated by the atheoretical
manner in which they derive their algorithm. That is, when they
choose to base their model strictly on the basis of these particular
cases, they tied their algorithm very tightly to the particular
characteristics of their sample. I would, therefore, not expect
the algorithm to generalize or work well on persons or crimes not
well represented in the original sample.
The idea
that their algorithm does not easily generalize to other situations
is supported by their statement that they would like to continue
building their sample, doubling it if possible. This suggests
that the weights, and perhaps the parameters, are still drifting.
This reveals a number of things about their algorithm. First, it
suggests that there is a considerable amount of variability that
remains, raising doubts about using it to make important decisions
about individual s futures. Second, it suggested they have
over modeled their sample, reducing its generalizability to cases
outside the sample. Third, it suggested that their model does not
capture well the essential differences between DI and NDI responses.
Had this been the case, they should have been able to derive their
model from a smaller, rather than a larger, sample. This is particularly
important because the model is to be applied to individuals, rather
than groups, in order to make a vitally important decision.
Failure to
appreciate the fact that the polygraph itself affects the character
as well as the quality of the data is yet another example of the
contractors failure to obtain even a rudimentary knowledge of the
subject they were working on. Even a cursory glance it any basic
text on psychophysiological methods would have revealed the importance
of filters and their interaction with all physiological signals.
Consequently, one of their features, the one based on
the Autonomic GSR signal, is almost certainly measure of the time
constant of the amplifier, rather than on a signal from the subject.
Hence, the algorithm will not work properly if a different manufacturer
uses a different time constant, or if the operator makes a change
to the amplifier settings. Note that the algorithm could be adjusted
to compensate for such changes in the time constant. However, without
adequate knowledge that this part of the algorithm reflects amplifier
characteristics and not subject responses, the contractors had no
idea that such changes in the algorithm would be required. Thus,
the accuracy of the algorithm is affected in ways the contractors
did not understand, or warn about, because of their ignorance of
basic information on the topic they were working in.
In a future,
I feel certain that the contractors will return to government agencies
to obtain additional money for developing their algorithm. Indeed,
they can argue that they did as they said and fulfilled the contract,
whether or not it was done as well, expeditiously or cheaply as
they or someone else might have done had they first developed an
expertise in psychophysiology. Therefore, it is hard for me to imagine
that these investigators will not receive another government contract
to either validate their method, or extended it to other formats,
or both.
Thus, there
are lessons to be learned from the way in which the previous contract
was written and administered. First, any future contracts (with
these or other investigators) must include a provision that any
algorithm for separating DI and NDI persons be TESTED by scientifically
accepted methods as the condition for fulfilling the contract. Second,
the contractors should submit their research/experimental plan for
review by a panel of scientists (as is done in all cases for non-military
grants and contracts) employed by the funding agency. The funding
agency may also want to consider using the panel to periodically
assess the progress on the contracts so that any problems can be
corrected as they occur, rather than after the contract has been
completed. Implementing these procedures will dramatically reduce
contract problems.
In sum,
the contractors have developed, delivered and sold in algorithm
to separate DI from NDI subjects that has no demonstrated validity.
I would recommend that all use of this algorithm cease until its
accuracy and validity can be demonstrated. Further, I would not
give these contractors another contract without first assuring that
there will be a considerably greater deal of oversight from scientists
knowledgeable in the field of psychophysiology.
Dr.
John C. Kircher
1.
Overall Impressions. I
was appalled by the procedures used at the APL to develop computer
programs for analyzing polygraph charts. It is difficult for me
to understand why the government would (i) fund efforts to develop
algorithms to analyze signals from hardware that is 40 years out
of date, (ii) fund investigators that no formal training in the
relevant scientific disciplines (physiology, psychology, and psychophysiology)
and clearly made no attempt to study these areas on their own, and
(iii) fund investigators who, even in their own areas of expertise
(statistics, signal processing) show minimal grasp of basic principles
of classical and modern measurement theory, effects of multiple
nonlinear transformations on their signals, and issues of research
design, generalizability, and shrinkage.
2.
Alternative Analytic Techniques. The JHU/APL team did
not consider, nor do they appear to be aware of alternative analytic
methods such as those developed by David C. Raskin and me, which
are described in detail in several scientific articles and book
chapters.
3.
Other Measures from the Existing Data Base. I am not
sure, but I do not believe that any traditional measures of electrodermal
and cardiovascular activity were among me 9,911 feature these investigators
report having examined (I believe with various types of signal transformations,
the actual number of features examined by these investigators is
many times greater than 10, 000). Traditional measures would include
features such as peak amplitude, rise time, half-recovery time,
and onset latency (e.g., Coles, Donchin, & Porges, 1988; Greenfield
& Sternbach, 1972; Martin &Venables, 1980; Fowles, Christie,
Edlberg, Grings, Lykken, & Venables, 1981). It might be worth
taking a look at a few traditional measures and correlating them
with the features generated by the investigators at APL. It might
give us some idea of what they are measuring.
4.
Hardware. The major problem with Axciton hardware is
the measurement of skin resistance from stainless steel plates.
Even when properly recorded, SR shows baseline drift that is often
greater than the magnitude of the physiological responses. As a
consequence, on traditional polygraphs the examiner must frequently
re-center the pen. Axcitons solution to this problem is to
filter the signal to reduce baseline drift. However, this filter
distorts the signal and moves us one step further from the subjects
cognitive and emotional response to the stimulus. To measure
SRRs, the Axciton uses stainless steel plates. These plates
are subject to polarization, bias, and movement artifacts.
I do not
know if the Lafayette system uses stainless-steel electrodes, but
it does record skin conductance. I doubt that the Lafayette hardware
filters the SC signal in the same manner as the Axciton, if that
all. If the Lafayette uses Ag-AgCl electrodes with electrode paste
and records with a constant .5V circuit as we do, there would be
less polarization, bias, and minimal effects on sweat gland activity.
I cannot imagine how the Polyscore algorithm, which was developed
with data from the Axciton, could possibly correct for the differences
between the filtered SR signals from the Axciton and the SC singles
from Lafayette.
In the copies
of transparencies provided by the APL and at one point during the
meeting, the investigators noted that SR signals that exceeded the
12-bit resolution of the Axciton resulted in flat lines at the maximum
(clipping of the waveform). The investigators then noted that this
problem had been corrected. At what point was this problem corrected?
Can it be assumed that many of the cases in their training set were
collected while this was a problem? Were these strong SRRs
treated as missing values (which would bias results), or were they
measured as they appeared in the data files? The occurrence of artificial
ceilings in SRRs would seriously affect the diagnosticity
of various percentile measures of GSR, their
selection for logistic regression model, and the weights given them
by the model.
5.
Algorithm Development: Signal in Feature Transformations.
The investigators perform numerous
nonlinear transformations of signals prior to extracting features
from them. The Axciton hardware filters low frequencies from the
SR channel, and the Polyscore then detrends the filtered SR data.
The signal from which the GSR features are extracted
is a nonlinear transformation of another nonlinear transformation
of skin resistance. Even if the investigators measured something
as simple as the amplitude of this signal, one wonders how that
measurement relates to actual changes in skin resistance. The transformations
do not stop there. They measure response magnitude in terms of pseudo-percentiles
and in some cases differences between pseudo-percentiles. While
these measures are functionally related to the subjects responses
to test questions, it is difficult to see how, and one wonders about
the effects of variations in gain settings and even the number of
questions on the obtained measures of electrodermal responses.
The cardio
channel is detrended and then partitioned into high and low frequency
components. I am concerned about the effect of detrending on measures
extracted from these signals as well. The justification for detrending
cardio signals is to correct for leaks in the pneumatic system (The
same justification is given for respiration channels since these
are recorded from pneumatic bellows.) It seems to me that the
solution to leaks in the recording system is to find out where there
are leaks and fix them. I believe it is a mistake to transform all
signals to correct for the possibility that there might
be a leak in recording system. Again, with each transformation,
the measurements we use to discriminate between truthful and deceptive
subjects are one step further removed from the psychological processes
that give rise to the changes in physiology we wish to assess.
Respiration
signals are detrended and baselined (pp 13-15 of
the Algorithm Overview handout). I do not see how the respiration
signal would change if only the baselining were done. Ultimately,
the points of maximum expiration are aligned. I also fail to
see any possible advantage in baselining respirations. I think it
must distort the signal in some nonlinear manner to fit the constraint
that all baseline values fall on a straight line. And why should
it matter to the algorithm if the baseline points are aligned? There
might be some advantage in applying this transformation if the baselined
respirations were to be presented to polygraph examiners for visual
inspection. However, I asked about this, and they do not show the
baselined respirations to the polygraph examiner. I believe the
transformations on respiration channels have only adverse effects
on measurements from these channels because they distort the signals
and do so unnecessarily since no visual analysis is ever performed
on the transformed data.
6.
Algorithm Development: Feature Extraction. APL approached
their task as though no one had ever before quantified respiration,
electrodermal, or cardiovascular activity and used them to draw
inferences about psychological states or processes. The measures
developed by APL bear little if any resemblance to the types of
measurements made by psychophysiologists for many decades. They
did not study this literature. For example, they did not know that
the reductions in cardio pulse amplitude in their model could reflect
an increase in blood pressure or a decrease in blood pressure depending
on the pressure of the occluding cuff. Are we to believe that these
investigators, with no training in psychophysiology, have identified
new ways of characterizing psychophysiological events that are generally
superior to be accumulated wisdom of research scientists who have
been investigating these phenomena for over 100 years? Standards
have been develop by the scientific community for measuring physiological
activity, and these standards were completely ignored. Massive computation
is no substitute for careful study of the scientific literature
and the application of well established psychophysiological and
psychometric principles.
One potential
advantage in using a computer to quantify physiological reactions
is that it can help to train polygraph examiners to be better chart
interpreters. With all the nonlinear transformations of signals
and features made by Polyscore and the nonstandard types of features
used by this algorithm, there is no way the measurements made by
Polyscore could possibly be used to train polygraph examiners to
be better numerical scorers.
The investigators
at APL have argued that an advantage in using the computer is that
evaluates aspects of the charts that the human interpreter could
not possibly see. The only possible way additional complexity
could be considered advantageous is if it actually improves diagnostic
accuracy, and there is no evidence of this. Otherwise, it is nothing
but smoke and mirrors. If the algorithm is a complete mystery to
the best scientific minds in the field, it could not possibly be
viewed as anything but a black box to the field polygraph examiners
who use it.
7.
Algorithm Development: Feature Selection. Stepwise
logistic regression was used with about 33 blocks of 300 different
variables to select the features for the model. Stepwise regression
procedures are notoriously unreliable, they capitalize on chance,
and yield highly inflated estimates of discrimination (Hocking,
1983). The problem is compounded by the enormous number of variables
provided to the stepwise variable selection algorithm. Under these
conditions, random numbers would yield perfect or near perfect discrimination
between groups, which is precisely what these investigators report.
Their approach borders on the absurd. It is the most extreme example
of overfitting a training set that I have ever seen or could possibly
imagine. I do not understand how a professional statistician could
even consider developing a prediction equation with more than 10
variables per subject. Texts on multivariate statistics recommend
10-20 subjects per variable and certainly no fewer than five subjects
per variable. Not only did these investigators fail to follow these
recommendations, they missed the boat by two orders of magnitude.
On purely statistical grounds, any claim that the results obtained
in their training set will generalize to new cases is completely
without merit.
From a
statistical standpoint, theres no reason to believe that the
model developed at APL has any power whatsoever to discriminate
between truthful and deceptive subjects. Other indications that
their model has limited generalizability include intimations by
the investigators (no data) that the model for ZOC test would (or
does) not work well for MGQTs or in laboratory mock crime experiments.
Theres no theoretical reason or empirical research to suggest
that the patterns of physiological changes associated with truthfulness
and deception should differ in ZOC and MGQ tests.
I think the
use of logistic regression for computerized polygraphy is appropriate.
It is an non-parametric method that is a bit more conservative than
discriminate analysis. The relative power of these two methods for
modeling differences between truthful and deceptive subjects depends
on the extent to which the assumptions that underlie the two techniques
are reasonable. Under conditions of multivariate normality, discriminate
analysis is likely to be more powerful. However, I take issue with
the claim by these authors that the probabilities of truthfulness/deception
from discriminate analysis are unreasonable (see Algorithm
Overview Page entitled Comparing Logistic Linear Repression).
For our presentation at the 1994 meeting of SPR in Atlanta, we created
a discriminate function and a logistic regression model composed
of the same set of measures from the same set of laboratory and
confirmed field cases (89 deceptive, 74 truthful subjects). The
correlation between the probabilities generated by the two models
exceeded .999. If discriminate analysis produces unreasonable
probabilities, then the same must be said about logistic regression.
8.
Algorithm the Development: Case Selection for Model Development.
The criteria for selecting cases for the training set differed
for individuals classified as deceptive or truthful. Criterion deceptive
subjects had confessed, and there was independent corroborating
evidence that the confession was correct. We do not know the percentage
of test questions within a test to which a subject confessed. Confessions
about all issues covered even in ZOC tests are rare. What became
of the data for questions within a test for which there was no confessions?
Apparently Mike Capps and someone else decided it there was corroborating
evidence. Was there independent corroborating evidence for every
deceptive case (or relevant question) in the training set? What
was the reliability of these judgments, and why was the corroborating
evidence criterion not stated in the handout (page 6) or in any
other descriptions of algorithm development supplied by the APL?
Apparently,
a guilty plea by a suspect was also used as a basis for assignment
to the deceptive group and perhaps even for clearing
an innocent suspect. There are serious problems with this criterion,
since people enter guilty pleas for many reasons, only one of which
is that they committed a crime and lied about it on their polygraph
test. Important details about the reliability of the selection criteria
for deceptive subjects were not presented, and I have grave concerns
about the use of guilty pleas as a criterion for case selection.
From the available information, it is not possible to estimate the
proportion of subjects, or the proportion of answers to individual
questions within tests, that were consider deceptive but were actually
truthful. In all likelihood, the algorithm
was trained on a contaminated sample of deceptive cases.
The manner
which criterion truthful subjects were selected for the training
set was even more problematic. The vast majority of this sample
was composed of cases in which the original examiner and two independent
evaluators agreed that the subject was truthful. I suspect that
this sample contains few false negative errors, too few in fact.
There is a lot of research which shows that a large percentage of
truthful subjects produce ambiguous charts, especially in field
test. The outcome of a polygraph test is ambiguous when the subject
responds similarly to control and relevant questions. These cases
were systematically excluded from the sample of truthful cases since
only clearly truthful charts are likely to result in complete agreement
among polygraph examiners. It is not difficult to see why the algorithm
was able to discriminate this homogeneous sample of clearly truthful
charts from cases in the deceptive group. Inclusion of only clearly
truthful charts biases estimates of the accuracy of outcomes from
Polyscore, not only for truthful subjects but also for deceptive
subjects. More importantly, it is likely to have undesirable effects
on the selection of variables for the model. Variables that could
be of great value in increasing the statistical distance between
ambiguous truthful charts and deceptive ones would not be selected
for the model because the stepwise algorithm does not know that
these cases even exist.
The investigators
to great pains to state that their goal was not to determine the
accuracy of polygraph examinations. However, it is not possible
to separate the issue of the accuracy of polygraph examinations
from the task of developing an algorithm that has utility for field
polygraphy. By systematically excluding inconclusive cases from
the training set, they misrepresented the qualitative and quantitative
nature of psychophysiological differences between the target populations
truthful and deceptive subject. These populations are considerably
more heterogeneous than the samples of cases included in the training
set. Consequently, their model was not optimized to distinguish
between truthful and deceptive field suspects, the observed differences
between the groups are biased, and they fool themselves and their
customers about the value of the algorithm for field applications.
9.
Recommendations in Future Directions.
a. Stop funding the research at the APL.
The research at APL has been an incredible waste of time and money.
They completely ignored scientific standards for psychophysiological
measurements and statistical analysis; they adulterated methods
David Raskin and I pioneered almost 20 years ago; and they promote
the algorithm for field used it has never been adequately tested.
One would think that a professional statistician with plans to generate
thousands of new and untested measures with think to hold out a
few cases in order to cross-validate the model. The only test of
the validity of the algorithm was completed at DODPI shortly before
meeting in March. There were only four confirmed truthful subjects
in this cross-validation study, and Polyscore misclassified one
of them (75% correct). Statistically, this is no better than chance
accuracy on truthful subjects. Whatever limited validity data exist
speak more to the power of control question test than their quality
of the research. Accuracy on deceptive subjects was much better
(92% I think). However, with little information about the accuracy
on truthful subjects, this figure is difficult to interpret. If
the algorithm is biased against truthful suspects and yields a high
rate of false positives, one would expect highly accurate decisions
on deceptive subjects.
A far better
use of funds would be to: (i) announce your interest in the development
of computerized polygraph systems and solicit a range of proposals
from the research community; or (ii) organize a meeting of a select
group of psychophysiologists with some knowledge and interest in
polygraphy. (It wouldnt hurt to invite a few outspoken members
from the user groups whose task would be to keep the feet of ivory
tower types firmly planted on the ground.) The purpose of the meeting
would be to develop the research plan for further development of
computer algorithms for the detection of deception. The research
program would be characterized a coherent agenda, open discussion,
consensus, and collaborative efforts by knowledgeable investigators.
I am certain that the result would look nothing like the APL/Axciton
system you have now.
b.
Model the decision making of human experts who make accurate
decisions yet tend to extract diagnostic information from the polygraph
charts that is not included in the computer model.
c.
Investigative new sensors. Autonomic measures that show
considerable promise includes skin potential, components of variance
in heart period (vagal tone), and blood pressure. Pilot data from
our lab and John Podlesnys research with the Finapres blood
pressure monitor suggest that measures derived from these channels
are diagnostic. Moreover, they are more likely than additional characterizations
of conventional polygraph channels to make independent contributions
to existing combinations of weighted physiological measures.
John
A. Stern
1. This
is a statistical Tour de Force - or is it a Tour de Farce??
I feel strongly that it is the latter. Their attempt at
model building deals purely with the development of
a statistical model, one that is purely descriptive in nature.
2. The effort
to develop an algorithm that discriminates between deceptive and
non-deceptive is laudatory, however the effort is completely atheoretical
nor did it have the benefit of skilled polygraphers in suggesting
variables that might (or might not) be entered into the discriminative
equation.
3. The need
for 500+ examinations to develop a predictive equation
suggests a lot of error variance in the data base. The suggestion
to develop an equation based on a smaller subset of the data and
applying to the remaining data set was considered inappropriate
by them. They must have had second thoughts about this! We did receive
a copy of a cross-validation report in which the data
set (624 cases) was divided into four subsets. Using their modeling
procedures models were built for each database and they now claim
only a 3% accuracy degradation when the model developed on one data
base is applied to the other three sets. They go on to say It
also shows that although some models had no terms (features) in
common, there is a strong correlation for some features and allowing
LOGIT to choose among them on the basis of the data base results
in equivalently performing models. How is that for gobbledygook
and scoring a touchdown for the other side. One must now asked,
how many samples are necessary for the development of a model
and which is the most appropriate model. If all four worked equally
well one must have doubts about their procedures!!!
At the meeting
they were not satisfied with the number of cases in their data bank
and felt they needed more. One wonders how many more cases will
be required before they are satisfied that the equation is reliable
and valid and field applicable - although it is being marketed and
applied in field investigations. We wonder about the ethics of marketing
an unvalidated products. We believe it to be highly unethical -
regardless of the outcome of a validation study.
4. The models
developed are highly restrictive. New models have to be developed
for Zone scoring and the MGQT procedures. It would seem to me that
a more general model would be desirable - if it is possible.
5. To my
surprise only polygraph data and no personal data - such as gender,
age, race, etc. enter into the predictive equation.
6. Though
not or only poorly articulated it appears that the features
or factors cannot be described in terms of what component
of the response in question is evaluated. With all their statistical
manipulations it apparently becomes difficult to decipher the algorithm.
The features used in the algorithm are purely empirically derived
and the authors could not or would not specify what each feature
measured nor identify weights associated with each feature. We were
shown one of the Polyscore reports and I noticed that in the particular
report 80% of the weighting was attributed to electrodermal variables.
Ideally one should specify the decision rules identified in the
algorithm and then evaluate these rules empirically.
7. Their
response to questions was, at best, evasive and sometimes offensive.
It appears that there is little in the way of documentation concerning
their research efforts. They alluded to analyses done
earlier but theres no paper trail allowing for an evaluation
of what they have actually done over the years.
8. The product
Polyscore that has been developed is not generalizable,
atheoretical, and aphysiological.
9. The only
documentation available to us was a second draft of a paper they
hope to publish. It contains little that is not found in their Users
Guide to the Polyscore system. It is marred with inaccuracies and
would not pass muster the reasonable editors.
10. They
are woefully ignorant about: a. the polygraph used to collect the
data used by then. For example, they had no idea about the filter
characteristics of the amplifiers used. b. about facts associated
with recording techniques. For example, the nature and meaning of
measures derived from the blood pressure recording procedure. c.
the simple fact that speech affects the recording of respiratory
activity and that such activity needs to be taken into consideration
when scoring the data. d. that there are significant differences
in the resting levels of skin conductance as a function of skin
color and probably age.
11. With
all the hype about the reliability and accuracy of data analysis
the question about its relative accuracy is still in unknown.
12. One recommendation
agreed to by the panel was the need for a broad base look at the
polygraph, the tool used to acquire the information. The current
effort focused only on data analytic procedures.
What is needed
is a closer look at: sensors, amplifiers, A/D converters, data loggers,
and data analysis techniques. The polygraph, as currently configured,
has not changed much from the days of the old Keeler Polygraph.
We think that current technology should allow for the utilization
of better amplifiers as well as broadening the array of physiological
measures that might be used.
|