Psychology 121, Lecture 8
by Hal S. Kopeikin, Ph.D. © 1998
Overview
-
When tests are used to make decision, validity can be measured by
the quality of decisions they dictate. We'll begin by looking at validity
from the perspective of decision making
-
Base rates (the frequency of something in a population) have interesting
implication for decision making accuracy. We'll explore them next.
-
The conditions of administration may influence reliability and validity.
Relevant dimensions are examined..
-
Finally, we'll introduce interviews and ways in which they vary.
Judging tests as aides in decision-making
Since the purpose of testing is usually to help make decisions, one logical
way of evaluating a test is determining how much better the decisions made
with it are vs. those made without it. Typically, this involves dividing
test scores into predicted 'success' or 'failure,' and categorizing the
behavior one is trying to predict in terms of 'success' or 'failure.' You
should be aware that the terms 'success' and 'failure' are a bit arbitrary.
For example, when one is predicting suicide a 'success' might be an actual
suicide although this is not a good thing.
-
Cutting Score: The cutting score is the test score used to divide
the population of test scores into predicted successes or failures.
-
Hits and Misses: Correct predictions are called Hits while incorrect
predictions are called Misses.
A finer classification of hits and misses and our predictions is as follows:
Positive predictions imply someone will succeed, has a characteristic,
fits in a group, etc.
Negative predictions imply (s)he does not
True positives: Hits where success is predicted and in fact occurs.
True negatives: Hits where failure is predicted and in fact occurs.
False positives: Misses where success is predicted but where failure
actually occurs.
False negatives: Misses where failure is predicted but where success
actually occurs.
Notice that 'true' and 'false' refer to hits and misses while 'positive'
and 'negative' refer to successes and failures.
Ratios expressing Decision Making Accuracy
We can use different ratios to express the accuracy of a particular test.
There are different ratios which emphasize particular facets of accuracy.
Total Hit Rate= (True Positive + True Negative)/ All
results (True Positive + True Negatives + False Positives + False Negatives)
This is a global measure of the decision making accuracy of
the test. It counts all errors and all correct decisions equally.
Positive Hit Rate= True Positives/ (True Positives + False Positives)
This answers the question of 'how accurate are positive predictions?'
It is a good measure when false positives are particularly worrisome. For
instance, you really don`t want a child molester to falsely pass your test
for becoming a baby-sitter.
Negative Hit Rate= True Negatives/(True Negatives + False Negatives)
This answers the question, 'How accurate are negative predictions?'
In this case, the rate will go down with the number of false negatives.
This measure might be very important for a test which measures likelihood
of suicide. In this case, you do not want to get any false negatives which
wrongly predict that the person will not commit suicide.
Sensitivity= True Positives/ (True Positives + False Negatives)
This ratio answers the question, 'How good is the test at categorizing
those who actually succeed?' In this and the next ratio, you start with
the results and then look back to see how many the test got right. In this
case you examine positives (" successes").
Specificity= True Negatives/ ( True Negatives +False Positives)
This ratio answers the question, 'How good is the test at categorizing
those who actually fail?' In this case you examine negatives ("failures").
Data Illustrating Concurrent Validity of Depression Measures As Functions
of Cutting Scores
from Rapp SR; Parisi SA; Walsh DA; Wallace CE. Detecting depression
in elderly medical inpatients. Journal of Consulting and Clinical Psychology,
1988 Aug, 56(4):509-13.
Physician Detection of Depression,
RAW NUMBERS
CRITERION MEASURE
Depressed Not Depressed TOTAL
Estimate Depressed 2 6 8
Not Dep. 21 121 142
TOTAL 23 127 150
PERCENTAGES
CRITERION MEASURE
Depressed Not Depressed TOTAL
Estimated DEP. 1% 4% 5%
Not Dep. 14% 81% 95%
TOTAL 15% 85% 100%
DEFINITION OF TERMS PREDICTOR CRITERION
------------------- --------- ---------
TRUE POSITIVE YES YES
FALSE POSITIVE YES NO
TRUE NEGATIVE NO NO
FALSE NEGATIVE NO YES
-
TOTAL HIT RATE = (TRUE POSITIVE + TRUE NEGATIVE)/(all subjects)
=82%
SENSITIVITY = (TRUE POSITIVE)/(TRUE POSITIVE + FALSE NEGATIVE)
= 9% (all of these are really depressed)
SPECIFICITY = (TRUE NEGATIVE)/(TRUE NEGATIVE + FALSE POSITIVE)
= 95% (none of these are really depressed)
POSITIVE HIT RATE = (TRUE POSITIVE)/(TRUE POSITIVE + FALSE POSITIVE)
= 25% (all of these are predicted depressed)
NEGATIVE HIT RATE = (TRUE NEGATIVE)/(TRUE NEGATIVE + FALSE NEGATIVE)
= 85% (none of these are predicted depressed)
Effects of Cutting Scores
You should be aware that some cutting scores will be optimal for some rates
but not for other rates. There are very few cases in which all the rates
are optimized by with the same cutting score. Adjusting cutting scores
will reduce some errors while raising others. For example raising the cutting
score will usually reduce sensitivity while raising specificity.
Beck Depression Inventory as an estimate of Depression
Sensitivity
Specificity
Positive Hit Rate
Negative Hit Rate
Cutting Score
8 100% 50% 26% 100%
9 91% 60% 29% 97%
10
83% 65% 30% 95%
11
78% 72% 33% 95%
12
74% 76% 36% 94%
13
70% 82% 41% 94%
14
70% 83% 43% 94%
15
70% 84% 44% 94%
16
70% 87% 49% 94%
17
65% 90% 54% 93%
18
65% 90% 54% 93%
Base Rates and Criterion-Related validity
The base rate is the frequency of a behavior in a particular population.
The base rate of a characteristic indicates its frequency in a particular
population. For instance, if 90% of the class gets passing grades then
that would be the base rate for the class. And if 1/125 of Americans die
in car accidents, then that would be the base rate of that behavior.
Base rates can be used to make predictions and when the behavior is
very common or very rare the total hit rate of predictions based on the
base rate can be better than many psychology tests. Nevertheless, psychology
tests are still used to predict very common and very rare behavior because
they make different kinds of errors than base rates. You could develop
a test which has a better false negative rate than a prediction based on
base rates. This test would be useful if you were interested in catching
all possible suicides.(Tests tend to over predict the rare and under predict
the usual.) Test-based predictions are likely to be best overall when base
rates approach 0.5.
Test Administration
Administration is Standardized
-
by maintaining consistent test-giving procedures such as instructions,
prompts, time-limits
-
to minimize variability in test scores related to differences in administrative
procedures
Examiners Effects and Examiner-Subject Interactions
-
Although testing procedures can be standardized, differences between examiners
seem inescapable. The impact of those differences may depend on characteristics
of subjects.
Race
-
Maximal performance tests (e.g., I.Q., aptitude, achievement), other than
classroom tests, have traditionally been administered by professionals.
Historically in the U.S., this has meant that most administrators have
been white and at least middle class.
-
This "standardization of examiners" might minimize error.
-
However, examiner effects could be moderated by test-taker characteristics.
This might create bias, i.e. stable errors. For example, concerns about
the reactions of black children and white examiners have been persistently
raised.
-
Most research shows little or no effects of racial match between test-administrators
and test-takers. Literature reviews have concluded that effects of administrators'
race are minimal.
Rapport
-
Effects of the examiner-subject relationship are frequently found, and
sometimes large.
-
Pre-existing examiner-subject relationships may magnify these effects.
-
While extremes in rapport obviously matter, how does this relate to professional
testing? In less than extreme cases-- the usual scenario in professional
settings--the effect is relatively minor.
Expectancy Effects
-
Examiners may subtly communicate performance expectations to subject, who
may oblige them. This is especially the case in one-on-one scenarios.
-
Scoring too could be effected by examiner biases.
-
Research suggests such effects are usually weak and inconsistent with trained
examiners.
Response Incentives and Reinforcements
-
Effort affects most maximal performance measures, so incentives and rewards
do so indirectly
-
Praise, tokens, candy have been studied; their effects are complex &
related to subject variables
-
Typical performance measures can be influenced by such variables too
-
How should these effects be managed? Consistency? 0ptimization for particular
test-takers? This is a very complicated question.
Subject Variables: Anxiety, illness, hormones
-
Yerkes-Dodson curve shows inverted-U relationship between anxiety and test
performance. Some forms of administration affect anxiety levels.
-
Extremes in physical health certainly influence test performance. Lesser
illnesses have little effect.
-
Hormonal effects are complex, theoretically interesting but typically small
in magnitude.
Interviews
-
You don't need to worry too much about this chapter. It goes into more
depth than you need for this course. Pay more attention to my notes
for this topic.
-
An interview is a conversation with a purpose.
General Characteristics of Interviews
Structure.
-
Structured interviews are standardized, much like tests. They have pre-designated
questions, sequence, even scoring. The interviewer is directive, steering
the conversations in predetermined directions. Standardization facilitates
comparison of individuals by providing a common focus and metric. Research
suggests structure can improve reliability and validity.
-
Unstructured interviews have a free flowing, spontaneous quality. The interviewer
is relatively nondirective, following the interviewee's lead. Open-ended
questions are the norm and "scoring" is informal. Such interviews can be
ideographic (a picture of the individual can be painted), sensitive to
individual uniqueness, flexible.
Rapport
Interviews are typically more disclosing, honest, and responsive
when their relationship with the interviewer is warm and comfortable. Interviewers
will therefore make attempts to be overtly pleasant, respectful, interested,
and nonjudgemental. They will facilitate communication with appropriate
responses and nonverbal behavior. The major exception to this the stress
interview, where examiners intentionally create a challenging or threatening
interpersonal context to assess the interviewee's reactions to such situations.
Interactive
Interviews are interaction between two people, effected by
both. Interview outcomes thus depend on characteristics of both participants
and their interaction. Influence is reciprocal. Interviews are adaptive,
so responses determine the direction of subsequent explorations. Until
recently only interviews had this interactive component. Now computers
can also tailor questions as you go along.