Psychology 121, Lecture 7
by Hal S. Kopeikin, Ph.D. © 2000
Announcements
Mid-terms were scored and the grades have been posted. The grades are approximate.
If you have questions about your exam, use office hours to discuss it.
Reviewing the test might highlight what you learned and what you need to
study.
See the class web page for other details and announcements
Test Construction & Selection
Today we will be looking at ways of building and selecting
test. We will look briefly at how to build tests, but our focus will be
on evaluating and selecting them.
Test are no better than the items composing them. Of course, they
could be worse (imagine a midterm with nothing but great questions about
reliability).
Item formats
The two basic types of items are
-
forced-choice and
-
free-response.
You should note that good items on a test are a necessary but not
sufficient condition for a good test. A test could have excellent questions
but low content validity.
Forced choice items
-
Forced choice items are answered by selecting from alternatives.
-
Examples: True/False, Multiple Choice, Matching, Likert Scales,
Category Scales (e.g., on a scale from 1 to 10 ), Checklists (e.g., check
the adjectives that apply to you), Q-Sorts
Advantages of forced choice item questions
-
Administration is quick, easy, and reliable. This makes validity easier
to achieve.
-
Permits rapid sampling of broad content domains (often supporting content
validity) quick = more items= more reliability, thus potentially more validity.
-
Method variance is relatively less of a bias than with writing skill across
items. This strength is often overlooked by people. People often think
that there is a bias in forced choice questions because some people are
better at taking these types of tests than others. For example, some people
know that 'c' is the most common answer, that longer answers are more frequently
right than short answers, etc. Although there is a bias as a result of
this, this is generally less of a bias than that encountered with longer
written answers.
-
Some people think forced choice items do not afford partial credit but
this is not true across items. If you can eliminate 2 out of four answers
on a number of multiple choice questions, you score will be higher than
those who can only eliminate one answer. In this way, you will be credited
for your partial knowledge.
Disadvantages of forced choice item questions
-
It is easier and common to write items measuring (& inspiring) superficial,
rote learning rather than encouraging and testing integration and application
of knowledge. Forced choice questions can demand critical thinking, analysis,
etc., but in practice they often fail to.
-
The bias that occurs in essay questions is arguably not really a bias since
one of the skills you should rightly be tested on is your writing ability.
But forced-choice test taking ability is not a skill that is usually relevant
and hence to the extent that the ability biases the results, it is clearly
a bias.
Comments
-
Most people do not correct for guessing on forced choice tests although
it is done on some high-profile tests (SAT, GRE, etc.). The jury is out
on whether it really affects anything. Most psychometricians don't think
it matters.
Free response items
Examples: Fill in the blank, short answer, essay. These questions
allow you to create the answer rather than choose one.
Advantages:
-
- With these questions you are doing recall rather than recognition. This
usually requires deeper storage and the ability to retrieve information.
This is usually what is required in the real world and hence this is a
strength of this kind of question.
-
It is easier to write items assessing (& encouraging) organization
and integration of knowledge.
-
Although fill in the blank questions are free response items they are not
heavily influenced by writing abilities so there is little bias as a result
of strong or weak writing abilities.
-
Short answers allow pretty broad coverage (& more items) of the relevant
material so content validity can be pretty high pretty easily. Essay questions
take much longer to answer so there is more difficulty in covering all
the material.
Disadvantages
-
It is much harder to score these questions reliably (esp. Essays, somewhat
short answers, less with fill in the blanks)
-
Writing skill is usually a bias. This bias may have a dramatic effect in
essay questions (less for fill in the blanks)
-
It is time consuming to administer & score, so content validity &
reliability are threatened (esp. essay)
Item Analyses
-
Difficulty
BEWARE: usually scales are named so a high score indicates more of
the scale's name. Difficulty violates this convention. Difficulty = % who
get item correct, so it probably should be called "easiness" instead. Statistically,
scores will have maximum variability when difficulty is halfway between
100% and chance (this maximizes variance). Thus, reliability and validity
are potentially maximized around that difficulty.
-
Distracters
Distracters are the "incorrect" alternatives. Theoretically, more are
better, as long as they are good (Good distractors are attractive if you
don't know the answer, but not if you do). In practice, it is hard to find
more than 3 or 4 good distractors, so most multiple choice items have that
many. Distractors are usually analyzed by looking at the number of people
endorsing them; if near zero, then not much distraction. With a good distracter
there will be a negative correlation between the amount of knowledge the
person has and the frequency with which they will pick a distracter.
-
Discriminability
High discriminability is good. Discriminability measures whether those
"high" on the rest of the test score more "highly" on the item in question:
For instance, in determining whether an essay question is good we would
like to know if there is a strong correlation between an individual's score
on the essay and his/her score on the forced choice questions on the test.
There are two common ways of measuring discriminability: (1) correlation
between that item and score on the rest of the test (as in the example
above), and (2) difference in the proportion correct for those in the top-
and bottom-thirds on the rest of the test.
Item characteristic curves also reveal performance differences on the
item as a function of general level on the test (see pp 165-167, Figures
6-3, 6-4, 6-5, 6-6, 6-7).
Item Response Theory
With testing by computer becoming more feasible there is a
growing interest in attempting to improve testing by assessing probability
of getting an item correct, given a certain level of ability. This is done
by tailoring later question difficulty based on the performance of earlier
questions. Ideally, one receives more items with a .5 difficulty for people
with his/her ability. This can maximize variance, enhancing reliability
and validity.
Selecting Tests
Professional standards set minimum expectations for published tests. Included
in this are stipulations regarding information included in the test manual.
Basic information on reliability, validity, & norms, e.g., are required.
The text lists seven references. You should read about all of them but
be particularly familiar with these two:
-
Test in Print III-- This provides basic information on most published
tests. But it is more descriptive than evaluative.
-
A more critical evaluation of tests that goes into more depth is found
in Mental Measurements Yearbook
This book has critical reviews of most major tests.