Evaluating What a Test Really Measures
Validity: Does the test measure what it claims to measure?
Types of Validity
Content
Face
Criterion-Related (concurrent or predictive)
Construct
Content Validity
Adequately sampling the domain.
In personnel selection: job-related.
Determined by expert judgements.
Intended outcomes
Face Validity
The items look like they reflect whatever is being measured.
Uses experts to evaluate.
Chapter 8
Using Tests to Make Decisions:
Criterion-Related Validity
What is a criterion?
This is the standard by which your measure is being judged or evaluated.
Criterion-Related Validity
Predictive validity correlating test scores with future behavior
on the behavior
after examinees have had a chance to exhibit the predicted
behavior; e.g., success on the job.
Concurrent validity correlating test scores with an independent
measure of the same trait that the test is designed to measure currently
available.
E.g.1, Teachers ratings of reading ability validated by correlating with reading test scores.
Or being able to distinguish between groups known to be different; i.e., significantly different mean scores on the test.
In both predictive and concurrent validity, we validate by comparing scores with a criterion (the standard by which your measure is being judged or evaluated).
Selecting a Criterion
Objective criteria: observable and measurable; e.g., sales
figures, number of accidents, etc.
Subjective criteria: based on a persons judgment; e.g., employee
job ratings. Example
CRITERION MEASUREMENTS MUST THEMSELVES BE VALID!
Usually use content validity; e.g., supervisors determination of
what job characteristics are important.
BOTH PREDICTOR AND CRITERION MEASURES MUST BE RELIABLE FIRST!
E.g., inter-rater reliability of the criterion measure.
Reliability estimates of predictors can be obtained by one of the
4 methods covered in Chapter 6.
Validity vs. Reliability
An unreliable test cannot be valid.
The validity of a test cannot exceed its reliability; i.e., a test
cant correlate with another measure higher than it correlates with itself.
Correlation Between Predictor and Criterion
Coefficient of determination: r2 tells us how much covariation
exists between predictor and criterion; e.g., if r = .7, then 49% of the
variance is common to both.
Using Validity Information To Make Predictions
Decide what is success on the criterion; e.g., job performance
6 months after hire.
Determine what minimum predictor score (cut score) will predict
success on the job.
Outcomes of Prediction
Hits: a) True positives - predicted to succeed and did.
b) True negatives - predicted
to fail and did.
Misses: a) False positives - predicted to succeed and didnt.
b) False negatives
- predicted to fail and would have succeeded.
WE WANT TO MAXIMIZE TRUE HITS AND MINIMIZE MISSES!
Predictive validity correlation determines accuracy of prediction:
HIGHER r = HIGHER PREDICTION
Chapter 9
Construct Validity
What is a construct?
An imaginary trait or disposition inferred from observations of
specific instances of behavior that have something in common.
E.g., assertiveness, OCD, etc.
Use indirect measures of the construct, e.g., a scale which contains
examples of behaviors that we consider evidence of the construct.
But how can we validate that scale?
Construct Validity
Comparing high vs. low scoring people on behavior implied by the
construct.
Or by comparing groups known to differ on the construct; e.g., KKK
members vs. NAACP members on Attitudes Toward Blacks scale.
Unidimensionality of the construct being measured; i.e., homogeneity
of items.
ONLY ONE CONSTRUCT CAN BE MEASURED VALIDLY BY ONE SCALE!
Construct validity requires homogeneous items high internal consistency
reliability;
therefore unidimensional!
Convergent Validity
Convergent validity, agreement among ratings, scales, or measurements
gathered independently of one another, where measures should be theoretically
related.
Discriminant Validity
Discriminant validity, Discriminate validity is the lack of a relationship
among measures which theoretically should not be related.
Multitrait-Multimethod Design
Searching for convergence across different measures of the same thing
and for divergence between measures of different things.
E.g., a scale of intelligence should correlate with a measure of
verbal ability but not with assertiveness.
Chapter 10
Developing Psychological Tests
Developing a test plan
Defining the construct
Choosing the test format
Specifying admin and scoring methods
Developing the test itself
Defining the construct
Operationalizing the construct in terms of observable behaviors.
Job analysis in terms of the knowledge, skills, abilities, and other
characteristics (KSAOs) what it takes to succeed.
Learning objectives
Choosing the test format and Composing the test items
Objective items
1. Multiple choice
2. True/False
3. Forced choice
Subjective items
1. Essay
2. Interview
3. projective
4. Sentence completion
Specifying Scoring Methods
Cumulative
Categorical
ipsative
Types of Response Bias
Response sets
Social desirability
Acquiesence
Random responding
Faking
Writing Good Items
Follow the test plan.
Base each item on a learning objective.
Items should not be answerable from a students general knowledge.
Write each item in a clear, direct manner.
Use appropriate language.
Make all items independent.
Have an expert review the items.
Multiple Choice Items
Avoid negatives
All choices should be similar in length and style.
Only one answer correct or best.
Avoid overlapping choices.
Avoid all or none of the above.
Comparing objective vs. subjective formats
Objective items provide better content validity.
Objective are more difficult to construct.
Objective are easier to score, and more accurately.
Subjective items are easier and quicker to write and assess higher-order
skills, but harder to grade and less valid.
Writing Admin Instructions
Setting
Specific requirements
Time limits
Admin script
Chapter 11
Piloting and Revising Tests
The Pilot Test
A scientific investigation of the new tests reliability and validity.
Using a sample of the people for whom the test is intended.
Quantitative Item Analysis
How valid is the item itself?
Item difficulty: p should be between .2 and .8.
Item discrimination: D = U L
D is able to reach its maximum when p = .5.
Inter-Item Correlations
A measure of internal consistency or homogeneity.
Items that dont correlate with others may be measuring other constructs.
Item Characteristic Curves
Relates the performance of each item to the testees ability on the
construct being measured, i.e., his/her score on the test.
Item characteristic curve (ICC) a graph of the probability of answering
an item correctly with level of ability on the construct being measured.
Measures p and D.
Item characteristic curve (ICC): the greater the slope, the
greater the discrimination.
The lower the height of the curve, the more difficult is the item.
Different curves for different groups of testees can indicate bias.
Concerns for each item
Difficulty: what % of testees got it correct.
Discrimination: how well it discriminates between high and
lower scorers.
Validity: how well it correlates with test score.
Validation and Cross-Validation
Try out on a sample different from the pilot test.
Differential Validity
Different validity correlation coefficients for different subgroups,
e.g., men vs. women are O.K.
Unfair discrimination means that persons with equal chances of success
on the job have unequal probabilities of being hired for the job.
Developing Cut Scores
Minimum score for acceptance or hiring.
May use a panel of experts.
Or actual correlation of the predictor test with success on the job
or college.
Developing Norms
Administer the test to a large random sample from the population.
Sometimes, subgroup norms also.
Chapter 12
Survey Research vs. tests
Tests measure individual behavior.
Surveys measure group behavior (thoughts, feelings, attitudes, actions,
etc.)
Causal vs. Correlational Methods
Experimental research techniques IVs effects on a DV, controlling
for RVs only way to determine cause and effect relationships.
Descriptive (correlational) research techniques simply looking
at frequency of one behavior related to the occurrence of another; e.g.,
suicide rates vary with amount of country music played.
The Survey Method
Clear objectives.
Clear and unbiased questions.
Administered to a representative sample taken from a population.
Answers analyzed to answer the objectives.
Unbiased, objective reporting
Reliable and valid.
Types of Surveys
Self-administered; e.g., printed, mailed
Personal interviews:
Face-to-face
Telephone
Developing Survey Questions
Open-ended questions
Closed-ended questions, including multiple-choice (Likert) , ranking,
rating questions.
Rules for Writing Questions
Clear and unambiguous (Check your sex.)
Use appropriate rating scales and response options.
Include appropriate categorical alternatives (including other)
Not double-barreled questions.
Appropriate reading level (4th grade).
No leading or loaded questions.
Sources of Error
Questions
Sampling
Types of Samples
Probability (random) equal chance of being chosen.
Simple random sampling
Systematic sampling
Stratified random sampling
Cluster sampling
Nonprobability (convenience) sampling.