Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam Annual Conference of...

48

Transcript of Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam Annual Conference of...

Page 1: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.
Page 2: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Six degrees of integration: an agenda for joined-up assessment

Dylan Wiliam

www.dylanwiliam.net

Annual Conference of the Chartered Institute of Educational Assessors, London: 23 April 2008

Page 3: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

OverviewSix degrees of integration Function

Formative versus summative Quality

Validity versus reliability Format

Multiple-choice versus constructed response Scope

Continuous versus one-off Authority

Teacher-produced versus expert-produced Locus

School-based versus externally marked

Page 4: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

FunctionQualityFormatScopeAuthorityLocus

Page 5: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

A statement of the blindlingly obviousYou can’t work out how good something is until you know what it’s intended to do…

Function, then quality

Page 6: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Formative and summativeDescriptions of InstrumentsPurposesFunctions

An assessment functions formatively when evidence about student achievement elicited by the assessment is interpreted and used to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of that evidence.

Page 7: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Gresham’s law and assessmentUsually (incorrectly) stated as “Bad money drives out good”

“The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others” (Mundell, 1998)

The parallel for assessment: Summative drives out formative

The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way

Page 8: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

FunctionQualityFormatScopeAuthorityLocus

Page 9: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

ReliabilityReliability is a measure of the stability of assessment outcomes under changes in things that (we think) shouldn’t make a difference, such asmarker/rateroccasion item selection

Page 10: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Test length and reliability

0.70 0.75 0.80 0.85 0.90 0.95

0.70 1.0

0.75 1.3 1.0

0.80 1.7 1.3 1.0

0.85 2.4 1.9 1.4 1.0

0.90 3.9 3.0 2.3 1.6 1.0

0.95 8.1 6.3 4.8 3.4 2.1 1.0

From

To

Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).

Page 11: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Reliability is not what we really wantTake a test which is known to have a reliability of around 0.90 for a

particular group of students.Administer the test to the group of students and score itGive each student a random script rather than their ownRecord the scores assigned to each student

What is the reliability of the scores assigned in this way?A. 0.10B. 0.30C. 0.50D. 0.70E. 0.90

Page 12: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Reliability v consistencyClassical measures of reliabilityare meaningful only for groupsare designed for continuous measures

Marks versus gradesScores suffer from spurious accuracyGrades suffer from spurious precision

Classification consistencyA more technically appropriate measure of the reliability of assessmentCloser to the intuitive meaning of reliability

Page 13: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Reliability & classification consistency

Classification consistency of National Curriculum Assessment in England

reliability levels 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

KS1 3 73% 75% 77% 79% 81% 83% 86% 90%

KS2 5 56% 58% 60% 64% 68% 73% 77% 84%

KS3 8 45% 47% 50% 54% 57% 62% 68% 76%

Page 14: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

ValidityTraditional definition: a property of assessments A test is valid to the extent that it assesses what it purports to assess Key properties (content validity)

Relevance Representativeness

Fallacies Two tests with the same name assess the same thing Two tests with different names assess different things A test valid for one group is valid for all groups

Page 15: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Trinitarian doctrines of validityContent validity

Criterion-related validityConcurrent validityPredictive validity

Construct validity

Page 16: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

ValidityValidity is a property of inferences, not of assessments

“One validates, not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971; emphasis in original)

The phrase “A valid test” is therefore a category error (like “A happy rock”) No such thing as a valid (or indeed invalid) assessment No such thing as a biased assessment

Reliability is a pre-requisite for validity Talking about “reliability and validity” is like talking about “swallows and birds” Validity includes reliability

Page 17: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Modern conceptions of validity

Validity subsumes all aspects of assessment qualityReliabilityRepresentativeness (content coverage)RelevancePredictiveness

But not impact (Popham: right concern, wrong concept)

“Validity is an integrative evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989 p. 13)

Page 18: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Consequential validity? No such thing!As has been stressed several times already, it is not that adverse social consequences of test use render the use invalid, but, rather, that adverse social consequences should not be attributable to any source of test invalidity such as construct-irrelevant variance. If the adverse social consequences are empirically traceable to sources of test invalidity, then the validity of the test use is jeopardized. If the social consequences cannot be so traced—or if the validation process can discount sources of test invalidity as the likely determinants, or at least render them less plausible—then the validity of the test use is not overturned. Adverse social consequences associated with valid test interpretation and use may implicate the attributes validly assessed, to be sure, as they function under the existing social conditions of the applied setting, but they are not in themselves indicative of invalidity. (Messick, 1989, p. 88-89)

Page 19: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Threats to validityInadequate reliability

Construct-irrelevant variance Differences in scores are caused, in part, by differences not relevant to the construct

of interest The assessment assesses things it shouldn’t The assessment is “too big”

Construct under-representation Differences in the construct are not reflected in scores

The assessment doesn’t assess things it should The assessment is “too small”

With clear construct definition all of these are technical—not value—issuesBut they interact strongly…

Page 20: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

School effectivenessDo differences in exam results support inferences about school quality?

Key issues:Value-addedSensitivity to instruction

Learning is slower than generally assumed Sensitivity to instruction of tests is exacerbated by test-construction

procedures

Result: invalid attributions about the effects of schooling

Page 21: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Learning is hard and slow…

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

6 7 8 9 10 11 12

Age (years)

Facility

Source: Leverhulme Numeracy Research Programme

860+570=?

Page 22: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Why does this matter?In England, school-level effects account for only 7% of the variability in GCSE scores.

In terms of value-added, there is no statistically significant difference between the middle 80 percent of English secondary schools

Correlation between teacher quality and student progress is low:Average cohort progress: 0.3 sd per yearGood teachers (+1 sd) produce 0.4 sd per yearPoor teachers (-1 sd) produce 0.2 sd per year

Page 23: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

So…

Although teacher quality is the single most important determinant of student progress…

…the effect is small compared to the accumulated achievement over the course of a learner’s education…

…so that inferences that school outcomes are indications of the contributions made by the school are unlikely to be valid.

Page 24: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

FunctionQualityFormatScopeAuthorityLocus

Page 25: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Item formats“No assessment technique has been rubbished quite like multiple choice, unless it be graphology” Wood, 1991, p. 32)

Myths about multiple-choice itemsThey are biased against femalesThey assess only candidates’ ability to spot or guessThey test only lower-order skills

Page 26: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Comparing like with like…Constructed-response itemsCan be improved through guidance to markersCan be developed cheaply, but are expensive to scoreFor a one-hour year-cohort assessment in England

Development: £5 000 Scoring: £1 000 000

Multiple-choice itemsCannot be improved through guidance to markersCan be developed cheaply, but are cheap to scoreFor a one-hour year-cohort assessment in England

Development: £1 000 000? Scoring: £5 000

Page 27: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Mathematics 1What is the median for the following data set?

38 74 22 44 96 22 19 53

A. 22B. 38 and 44C. 41D. 46E. 77F. This data set has no median

Page 28: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Mathematics 2What can you say about the means of the following two data sets?

Set 1: 10 12 13 15

Set 2: 10 12 13 15 0

A. The two sets have the same mean.

B. The two sets have different means.

C. It depends on whether you choose to count the zero.

Page 29: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Mathematics 3Which of the shapes below contains a dotted line that is also a diagonal?

Page 30: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Wilson & Draney, 2004

Science

The ball sitting on the table is not moving. It is not moving because:

A. no forces are pushing or pulling on the ball.

B. gravity is pulling down, but the table is in the way.C. the table pushes up with the same force that gravity pulls downD. gravity is holding it onto the table. E. there is a force inside the ball keeping it from rolling off the table

Page 31: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Science 2

You look outside and notice a very gentle rain. Suddenly, it starts raining harder. What happened?

A.A cloud bumped into the cloud that was only making a little rain.

B.A bigger hole opened in the cloud, releasing more rain.

C.A different cloud, with more rain, moved into the area.

D.The wind started to push more water out of the clouds.

Page 32: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Science 3Jenna put a glass of cold water outside on a warm day. After a while, she could see small droplets on the outside of the glass. Why was this?

A. The air molecules around the glass condensed to form droplets of liquid

B. The water vapor in the air near the cold glass condensed to form droplets of liquid water

C. Water soaked through invisible holes in the glass to form droplets of water on the outside of the glass

D. The cold glass causes oxygen in the air to become water

Page 33: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Science 4How could you increase the temperature of boiling water?

A. Add more heat.

B. Stir it constantly.

C. Add more water.

D. You can’t increase the temperature of boiling water.

Page 34: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Science 5What can we do to preserve the ozone layer?

A. Reduce the amount of carbon dioxide produced by cars and factories

B. Reduce the greenhouse effect

C. Stop cutting down the rainforests

D. Limit the numbers of cars that can be used when the level of ozone is high

E. Properly dispose of air-conditioners and fridges

Page 35: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

EnglishWhere would be the best place to begin a new paragraph?

No rules are carved in stone dictating how long a paragraph should be. However, for argumentative essays, a good rule of thumb is that, if your paragraph is shorter than five or six good, substantial sentences, then you should reexamine it to make sure that you've

developed the ideas fully. A Do not look at that rule of thumb, however, as hard and fast. It is simply a general guideline that may not fit some paragraphs. B A paragraph should be long enough to do justice to the main idea of the paragraph. Sometimes a paragraph may be short; sometimes it will be long.  C On the other hand, if your paragraph runs on to a page or longer, you should probably reexamine its coherence to make sure that you are sticking to only one main topic. Perhaps you can find subtopics that merit their own paragraphs. D Think more about the unity, coherence, and development of a paragraph than the basic length. E If you are worried that a paragraph is too short, then it probably lacks sufficient development. If you are worried that a paragraph is too long, then you may have rambled on to topics other than the one stated in your topic sentence.

Page 36: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

English 2In a piece of persuasive writing, which of these would be the best thesis statement?

A. The typical TV show has 9 violent incidentsB. There is a lot of violence on TVC. The amount of violence on TV should be reducedD. Some programs are more violent than othersE. Violence is included in programs to boost ratingsF. Violence on TV is interestingG. I don’t like the violence on TVH. The essay I am going to write is about violence on TV

Page 37: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

HistoryWhy are historians concerned with bias when analyzing sources?

A. People can never be trusted to tell the truthB. People deliberately leave out important detailsC. People are only able to provide meaningful information if they

experienced an event firsthandD. People interpret the same event in different ways, according to their

experienceE. People are unaware of the motivations for their actionsF. People get confused about sequences of events

Page 38: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

FunctionQualityFormatScopeAuthorityLocus

Page 39: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

“All the women are strong, all the men are good-looking, and all the children are above average.” Garrison Keillor

The Lake Wobegon effect revisited

QuickTime™ and a decompressor

are needed to see this picture.

Page 40: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Effects of narrow assessmentIncentives to teach to the testFocus on some subjects at the expense of othersFocus on some aspects of a subject at the expense of othersFocus on some students at the expense of others (“bubble” students)

ConsequencesLearning that is

Narrow Shallow Transient

Page 41: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

FunctionQualityFormatScopeAuthorityLocus

Page 42: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Authority

Reliability requires random sampling from the domain of interest

Increasing reliability requires increasing the size of the sample

Using teacher assessment in certification is attractive: Increases reliability (increased test time) Increases validity (addresses aspects of construct under-representation)

But problematic Lack of trust (“Fox guarding the hen house”) Problems of biased inferences (construct-irrelevant variance) Can introduce new kinds of construct under-representation

Page 43: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

FunctionQualityFormatScopeAuthorityLocus

Page 44: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

LocusUsing external markers to mark student assessments involves spending more money in order to deny teachers professional learning opportunities

Getting teachers involved in “common assessment” Is not assessment for learning, nor formative assessmentBut it is valuable, perhaps even essential, professional development

Page 45: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Final reflections

Page 46: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

The challengeTo design an assessment system that is:

Distributed So that evidence collection is not undertaken entirely at the end

Synoptic So that learning has to accumulate

Extensive So that all important aspects are covered (breadth and depth)

Manageable So that costs are proportionate to benefits

Trusted So that stakeholders have faith in the outcomes

Page 47: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

Constraints and affordancesBeliefs about what constitutes learning;

Beliefs in the reliability and validity of the results of various tools;

A preference for and trust in numerical data, with bias towards a single number;

Trust in the judgments and integrity of the teaching profession;

Belief in the value of competition between students;

Belief in the value of competition between schools;

Belief that test results measure school effectiveness;

Fear of national economic decline and education’s role in this;

Belief that the key to schools’ effectiveness is strong top-down management;

Page 48: Six degrees of integration: an agenda for joined-up assessment Dylan Wiliam  Annual Conference of the Chartered Institute of Educational.

The minimal take-aways…No such thing as a summative assessment

No such thing as a reliable test

No such thing as a valid test

No such thing as a biased test

“Validity including reliability”