Assessment can sometimes feel like a tedious chore. Teaching should be about inspiring young minds, not driving them through a series of stultifying tests. And when assessment is done badly, it can get in the way of good teaching, and even distort it. But done well, assessment can give pupils better information about their performance and teachers better information about whether their methods are working. Assessment is just a type of measurement, and we know from other walks of life that accurate measurement can be transformative: improvements in measurement brought about by microscopes and stethoscopes in the 19th century led to improvements in healthcare itself, while improvements in how we measure time led to improvements in navigation and ultimately helped to support the development of satellite navigation systems. Two vital concepts for understanding assessment are validity and reliability, and the rest of this article will give a brief introduction to both.
Perhaps the most important concept in assessment is validity. Daniel Koretz, professor of assessment at Harvard University, says that ‘validity is the single most important criterion for evaluating achievement testing’ (Koretz, 2008). Dylan Wiliam agrees, saying that it is ‘the central concept in assessment’, and defines it as follows:
The really important idea here is that we are hardly ever interested in how well a student did on a particular assessment. What we are interested in is what we can say, from that evidence, about what the student can do in other situations, at other times, in other contexts. Some conclusions are warranted on the basis of the results of the assessment, and others are not. The process of establishing which kinds of conclusions are warranted and which are not is called validation. (Wiliam, 2014)
The startling implication of this is that the actual result on an assessment doesn’t matter. What matters is the inferences that we can make from that result. Ultimately, sixth form heads and university admissions officers don’t really care that a pupil sat in a hall on a day in June and answered a particular set of questions correctly or incorrectly. What they are really interested in is what this performance on the day in June can tell them about what that pupil will be able to do later. If the pupil gets a certain mark or grade on the exam, does that mean that they can start studying an A-level in that subject three months later? Does it mean that they can start a university course in a related subject? Does it mean that they can start the course without any help, or will they need some kind of extra support? It is important to note here that validity refers not to a test or assessment itself, but to these inferences.
Tests themselves are not valid or invalid. Rather, it is an inference based on test scores that is valid or not. A given test might provide good support for one inference, but weak support for another. For example, a well-designed end of course exam in statistics might provide good support for inferences about students’ mastery of basic statistics, but very weak support for conclusions about mastery of mathematics more broadly. (Koretz, 2008, p. 31)
Validity matters for classroom assessment too. When a pupil gets a question wrong in class, or is unable to provide a response, what inferences are you justified in making? Often, we infer from a wrong answer or no answer that a pupil hasn’t understood the concept being tested. But sometimes that inference is not justified. Imagine that you’ve taught pupils a unit on weights and measures, and at the end of the unit you give them a word problem, asking them to use their knowledge of weights and measures in context. It is possible that they will struggle not because they don’t understand weights and measures, but because they are weak readers, or because the story the problem is embedded in contains unfamiliar concepts.
In practice, this happens quite a lot with word problems in maths. So, it is really vital for us to ask ourselves all the time why we are assessing. What is the purpose of the questions we are asking? What are we trying to find out? What do we want to know? If we want to know how well our pupils can apply weights and measures in a real-world context, then we should embed the problem in a real-world context. But if we want to find out whether they have understood weights and measures, we might design a narrower question. In the classroom, the advantage of narrower questions is that they make it easier for us to make inferences about what we and the pupils should do next. Obviously, the advantage of broader questions is that they allow us to make inferences about what a pupil can do in real-world contexts. As we’ve seen, it isn’t that one individual question or assessment is more or less valid. Validity isn’t a property of a test, it’s a property of the inference. So both the narrow and the broad question can be used to make valid inferences, but they can also be used to make invalid ones. The problem often isn’t with the question or the assessment, but with the way in which we interpret and use the results.
A second vital assessment concept is reliability, which is about the consistency of assessment. For an analogy, think of a set of kitchen scales. If you weigh a kilogram bag of sugar on the scales 10 times, then you would expect that each time you would get a reading of one kilogram, or very close to one kilogram. For an educational assessment to be reliable, something similar needs to happen. If a pupil were to take different versions of the same test, they should get approximately the same mark. If they were to take the test at different times of day, they should get approximately the same mark. And if a pupil’s answer paper were submitted to ten different markers, it should return each time with approximately the same mark. In practice, neither kitchen scales nor exams are perfectly reliable. There is always an element of error. What matters is how big that error is and how it affects the inferences we want to make.
One factor that can have a big impact on reliability is agreement between markers. Closed questions are often very easy to mark because it’s clear what the right and wrong answers are. This is why maths teachers can often whiz through an exam paper ticking and crossing lots of questions, and still be sure that other markers would agree with them. English teachers, by contrast, often take much longer to pore over essays and mark schemes, and even then they can’t be certain that another marker would agree with them. There are ways around this: comparative judgement, for example, is a new method of assessment that provides better reliability for open tasks like essays.
Reliability is particularly important to consider when it comes to measuring progress. When we measure progress, we are often looking at the difference in performance between one test and the next. So we have two sets of measurement error to deal with: the error on the first test and the error on the second. A pupil who looks like they’ve made a big improvement from one test to the next might just have had a particularly bad day and a harsh marker on the first test, and a good day and a generous marker on the second. Understanding reliability helps us to understand whether pupils really have made progress or not.
Check out Issue 1 of Impact, dedicated to assessment: impact.chartered.college/issue/issue-1-assessment.
Daisy Christodoulou is the Director of Education at No More Marking, a provider of online comparative judgement. Before that, she was Head of Assessment at Ark Schools, a network of 35 academy schools. She has taught English in two London comprehensives and has been part of government commissions on the future of teacher training and assessment. Daisy is the author of Seven Myths about Education and Making Good Progress? The Future of Assessment for Learning, as well as the influential blog: thewingtoheaven.wordpress.com. You can also find her on Twitter @daisychristo.