One of the most difficult and time-consuming tasks facing teachers is the assessment of extended pieces of writing. This article will outline some of the challenges and trade-offs involved with the assessment of writing at the moment, and then consider whether comparative judgement offers an improvement on these traditional methods.

In order to see why assessing writing is so tricky, we first need to consider two fundamental assessment concepts: validity and reliability. Take validity first. Whenever we look at a student’s test results, we need to know if it is valid for us to make certain inferences about those results. For example, a teacher might want to make inferences about a student’s particular strengths and weaknesses, about whether they need extra help in writing, or about their overall performance in comparison to their peers. There are inferences that employers and universities might like to make too: can they use the results of a writing assessment to support an inference about whether that student is capable of starting a three-year degree course that will involve writing lots of essays, or of doing a job which requires writing lots of emails and reports under significant time pressure?

Some of the inferences we want to make concern a student’s performance in the future – sometimes the distant future. This is particularly the case with the inferences being made from national tests, where, as we have seen, employers and universities want to know how students will cope with the tasks they will face in the workplace or library. Given this, one might argue that the most valid assessment tasks are those that most closely resemble the tasks students will need to undertake in the real world. If we follow this argument, writing tasks in assessments should have a plausible real-world purpose and audience. We might even argue that students should produce such work in less formal and more collaborative conditions. After all, in the real world, nobody bans you from using a dictionary or an online spell-checker, or asking a colleague for help.

Assessment reliability

However, before we advance too far down this path, we have to consider the other important assessment concept: that of reliability. Reliability refers to the consistency of measurement. Perhaps the simplest analogy is with a set of kitchen scales. If you have a one-kilogram bag of flour and you measure it several times on the same set of scales, then a reliable set of scales should return approximately the same reading each time. Similarly, a reliable assessment should result in a student getting a similar grade if they were to resit a different version of the same assessment, or if their assessment were marked by a different examiner. Reliability can sound like a pedantic requirement, but it is in fact a prerequisite of validity. If an assessment is very unreliable, we cannot be sure what it is telling us, making any attempt to create a valid inference extremely difficult.

For example, imagine getting the results of an assessment that shows a student has improved by three grades since their last assessment. Typically, we would infer from that result that they had made big improvements. But what if we learnt that the assessment they had taken was only accurate to plus or minus three grades? That inference would no longer be valid. No assessment – or indeed, set of scales – is perfectly reliable, and all measurements have error, but we need to be able to measure the error in order to see how accurate the assessment is, and to help us make valid inferences.

Many of the changes discussed earlier that seemed to make an assessment more valid can actually increase unreliability. Assessments that require students to write long and extended answers may seem more valid than short-answer questions, in that they mimic the tasks students are expected to do in the real world. But long and extended written tasks are hard to mark reliably (Tisi, 2013). Markers often disagree about the grade they should be awarded, and ensuring marking consistency is a costly and time-consuming process. Disagreement between markers is not the only source of unreliability introduced by extended written answers. Because extended written answers take time to write, only a few can be included in an exam. And the fewer questions there are on an exam paper, the more likely it is that a pupil will be ‘lucky’ or ‘unlucky’ with the selection of questions.

Similarly, allowing students to seek help from outside sources introduces an element of unreliability, in that it is hard to be sure that all students have had exactly the same amount of help from people with similar qualifications. In the US, evaluations of such authentic ‘portfolio assessments’ have revealed significant problems: one major study concluded that such assessments were so unreliable that they ‘preclude most of the intended uses of the scores’ and ‘fail to discern real differences in quality’ (Koretz, 1998).

One possible response to this problem is to use more short-answer and constructed response questions in the assessment of writing. Scores from such tasks are more reliable than those from extended writing tasks, and such tasks are quicker and cheaper to mark as well. The standard objection to such questions is that while they are extremely reliable, they do not allow us to make valid inferences about things that we are actually interested in. Reliability is a necessary condition for validity, but it is not a sufficient one. We could ask students to throw basketballs at a hoop and count the number of successes: we could mark that assessment reliably, but we would not be able to make any valid inferences about a student’s writing skill from it. That is, of course, a caricature. In practice, well-designed short-answer questions are more sophisticated than is commonly assumed. They often correlate well with measures of broader writing skill, and it is possible to make some valid inferences about broader writing skill from them (Hirsch, 2010). However, very few people would want to see such questions entirely replace extended writing, because the risk would be that if extended writing were no longer assessed, it would not be taught either. In a high-stakes assessment system such as England’s, instruction would come to focus on the types of questions that would appear on the test, and it would also be likely, therefore, that the correlation between performance on short-answer questions and on extended writing tasks would start to break down.

A method of reliably assessing extended writing is therefore very important, for teaching and learning as much as for reliable assessment. In order to consider how we might arrive at this, let us first see why it is so difficult to mark extended writing at the moment. In two important ways, the current methods are problematic. First, traditional writing assessment often depends on absolute judgements. Markers look at a piece of writing and attempt to decide which grade is the best fit for it. This may feel like the obvious thing to do, but in fact humans are very bad at making such absolute judgements. This is not just true of marking essays, either, but of all kinds of absolute judgement. For example, if you are given a shade of blue and asked to identify how dark a shade it is on a scale of 1 to 10, or given a line and asked to identify the exact length of it, you will probably struggle to be successful. However, if you are given two shades of blue and asked to find the darker one, or two lines, and asked to find the longer one, you will find that much easier. Absolute judgement is hard; comparative judgement is much easier, but traditional essay marking works mainly on the absolute model (Laming, 2003).

Second, traditional writing assessment often depends on the use of prose descriptions of performance, such as those found in mark schemes or exam rubrics. The idea is that markers can use these descriptions to guide their judgements. For example, one exam board, AQA, describes the top band for writing in the following way:

  • Writing is compelling, incorporating a range of convincing and complex ideas
  • Varied and inventive use of structural features.

The next band down is described as follows:

  • Writing is highly engaging, with a range of developed complex ideas
  • Varied and effective structural features.

Already it is not hard to see the kinds of problems such descriptors can cause. What is the difference between ‘compelling’ and ‘highly engaging’? Or between ‘effective’ use of structural features and ‘inventive’ use? Such descriptors cause as many disagreements as they resolve. A great deal of practical and philosophical research shows that prose descriptors are capable of being interpreted in a number of different ways, and as such do not help improve marking reliability (Polanyi, 1962). As Alison Wolf (1998) says, ‘one cannot, either in principle or in theory, develop written descriptors so tight that they can be applied reliably, by multiple assessors, to multiple assessment situations.’

Comparative judgement

Despite these research findings, the most recent iteration of England’s national assessment of primary writing involved a set of ‘secure-fit’ descriptors that were supposed to be more precise than previous versions, and as such to improve reliability. In practice, this has not happened: the descriptors have been applied inconsistently across schools and regions, making it hard to derive any valid inferences from the results (Allen, 2016). And, just as we saw that the overuse of short-answer questions can lead to the distortion of teaching and learning, the ‘secure-fit’ descriptors have led to similar distortions, as teachers encourage students to shoehorn in certain techniques that are ‘essential’ for the achievement of a certain grade (Tidd, 2016). This problem has also been anticipated by research showing that more specific mark schemes often lead to stereotyped responses (Wiliam, 1994). In addition to these problems with reliability and validity, using absolute judgement and prose descriptors is also very inefficient, particularly if the prose descriptors consist of several different statements that a student has to meet, and which the teacher has to check off and record.

Comparative judgement offers a way of assessing writing which, as its name suggests, does not involve difficult absolute judgements, and which also reduces reliance on prose descriptors. Instead of markers grading one essay at a time, comparative judgement requires the marker to look at a pair of essays, and to judge which one is better. The judgement they make is a holistic one about the overall quality of the writing. It does not need to be guided by a rubric, and can be completed fairly quickly. If each marker makes a series of such judgements, it is possible for an algorithm to combine all the judgements and use them to construct a measurement scale (Pollitt, 2012). This algorithm is not new: it was developed in the 1920s by Louis Thurstone. In the last few years, the existence of online comparative judgement software has made it easy and quick for teachers to experiment with such a method of assessment.

Over the last two years, first in my role as head of assessment at Ark Schools, and now as director of education at No More Marking (a provider of online comparative judgement software), I have helped to run a number of trials involving the use of comparative judgement to assess primary and secondary writing.

After a number of promising small-scale trials that delivered highly reliable results, in early 2017 we ran a bigger trial involving 8,512 Year 6 students from 199 primary schools in England. From those schools, 1,649 teachers took part in the judging. The moderation scripts were split into five moderation pots, and the Rasch separation ratio reliability of the judging in these five pots varied from 0.84 to 0.88. The median time taken to make a decision was 38 seconds (No More Marking, 2017). In an average-sized primary school with 60 students in a year group and 12 teachers in the school, this required a total judging time per teacher of just over 30 minutes. Both the reliability and the efficiency of the comparative judgement approach therefore compared favourably with the more traditional approaches.

The immediate short-term gains from comparative judgement are the reliability and efficiency. But in the medium term, it is possible that such an approach could improve the validity of the inferences we want to make, expand the types of inferences we are able to make, and indeed improve our understanding of what good writing is. Throughout history, improvements in measurement have led to improvements and innovations in the construct being measured. If teachers have a more reliable way of measuring their students’ improvement in writing, then it becomes easier for them to identify the teaching practices and interventions that work, and those that don’t. Comparative judgement therefore offers a promising way of improving both the assessment and the teaching of writing.

References

Allen R (2016) Consistency in Key Stage 2 writing across local authorities appears to be poor. Education Datalab. Available at: https://educationdatalab.org.uk/2016/09/consistencyin-key-stage-2-writingacross-local-authoritiesappears-to-be-poor/ (accessed 24 August 2017).

AQA (2017) GCSE English Language 8700, paper 2 mark scheme. Available at: http://filestore.aqa.org.uk/resources/english/AQA-87002-SMS.pdf (accessed 24 August 2017).

Hirsch Jr ED (2010) The Schools We Need and Why We Don’t Have Them. New York: Doubleday, pp.176-214.

Koretz D (1998) Large-scale portfolio assessments in the US: Evidence pertaining to the quality of measurement. Assessment in Education: Principles, Policy & Practice 5(3): 309-334.

Laming D (2003) Human Judgment: The Eye of the Beholder. Andover: Cengage Learning EMEA.

No More Marking (2017) Sharing standards 2016-17. Available at: http://bit.ly/2wAi9x7 (accessed 24 August 2017).

Polanyi M (1962) Personal Knowledge. London: Routledge.

Pollitt A (2012) Comparative judgement for assessment. International Journal of Technology and Design Education 22(2): 157-170.

Sadler, DR (1987) Specifying and promulgating achievement standards. Oxford Review of Education 13: 191-209.

Tidd M (2016) Instead of discussing great teaching and learning, we’re looking for loopholes to hoodwink the moderators. Times Educational Supplement. Available at: http://bit.ly/2vhIZ8X (accessed 24 August 2017).

Tisi, J et al (2013) A Review of Literature on Marking Reliability Research (Report for Ofqual). Slough: NFER p.65-67.

Thurstone LL (1927) A law of comparative judgment. Psychological Review 34.4: 273.

Wolf A (1998) Portfolio assessment as national policy: the National Council for Vocational Qualifications and its quest for a pedagogical revolution. Assessment in Education: Principles, Policy & Practice 5.3:413-445.

Wiliam D (1994) Assessing authentic tasks: alternatives to mark-schemes. Nordic Studies in Mathematics Education 2.1: 48-68.

Ward H (2017) More Sats ‘chaos’ as two thirds of moderators fail to assess pupils’ work correctly. Times Educational Supplement. Available at: http://bit.ly/2wppwHg (accessed 24 August 2017).