In exploring what constitutes effective teaching and learning in classrooms and schools, this article centres around the important role of evaluation in evidence-based practice as a means of establishing ‘what works’. Professor Rob Coe, Director of the Centre for Evaluation and Monitoring, Durham University, looks at ways of establishing high quality, local strategies to evaluate interventions to ensure that educators have a clear evidence on the efficacy of their innovative practices. Coe argues that three key questions need to be addressed in designing good evaluation: What exactly are you evaluating? What counts as success? What would have happened otherwise? Coe cites successful examples of teachers using effective evaluation strategies to inform better decision-making about what has worked for them, how to improve what may be capable of being made to work, and what to stop doing. He concludes that strong, real-time, local evaluation may offer the best route for converting evidence-based practice into authentic improvement.

Evidence-based practice in education has come a long way in the last 20 years. We know quite a bit now about which interventions and practices are effective in enhancing learning. Sources like the Education Endowment Foundation/Sutton Trust Teaching and Learning Toolkit (Higgins et al., 2016) and the work of teams led by researchers such as Hattie (2008), Slavin (2002) and Marzano (2003) have done a good job of synthesising and summarising the best available evidence in practitioner-relevant forms. But we also know that when we try to implement those interventions at scale and in real settings, sometimes we don’t see the results that were found in the best research studies. The evidence can give us a general steer – a ‘best bet’ – but it cannot give us a recipe, template or guarantee. We might say that the evidence generates a theoretical framework: a set of general principles and mechanisms rather than a specific implementation or formula. If you want to improve something in a real school, it probably is helpful if your decision-making is informed by a deep understanding of relevant, high-quality educational research; but it is certainly not enough.

One implication is that we need to increase the breadth and depth of evidence-informed practical wisdom in schools: wisdom because it is more than just a tick-list knowledge of what works, but in fact a deep, integrated understanding of how, why and when; and practical because research seldom tells us directly what to do in a classroom or school, so the translation into effective practice rests, in large part, with teachers or school leaders. To build this practical wisdom we need to develop programmes of substantial training for significant numbers of teachers. And that raises a whole host of challenging questions: What organisations or individuals have the knowledge and capacity to deliver that kind of training? How can access to such training be enabled for large groups of teachers? How can this be added to the requirements of working in schools, without either exacerbating the workload crisis or pushing out something else that is equally important? How will we fund all this training?

These are important questions; they are extremely difficult questions; and they are likely to be at the centre of the agenda of the Chartered College of Teaching over the next months and years. But they are not my focus here. Instead, I want to focus on another less obvious implication: the need for high-quality, local evaluation.

If ‘what works’ doesn’t always work, how do we make good choices? It may be that we are not even very good at estimating, given a description of an intervention and a particular context, what the effect will be. My guess is that a small percentage of teachers and a small percentage of researchers could predict with moderate correlations which interventions will work. For the majority of both groups, however, I would not be surprised if the correlations were close to zero. To the best of my knowledge, no one has ever tried to test this, so for now my guess is not evidence-based.

If I am right, though, the up-front decision about which ‘best bet’ to invest in may be less important than the multiple, real-time, feedback-informed decisions about exactly how to implement it. The key here is feedback-informed: if the feedback is good and capable of being acted on, then doing ‘evidence-based practice’ becomes more of a learning process than an implementation process. This idea is developed, with useful examples, by Bryk and colleagues in ‘Learning to Improve’ (Bryk et al., 2015).

For feedback to be good, it needs to give accurate and timely information about the impact the intervention is having (or had) on the intended outcome, in our context. Ideally, it should also tell us something about what was actually done, since this is almost never completely faithful to the intended intervention. If this feedback is combined with a strong theoretical knowledge about the likely how and why – for example, understanding how people learn, what pedagogies are effective, and how schools and classrooms operate – then it becomes a basis for improving decisions and processes. In other words, we need high-quality, real-time, local evaluation.

When people think about evaluation, they are seldom against it. They may question whether its value justifies the effort it takes, but most people would agree that it is good to evaluate what you do and, particularly if you have made a change that requires significant investment of time or money, to evaluate the impact of an intervention. The problem is that good evaluation is hard to do, and what mostly passes for evaluation is not very good. Poor evaluation can be worse than none, since it gives wrong information and a false sense of confidence (Coe et al., 2015).

In designing a good evaluation we must answer three essential questions:

  • What exactly are you are evaluating?
  • What counts as success?
  • What would have happened otherwise?

What exactly are you evaluating?

Traditionally, we evaluate an intervention: a specified change or new programme. However, sometimes it is not clear exactly what is being evaluated. For example, suppose a school leadership team decides it wants to introduce ‘comment only’ marking, but teachers are unconvinced and the new policy is only loosely followed in practice. Are we then evaluating the intended policy change or what actually happened? It may seem obvious that it should be the former, but robust ‘intention-to-treat’ evaluation designs require us to evaluate the latter. Ideally of course we should avoid this problem by making sure the intended change is one that will be comprehensively and faithfully implemented – usually by careful piloting. The key point here is that the intervention being evaluated must be manipulable: something we can change deliberately. If we don’t know how to get all teachers to do comment-only marking, we can’t evaluate comment-only marking. We might be able to evaluate ‘an attempt to introduce comment only-marking (that did not achieve full and faithful implementation)’ since we probably can introduce ‘an attempt’, but this is unlikely to be very useful. A further requirement is that a manipulable intervention must not be too limited in the time or place it could be implemented. There is little value in finding the impact of something we could not in principle replicate.

In traditional evaluation thinking, we often compare the intervention to ‘business as usual’. If that is the comparator, then it is important that we can describe in detail what this means, as well as being able to describe the intervention. A comparison requires that we know enough detail about both the things we compare.

What counts as success?

All evaluation must have some kind of outcome measure against which to evidence success.Sometimes we may want to have more than one outcome, but best practice is to identify a single primary outcome before the intervention starts. This avoids the bias that would ensue if we collected a large number of outcomes and did not decide which ones were important until we saw the results.

Often the primary outcome will be a measure of attainment, derived from a robust assessment process. However, it could be an attitude or behaviour, for example. The most important thing about the outcome measure is that it must reflect the values and aims of the intervention. For example, we might introduce a programme of training in assertive discipline with the aim of improving students’ behaviour. A key question would be, ‘If there is an improvement in behaviour, but no change in attainment, would we see that as a success?’ If the answer is yes, then the primary outcome should be behaviour; but if success means improving attainment, then attainment must be the primary outcome. Either way, we might well include the other as a secondary outcome.

Success criteria do not have to be assessments, or even measures as such, but a good understanding of assessment is crucial to selecting, using and interpreting robust indicators of impact.

What would have happened otherwise?

This is known as the counterfactual. It is a vital question, but in practice elusive: we can never know what would have happened otherwise. But underestimating the effects of things other than the intended intervention is a major reason why poor evaluations get it wrong. Human beings seem to have a natural tendency to create and believe causal stories when what they actually see is just association and a plausible mechanism. So we accept attributions and believe them strongly, even when they are not supported by evidence. Hence it may be that the value of an evidence-based approach is more about being aware of our assumptions and critical of plausible stories than being knowledgeable about research findings.

Evaluators have developed a range of ingenious ways of estimating the counterfactual. The classic is random allocation: individuals or groups are not allowed to choose which treatment they get, but are allocated at random. If the numbers are large enough we get a statistical guarantee of initial equivalence. We do not have to have a ‘no treatment’ control group; each can get a different intervention, deemed to have equal value to the recipients, but perhaps differentially focused on the desired outcome, or we may achieve this by offering some compensation, or through simply comparing two plausible and realistic approaches between which we want to make an informed choice. In a waiting list design, both groups get the intervention but random allocation simply determines when. The range of options for random allocation is enormous; selecting an optimal design requires a good deal of expertise, and with the right expertise and ingenuity it is usually possible to come up with a strong design.

However, if we want to evaluate an intervention in a single school, random allocation may not always be possible and a weaker design must be used. Most nonrandomised designs depend on matching individuals in the two groups on characteristics such as pre-test scores. If we can match on a number of high-quality measures, and if the groups are initially similar on these measures, and we know enough about how the groups were formed, then these designs can be quite strong. For example, we might compare different tutor groups that have been allocated by a pseudo-random process. Or we might compare this year’s cohort with last year’s, provided we use the same baseline and outcome measures at the same times and are confident that nothing else has changed between the two years. Another strong design uses a cut-off on the baseline to allocate treatments – for example, if all those below a certain score get a catch-up programme. A specific form of analysis uses ‘regression discontinuity’ to estimate unbiased causal effects.

This introduction to the why, what and how of evaluation is necessarily brief. There is a vast and specialist field of expertise in evaluation that can be studied and practised at length. It is certainly challenging for full-time teachers to learn enough about evaluation and to find the time and resources to do it well. However, there are good examples of teachers using these approaches to inform better decision-making about what has worked for them, how to improve what may be capable of being made to work, and what to stop doing. If doing ‘what works’ was simple, there would be no need to evaluate what was already proven. But in a world where nothing is really proven – and certainly nothing simple – high-quality, real-time, local evaluation may well be our best chance of converting evidence-based practice into real improvement.



Bryk AS, Gomez LM, Grunow A and LeMahieu PG (2015) Learning to Improve: How America’s Schools Can Get Better at Getting Better. Cambridge, MA: Harvard Education Press.

Coe R, Kime S, Nevill C and Coleman R (2013) The DIY evaluation guide. London: Education Endowment Foundation. Available at:

Hattie J (2008) Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge Best Evidence Encyclopedia. Available at:

Higgins S Katsipataki M, Villanueva-Aguilera ABV, Coleman R, Henderson P, Major LE, Coe R and Mason D (2016) The Sutton Trust-Education Endowment Foundation Teaching and Learning Toolkit. December 2016. London: Education Endowment Foundation. Available at:

Marzano, RJ (2003) What Works in Schools: Translating Research into Action. Alexandria, VA: ASCD.

Slavin RE (2002) Evidence-based educational policies: Transforming educational practice and research. Educational Researcher 31(7): 15-21.