One of the proposals in the Department for Education’s (DfE) recent consultation on primary assessment is to introduce a school-entry assessment to act as a baseline for measuring progress across the primary phase and use this as part of the accountability system. This would enable progress across seven years of schooling to be taken into account, rather than across four years as is currently the case.

It is likely that the principle behind this proposal will be welcomed by schools, as the measurement of progress, rather than absolute attainment, was indicated as the better way to assess the contribution of schools by almost 90 per cent of a representative sample of primary senior leaders and classroom teachers participating in a National Foundation for Educational Research teacher omnibus survey (NFER, 2017). However, any assessment of very young children is likely to be a controversial area and needs careful consideration.

The reintroduction of baseline testing proposed by the DfE is intended not as a starting point for measuring individual performance over time but for measuring the performance of a whole cohort. This article will not address the question of whether this is the best way to judge schools, rather whether it is possible to design an assessment that will be fit for this intended purpose.

Issues of validity and reliability arise in relation to any assessment not just assessments of young children. The validity of any assessment is linked to the purpose for which the results are to be used. It represents the extent to which the score or outcomes of the assessment allow valid inferences to be made about the skills, understanding and knowledge of the child. In research terms, validation is a process that takes place during the development of an assessment and continues during its use; threats to the validity of the assessment are identified and evidence is sought to demonstrate that the threats have been avoided or minimised. An assessment is reliable if the outcomes produced can be shown to be accurate and consistent (for example, the outcome for each child would be the same or very similar if the assessment were repeated or carried out by a different practitioner).

The majority of teachers already assess children upon entry to school. In a survey of more than 350 schools carried out in 2014, before the introduction of the DfE’s approved Reception baseline schemes, all the schools reported carrying out some form of entry assessment of children (Lynch et al., 2015). In a recent NFER survey (NFER, 2017), around 80 per cent of responding schools assessed children during the first term of the Reception class; of those, more than 75 per cent reported that the assessments included a combination of observations during classroom activities and one-to-one assessments of each child. In the debate around whether baseline assessments should be based on the observation of children in classroom activities (observational assessments) or based on one-to-one assessments of children carrying out standardised tasks, the former (observational) is often perceived as less reliable than the latter (standardised assessment), and the latter is often perceived as being less valid than the former. Both forms of assessment have their advantages and disadvantages, but the issue is which of these approaches is the more appropriate to act as a baseline for measuring progress.

In designing any type of assessments there are inevitable trade-offs. Standardised assessments reduce some threats to validity and increase others; the same is true of teacher observations. Although it is often claimed that observational assessments during normal classroom activities are more valid due to their authenticity, a serious threat to the validity of such observational judgements is that there is too much variety in the way the assessments are conducted and scored. If judgements of children are made based on observations in different circumstances, the danger is that the outcome may reflect the context of the assessment rather than the abilities of the child. Some circumstances may offer more support or contextual clues than others, making the task being assessed more (or less) accessible for the child. Controlling the circumstances in which the observations take place in order to make the judgements fairer can reduce this particular threat to validity, but may be very time-consuming to set up and result in something akin to a standardised assessment.

Many critics of using standardised assessments of young children point to the lack of authenticity or the adverse impact of unfamiliar environments or tasks. However, these threats to validity can be minimised by using familiar practical resources and including some familiarisation or introductory activities. It is of course vital that any assessments of young children are done sensitively by trained professionals, and with sensible and age-appropriate administration instructions and guidance. Most young children in Reception soon experience a very common type of one-to-one ‘formal assessment’ – a familiar adult listening to them read (or assessing their familiarity with books). Therefore a standardised baseline assessment may appear no more strange (or possibly less strange) than other activities they are asked to do in school that they have never done before – lining up, assemblies, doing the register, etc.

A key advantage of standardised, task-based assessments is that every child is given the same opportunity to demonstrate what they know, what they understand and what they can do. The tasks, the resources and the way they are administered are the same for every child, reducing sources of irrelevant variation in scores. Further, the clearly defined yes/no criteria make it easy for teachers to reach consistent judgements. However, it is often assumed that because the tasks are standardised, their use will be automated/sterile, with the children being assessed becoming upset by tasks they are unfamiliar with or are unable to complete. This underestimates the professionalism both of the assessment developers and the practitioners carrying out these assessments.

Child support

Firstly, well-designed assessments will provide guidance on how to introduce the assessments to the children and will employ discontinuation or routing rules to ensure children are not faced with tasks that are much too difficult for them. In the NFER Reception baseline assessment (RBA) we also suggested that if a child was unable to complete a particular task or set of tasks, before discontinuing the assessment or moving on to a different section, teachers might like to provide some scaffolding or modelling to support the child to complete some part of the task. Although a child given such support would not be scored as being able to do the task, helping the child to complete it would ensure a more positive experience for the child and the possibility of some benefit from the modelling/instruction given by the teacher.

Other issues sometimes cited as possible threats to the validity of baseline assessments are external factors such as the mood of the child or how tired or hungry the children are. Some of these factors affect children in older age groups and other threats to validity can appear with older children (e.g. motivation/self-esteem) that may not affect young children at all or to the same extent. In many cases, young children will not even realise they are being assessed. One advantage of the proposed baseline assessment (if it follows the previous DfE model) is that teachers can choose when and where they administer the assessment; it does not have to be carried out on a particular day or at a particular time. With any assessment of young children, guidance/training should be provided about when to assess each child, ensuring they are sufficiently settled in school, choosing an optimal time of day, etc. There should also be guidance as to how to split the assessment over two or more sessions where appropriate and when to abandon/curtail an assessment using routing rules or professional judgement.

It is often assumed that because their performance can vary from day to day, it is not possible to get an accurate picture of children’ abilities from a one-off assessment. In other words, the results of formal assessments of children will not be reliable. However, research evidence demonstrates that high levels of test re-test reliability (over the assessment as a whole) can be achieved with young children; they tend to achieve very similar scores when assessed on two separate occasions. On the NFER RBA, an overall test re-test correlation of 0.96 was achieved when the same children were assessed a second time within a week of the first administration. Although the performance of a child on individual task elements may differ slightly from day to day, this evidence suggests that a reasonably accurate overall picture of what an individual child knows, understands and can do can be derived from the assessment as a whole.

Further, it is important to note that the DfE’s proposed baseline measure will be at the cohort level, so individual variability will be ‘smoothed out’ over the cohort as a whole. In other words, some children may perform slightly better than expected on the assessment and some may perform slightly worse, but the aggregated outcome will be reasonably representative of the intake as a whole. Of course, gaming by schools could distort the extent to which the baseline is representative of the cohort, but this distortion would occur whatever the type of assessment.

Observational assessments are often perceived as more reliable than standardised assessments because judgements can be made from observations in more than one context over a period of time. However, in practical terms it is very difficult to carry out multiple observations against every criterion if the assessment window is relatively short. Also, the criteria used in such assessments are often open to some level of subjective interpretation with teachers having to make ‘best-fit’ judgements against a number of statements. Training can help to moderate judgements and build consistency within schools. But as with any system based on subjective judgements, localism can occur and it is difficult to develop and maintain consistency across different schools. It is also the case that because the observations may give contradictory evidence, teachers may err on the side of caution, which when multiplied over the assessment as a whole may give an inaccurate picture of the child.

If we return to the purpose for which baseline assessments will be used, the point of the baseline is not to assess children’ potential, as some have suggested, but to assess their starting points or rather the starting point of the cohort. As such, the outcome of the baseline assessment will reflect the social and personal characteristics of the intake. Socioeconomic factors will impact on the results of the baseline assessments (as they do with every other assessment carried out in schools) and this will be reflected in the cohort baseline. The reason for measuring children’ starting points that reflect such characteristics is to then credit schools for the value they add by comparing the progress that schools with similar intakes make between school entry and the end of Key Stage 2.

We are assuming that much work will have to be done by the DfE in deciding how schools will be compared, particularly where cohorts have large proportions of children learning English as an additional language. These children have the potential to achieve far greater progress over the primary phase because their starting points will reflect unfamiliarity with the language of the assessment. As a result, their literacy/numeracy skills may progress much faster than children whose first language is English once their familiarity with the English language matches their peers’.

All assessment data should be treated with some caution, but if schools are to be held to account we have to decide if measuring from school entry to the end of Key Stage 2 is better than the current system. And if the accountability system is to use a baseline measure, it is important that standardised assessments are carefully developed and administered so that threats to validity and reliability can be minimised and an appropriate cohort baseline can be derived.



Lynch S, Bamforth H and Sims D (2015). Reception Baseline Research: Views of Teachers, School Leaders, Parents and Carers. (Department for Education research report 409). London: Department for Education.

NFER (2017) In: Kirkup C (2017) Department for Education primary assessment in England government consultation: NFER response, 21 June 2017. Available at: (accessed 21 August 2017).