About the Data Catalogue

What it is, and how to use it.

What is the EEF Data Catalogue?

This Data Catalogue is designed to facilitate access to data from the growing number of EEF-funded evaluations for secondary research analysis. Since 2011, the EEF has funded hundreds of independent evaluations of different projects to find out what works in improving children and young people's attainment outcomes. Most of the evaluation data comes from Randomised Control Trials (RCTs) and has been archived.

How is the data stored?

Typically, the archived evaluation data is comprised of the data submitted by the independent evaluators and a standardised transformation of that data across projects (examples provided below). The standardised version has been created to facilitate analyses involving more than one project and to aid data manipulation and interpretation.

The archive contains the Pupil Matching Reference (PMR) to enable linkage, for example, with the National Pupil Database (NPD).

A guide to the filename conventions

The filenames follow a convention illustrated by the examples below.

In the case of the Accelerated Reader project, AcceleratedReading14_SubmittedData_Reformatted is the data submitted by the evaluator in the format specified in the archive submission specification. It is the data the evaluator used for the analysis in the evaluation report. AcceleratedReading14_StandardisedFormat is a standardized transformation of the data submitted by evaluators to fit a common specification across projects.

Example archive data - Accelerated Reader (efficacy trial)

AcceleratedReading14_StandardisedFormat
AcceleratedReading14_SubmittedData_Reformatted

In the case of the Rapid Phonics project two files were submitted by the evaluator. RapidPhonics13_SubmittedData_OriginalFormat is the project data in the format used by the researcher for their analysis. RapidPhonics13_SubmittedData_Reformatted has been transformed by the evaluator to fit the archive submission specification.

Example archive data - Rapid Phonics (efficacy trial)

RapidPhonics13_SubmittedData_OriginalFormat
RapidPhonics13_SubmittedData_Reformatted
RapidPhonics13_StandardisedFormat

With the Families and Schools Together (FAST) project, survey data has also been archived by the evaluator. Any datasets beyond the scope of the submission specification are labelled OtherSubmittedData.

Example archive data - Families and Schools Together (FAST, effectiveness trial)

FAST141_SubmittedData_Reformatted
FAST141_OtherSubmittedData_BaselineSurvey
FAST141_OtherSubmittedData_EndpointSurvey
FAST141_OtherSubmittedData_MidpointSurvey
FAST141_OtherSubmittedData_NonAttainmentOutcomes
FAST141_StandardisedFormat

When accessing data from many projects researchers may prefer to use the standardised version of the data. If the research focuses on a single project, the original submission format may be preferable.

How to search the catalogue

The titles and descriptions of each project are searchable. For example, to view all the projects evaluated by a particular organisation simply search their name using the free text search box. To refine the search results to return only exact matches enclose the search phrase in quotes. Clicking the search button with an empty free text search box will return all projects. Search results can be refined or filtered using the checkboxes presented on the left-hand side of the screen.

Quantities and dates can be used to find projects using free text search box. Examples of the particular syntax for a set of search phrases are shown below:

Example Search Terms

Exact Phrase Search
- Search Query: "Behavioural Insights Team"
- URL example: Search for "Behavioural Insights Team"
Boolean Operators
- Search Query: rs: summary.programmeDeveloperDeliveryTeam:"Behavioural Insights Team" OR summary.projectEvaluator:"University of Manchester"
- URL example: Search with Boolean Operators
Search by Date (free text)
- Search Query: September 2016
- URL example: Search by Date
Search by Number (free text)
- Search Query: 200
- URL example: Search by Number

Using Specific Field Constraints

Report Publication Date
- Field: coverage.reportPublicationDate
- Search Query: rs: coverage.reportPublicationDate:"July 2016"
- URL example: Search by Report Publication Date
Intervention Start Date
- Field: coverage.interventionStartDate
- Search Query: rs: coverage.interventionStartDate:"September 2018"
- URL example: Search by Intervention Start Date
Intervention End Date
- Field: coverage.interventionEndDate
- Search Query: rs: coverage.interventionEndDate:"July 2017"
- URL example: Search by Intervention End Date
Programme Developer/Delivery Team
- Field: summary.programmeDeveloperDeliveryTeam
- Search Query: rs: summary.programmeDeveloperDeliveryTeam:"ARK"
- URL example: Search by Programme Developer/Delivery Team
Project Evaluator
- Field: summary.projectEvaluator
- Search Query: rs: summary.projectEvaluator:"RAND"
- URL example: Search by Project Evaluator
Pupils
- Field: evaluationDetails.pupils
- Search Query: rs: evaluationDetails.pupils:5462
- URL example: Search by Pupils
Schools
- Field: evaluationDetails.schools
- Search Query: rs: evaluationDetails.schools:197
- URL example: Search by Schools

Synthetic data assets

The data catalogue includes a set of synthetic datasets created using the archived data from evaluations. Synthetic data is artificially generated data that mimics the characteristics of real data but does not contain any personally identifiable information or sensitive data.

Key characteristics

The available datasets are low fidelity, meaning they do not replicate the relationships between variables. While this limits what the data can be used for, it has advantages regarding data protection and accessibility for researchers.
The synthetic data is generated based on summary statistics, the mean and standard deviation for numerical data and the distributions of categorical data items.
If summary statistics contained extreme or sensitive values (for example, a very rare category) these were removed or appropriately modified before data generation, in accordance with disclosure procedures from the Office for National Statistics (ONS), who oversee the secure storage of the original data.
The synthetic data files are based on the archive output format that includes both values and labels for those values. An example of these fields is Treatment_Allocation and Treatment_Allocation_Desc, which include coded values and corresponding descriptions respectively. In the original data, these pairs align, but with the synthetic data, each field is generated independently, and no relationship among variables is preserved, so they may not align.

Benefits for researchers

The archive is a collection of datasets from separate evaluation projects, the availability and coverage of variables can vary among projects. These low-fidelity synthetic datasets intend to enable researchers to explore datasets and determine whether a project's data will likely suit the intended purposes. These low-fidelity datasets also support the development of code outside the Secure Research Service environment.

Accessing the data and code

Synthetic datasets can be found under the Resources tab on each project page. Researchers can also access the open-source code used to create these low-fidelity datasets on GitHub: BIT-EEF-behavioural_synthetic_library.

This work was undertaken in the Office for National Statistics Secure Research Service using data from the ONS and other owners and does not imply the endorsement of the ONS or other data owners.

We welcome your feedback

To support ongoing improvements to this data catalogue, we invite you to share your views and suggestions via our feedback form. You can access the form by clicking here.

About the Data Catalogue

What it is, and how to use it.

What is the EEF Data Catalogue?

How is the data stored?

A guide to the filename conventions

How to search the catalogue

Example Search Terms

Exact Phrase Search

Boolean Operators

Search by Date (free text)

Search by Number (free text)

Using Specific Field Constraints

Report Publication Date

Intervention Start Date

Intervention End Date

Programme Developer/Delivery Team

Project Evaluator

Pupils

Schools

Synthetic data assets

Key characteristics

Benefits for researchers

Accessing the data and code

We welcome your feedback