Guide 4: NPD demographic and SEND data

Scripts needed for this guide

Introduction

In the previous guide, we created our initial cohort spine, whether using an NPD or HES inception. We will now extract demographic and SEND data. We are doing this across three different scripts because we are taking data at different time points/using different summaries (Table 1). Our choices here, including time frame, are more or less arbitrary and are chosen for the purposes of illustrating different ways of summarising and extracting this kind of data.

Table 1. Demographic data to be extracted. FSM free school meals; IDACI income deprivation affecting children index; SEND special educational needs and disability. * This is not necessarily the case. See text that follows.
Variable	Nature	Time point
Gender	Time invariant*	Modal value across reception to year 7
Ethnicity	Time invariant*	Modal value across reception to year 7
First language group major	Time invariant*	Modal value across reception to year 7
Year of birth	Time invariant	Modal value across reception to year 7
Month of birth	Time invariant	Modal value across reception to year 7
IDACI rank (tenth) - year 7	Time variant	IDACI rank in year 7
FSM - year 7	Time variant	Whether had FSM in year 7
Ever FSM	Time variant	Whether ever had FSM in reception to year 7
Ever SEND	Time variant	Whether ever had SEND in reception year 7
Highest level of SEND	Time variant	Highest level of SEND reception to year 7

Time invariant variables

Time invariant variables are those that are fixed, or considered fixed, in analyses. A study participant in theory only ever has one value on these, regardless of when we observe them. Therefore, values on these variables should always be the same across a child’s records throughout ECHILD. However, as often happens with administrative data, we see that records belonging to the same PMR/TPID over time may have conflicting values. Leaving aside the possibility of linkage error, this could be a result of data entry error (or they are not really time invariant—see below). Where records conflict, we therefore need to pick just one. In our example, we are treating gender, ethnicity, language group, year of birth and month birth as time invariant and select the modal (i.e., most frequently occurring) value from each child’s records across reception to year 7.

When taking the modal value, we are assigning equal weight to each record. In other words, each record is equally believable. If you have reason to believe that some records carry more weight, then you could employ a different method, for example by selecting the most recent value in situations where you believe that the most recent record is accurate. This assigns 100% weight to the most recent record and 0% to all previous records, regardless of their value and how many times they occur. We will show you how you can modify the code in our script to do this. In theory, you could employ more complex weights, though whether this is justified (i.e., whether it ever makes a difference to your analyses) should be borne in mind in determining whether the extra effort is worth it.

Another factor to consider is whether the variable is truly time invariant. We have noted gender, ethnicity and language with an asterisk in Table 1 because, in reality, how people identify in terms of their gender, ethnicity and language can change over their life course. This could then be reflected in the data. When working with administrative data, this becomes difficult when recognising that data quality for variables such as ethnicity for some groups, e.g., non-White children, is likely to be poorer (see the reports by Wickersham et al and Razieh et al). Additionally, it is not always clear who is reporting and recording the values in question, nor exactly what they think they are recording (consider the difference between gender and sex). As such, discerning whether a conflicting ethnicity, language or gender value is a true change in identity, correction of earlier records, data entry error, linkage error or something else, is very complex. You always have the option of sensitivity analyses, which may be warranted where the amount of conflicting information is high and/or where these variables are a particular focus of your study.

Time variant variables

Many variables are time variant, such as whether a child is getting free school meals (FSM) and their income deprivation affecting children index (IDACI) rank. This is because a child is not always necessarily eligible for FSM (this depends on their parents getting means-tested benefits) and because a child may move between areas, into areas of more or less deprivation.

Sometimes we are only interested in values in a given year, for example, at baseline. In this example, we will extract FSM and IDACI rank (available in tenths in ECHILD) at year 7. In other instances, we may wish to know whether a child ever experienced something, such as whether they ever had FSM, or the highest level of something they experienced, such as the highest level of SEND provision they ever received. We therefore also extract variables that indicate whether a child ever had FSM, whether they ever had SEND provision and their highest level of SEND provision from reception to year 7.

Modal values from reception to year 7

Data extraction

Our first script, 02_npd_demographic_modals.r, begins in a similar way to how we identified our NPD cohort in the previous guide: we define some year variables we are interested in, we extract the relevant census table names and then we run a function that creates the SQL query for each table, combining results into one data table. We then run some queries for the alternative provision and pupil referral unit censuses separately.

At this stage it is worth noting that the years we define here are 2008 to 2017 (Code Snippet 1).

Code Snippet 1 (R - 02_npd_demographic_modals.r)

---
years <- 2008:2017
---

These are the years that CONTAIN the reception to year 7 years for ALL THREE birth cohorts (born academic years 2002/3 to 2004/5), not the years that ARE the reception to year 7 years for each (e.g., the 2002/3 cohort will be in year 7 in 2014/15). However, this is not a problem, because we deal with year selection when constructing our SQL query (Code Snippet 2). Here we use the IN operator within our WHERE statement to select the relevant years from NCYearActual.

Code Snippet 2 (R - 02_npd_demographic_modals.r)

---
paste0(
   "SELECT ",
   pupil_column , ", ",
   gender_column, ", ",
   eth_column, ", ",
   language_column, ", ",
   year_birth_column, ", ",
   month_birth_column,
   " FROM ", table_name, 
   " WHERE ", ncyearactualcolumn, " IN ('R', '1', '2', '3', '4', '5', '6', '7') OR ",
   " (", ncyearactualcolumn, " = 'X' AND ", age_column, " <= '11')")
)
---

We also define an object called skips (Code Snippet 3), which we use to tell our function to skip certain censuses (Code Snippet 4). This is necessary because ethnicity is not contained in the summer census from 2011, nor the autumn census from 2012. In Code Snippet 3 we use the collapse argument within paste0() to create a string that serves as our regular expression in grepl() in the next line. Run the paste0() functions separately if you are not sure how these are working together.

Code Snippet 3 (R - 02_npd_demographic_modals.r)

---
skipstr <- paste0(paste0("Autumn_Census_", 2012:2017, collapse = "|"), "|",
                  paste0("Summer_Census_", 2011:2017, collapse = "|"))
skips <- tables$TABLE_NAME[grepl(skipstr, tables$TABLE_NAME)]
---

Code Snippet 4 (R - 02_npd_demographic_modals.r)

---
if (!(table_name %in% skips)) {
  
  # Query with ethnic group major here
  
} else {
  
  # Query without ethnic group major here
  
  temp[, ethnicgroupmajor := NA]
  
  col_order <- c(pupil_column,
                 gender_column,
                 "ethnicgroupmajor",
                 language_column,
                 year_birth_column,
                 month_birth_column)
  
  temp <- temp[, ..col_order]
  
}
---

Data cleaning

Before we actually select our modal value, we should clean the data. The script carries out the steps shown in Table 2.

Table 2. Data cleaning carried out prior to identifying modal values
Variable	Cleaning
Gender	Sets values `0` and `9` to `NA.` Then creates a variable called female, where `1` indicates female, and `0` male. This is a practice for binary variables that we recommend generally as there can never be confusion as to the coding. Binary variables coded as `1` and `2` should be avoided as this then requires one to consult a data dictionary. Then we drop the now-redundant gender variable.
Ethnicity	Ethnic groups are combined into the five major ethnic groups.
Language	Languages are combined into English or not English.
Year and month of birth	None.

Getting modal values

In order to get the modal value, we use mode_fun(), defined in 00b_prelim.r, which finds the modal element of a vector or, where the vector is multimodal (i.e., there are at least two options that occur equally with the highest frequency), we pick one at random using a single random draw from a multinomial distribution (Code Snippet 5).

Code Snippet 5 (R - 00b_prelim.r)

---
mode_fun <- function(x) {
  v <- x[!is.na(x)]
  ux <- unique(v)
  tab <- tabulate(match(v, ux))
  md <- ux[tab == max(tab)]
  if (length(md) == 1) {
    return(md)
  } else {
    return(md[which(rmultinom(1, 1, rep(1/length(md), length(md))) == 1)])
  }
}
---

There are two alternatives in Code Snippet 6: (1) try to find a modal value and if multimodal, return the last value; (2) always return the last value, regardless of the mode. If using these, you will need to go back and edit the script that extracts data to also include year and census, so that you can ensure the data are ordered properly.

Code Snippet 6 (R)

---
# Try to find a mode; if multimodal, return the last value
mode_fun <- function(v) {
  v <- v[!is.na(v)]
  ux <- unique(v)
  tab <- tabulate(match(v, ux))
  md <- ux[tab == max(tab)]
  if (length(md) == 1) {
    return(md)
  } else {
    return(v[length(v)])
  }
}

# Always return the last value
get_last_value <- function(v) { return(v[length(v)]) }
---

We use mode_fun() (Code Snippet 5) to get the modal value of each of the time invariant demographic variables by PMR, i.e., using all of each child’s records from reception to year 7 (Code Snippet 7). We first set a seed because we use a random process (random draw from a multinomial distribution) in mode_fun().

Code Snippet 7 (R - 02_npd_demographic_modals.r)

---
set.seed(100)
demo_modals[, female := mode_fun(female), by = PupilMatchingRefAnonymous]
demo_modals[, ethnicgroup := mode_fun(ethnicgroup), by = PupilMatchingRefAnonymous]
demo_modals[, languagegroup := mode_fun(languagegroup), by = PupilMatchingRefAnonymous]
demo_modals[, yearofbirth := mode_fun(yearofbirth), by = PupilMatchingRefAnonymous]
demo_modals[, monthofbirth := mode_fun(monthofbirth), by = PupilMatchingRefAnonymous]
---

The result

After deduplication, we end up with a data table that contains PMR, year of birth, month of birth, female, ethnic group and language group. We will save this as its own object before moving onto the next script to extract data at year 7.

Values at year 7

Script 03_npd_demographic_year7.r is significantly simpler, as we are only dealing with one time point. Again, we extract data from the censuses, alternative provision and pupil referral units and deduplicate. We then save our resulting object, which contains PMR, FSM and IDACI data at year 7.

“Ever” values across reception to year 7

Script 04_npd_demographic_ever.r is a little more complex. We first extract data. We then identify which children are enrolled in special schools (which we flag using a variable called specialist_school). To do this, we use the Get Information About Schools database and link the relevant information using the school unique reference number (URN) (Code Snippet 8). [Note: this step is still under construction.]

Code Snippet 8 (R - 04_npd_demographic_ever.r)

---
# Under construction - code omitted
---

Next, we clean our SEND variable to combine the old Action and Action Plus values with the new Support value. Support replaced Action/Action Plus from academic year 2014/15. Similarly, our SEND variable combines the old statements of SEND with the new Education, Health & Care Plan (which were phased in from 2014/15). We also add children enrolled in alternative provision or pupil referrals units to our specialist_school flag, as children only in alternative provision will not be captured in the Get Information About Schools database (Code Snippet 9).

Code Snippet 9 (R - 04_npd_demographic_ever.r)

---
demographics_ever[senprovision %in% c("A", "P", "K"), senprovision := "support"]
demographics_ever[senprovision %in% c("E", "e", "S"), senprovision := "sehcp"]
demographics_ever[senprovision == "N", senprovision := NA]
demographics_ever[census %in% c("AP", "PRU"), specialist_school := T]
---

Finally, we define fsm_ever_primary by taking the maximum value of fsmeligible across all records (Code Snippet 10). If there is ever a 1 (the maximum of 0 and 1), then this operation will return a 1 by child.

Code Snippet 9 (R - 04_npd_demographic_ever.r)

---
demographics_ever[, fsm_ever_primary := max(fsmeligible), by = PupilMatchingRefAnonymous]
---

The result

At the end of this script, we retain just four variables, PupilMatchingRefAnonymous, fsm_ever_primary, senprovision and specialist_school. Note that we so far have not defined an “ever SEN” variable. We will do this in Guide 11. For now, we just save the resulting object as its own file and clear the workspace.

Coming up…

In total, we have therefore created three separate datasets containing the different types of demographic information that we will later join onto our spine. In the next Guide, we will think about why we might now also want to extract demographic data from HES and how you can adapt code provided in this Guide to do so.