Guide 4: NPD demographic and SEND data

Scripts needed for this guide

Introduction

In the previous guide, we created our initial cohort spine, whether using an NPD or HES inception. We will now extract demographic and SEND data. We are doing this across three different scripts because we are taking data at different time points/using different summaries (Table 1). Our choices here, including time frame, are more or less arbitrary and are chosen for the purposes of illustrating different ways of summarising and extracting this kind of data.

Table 1. Demographic data to be extracted. FSM free school meals; IDACI income deprivation affecting children index; SEND special educational needs and disability. * This is not necessarily the case. See text that follows.
Variable Nature Time point
Gender Time invariant* Modal value across reception to year 7
Ethnicity Time invariant* Modal value across reception to year 7
First language group major Time invariant* Modal value across reception to year 7
Year of birth Time invariant Modal value across reception to year 7
Month of birth Time invariant Modal value across reception to year 7
IDACI rank (tenth) - year 7 Time variant IDACI rank in year 7
FSM - year 7 Time variant Whether had FSM in year 7
Ever FSM Time variant Whether ever had FSM in reception to year 7
Ever SEND Time variant Whether ever had SEND in reception year 7
Highest level of SEND Time variant Highest level of SEND reception to year 7

Time invariant variables

Time invariant variables are those that are fixed, or considered fixed, in analyses. A study participant in theory only ever has one value on these, regardless of when we observe them. Therefore, values on these variables should always be the same across a child’s records throughout ECHILD. However, as often happens with administrative data, we see that records belonging to the same PMR/TPID over time may have conflicting values. Leaving aside the possibility of linkage error (and the possibility that these are not really time invariant—see below), this could be a result of data entry error. Where records conflict, we therefore need to pick just one. In our example, we are treating gender, ethnicity, language group, year of birth and month birth as time invariant and select the modal (i.e., most frequently occurring) value from each child’s records across reception to year 7.

When taking the modal value, we are assigning equal weight to each record. In other words, each record is equally believable. If you have reason to believe that some records carry more weight, then you could employ a different method, for example by selecting the most recent value in situations where you believe that the most recent record is accurate. This assigns 100% weight to the most recent record and 0% to all previous records, regardless of their value and how many times they occur. We will show you how you can easily modify the code in our script to do this. In theory, you could employ more complex weights, though whether this is justified (i.e., whether it ever makes a difference to your analyses) should be borne in mind in determining whether the extra effort is worth it.

Another factor to bear in mind is whether your variable is truly time invariant. We have noted gender, ethnicity and language with an asterisk in Table 1 because, in reality, how people identify in terms of their gender, ethnicity and language can change over their life course. This could then be reflected in the data. When working with administrative data, this becomes difficult when recognising that data quality for variables such as ethnicity for some groups (e.g., non-White children) is likely to be poorer and that it is not always clear who is reporting and recording the values in question, nor exactly what they think they are recording (consider, e.g., the difference between gender and sex). As such, discerning whether a conflicting ethnicity, language or gender value is a true change in identity, correction of earlier records, data entry error, linkage error or something else, is very complex. You always have the option of sensitivity analyses, which may be warranted where the amount of conflicting information is high and/or where these variables are a particular focus of your study.

Time variant values

Many variables are time variant, such as whether a child is getting free school meals (FSM) and their income deprivation affecting children index (IDACI) rank. This is because a child is not always necessarily eligible for FSM (this depends on their parents getting means-tested benefits) and because a child may move between areas, into areas of more or less deprivation.

Sometimes we are only interested in values in a given year, for example, at baseline. In this example, we will extract FSM and IDACI rank (available in tenths in ECHILD) at year 7. In other instances, we may wish to know whether a child ever experienced something, such as whether they ever had FSM, or the highest level of something they experienced, such as the highest level of SEND provision they ever received. We therefore also extract variables that indicate whether a child ever had FSM, whether they ever had SEND provision and their highest level of SEND provision from reception to year 7.

Getting modal values

In order to get the modal value, we use mode_fun(), defined in 00b_prelim.r, which finds the modal element of a vector or, where the vector is multimodal (i.e., there are at least two options that occur equally with the highest frequency), we pick one at random using a single random draw from a multinomial distribution (Code Snippet 5).

Code Snippet 5 (R - 00b_prelim.r)

---
mode_fun <- function(x) {
  v <- x[!is.na(x)]
  ux <- unique(v)
  tab <- tabulate(match(v, ux))
  md <- ux[tab == max(tab)]
  if (length(md) == 1) {
    return(md)
  } else {
    return(md[which(rmultinom(1, 1, rep(1/length(md), length(md))) == 1)])
  }
}
---

There are two alternatives in Code Snippet 6: (1) try to find a modal value and if multimodal, return the last value; (2) always return the last value, regardless of the mode. If using these, you will need to go back and edit the script that extracts data to also include year and census, so that you can ensure the data are ordered properly.

Code Snippet 6 (R)

---
# Try to find a mode; if multimodal, return the last value
mode_fun <- function(v) {
  v <- v[!is.na(v)]
  ux <- unique(v)
  tab <- tabulate(match(v, ux))
  md <- ux[tab == max(tab)]
  if (length(md) == 1) {
    return(md)
  } else {
    return(v[length(v)])
  }
}

# Always return the last value
get_last_value <- function(v) { return(v[length(v)]) }
---

We use mode_fun() (Code Snippet 5) to get the modal value of each of the time invariant demographic variables by PMR, i.e., using all of each child’s records from reception to year 7 (Code Snippet 7). We first set a seed because we use a random process (random draw from a multinomial distribution) in mode_fun().

Code Snippet 7 (R - 02_npd_demographic_modals.r)

---
set.seed(100)
demo_modals[, female := mode_fun(female), by = PupilMatchingRefAnonymous]
demo_modals[, ethnicgroup := mode_fun(ethnicgroup), by = PupilMatchingRefAnonymous]
demo_modals[, languagegroup := mode_fun(languagegroup), by = PupilMatchingRefAnonymous]
demo_modals[, yearofbirth := mode_fun(yearofbirth), by = PupilMatchingRefAnonymous]
demo_modals[, monthofbirth := mode_fun(monthofbirth), by = PupilMatchingRefAnonymous]
---

The result

After deduplication, we end up with a data table that contains PMR, year of birth, month of birth, female, ethnic group and language group. We will save this as its own object before moving onto the next script to extract data at year 7.

Values at year 7

Script 03_npd_demographic_year7.r is significantly simpler, as we are only dealing with one time point. Again, we extract data from the censuses, alternative provision and pupil referral units and deduplicate. We then save our resulting object, which contains PMR, FSM and IDACI data at year 7.

“Ever” values across reception to year 7

Script 04_npd_demographic_ever.r is a little more complex. We first extract data. We then identify which children are enrolled in special schools (which we flag using a variable called specialist_school). To do this, we use the Get Information About Schools database and link the relevant information using the school unique reference number (URN) (Code Snippet 8). [Note: this step is still under construction.]

Code Snippet 8 (R - 04_npd_demographic_ever.r)

---
# Under construction - code omitted
---

Next, we clean our SEND variable to combine the old Action and Action Plus values with the new Support value. Support replaced Action/Action Plus from academic year 2014/15. Similarly, our SEND variable combines the old statements of SEND with the new Education, Health & Care Plan (which were phased in from 2014/15). We also add children enrolled in alternative provision or pupil referrals units to our specialist_school flag, as children only in alternative provision will not be captured in the Get Information About Schools database (Code Snippet 9).

Code Snippet 9 (R - 04_npd_demographic_ever.r)

---
demographics_ever[senprovision %in% c("A", "P", "K"), senprovision := "support"]
demographics_ever[senprovision %in% c("E", "e", "S"), senprovision := "sehcp"]
demographics_ever[senprovision == "N", senprovision := NA]
demographics_ever[census %in% c("AP", "PRU"), specialist_school := T]
---

Finally, we define fsm_ever_primary by taking the maximum value of fsmeligible across all records (Code Snippet 10). If there is ever a 1 (the maximum of 0 and 1), then this operation will return a 1 by child.

Code Snippet 9 (R - 04_npd_demographic_ever.r)

---
demographics_ever[, fsm_ever_primary := max(fsmeligible), by = PupilMatchingRefAnonymous]
---

The result

At the end of this script, we retain just four variables, PupilMatchingRefAnonymous, fsm_ever_primary, senprovision and specialist_school. Note that we so far have not defined an “ever SEN” variable. We will do this in Guide 11. For now, we just save the resulting object as its own file and clear the workspace.

Coming up…

In total, we have therefore created three separate datasets containing the different types of demographic information that we will later join onto our spine. In the next Guide, we will think about why we might now also want to extract demographic data from HES and how you can adapt code provided in this Guide to do so.