Appendix: a brief introduction to data.table

Script needed for this guide

data.table.demo.r

Introduction

Throughout these guides, we use the R package data.table. This package makes working with large data frames much easier. In this guide, we introduce data.table. While by no means comprehensive (we do not make full use of data.table’s functionality), it should give you just enough to get you going with the guides if you are unfamiliar with the package.

Some pretend data

We are going to simulate some data, rather than use real data. This is to keep things simple as we illustrate some points. The first part of the script data.table.demo.r simulates a dataset with one million rows and a set of columns. While these represent the sorts of data you will encounter in ECHILD, their content is entirely randomly generated with no relationships specified among the variables.

A `data.table` is just a fancy `data.frame`

The first thing to note is that data.table introduces a new object class, the data.table. A data.table is nothing other than a fancy data.frame: it does what a data.frame can do, and then some. In the words of its creators, data.table “provides a high-performance version of base R’s data.frame with syntax and feature enhancements for ease of use, convenience and programming speed.” There are some key differences that make data.table syntax much tidier than base R and that make certain operations much easier. Another key advantage is that working with data.table is often much quicker than with the base R data.frame.

Indexing a `data.table`

In base R, we use square brackets to index an object such as a data.frame, supplying a value to the i (row) and j (column) coordinates. For example, we might supply integers to select a specified cell or we supply a vector to subset to certain rows. We can do the same thing with a data.table (Code Snippet 1). Note how the syntax for data.table is much simpler. In base R, we must include a comma in the square brackets, even if we leave the j coordinate blank (i.e., to return all columns). A key difference is that, with data.table, we do not need to explicitly supply the j coordinate and, if we are just indexing on i (row), we can omit the column.

Code Snippet 1 (R - data.table.demo.r)

---
# Return the 1st row and 4th column
dt_dataframe[1, 4]
dt_datatable[1, 4]

# Return just the 1st row
dt_dataframe[1, ]
dt_datatable[1, ]
dt_datatable[1]

# Return just the 4th column
dt_dataframe[, 4]
dt_datatable[, 4]

# Return all columns for a specific group of children (in this case, all girls)
dt_dataframe[dt_dataframe$female == 1, ]
dt_datatable[female == 1]

# Return all rows for specified columns
dt_dataframe[, c("tokenpersonid", "age_yr")]
dt_datatable[, c("tokenpersonid", "age_yr")]
dt_datatable[, .(tokenpersonid, age_yr)]
---

Note how, as well, the output is more concise when using data.table (Code Snippet 2).

Code Snippet 2 (R output)

---
> dt_dataframe[, 4]
   [1] "neurological" "respiratory"  "respiratory"  "infection"    "respiratory"  "infection"    "infection"    "neurological" "mental"      
  [10] "respiratory"  "infection"    "infection"    "infection"    "mental"       "respiratory"  "infection"    "infection"    "infection"   
  [19] "neurological" "neurological" "infection"    "cardiac"      "infection"    "infection"    "respiratory"  "respiratory"  "cardiac"     
  [28] "infection"    "mental"       "infection"    "neurological" "respiratory"  "infection"    "infection"    "infection"    "mental"      
  [37] "infection"    "infection"    "respiratory"  "respiratory"  "respiratory"  "neurological" "mental"       "cardiac"      "mental"      
  [46] "neurological" "cardiac"      "infection"    "respiratory"  "respiratory"  "infection"    "neurological" "mental"       "neurological"
  [55] "respiratory"  "neurological" "mental"       "respiratory"  "cardiac"      "cardiac"      "neurological" "cardiac"      "mental"      
  [64] "cardiac"      "infection"    "infection"    "respiratory"  "neurological" "infection"    "cardiac"      "respiratory"  "mental"      
  [73] "cardiac"      "cardiac"      "cardiac"      "infection"    "mental"       "mental"       "respiratory"  "cardiac"      "neurological"
  [82] "cardiac"      "neurological" "respiratory"  "mental"       "cardiac"      "neurological" "infection"    "infection"    "mental"      
  [91] "infection"    "mental"       "neurological" "neurological" "respiratory"  "infection"    "cardiac"      "mental"       "neurological"
 [100] "respiratory"
[ reached getOption("max.print") -- omitted 999900 entries ]


> dt_datatable[, 4]
            chc_group
               <char>
      1: neurological
      2:  respiratory
      3:  respiratory
      4:    infection
      5:  respiratory
     ---             
 999996:       mental
 999997:       mental
 999998:  respiratory
 999999:    infection
1000000:       mental
---

Ordering

Ordering a data.table is also rather easier and quicker than ordering a data.frame (Code Snippet 3). In this example, ordering the data.frame took about 1.1 seconds compared to 0.05 for data.table.

Code Snippet 3 (R - data.table.demo.r)

---
dt_dataframe <- dt_dataframe[order(dt_dataframe$chc_group, dt_dataframe$epistart), ]
dt_datatable <- dt_datatable[order(chc_group, epistart)]
---

Creating variables explicitly

To add a new variable to a base R data.frame, we use code as in the first line of Code Snippet 4. While we could do this to a data.table, data.table’s own way is slightly different, as shown in the second line of Code Snippet 4. Note how we can perform operations on other columns within the same data.table easily with reference to their column name.

Code Snippet 4 (R - data.table.demo.r)

---
dt_dataframe$age_yr_sq <- dt_dataframe$age_yr ^ 2
dt_datatable[, age_yr_sq := age_yr ^ 2]
---

Creating variables in a loop

In Code Snippet 4, we created the new column explicitly. Occasionally, however, in the guides we want to create a group of variables in a loop, where the new column is created dynamically (for example, when dealing with dual enrolment in guide 8). Study Code Snippet 5. Here we create variables that give the date that is the episode start date plus 365 days, 730 days, and so on until 3,650 days (i.e., roughly 10 years).

Code Snippet 5 (R - data.table.demo.r)

---
periods <- seq(365, 365 * 10, by = 365)

for (period in periods) {
  new_var <- paste0("epistart_plus_", period, "_days")
  dt_datatable[, (new_var) := epistart + period]
}

rm(period, periods, new_var)
---

Reshaping

Reshaping in base R is notoriously difficult. The syntax for the base R reshape() function is not at all intuitive. Fortunately, data.table comes with its own reshaping functions, melt() and dcast(), which are illustrated in Code Snippet 6, where we reshape the episode start date and the new dates we created in Code Snippet 5 into long format, and then back into wide.

Code Snippet 6 (R - data.table.demo.r)

---
# Reshape to long
epistart_vars <- names(dt_datatable)[grepl("epistart", names(dt_datatable))]

dt_datatable_long <-
  melt(
    dt_datatable[, c("tokenpersonid", epistart_vars), with = F],
    id.vars = "tokenpersonid",
    value.name = "date"
  )

dt_datatable_long <- dt_datatable_long[order(tokenpersonid)]

# Reshape (back) to wide
dt_datatable_wide <-
  dcast(
    dt_datatable_long,
    formula = tokenpersonid ~ variable
  )
---

Further help

Hopefully between this guide and the others, you now have enough to get you going. If you have any questions, feel free to post a question in the ECHILD discussion forum. There is, of course, also a wealth of information on data.table on the internet that you can search and consult.