Appendix: a brief introduction to data.table
Script needed for this guide
Introduction
Throughout these guides, we use the R package data.table. This package makes working with large data frames much easier. In this guide, we introduce data.table
. While by no means comprehensive (we do not make full use of data.table
’s functionality), it should give you just enough to get you going with the guides if you are unfamiliar with the package. The second half of this guide also provides some hints and tips for writing scripts and managing memory.
Some pretend data
We are going to simulate some data, rather than use real data. This is to keep things simple as we illustrate some points. The first part of the script data.table.demo.r simulates a dataset with one million rows and a set of columns. While these represent the sorts of data you will encounter in ECHILD, their content is entirely randomly generated with no relationships specified among the variables.
A data.table
is just a fancy data.frame
The first thing to note is that data.table
introduces a new object class, the data.table
. A data.table
is nothing other than a fancy data.frame
: it does what a data.frame
can do, and then some. In the words of its creators, data.table
“provides a high-performance version of base R’s data.frame
with syntax and feature enhancements for ease of use, convenience and programming speed.” There are some key differences that make data.table
syntax much tidier than base R and that make certain operations much easier. Another key advantage is that working with data.table
is often much quicker than with the base R data.frame
.
Indexing a data.table
In base R, we use square brackets to index an object such as a data.frame
, supplying a value to the i
(row) and j
(column) coordinates. For example, we might supply integers to select a specified cell or we supply a vector to subset to certain rows. We can do the same thing with a data.table
(Code Snippet 1). Note how the syntax for data.table
is much simpler. In base R, we must include a comma in the square brackets, even if we leave the j
coordinate blank (i.e., to return all columns). A key difference is that, with data.table
, we do not need to explicitly supply the j
coordinate and, if we are just indexing on i
(row), we can omit the column.
Code Snippet 1 (R - data.table.demo.r)
---
# Return the 1st row and 4th column
1, 4]
dt_dataframe[1, 4]
dt_datatable[
# Return just the 1st row
1, ]
dt_dataframe[1, ]
dt_datatable[1]
dt_datatable[
# Return just the 4th column
4]
dt_dataframe[, 4]
dt_datatable[,
# Return all columns for a specific group of children (in this case, all girls)
$female == 1, ]
dt_dataframe[dt_dataframe== 1]
dt_datatable[female
# Return all rows for specified columns
c("tokenpersonid", "age_yr")]
dt_dataframe[, c("tokenpersonid", "age_yr")]
dt_datatable[,
dt_datatable[, .(tokenpersonid, age_yr)]---
Note how, as well, the output is more concise when using data.table
(Code Snippet 2).
Code Snippet 2 (R output)
---
> dt_dataframe[, 4]
1] "neurological" "respiratory" "respiratory" "infection" "respiratory" "infection" "infection" "neurological" "mental"
[10] "respiratory" "infection" "infection" "infection" "mental" "respiratory" "infection" "infection" "infection"
[19] "neurological" "neurological" "infection" "cardiac" "infection" "infection" "respiratory" "respiratory" "cardiac"
[28] "infection" "mental" "infection" "neurological" "respiratory" "infection" "infection" "infection" "mental"
[37] "infection" "infection" "respiratory" "respiratory" "respiratory" "neurological" "mental" "cardiac" "mental"
[46] "neurological" "cardiac" "infection" "respiratory" "respiratory" "infection" "neurological" "mental" "neurological"
[55] "respiratory" "neurological" "mental" "respiratory" "cardiac" "cardiac" "neurological" "cardiac" "mental"
[64] "cardiac" "infection" "infection" "respiratory" "neurological" "infection" "cardiac" "respiratory" "mental"
[73] "cardiac" "cardiac" "cardiac" "infection" "mental" "mental" "respiratory" "cardiac" "neurological"
[82] "cardiac" "neurological" "respiratory" "mental" "cardiac" "neurological" "infection" "infection" "mental"
[91] "infection" "mental" "neurological" "neurological" "respiratory" "infection" "cardiac" "mental" "neurological"
[100] "respiratory"
[getOption("max.print") -- omitted 999900 entries ]
[ reached
> dt_datatable[, 4]
chc_group<char>
1: neurological
2: respiratory
3: respiratory
4: infection
5: respiratory
---
999996: mental
999997: mental
999998: respiratory
999999: infection
1000000: mental
---
Ordering
Ordering a data.table
is also rather easier and quicker than ordering a data.frame
(Code Snippet 3). In this example, ordering the data.frame
took about 1.1 seconds compared to 0.05 for data.table
.
Code Snippet 3 (R - data.table.demo.r)
---
<- dt_dataframe[order(dt_dataframe$chc_group, dt_dataframe$epistart), ]
dt_dataframe <- dt_datatable[order(chc_group, epistart)]
dt_datatable ---
Creating variables explicitly
To add a new variable to a base R data.frame
, we use code as in the first line of Code Snippet 4. While we could do this to a data.table
, data.table
’s own way is slightly different, as shown in the second line of Code Snippet 4. Note how we can perform operations on other columns within the same data.table
easily with reference to their column name.
Code Snippet 4 (R - data.table.demo.r)
---
$age_yr_sq <- dt_dataframe$age_yr ^ 2
dt_dataframe:= age_yr ^ 2]
dt_datatable[, age_yr_sq ---
Creating variables in a loop
In Code Snippet 4, we created the new column explicitly. Occasionally, however, in the guides we want to create a group of variables in a loop, where the new column is created dynamically (for example, when dealing with dual enrolment in guide 8). Study Code Snippet 5. Here we create variables that give the date that is the episode start date plus 365 days, 730 days, and so on until 3,650 days (i.e., roughly 10 years).
Code Snippet 5 (R - data.table.demo.r)
---
<- seq(365, 365 * 10, by = 365)
periods
for (period in periods) {
<- paste0("epistart_plus_", period, "_days")
new_var := epistart + period]
dt_datatable[, (new_var)
}
rm(period, periods, new_var)
---
Reshaping
Reshaping in base R is notoriously difficult. The syntax for the base R reshape()
function is not at all intuitive. Fortunately, data.table
comes with its own reshaping functions, melt()
and dcast()
, which are illustrated in Code Snippet 6, where we reshape the episode start date and the new dates we created in Code Snippet 5 into long format, and then back into wide.
Code Snippet 6 (R - data.table.demo.r)
---
# Reshape to long
<- names(dt_datatable)[grepl("epistart", names(dt_datatable))]
epistart_vars
<-
dt_datatable_long melt(
c("tokenpersonid", epistart_vars), with = F],
dt_datatable[, id.vars = "tokenpersonid",
value.name = "date"
)
<- dt_datatable_long[order(tokenpersonid)]
dt_datatable_long
# Reshape (back) to wide
<-
dt_datatable_wide dcast(
dt_datatable_long,formula = tokenpersonid ~ variable
)---
Further help
Hopefully between this guide and the others, you now have enough to get you going. If you have any questions, feel free to post a question in the ECHILD discussion forum. There is, of course, also a wealth of information on data.table
on the internet that you can search and consult.