Lists from the start
Again: Don’t ever create d1
d2
d3
in the first place, just create a list d
with 3 elements.
Reading multiple files into a list of data frames
This is done pretty easily when reading in files. Maybe you’ve got files data1.csv, data2.csv, ...
in a directory. Your goal is a list of data.frames called mydata
. The first thing you need is a vector with all the file names. You can construct this with paste (e.g., myfiles = paste0("data", 1:5, ".csv")
), but it’s probably easier to use list.files
to grab all the appropriate files: myfiles <- list.files(pattern = "*.csv")
.
At this point, most R beginners will use a for
loop, and there’s nothing wrong with that, it works.
mydata <- list()
for (i in seq_along(myfiles)) {
mydata[[i]] <- read.csv(file = myfiles[i])
}
A more R-native way to do it is with lapply
mydata <- lapply(myfiles, read.csv)
Either way, it’s handy to name the list elements to match the files
names(mydata) <- gsub("\\.csv", "", myfiles)
# or, if you prefer the consistent syntax of stringr
names(mydata) <- stringr::str_replace(myfiles, pattern = ".csv", replacement = "")
Splitting a data frame into a list of data frames
This is super-easy, the base function split()
does it for you. You can split by a column (or columns) of the data, or by anything else you want
mt_list = split(mtcars, f = mtcars$cyl)
# This gives a list of three data frames, one for each value of cyl
This is also a nice way to break a data frame into pieces for cross-validation. Maybe you want to split mtcars
into training, test, and validation pieces.
groups = sample(c("train", "test", "validate"),
size = nrow(mtcars), replace = TRUE)
mtsplit = split(mtcars, f = groups)
# and mtsplit has appropriate names already!
Simulating a list of data frames
Maybe you’re simulating data, something like this:
my.sim.data = data.frame(x = rnorm(50), y = rnorm(50))
But who does only one simulation? You want to do this 100 times, 1000 times, more! But you don’twant 10,000 data frames in your workspace. Use replicate
and put them in a list:
sim_list = replicate(n = 10,
expr = {data.frame(x = rnorm(50), y = rnorm(50))},
simplify = F)
In this case especially, you should also consider whether you really need separate data frames, or would a single data frame with a “group” column work just as well? Using data.table
, dplyr
, or plyr
it’s quite easy to do things “by group” to a data frame.
I didn’t put my data in a list 🙁 I will next time, but what can I do now?
If you have data frames named in a pattern, e.g., df1
, df2
, df3
, and you want them in a list, you can get them if you can write a regular expression to match the names. Something like
df_list = lapply(ls(pattern = "df[0-9]"), get)
You should start off double checking just the ls
part and make sure you’re getting the right variables. And next time use lists from the start.
Why put the data in a list?
Put similar data in lists because you probably want to do similar things to each data.frame, and functions like lapply
, sapply
do.call
, and the plyr
l*ply
functions make it really easy to do that. Examples of people easily doing things with lists are all over SO.
A couple common tasks might be combining them. If you want to stack them on top of each other, you could use rbind
for a pair of them, and do.call
with rbind
, or (for speed) dplyr::bind_rows
to put them together. (Similarly using cbind
or dplyr::bind_cols
for columns.) To merge (join) a list of data frames, you can see these answers.
Think of scalability. If you really only need three variables, it’s fine to use d1
, d2
, d3
. But then if it turns out you really need 6, that’s a lot more typing. And next time, when you need 10 or 20, you find yourself copying and pasting lines of code, maybe using find/replace to change d14
to d15
, and you’re thinking this isn’t how programming should be. If you use a list, the difference between 3 cases, 30 cases, and 300 cases is at most one line of code—no change at all if your number of cases is automatically detected by, e.g., how many .csv
files are in your directory.
Even if you use a lowly for loop, it’s much easier to loop over the elements of a list than it is to construct variable names with paste
and access the objects with get
.
You can name the elements of a list, in case you want to use something other than numeric indices to access your data frames (and you can use both, this isn’t an XOR choice).
Overall, using lists will lead you to write cleaner, easier-to-read code, which will result in fewer bugs and less confusion.