Mutiple Dataframes within Lists from the start

Lists from the start

Again: Don’t ever create d1 d2 d3 in the first place, just create a list d with 3 elements.

Reading multiple files into a list of data frames

This is done pretty easily when reading in files. Maybe you’ve got files data1.csv, data2.csv, ... in a directory. Your goal is a list of data.frames called mydata. The first thing you need is a vector with all the file names. You can construct this with paste (e.g., myfiles = paste0("data", 1:5, ".csv")), but it’s probably easier to use list.files to grab all the appropriate files: myfiles <- list.files(pattern = "*.csv").

At this point, most R beginners will use a for loop, and there’s nothing wrong with that, it works.

mydata <- list()
for (i in seq_along(myfiles)) {
    mydata[[i]] <- read.csv(file = myfiles[i])
}

A more R-native way to do it is with lapply

mydata <- lapply(myfiles, read.csv)

Either way, it’s handy to name the list elements to match the files

names(mydata) <- gsub("\\.csv", "", myfiles)
# or, if you prefer the consistent syntax of stringr
names(mydata) <- stringr::str_replace(myfiles, pattern = ".csv", replacement = "")

Splitting a data frame into a list of data frames

This is super-easy, the base function split() does it for you. You can split by a column (or columns) of the data, or by anything else you want

mt_list = split(mtcars, f = mtcars$cyl)
# This gives a list of three data frames, one for each value of cyl

This is also a nice way to break a data frame into pieces for cross-validation. Maybe you want to split mtcars into training, test, and validation pieces.

groups = sample(c("train", "test", "validate"),
                size = nrow(mtcars), replace = TRUE)
mtsplit = split(mtcars, f = groups)
# and mtsplit has appropriate names already!

Simulating a list of data frames

Maybe you’re simulating data, something like this:

my.sim.data = data.frame(x = rnorm(50), y = rnorm(50))

But who does only one simulation? You want to do this 100 times, 1000 times, more! But you don’twant 10,000 data frames in your workspace. Use replicate and put them in a list:

sim_list = replicate(n = 10,
                     expr = {data.frame(x = rnorm(50), y = rnorm(50))},
                     simplify = F)

In this case especially, you should also consider whether you really need separate data frames, or would a single data frame with a “group” column work just as well? Using data.table, dplyr, or plyr it’s quite easy to do things “by group” to a data frame.

I didn’t put my data in a list 🙁 I will next time, but what can I do now?

If you have data frames named in a pattern, e.g., df1, df2, df3, and you want them in a list, you can get them if you can write a regular expression to match the names. Something like

df_list = lapply(ls(pattern = "df[0-9]"), get)

You should start off double checking just the ls part and make sure you’re getting the right variables. And next time use lists from the start.

Why put the data in a list?

Put similar data in lists because you probably want to do similar things to each data.frame, and functions like lapply, sapply do.call, and the plyr l*ply functions make it really easy to do that. Examples of people easily doing things with lists are all over SO.

A couple common tasks might be combining them. If you want to stack them on top of each other, you could use rbind for a pair of them, and do.call with rbind, or (for speed) dplyr::bind_rows to put them together. (Similarly using cbind or dplyr::bind_cols for columns.) To merge (join) a list of data frames, you can see these answers.

Think of scalability. If you really only need three variables, it’s fine to use d1, d2, d3. But then if it turns out you really need 6, that’s a lot more typing. And next time, when you need 10 or 20, you find yourself copying and pasting lines of code, maybe using find/replace to change d14 to d15, and you’re thinking this isn’t how programming should be. If you use a list, the difference between 3 cases, 30 cases, and 300 cases is at most one line of code—no change at all if your number of cases is automatically detected by, e.g., how many .csv files are in your directory.

Even if you use a lowly for loop, it’s much easier to loop over the elements of a list than it is to construct variable names with paste and access the objects with get.

You can name the elements of a list, in case you want to use something other than numeric indices to access your data frames (and you can use both, this isn’t an XOR choice).

Overall, using lists will lead you to write cleaner, easier-to-read code, which will result in fewer bugs and less confusion.