I’ve covered a few topics in the past including the plyr package, which is kind of like “GROUP BY” for R, and the merge function for merging datasets. I only recently found the sqldf package for R, and it’s already one of the most useful packages I’ve ever installed. The main function in the package is sqldf(), which takes a quoted string as an argument. You can treat data frames as tables as if they were in a relational database. You can use some of the finer aspects of SQL like the INNER JOIN or the subquery, which are extremely difficult operations to mimic using standard R programming. While this isn’t an SQL tutorial, try out some of these commands to see what sqldf can do for you. Read more about the sqldf package here.
> # install the package > install.packages("sqldf") > > #load it > library(sqldf) > > # set the random seed > set.seed(42) > > #generate some data > df1 = data.frame(id=1:10,class=rep(c("case","ctrl"),5)) > df2 = data.frame(id=1:10,cov=round(runif(10)*10,1)) > > #look at the data > df1 id class 1 1 case 2 2 ctrl 3 3 case 4 4 ctrl 5 5 case 6 6 ctrl 7 7 case 8 8 ctrl 9 9 case 10 10 ctrl > df2 id cov 1 1 9.1 2 2 9.4 3 3 2.9 4 4 8.3 5 5 6.4 6 6 5.2 7 7 7.4 8 8 1.3 9 9 6.6 10 10 7.1 > > # do an inner join > sqldf("select * from df1 join df2 on df1.id=df2.id") id class id cov 1 1 case 1 9.1 2 2 ctrl 2 9.4 3 3 case 3 2.9 4 4 ctrl 4 8.3 5 5 case 5 6.4 6 6 ctrl 6 5.2 7 7 case 7 7.4 8 8 ctrl 8 1.3 9 9 case 9 6.6 10 10 ctrl 10 7.1 > > # where clauses > sqldf("select * from df1 join df2 on df1.id=df2.id where class='case'") id class id cov 1 1 case 1 9.1 2 3 case 3 2.9 3 5 case 5 6.4 4 7 case 7 7.4 5 9 case 9 6.6 > > # lots of sql fun > sqldf("select df1.id, df2.cov as covariate from df1 join df2 on df1.id=df2.id where class='case' and cov>3 order by cov") id covariate 1 5 6.4 2 9 6.6 3 7 7.4 4 1 9.1