How to join data frames in R (inner, outer, left, right)?

 

Given two data frames

df1 = data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3)))
df2 = data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1)))

> df1
  CustomerId Product
           1 Toaster
           2 Toaster
           3 Toaster
           4   Radio
           5   Radio
           6   Radio

> df2
  CustomerId   State
           2 Alabama
           4 Alabama
           6    Ohio

How can I do database style, i.e., sql style, joins? That is, how do I get:

  • An inner join of df1 and df1
  • An outer join of df1 and df2
  • A left outer join of df1 and df2
  • A right outer join of df1 and df2

P.S. IKT-JARQ (I Know This – Just Adding R Questions)

Extra credit:

How can I do a sql style select statement?

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by="CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x=TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y=TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass “CustomerId” to R as the matching variable. I think it’s almost always best to explicitly state the identifiers on which you want to merge; it’s safer if the input data.frames change unexpectedly and easier to read later on.

I would recommend checking out Gabor Grothendieck’s sqldf package, which allows you to express these operations in SQL.

library(sqldf)

## inner join
df3 <- sqldf("SELECT CustomerId, Product, State 
       FROM df1 JOIN df2 USING(CustomerID)")

## left join (substitute 'right' for right join)
df4 <- sqldf("SELECT CustomerId, Product, State 
       FROM df1 LEFT JOIN df2 USING(CustomerID)")

I find the SQL syntax to be simpler and more natural than its R equivalent (but this may just reflect my RDBMS bias).

See Gabor’s sqldf Google Code page for more information