Pivot Tables in R with dplyr

[Total: 4 Average: 4/5]

The prolific Hadley Wickham did it again! This time he came up, together with Romain Francois, with an amazing library for data manipulation that turns the task of making Pivot Tables in R a real breeze. Enter dplyr. Along the lines of ggplot2, also from the same main author, dplyr implements a grammar of data manipulation and also introduces a new syntax using “pipe” operators.

What is a “pipe” operator? Well, the best way to learn about it is to use it. Ready to experiment with dplyr? Let’s get started.

1. Get the sample data file

We will be using the same Sales data file from my previous posts Pivot Tables in R with melt and cast and Pivot Tables in R with aggregate. Please refer to those posts for detailed instructions on how to download and prepare the data file for use in R.

2. Load the data file into R

This can be done simply with the command:

1	data <– read.csv(“sales_data.csv”, header=TRUE)

Here the first 6 rows from the resulting data frame:

> head(data)

Country Salesperson Order.Date OrderID Order.Amount

1 UK Suyama 37812 10249 1863.40

2 USA Peacock 37813 10252 3597.90

3 USA Peacock 37814 10250 1552.60

4 UK Dodsworth 37817 10255 2490.50

5 USA Leverling 37817 10251 654.06

6 UK Buchanan 37818 10248 440.00

3. Install and load the dplyr library

dplyr is available directly from CRAN (the latest version is 0.2 at the time of this writing). It can be installed and made available to R with the following commands:

1 2	> install.packages(“dplyr”) > library(dplyr)

After having loaded the dplyr library, you may get some warnings that some objects are masked by other packages. This is ok and it is not to worry about at this time.

4. Getting familiar with dplyr

dplyr comes with a set of functions that can be used to perform the most common manipulation on data. They all work on data frames and table data frames, a new “smarter” version of data frames supported by dplyr.

You can convert a data frame to a table data frame through the tbl_df() function:

data_tdf <– tbl_df(data)

> data_tdf

Source: local data frame [799 x 5]

Country Salesperson Order.Date OrderID Order.Amount

1 UK Suyama 37812 10249 1863.40

2 USA Peacock 37813 10252 3597.90

3 USA Peacock 37814 10250 1552.60

4 UK Dodsworth 37817 10255 2490.50

5 USA Leverling 37817 10251 654.06

6 UK Buchanan 37818 10248 440.00

7 USA Leverling 37818 10253 1444.80

8 USA Leverling 37819 10256 517.80

9 USA Peacock 37824 10257 1119.90

10 UK Buchanan 37825 10254 556.62

.. ... ... ... ... ...

While converting to a table data frame is strictly not necessary to use dplyr, the new format provides a smarter way to display large data sets with many variables (columns). Since our data set is pretty limited, we will keep working with our data in a data frame format. Note however that the output of most dplyr functions will be a table data frame, then converted back transparently to a data frame.

5. Data manipulation verbs

dplyr introduces five data manipulation verbs, namely filter(), arrange(), select(), mutate() and summarise(). When it comes to the task of producing Pivot Tables, summarise() is our working horse.

Note to my american friends: you can use summarize() in place of summarise() and it will work as well!

summarise() takes multiple rows of a data frame and summarises (collapses) them into a single one by applying a function, for examplesum() to sum their values or mean() to calculate the average.

summarise() alone is not very useful though because it does not allow any grouping of the results before applying the summarizing function to the data. Since Pivot Tables are obtained by first grouping the rows according to the value of a variable (column) and then applying a summarizing function to each group, we need a way to group rows in dplyr first. Enter group_by().

6. Group_by()

group_by() enables data manipulation verbs to be applied to each subgroup of data, bringing then back the result of each group in a single data frame.

Back to our sample data, we want to obtain the total amount each Sales Person has sold. Using dplyr, first we group the data by Salesperson with group_by(), then apply summarise() to each group to find the total sum.

Since group_by() should be called first and the results passed to summarise(), we end up with the following fully working but quite convoluted syntax:

> summarise(group_by(data, Salesperson), Order.Amount=sum(Order.Amount))

Source: local data frame [9 x 2]

Salesperson Order.Amount

1 Buchanan 68792.25

2 Callahan 123032.67

3 Davolio 182500.09

4 Dodsworth 75048.04

5 Fuller 162503.78

6 King 116962.99

7 Leverling 201196.27

8 Peacock 225763.68

9 Suyama 72527.63

While the results are exactly as expected (please compare with the previous articles), the nested functions we had to use are pretty hard to read. Not surprisingly, the authors of dplyr have come up with a brilliant idea to simplify this complexity and make chained data operation extremely easy.

7. The pipe operator

What we are doing with the previous function calls is basically a chain of operations like this:

take data
group data by Salesperson using group_by()
take the result from the previous step and calculate the total for each group (aka Salesperson) it using summarise()

In other words, data goes into group_by() and the result of group_by() goes into summarise() producing the final pivot table.

The “pipe” operator introduced by dplyr does exactly this. It sends a piece of data as input to a function and then allows the output from the function to go into another function and so on. Using the pipe operator, we can produce the exact same pivot tables with the following much more readable syntax:

> data %>% group_by(Salesperson) %>% summarise(Order.Amount=sum(Order.Amount))

Source: local data frame [9 x 2]

Salesperson Order.Amount

1 Buchanan 68792.25

2 Callahan 123032.67

3 Davolio 182500.09

4 Dodsworth 75048.04

5 Fuller 162503.78

6 King 116962.99

7 Leverling 201196.27

8 Peacock 225763.68

9 Suyama 72527.63

Where %>% is the pipe operator. Basically we have taken our data frame data, we have sent it to group_by() telling it to group by Salesperson, then we have taken the output from group_by() and sent it to summarise() telling it to summarise each group by calculating the total of Order. Amount within each group. Isn’t it great?

No unnecessary repetition of data, no need to use temporary variable to store the results, no need to nest functions. Everything happens automatically behind the scenes and is amazingly fast on large data sets thanks to the use of optimized C/C++ code. Great job Handy and Romain!!!

8. Filtering

How about doing some filtering too? Say that, like we did in a previous article, we want to product a Pivot Tables only for the Sales person in USA. We can achieve this easily with dplyr using the filter() verb. We filter the data before grouping and summarising them. Here is the syntax using again the pipe operator:

> data %>% filter(Country==“USA”) %>% group_by(Salesperson) %>% summarise(Order.Amount=sum(Order.Amount))

Source: local data frame [5 x 2]

Salesperson Order.Amount

1 Callahan 123032.7

2 Davolio 182500.1

3 Fuller 162503.8

4 Leverling 201196.3

5 Peacock 225763.7

Fast, clean and easy to read!

9. Unleashing the power of dplyr

dplyr is particularly powerful also because it also allows multiple operations to be performed at once on the same data.

For the sake of this example, let’s add to our data set a fictitious Quantity column, representing the quantity sold for each order. We do this by generating a normally distributed Unit Sales Price with mean = 10 and standard deviation = 1, then calculating the Quantity as Order.Amount / Unit.Sales.Price.

Even if calculating the Quantity requires the Unit Sales Price to be defined first, with dplyr we can do everything at once, that is in one single pass, using mutate(). Here is the syntax and the outcome:

> data <– data %>% mutate(Unit.Sales.Price = rnorm(n=n(), mean=10, sd=1), Quantity = Order.Amount / Unit.Sales.Price)

> head(data)

Country Salesperson Order.Date OrderID Order.Amount Quantity

1 UK Suyama 37812 10249 1863.40 201.77949

2 USA Peacock 37813 10252 3597.90 307.72409

3 USA Peacock 37814 10250 1552.60 173.32053

4 UK Dodsworth 37817 10255 2490.50 266.73936

5 USA Leverling 37817 10251 654.06 67.22134

6 UK Buchanan 37818 10248 440.00 40.88416

Unit.Sales.Price

1 9.234834

2 11.691967

3 8.957970

4 9.336830

5 9.729946

6 10.762114

Incidentally, note that the visualization of our modified data set wraps-around. Not very user friendly. Converting it to a table data frame makes its visualization nicer:

> tbl_df(data)

Source: local data frame [799 x 7]

Country Salesperson Order.Date OrderID Order.Amount Quantity

1 UK Suyama 37812 10249 1863.40 201.77949

2 USA Peacock 37813 10252 3597.90 307.72409

3 USA Peacock 37814 10250 1552.60 173.32053

4 UK Dodsworth 37817 10255 2490.50 266.73936

5 USA Leverling 37817 10251 654.06 67.22134

6 UK Buchanan 37818 10248 440.00 40.88416

7 USA Leverling 37818 10253 1444.80 142.07045

8 USA Leverling 37819 10256 517.80 56.71080

9 USA Peacock 37824 10257 1119.90 115.06074

10 UK Buchanan 37825 10254 556.62 60.60626

.. ... ... ... ... ... ...

Variables not shown: Unit.Sales.Price (dbl)

Not only the new class table data frame shows the dimensions of the data frame as [799 x 7], but it prints nicely without wrapping around and specifying that one variable, Unite.Sales.Price, was not displayed.

Ok, now that we have our Quantities in place, we want to produce a Pivot Tables that has, for each sales person, the total amount sold and the average sold quantity. This is very easy too to achieve with dplyr. Here is the syntax:

> data %>% group_by(Salesperson) %>% summarise(Total.Order.Amount=sum(Order.Amount), Average.Quantity=mean(Quantity))

Source: local data frame [9 x 3]

Salesperson Total.Order.Amount Average.Quantity

1 Buchanan 68792.25 167.7801

2 Callahan 123032.67 125.1866

3 Davolio 182500.09 156.8355

4 Dodsworth 75048.04 188.9096

5 Fuller 162503.78 176.6365

6 King 116962.99 183.9184

7 Leverling 201196.27 159.4946

8 Peacock 225763.68 150.8685

9 Suyama 72527.63 112.2168

In fact summarise() can process multiple variables at once, each one with a different summarisation function, or apply different summarisation functions to the same variable.

Say, for example, we want to calculate for each sales person the total amount sold AND the maximum order size. This is very simple with dplyr compared to Excel, where you can only calculate one summary per column (unless you duplicate a column):

> data %>% group_by(Salesperson) %>% summarise(Total.Order.Amount=sum(Order.Amount), Max.Order.Amount=max(Order.Amount))

Source: local data frame [9 x 3]

Salesperson Total.Order.Amount Max.Order.Amount

1 Buchanan 68792.25 9210.90

2 Callahan 123032.67 4825.00

3 Davolio 182500.09 15810.00

4 Dodsworth 75048.04 11380.00

5 Fuller 162503.78 16387.50

6 King 116962.99 12615.05

7 Leverling 201196.27 10952.84

8 Peacock 225763.68 11188.40

9 Suyama 72527.63 4707.54

Well done! This is it for today. I plan to look again at dplyr and its handy pipe syntax in other upcoming posts. As always, your comments and remarks are more than welcome.