Reshaping by pivoting DataFrame objects
Data is often stored in CSV files or databases in so-called “stacked” or “record” format:
In [1450]: df Out[1450]: date variable value 0 2000-01-03 00:00:00 A 0.469112 1 2000-01-04 00:00:00 A -0.282863 2 2000-01-05 00:00:00 A -1.509059 3 2000-01-03 00:00:00 B -1.135632 4 2000-01-04 00:00:00 B 1.212112 5 2000-01-05 00:00:00 B -0.173215 6 2000-01-03 00:00:00 C 0.119209 7 2000-01-04 00:00:00 C -1.044236 8 2000-01-05 00:00:00 C -0.861849 9 2000-01-03 00:00:00 D -2.104569 10 2000-01-04 00:00:00 D -0.494929 11 2000-01-05 00:00:00 D 1.071804
For the curious here is how the above DataFrame was created:
import pandas.util.testing as tm; tm.N = 3 def unpivot(frame): N, K = frame.shape data = {'value' : frame.values.ravel('F'), 'variable' : np.asarray(frame.columns).repeat(N), 'date' : np.tile(np.asarray(frame.index), K)} return DataFrame(data, columns=['date', 'variable', 'value']) df = unpivot(tm.makeTimeDataFrame())
To select out everything for variable A we could do:
In [1451]: df[df['variable'] == 'A'] Out[1451]: date variable value 0 2000-01-03 00:00:00 A 0.469112 1 2000-01-04 00:00:00 A -0.282863 2 2000-01-05 00:00:00 A -1.509059
But suppose we wish to do time series operations with the variables. A better representation would be where the columns are the unique variables and anindex of dates identifies individual observations. To reshape the data into this form, use the pivot function:
In [1452]: df.pivot(index='date', columns='variable', values='value') Out[1452]: variable A B C D date 2000-01-03 0.469112 -1.135632 0.119209 -2.104569 2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 2000-01-05 -1.509059 -0.173215 -0.861849 1.071804
If the values argument is omitted, and the input DataFrame has more than one column of values which are not used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns whose topmost level indicates the respective value column:
In [1453]: df['value2'] = df['value'] * 2 In [1454]: pivoted = df.pivot('date', 'variable') In [1455]: pivoted Out[1455]: value value2 \ variable A B C D A B C date 2000-01-03 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265 0.238417 2000-01-04 -0.282863 1.212112 -1.044236 -0.494929 -0.565727 2.424224 -2.088472 2000-01-05 -1.509059 -0.173215 -0.861849 1.071804 -3.018117 -0.346429 -1.723698 variable D date 2000-01-03 -4.209138 2000-01-04 -0.989859 2000-01-05 2.143608
You of course can then select subsets from the pivoted DataFrame:
In [1456]: pivoted['value2'] Out[1456]: variable A B C D date 2000-01-03 0.938225 -2.271265 0.238417 -4.209138 2000-01-04 -0.565727 2.424224 -2.088472 -0.989859 2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.
Reshaping by stacking and unstacking
Closely related to the pivot function are the related stack and unstack functions currently available on Series and DataFrame. These functions are designed to work together withMultiIndex objects (see the section on hierarchical indexing). Here are essentially what these functions do:
- stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.
- unstack: inverse operation from stack: “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.
The clearest way to explain is by example. Let’s take a prior example data set from the hierarchical indexing section:
In [1457]: tuples = zip(*[['bar', 'bar', 'baz', 'baz', ......: 'foo', 'foo', 'qux', 'qux'], ......: ['one', 'two', 'one', 'two', ......: 'one', 'two', 'one', 'two']]) ......: In [1458]: index = MultiIndex.from_tuples(tuples, names=['first', 'second']) In [1459]: df = DataFrame(randn(8, 2), index=index, columns=['A', 'B']) In [1460]: df2 = df[:4] In [1461]: df2 Out[1461]: A B first second bar one 0.721555 -0.706771 two -1.039575 0.271860 baz one -0.424972 0.567020 two 0.276232 -1.087401
The stack function “compresses” a level in the DataFrame’s columns to produce either:
- A Series, in the case of a simple column Index
- A DataFrame, in the case of a MultiIndex in the columns
If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest level in a MultiIndex on the columns:
In [1462]: stacked = df2.stack() In [1463]: stacked Out[1463]: first second bar one A 0.721555 B -0.706771 two A -1.039575 B 0.271860 baz one A -0.424972 B 0.567020 two A 0.276232 B -1.087401 dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack is unstack, which by default unstacks the last level:
In [1464]: stacked.unstack() Out[1464]: A B first second bar one 0.721555 -0.706771 two -1.039575 0.271860 baz one -0.424972 0.567020 two 0.276232 -1.087401 In [1465]: stacked.unstack(1) Out[1465]: second one two first bar A 0.721555 -1.039575 B -0.706771 0.271860 baz A -0.424972 0.276232 B 0.567020 -1.087401 In [1466]: stacked.unstack(0) Out[1466]: first bar baz second one A 0.721555 -0.424972 B -0.706771 0.567020 two A -1.039575 0.276232 B 0.271860 -1.087401
If the indexes have names, you can use the level names instead of specifying the level numbers:
In [1467]: stacked.unstack('second') Out[1467]: second one two first bar A 0.721555 -1.039575 B -0.706771 0.271860 baz A -0.424972 0.276232 B 0.567020 -1.087401
You may also stack or unstack more than one level at a time by passing a list of levels, in which case the end result is as if each level in the list were processed individually.
These functions are intelligent about handling missing data and do not expect each subgroup within the hierarchical index to have the same set of labels. They also can handle the index being unsorted (but you can make it sorted by calling sortlevel, of course). Here is a more complex example:
In [1468]: columns = MultiIndex.from_tuples([('A', 'cat'), ('B', 'dog'), ......: ('B', 'cat'), ('A', 'dog')], ......: names=['exp', 'animal']) ......: In [1469]: df = DataFrame(randn(8, 4), index=index, columns=columns) In [1470]: df2 = df.ix[[0, 1, 2, 4, 5, 7]] In [1471]: df2 Out[1471]: exp A B A animal cat dog cat dog first second bar one -0.370647 -1.157892 -1.344312 0.844885 two 1.075770 -0.109050 1.643563 -1.469388 baz one 0.357021 -0.674600 -1.776904 -0.968914 foo one -0.013960 -0.362543 -0.006154 -0.923061 two 0.895717 0.805244 -1.206412 2.565646 qux two 0.410835 0.813850 0.132003 -0.827317
As mentioned above, stack can be called with a level argument to select which level in the columns to stack:
In [1472]: df2.stack('exp') Out[1472]: animal cat dog first second exp bar one A -0.370647 0.844885 B -1.344312 -1.157892 two A 1.075770 -1.469388 B 1.643563 -0.109050 baz one A 0.357021 -0.968914 B -1.776904 -0.674600 foo one A -0.013960 -0.923061 B -0.006154 -0.362543 two A 0.895717 2.565646 B -1.206412 0.805244 qux two A 0.410835 -0.827317 B 0.132003 0.813850 In [1473]: df2.stack('animal') Out[1473]: exp A B first second animal bar one cat -0.370647 -1.344312 dog 0.844885 -1.157892 two cat 1.075770 1.643563 dog -1.469388 -0.109050 baz one cat 0.357021 -1.776904 dog -0.968914 -0.674600 foo one cat -0.013960 -0.006154 dog -0.923061 -0.362543 two cat 0.895717 -1.206412 dog 2.565646 0.805244 qux two cat 0.410835 0.132003 dog -0.827317 0.813850
Unstacking when the columns are a MultiIndex is also careful about doing the right thing:
In [1474]: df[:3].unstack(0) Out[1474]: exp A B A animal cat dog cat dog first bar baz bar baz bar baz bar baz second one -0.370647 0.357021 -1.157892 -0.6746 -1.344312 -1.776904 0.844885 -0.968914 two 1.075770 NaN -0.109050 NaN 1.643563 NaN -1.469388 NaN In [1475]: df2.unstack(1) Out[1475]: exp A B A animal cat dog cat dog second one two one two one two one two first bar -0.370647 1.075770 -1.157892 -0.109050 -1.344312 1.643563 0.844885 -1.469388 baz 0.357021 NaN -0.674600 NaN -1.776904 NaN -0.968914 NaN foo -0.013960 0.895717 -0.362543 0.805244 -0.006154 -1.206412 -0.923061 2.565646 qux NaN 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317
Reshaping by Melt
The melt function found in pandas.core.reshape is useful to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns, considered measured variables, are “pivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”.
For instance,
In [1476]: cheese = DataFrame({'first' : ['John', 'Mary'], ......: 'last' : ['Doe', 'Bo'], ......: 'height' : [5.5, 6.0], ......: 'weight' : [130, 150]}) ......: In [1477]: cheese Out[1477]: first height last weight 0 John 5.5 Doe 130 1 Mary 6.0 Bo 150 In [1478]: melt(cheese, id_vars=['first', 'last']) Out[1478]: first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0
Combining with stats and GroupBy
It should be no shock that combining pivot / stack / unstack with GroupBy and the basic Series and DataFrame statistical functions can produce some very expressive and fast data manipulations.
In [1479]: df Out[1479]: exp A B A animal cat dog cat dog first second bar one -0.370647 -1.157892 -1.344312 0.844885 two 1.075770 -0.109050 1.643563 -1.469388 baz one 0.357021 -0.674600 -1.776904 -0.968914 two -1.294524 0.413738 0.276662 -0.472035 foo one -0.013960 -0.362543 -0.006154 -0.923061 two 0.895717 0.805244 -1.206412 2.565646 qux one 1.431256 1.340309 -1.170299 -0.226169 two 0.410835 0.813850 0.132003 -0.827317 In [1480]: df.stack().mean(1).unstack() Out[1480]: animal cat dog first second bar one -0.857479 -0.156504 two 1.359666 -0.789219 baz one -0.709942 -0.821757 two -0.508931 -0.029148 foo one -0.010057 -0.642802 two -0.155347 1.685445 qux one 0.130479 0.557070 two 0.271419 -0.006733 # same result, another way In [1481]: df.groupby(level=1, axis=1).mean() Out[1481]: animal cat dog first second bar one -0.857479 -0.156504 two 1.359666 -0.789219 baz one -0.709942 -0.821757 two -0.508931 -0.029148 foo one -0.010057 -0.642802 two -0.155347 1.685445 qux one 0.130479 0.557070 two 0.271419 -0.006733 In [1482]: df.stack().groupby(level=1).mean() Out[1482]: exp A B second one 0.016301 -0.644049 two 0.110588 0.346200 In [1483]: df.mean().unstack(0) Out[1483]: exp A B animal cat 0.311433 -0.431481 dog -0.184544 0.133632
Pivot tables and cross-tabulations
The function pandas.pivot_table can be used to create spreadsheet-style pivot tables. It takes a number of arguments
- data: A DataFrame object
- values: a column or a list of columns to aggregate
- rows: list of columns to group by on the table rows
- cols: list of columns to group by on the table columns
- aggfunc: function to use for aggregation, defaulting to numpy.mean
Consider a data set like this:
In [1484]: df = DataFrame({'A' : ['one', 'one', 'two', 'three'] * 6, ......: 'B' : ['A', 'B', 'C'] * 8, ......: 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 4, ......: 'D' : np.random.randn(24), ......: 'E' : np.random.randn(24)}) ......: In [1485]: df Out[1485]: A B C D E 0 one A foo -0.076467 0.959726 1 one B foo -1.187678 -1.110336 2 two C foo 1.130127 -0.619976 3 three A bar -1.436737 0.149748 4 one B bar -1.413681 -0.732339 5 one C bar 1.607920 0.687738 6 two A foo 1.024180 0.176444 7 three B foo 0.569605 0.403310 8 one C foo 0.875906 -0.154951 9 one A bar -2.211372 0.301624 10 two B bar 0.974466 -2.179861 11 three C bar -2.006747 -1.369849 12 one A foo -0.410001 -0.954208 13 one B foo -0.078638 1.462696 14 two C foo 0.545952 -1.743161 15 three A bar -1.219217 -0.826591 16 one B bar -1.226825 -0.345352 17 one C bar 0.769804 1.314232 18 two A foo -1.281247 0.690579 19 three B foo -0.727707 0.995761 20 one C foo -0.121306 2.396780 21 one A bar -0.097883 0.014871 22 two B bar 0.695775 3.357427 23 three C bar 0.341734 -0.317441
We can produce pivot tables from this data very easily:
In [1486]: pivot_table(df, values='D', rows=['A', 'B'], cols=['C']) Out[1486]: C bar foo A B one A -1.154627 -0.243234 B -1.320253 -0.633158 C 1.188862 0.377300 three A -1.327977 NaN B NaN -0.079051 C -0.832506 NaN two A NaN -0.128534 B 0.835120 NaN C NaN 0.838040 In [1487]: pivot_table(df, values='D', rows=['B'], cols=['A', 'C'], aggfunc=np.sum) Out[1487]: A one three two C bar foo bar foo bar foo B A -2.309255 -0.486468 -2.655954 NaN NaN -0.257067 B -2.640506 -1.266315 NaN -0.158102 1.670241 NaN C 2.377724 0.754600 -1.665013 NaN NaN 1.676079 In [1488]: pivot_table(df, values=['D','E'], rows=['B'], cols=['A', 'C'], aggfunc=np.sum) Out[1488]: D E \ A one three two one C bar foo bar foo bar foo bar foo B A -2.309255 -0.486468 -2.655954 NaN NaN -0.257067 0.316495 0.005518 B -2.640506 -1.266315 NaN -0.158102 1.670241 NaN -1.077692 0.352360 C 2.377724 0.754600 -1.665013 NaN NaN 1.676079 2.001971 2.241830 A three two C bar foo bar foo B A -0.676843 NaN NaN 0.867024 B NaN 1.39907 1.177566 NaN C -1.687290 NaN NaN -2.363137
The result object is a DataFrame having potentially hierarchical indexes on the rows and columns. If the values column name is not given, the pivot table will include all of the data that can be aggregated in an additional level of hierarchy in the columns:
In [1489]: pivot_table(df, rows=['A', 'B'], cols=['C']) Out[1489]: D E C bar foo bar foo A B one A -1.154627 -0.243234 0.158248 0.002759 B -1.320253 -0.633158 -0.538846 0.176180 C 1.188862 0.377300 1.000985 1.120915 three A -1.327977 NaN -0.338421 NaN B NaN -0.079051 NaN 0.699535 C -0.832506 NaN -0.843645 NaN two A NaN -0.128534 NaN 0.433512 B 0.835120 NaN 0.588783 NaN C NaN 0.838040 NaN -1.181568
You can render a nice output of the table omitting the missing values by calling to_string if you wish:
In [1490]: table = pivot_table(df, rows=['A', 'B'], cols=['C']) In [1491]: print table.to_string(na_rep='') D E C bar foo bar foo A B one A -1.154627 -0.243234 0.158248 0.002759 B -1.320253 -0.633158 -0.538846 0.176180 C 1.188862 0.377300 1.000985 1.120915 three A -1.327977 -0.338421 B -0.079051 0.699535 C -0.832506 -0.843645 two A -0.128534 0.433512 B 0.835120 0.588783 C 0.838040 -1.181568
Note that pivot_table is also available as an instance method on DataFrame.
Cross tabulations
Use the crosstab function to compute a cross-tabulation of two (or more) factors. By defaultcrosstab computes a frequency table of the factors unless an array of values and an aggregation function are passed.
It takes a number of arguments
- rows: array-like, values to group by in the rows
- cols: array-like, values to group by in the columns
- values: array-like, optional, array of values to aggregate according to the factors
- aggfunc: function, optional, If no values array is passed, computes a frequency table
- rownames: sequence, default None, must match number of row arrays passed
- colnames: sequence, default None, if passed, must match number of column arrays passed
- margins: boolean, default False, Add row/column margins (subtotals)
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified
For example:
In [1492]: foo, bar, dull, shiny, one, two = 'foo', 'bar', 'dull', 'shiny', 'one', 'two' In [1493]: a = np.array([foo, foo, bar, bar, foo, foo], dtype=object) In [1494]: b = np.array([one, one, two, one, two, one], dtype=object) In [1495]: c = np.array([dull, dull, shiny, dull, dull, shiny], dtype=object) In [1496]: crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c']) Out[1496]: b one two c dull shiny dull shiny a bar 1 0 0 1 foo 2 1 1 0
Adding margins (partial aggregates)
If you pass margins=True to pivot_table, special All columns and rows will be added with partial group aggregates across the categories on the rows and columns:
In [1497]: df.pivot_table(rows=['A', 'B'], cols='C', margins=True, aggfunc=np.std) Out[1497]: D E C bar foo All bar foo All A B one A 1.494463 0.235844 1.019752 0.202765 1.353355 0.795165 B 0.132127 0.784210 0.606779 0.273641 1.819408 1.139647 C 0.592638 0.705136 0.708771 0.442998 1.804346 1.074910 three A 0.153810 NaN 0.153810 0.690376 NaN 0.690376 B NaN 0.917338 0.917338 NaN 0.418926 0.418926 C 1.660627 NaN 1.660627 0.744165 NaN 0.744165 two A NaN 1.630183 1.630183 NaN 0.363548 0.363548 B 0.197065 NaN 0.197065 3.915454 NaN 3.915454 C NaN 0.413074 0.413074 NaN 0.794212 0.794212 All 1.294620 0.824989 1.064129 1.403041 1.188419 1.248988
Tiling
The cut function computes groupings for the values of the input array and is often used to transform continuous variables to discrete or categorical variables:
In [1498]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60]) In [1499]: cut(ages, bins=3) Out[1499]: Categorical: array([(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (9.95, 26.667], (26.667, 43.333], (43.333, 60], (43.333, 60]], dtype=object) Levels (3): Index([(9.95, 26.667], (26.667, 43.333], (43.333, 60]], dtype=object)
If the bins keyword is an integer, then equal-width bins are formed. Alternatively we can specify custom bin-edges:
In [1500]: cut(ages, bins=[0, 18, 35, 70]) Out[1500]: Categorical: array([(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]], dtype=object) Levels (3): Index([(0, 18], (18, 35], (35, 70]], dtype=object)