Chapter 4 Data Shaping using Reshape(2) and Stats Packages

Author: Aira Contreras

4.1 Introduction

Research data sets can be overwrought with information that may not be of interest to the researcher; on one hand it could be that the researcher is using data sets from experiments or studies that were not conducted directly by them and therefore needs to carefully choose variables, or perhaps there were too many variables collected during the study and a particular analysis only requires a select few. It is also possible that the analysis software is only able to process the data if it is fed into the input in a certain manner. Whatever the limitations are, data shaping is a key step in data analysis and there are many tools available to assist with this task. R programmers have acknowledged the need for CRAN approved packages that aid with this endeavor and since 2003 have created packages such as Reshape to help others. This section will focus on the functionalities of the Reshape2 package distributed in 2012 by Hadley Wickham and the Stats 2001 by the R Core Team.

4.2 Different Data Inputs

Data input formats, for the purposes of this exercise, exist mainly in 2 forms: wide format or long format. It is important to understand what format the data is in prior to re-shaping or performing an analysis. This will allow you to correctly reshape the data prior to analysis or input into the software correctly. The next subsections describe the differences between the 2 formats and provide a visual example of the formats.

4.2.1 Long Format

Data is described to be in long format when each row represents 1 time point or variable. For example, if an experiment collected information on state population, state mortality, and state income then the output would be the state (e.g. California) and the variable (e.g. Income) in the first row, the state (e.g. California) and and the variable (e.g. Mortality) in the next row, and so forth until a full data output is created.

#install.packages("reshape2")
#install.packages("dplyr")
library(reshape2)
library(dplyr)

SubjectID<-1:50
df<- data.frame(SubjectID,state.x77)
example<-melt(df,id="SubjectID")
ex<-example%>%filter(SubjectID==1)
ex
##   SubjectID   variable    value
## 1         1 Population  3615.00
## 2         1     Income  3624.00
## 3         1 Illiteracy     2.10
## 4         1   Life.Exp    69.05
## 5         1     Murder    15.10
## 6         1    HS.Grad    41.30
## 7         1      Frost    20.00
## 8         1       Area 50708.00

This example demonstrates that all of the rows, which measure a distinct variable with its corresponding value, belong to 1 subject (i.e. 1). This format is useful for analysis of repeated measures experiments in which the same subject may be asked about different benchmarks during the course of the experiment.

4.2.2 Wide Format

Data is described to be in wide format when each row represents all of the individual variable responses for 1 subjected, separated by column. For example, if an experiment collected information on subjected responses at different intervals then the output would be the subject (e.g. 1) and the responses at each interval in separate columns (e.g. inter1, inter2, inter3, etc.) in the first row, the next subject (e.g. 2) and the responses at each interval in separate columns (e.g. inter1, inter2, etc.) in the next row and so fhorth until a full data output is created.

wide <- reshape(Indometh, v.names = "conc", idvar = "Subject",
                timevar = "time", direction = "wide")
wide[,1:8]
##    Subject conc.0.25 conc.0.5 conc.0.75 conc.1 conc.1.25 conc.2 conc.3
## 1        1      1.50     0.94      0.78   0.48      0.37   0.19   0.12
## 12       2      2.03     1.63      0.71   0.70      0.64   0.36   0.32
## 23       3      2.72     1.49      1.16   0.80      0.80   0.39   0.22
## 34       4      1.85     1.39      1.02   0.89      0.59   0.40   0.16
## 45       5      2.05     1.04      0.81   0.39      0.30   0.23   0.13
## 56       6      2.31     1.44      1.03   0.84      0.64   0.42   0.24

This example demonstrates that all of the rows show a distinct subject with the distinct variable measures in each individual column.

4.3 Reshape2 Package Functions

The Reshape2 package contains various functions that helps users place their data into either the long format or the wide format depending on what is needed by the software being used for analysis. Though there are many functions availalble and it is encouraged that all functions are reviewed (you can do so here: https://cran.r-project.org/web/packages/reshape2/reshape2.pdf), the primary focus of this section will be on the use of the melt() and cast() functions. For proper use and data reshaping, the functions are ideally used in tandem, though as shown above, they can be used separately (i.e. only the melt function used for the separation of some data from others).

4.3.1 melt()

As it implies, the melt function strips away variables and/or structures from existing data sets and prepares the data for the cast function (or for analysis). Depending on the type of data that you may have (array, data frame, matrix), melt() should be able to accommodate the breakdown or reshaping of the data. A check for the use of the function called melt_check() is purported to exist, but the documentation on how to use this function is bare and I currently cannot make it run.

4.3.1.1 melt()/melt.default()/melt.data.frame()

The melt function requires the following inputs at minimum to work melt(data,Subject Name Code Column, Variable Column Names in List Form). As is, the function does not remove any missing values (na.rm = FALSE), so if it is necessary to remove missing values from the data set, this should be done prior to the use of melt() or na.rm should be set as equal to TRUE.

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)

head(aqm)
##   month day variable value
## 1     5   1    ozone    41
## 2     5   2    ozone    36
## 3     5   3    ozone    12
## 4     5   4    ozone    18
## 6     5   6    ozone    28
## 7     5   7    ozone    23

The air quality data set contained 4 measured variables (Ozone, Solar.R, Wind, and Temp) over 2 time metrics (Month, Day). Using melt(), the data set was reshaped in aqm to show Month and Day as the independent variables, with the dependent variables Ozone, Solar R., Wind, and Temp shown as the dependent variables with their corresponding measurments. Note that because we did not specify the measured variable, all of the variables not designated as independent variables were designated as measured variables. In order to select specific measured/dependent variables, it is necessary to specify them as shown in the example below.

aqm2 <- melt(airquality, id=c("month", "day"),measure.vars = "ozone", na.rm=TRUE)

head(aqm2)
##   month day variable value
## 1     5   1    ozone    41
## 2     5   2    ozone    36
## 3     5   3    ozone    12
## 4     5   4    ozone    18
## 6     5   6    ozone    28
## 7     5   7    ozone    23

Here, again, the columns are the same as the aqm data set, however, because we specify ozone as the measured variable, only ozone will appear in the data set aqm2.

4.3.1.2 melt.array()

If the data you are working with is presented in an array as opposed to a data frame or matrix (meaning that there are a fixed number of values of a sinlge type) then you can use the melt.array() feature.

a <- array(c(1:23, NA), c(2,3,4))

a[,,1:2]
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

The above is an example of a random array that was created with 3 dimensions (row, column, depth).

melt(a)[1:10,]
##    Var1 Var2 Var3 value
## 1     1    1    1     1
## 2     2    1    1     2
## 3     1    2    1     3
## 4     2    2    1     4
## 5     1    3    1     5
## 6     2    3    1     6
## 7     1    1    2     7
## 8     2    1    2     8
## 9     1    2    2     9
## 10    2    2    2    10

Using the function melt, without any additional indicators will break the data down and out of the current structures that hold them. The first 3 columns represent the location of the data and the fourth column indicates the value in that given location. Based on the data output provided, it can be very easy to get lost or make mistakes. Additional data commands that name the columns may be helpful to keep track of the data as shown below.

melt(a, varnames = c("X","Y","Z"))[1:10,]
##    X Y Z value
## 1  1 1 1     1
## 2  2 1 1     2
## 3  1 2 1     3
## 4  2 2 1     4
## 5  1 3 1     5
## 6  2 3 1     6
## 7  1 1 2     7
## 8  2 1 2     8
## 9  1 2 2     9
## 10 2 2 2    10

4.3.2 cast()

The cast() function is what allows us to shape the data after we have melted it down into the ideal formats that are needed. The options exist to run dcast() or acast(). As you can imagine, the dcast() function best serves those that are intending to have a data frame or matrix output.

4.3.2.1 acast()

According to the README, the cast function provides outputs in the following format: x_var+x_2y_var+y_2z_var. Though it may be an easy format to understand, it was not sufficient of an explanation for some. As such, please refer to the example below.

acast(aqm, day ~ month ~ variable)[1:10,,1:2]
## , , ozone
## 
##     5  6   7   8  9
## 1  41 NA 135  39 96
## 2  36 NA  49   9 78
## 3  12 NA  32  16 73
## 4  18 NA  NA  78 91
## 5  NA NA  64  35 47
## 6  28 NA  40  66 32
## 7  23 29  77 122 20
## 8  19 NA  97  89 23
## 9   8 71  97 110 21
## 10 NA 39  85  NA 24
## 
## , , solar.r
## 
##      5   6   7   8   9
## 1  190 286 269  83 167
## 2  118 287 248  24 197
## 3  149 242 236  77 183
## 4  313 186 101  NA 189
## 5   NA 220 175  NA  95
## 6   NA 264 314  NA  92
## 7  299 127 276 255 252
## 8   99 273 267 229 220
## 9   19 291 272 207 230
## 10 194 323 175 222 259

As demonstrated from the data output above, the cast() function shaped the data such that the months are the column headers, divided across days in the rows, and then by variable as the overall “topic”.

It is also possible to cast the data in such a way that the averages are obtained.

acast(aqm, month ~ variable, mean, margins = TRUE)
##          ozone  solar.r      wind     temp    (all)
## 5     23.61538 181.2963 11.622581 65.54839 68.70696
## 6     29.44444 190.1667 10.266667 79.10000 87.38384
## 7     59.11538 216.4839  8.941935 83.90323 93.49748
## 8     59.96154 171.8571  8.793548 83.96774 79.71207
## 9     31.44828 167.4333 10.180000 76.90000 71.82689
## (all) 42.12931 185.9315  9.957516 77.88235 80.05722

In the above example, the data once again separated by what was indicated (e.g. month and variable). We have added however, that the mean of these variables for each month be calculated and additionally, by adding the margins = TRUE command, we have asked for the means of each row to be included.

Overall, the takeaway for how to use cast should be as follows: acast(melted data set, rowvariablenamecolumnvariablenameOPTIONALvariableinformationisseparatedby,…ANYOTHERCOMMANDS,MARGINS).

4.3.2.2 dcast()

Similar to acast(), dcast() reforms the data, however, it places it into a data fram or matrix format.

dcast(aqm, month ~ variable, mean, margins = c("month", "variable"))
##   month    ozone  solar.r      wind     temp    (all)
## 1     5 23.61538 181.2963 11.622581 65.54839 68.70696
## 2     6 29.44444 190.1667 10.266667 79.10000 87.38384
## 3     7 59.11538 216.4839  8.941935 83.90323 93.49748
## 4     8 59.96154 171.8571  8.793548 83.96774 79.71207
## 5     9 31.44828 167.4333 10.180000 76.90000 71.82689
## 6 (all) 42.12931 185.9315  9.957516 77.88235 80.05722

To use a different data set, the example below takes data from the ChickWeight set and melts it down and casts it as a data frame that displays only the time (row) and variable (column) and the corresponding means.

head(ChickWeight)
## Grouped Data: weight ~ Time | Chick
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
names(ChickWeight) <- tolower(names(ChickWeight))
chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE)
dcast(chick_m, time ~ variable, mean)
##    time    weight
## 1     0  41.06000
## 2     2  49.22000
## 3     4  59.95918
## 4     6  74.30612
## 5     8  91.24490
## 6    10 107.83673
## 7    12 129.24490
## 8    14 143.81250
## 9    16 168.08511
## 10   18 190.19149
## 11   20 209.71739
## 12   21 218.68889

4.4 Stats Package Functions

The Stats package contains various functions that helps users place their data into either the long format or the wide format depending on what is needed by the software being used for analysis. Though there are many functions availalble and it is encouraged that all functions are reviewed (you can do so here: https://www.rdocumentation.org/packages/stats/versions/3.6.0), the primary focus of this section will be on the use of the reshape() function.

4.4.1 Reshape()

More readily than the melt() and cast() functions, reshape is able to guid the direction of the data as it reshapes and rebuilds the data set as indicated.

The reshape() functions require the basic inputs the melt() function did, with the added input of direction, which indicates whether the dataset should be a long format data set or a wide format dataset.

wide <- reshape(Indometh, v.names = "conc", idvar = "Subject",
                timevar = "time", direction = "wide")
wide[,1:5]
##    Subject conc.0.25 conc.0.5 conc.0.75 conc.1
## 1        1      1.50     0.94      0.78   0.48
## 12       2      2.03     1.63      0.71   0.70
## 23       3      2.72     1.49      1.16   0.80
## 34       4      1.85     1.39      1.02   0.89
## 45       5      2.05     1.04      0.81   0.39
## 56       6      2.31     1.44      1.03   0.84
state.x77 <- as.data.frame(state.x77)
long <- reshape(state.x77, idvar = "state", ids = row.names(state.x77),
                times = names(state.x77), timevar = "Characteristic",
                varying = list(names(state.x77)), direction = "long")
longa<-long%>%filter(state=="Alabama")

longa
##   Characteristic Population   state
## 1     Population    3615.00 Alabama
## 2         Income    3624.00 Alabama
## 3     Illiteracy       2.10 Alabama
## 4       Life Exp      69.05 Alabama
## 5         Murder      15.10 Alabama
## 6        HS Grad      41.30 Alabama
## 7          Frost      20.00 Alabama
## 8           Area   50708.00 Alabama

4.5 Summary

Data reshaping is important and there are many tools availalbe to assist users in re-formatting the data sets they have available. In addition to the packages discussed above, other packages have been developed by other R users to facilitate this process; the next generation of ReShape can be found in the tidyr package, and other packages like psych, include tools that help reformat data. Additionally with the use of additional packages like dplyr and forcats, additional power is added to the ability of the user to format and reformat as needed.

4.6 References

Kabacoff, R. I. (n.d.). Qhick-R: Reshape. Retrieved May 03, 2019, from https://www.statmethods.net/management/reshape.html

Wickham, H. (2012). reshape2: Flexibly reshape data: a reboot of the reshape package. R package version, 1(2).

Martin, K. G. (n.d.). The Wide and Long Data Format for Repeated Measures Data. Retrieved May 03, 2019, from https://www.theanalysisfactor.com/wide-and-long-data/

R Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/