DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
The reshape package was inspired by pivot tables. The package works on homogeneous data only, so your data needs to be all numeric or all character, and not a mixture of the.
The aim is to generate various aggregate summaries of the data. For
example, with the wine dataset we may like to look at the
average values of a number of input variables for each Type. The first
step, using the reshape package, is to melt the
data frame, which expands the non-identifying variables across the
identifying variables:
> wine.molten <- melt(wine, id="Type") > head(wine.molten) Type variable value 1 1 Alcohol 14.23 2 1 Alcohol 13.20 3 1 Alcohol 13.16 4 1 Alcohol 14.37 5 1 Alcohol 13.24 6 1 Alcohol 14.20 > tail(wine.molten) Type variable value 17312 3 Proline 660 17412 3 Proline 740 17512 3 Proline 750 17612 3 Proline 835 17712 3 Proline 840 17812 3 Proline 560 |
> cast(wine.molten, Type ~ variable, mean) Type Alcalinity Alcohol Ash Color Dilution Flavanoids Hue 1 17.03729 13.74475 2.455593 5.528305 3.157797 2.9823729 1.0620339 2 20.23803 12.27873 2.244789 3.086620 2.785352 2.0808451 1.0562817 3 21.41667 13.15375 2.437083 7.396250 1.683542 0.7814583 0.6827083 Magnesium Malic Nonflavanoids Phenols Proanthocyanins Proline 106.3390 2.010678 0.290000 2.840169 1.899322 1115.7119 94.5493 1.932676 0.363662 2.258873 1.630282 519.5070 99.3125 3.333750 0.447500 1.678750 1.153542 629.8958 |
We can also include the column and row totals. We will illustrate this
with a subset of the wine dataset:
> measure <- c("Alcohol", "Malic", "Ash") > wine.molten <- melt(wine, id="Type", measure=measure) > cast(wine.molten, Type ~ variable, mean, margins=c("grand_row", "grand_col")) Type Alcohol Malic Ash . 1 13.74475 2.010678 2.455593 6.070339 2 12.27873 1.932676 2.244789 5.485399 3 13.15375 3.333750 2.437083 6.308194 . 13.00062 2.336348 2.366517 5.901161 |
> aggregate(wine[,-1], list(Type=wine$Type), mean) Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids 1 1 13.74475 2.010678 2.455593 17.03729 106.3390 2.840169 2.9823729 2 2 12.27873 1.932676 2.244789 20.23803 94.5493 2.258873 2.0808451 3 3 13.15375 3.333750 2.437083 21.41667 99.3125 1.678750 0.7814583 Nonflavanoids Proanthocyanins Color Hue Dilution Proline 1 0.290000 1.899322 5.528305 1.0620339 3.157797 1115.7119 2 0.363662 1.630282 3.086620 1.0562817 2.785352 519.5070 3 0.447500 1.153542 7.396250 0.6827083 1.683542 629.8958 |
Another example using reshape.
> dat <- read.table("clipboard", header=TRUE) > dat Q S C 1 1 A 5 2 1 B 10 3 1 C 50 4 1 D 10 5 2 A 20 6 2 E 10 7 2 C 40 8 3 D 5 9 3 F 1 10 3 G 5 11 3 B 75 > res <- reshape(dat, direction = "wide", idvar = "Q", timevar = "S") > res Q C.A C.B C.C C.D C.E C.F C.G 1 1 5 10 50 10 NA NA NA 5 2 20 NA 40 NA 10 NA NA 8 3 NA 75 NA 5 NA 1 5 > res[is.na(res)] <- 0 > names( > res Q C.A C.B C.C C.D C.E C.F C.G 1 1 5 10 50 10 0 0 0 5 2 20 0 40 0 10 0 0 8 3 0 75 0 5 0 1 5 |
Or the same, but using the reshape package:
> library(reshape) > datm <- melt(dat, id=1:2) > cast(datm, Q ~ S) S A B C D E F G Q A B C D E F G 1 5 10 50 10 NA NA NA 2 20 NA 40 NA 10 NA NA 3 NA 75 NA 5 NA 1 5 |
With the basics in hand we can now explore the data in a more graphical fashion, beginning with plots that help understand individual variables (barplot, piechart, and line plots), followed by a number of plots that explore relationships between variables (scatterplot and correlation plot).
Copyright © 2004-2006 Graham.Williams@togaware.com Support further development through the purchase of the PDF version of the book.