Data Mining Survivor: Data_Manipulation

DATA MINING
Desktop Survival Guide
by Graham Williams

Data Frames

A data frame is essentially a list of named vectors, where, unlike a matrix, the different vectors (or columns) need not all be of the same data type. A data frame is analogous to a database table, in that each column has a single data type, but different columns can have different data types. This is distinct from a matrix in which all elements must be of the same data type.

> age <- c(35, 23, 56, 18) > gender <- c("m", "m", "f", "f") > people <- data.frame(Age=age, Gender=gender) > people Age Gender 1 35 m 2 23 m 3 56 f 4 18 f

The columns of the data frame have names, and the names can be assigned as in the above example. The names can also be changed at any time by assignment to the output of the function call to colnames:

> colnames(people) [1] "Age" "Gender" > colnames(people)[2] <- "Sex" > colnames(people) [1] "Age" "Sex" > people Age Sex 1 35 m 2 23 m 3 56 f 4 18 f

If we have the datasets we wish to combine as a single list of datasets, we can use the do.call function to apply rbind to that list so that each element of the list becomes one argument to the rbind function:

j <- list() # Generate a list of data frames for (i in letters[1:26]) { j[[i]] <- data.frame(rep(i,25),matrix(rnorm(250),nrow=25)) } j[[1]] allj <- do.call("rbind", j) # Combine list of data frames into one.

You can reshape data in a data frame using unstack:

> ds <- data.frame(type=c('x', 'y', 'x', 'x', 'x', 'y', 'y', 'x', 'y', 'y'), value=c(10, 5, 2, 6, 4, 8, 3, 6, 6, 8)) > ds type value 1 x 10 2 y 5 3 x 2 4 x 6 5 x 4 6 y 8 7 y 3 8 x 6 9 y 6 10 y 8 > unstack(ds, value ~ type) x y 1 10 5 2 2 8 3 6 3 4 4 6 5 6 8

To even assign the values to variables of the same names as the types you could use attach:

> attach(unstack(ds, value ~ type)) > x [1] 10 2 6 4 6 > y [1] 5 8 3 6 8

Subsections

Accessing Columns

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.