Data Mining Survivor: General_Manipulation

DATA MINING
Desktop Survival Guide
by Graham Williams

Elements

> letters # a b c [...] z > letters[10] # "j" > letters[10:15] # "j" "k" "l" "m" "n" "o" > letters[c(1, 2, 4, 8, 16)] # "a" "b" "d" "h" "p" > letters[-(10:26)] # "a" "b" "c" "d" "e" "f" "g" "h" "i"

An operator (or function) can be applied to a vector to return a vector. This is particularly useful for boolean operators, returning a vector of boolean values which can then be used to select specific elements of a vector:

> letters > "j" # FALSE FALSE FALSE [...] TRUE > letters[letters > "j"] # "k" "l" "m" "n" [...] "y" "z" > letters[letters > "w" | letters < "e"] # "a" "b" "c" "d" "x" "y" "z"

Here's a useful trick to ensure we don't divide by zero, which would otherwise give an infinite answer (Inf):

> x <- c(0.28, 0.55, 0, 2) > y <- c(0.53, 1.34, 1.2, 2.07) > sum(((x-y)^2/x)) [1] Inf > sum(((x-y)^2/x)[x!=0]) # Exclude the zeros [1] 1.360392

We could also generate random subsets of our data.

> subdataset <- dataset[sample(seq(1, nrow(dataset)), 1000),]

We can select elements meeting set inclusion conditions. Here we first select a subset of rows from a data frame having particular colours.

> ds[ds$colour %in% c("green", "blue"),] > ds[ds$colour %in% names(which(table(ds$colour) > 11)),]

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.