DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
We might only be interested in the numeric data, so we remove all columns that are not numeric from a dataset. We can use the survey dataset to illustrate this. First load the dataset and have a look at the column names and their types. We use the lapply function to apply the class function to each column of the data frame.
> load("survey.RData") > colnames(survey) [1] "Age" "Workclass" "fnlwgt" "Education" [5] "Education.Num" "Marital.Status" "Occupation" "Relationship" [9] "Race" "Sex" "Capital.Gain" "Capital.Loss" [13] "Hours.Per.Week" "Native.Country" "Salary.Group" > lapply(survey, class) $Age [1] "integer" $Workclass [1] "factor" $fnlwgt [1] "integer" $Education [1] "factor" $Education.Num [1] "integer" $Marital.Status [1] "factor" $Occupation [1] "factor" \$Relationship [1] "factor" $Race [1] "factor" $Sex [1] "factor" $Capital.Gain [1] "integer" $Capital.Loss [1] "integer" $Hours.Per.Week [1] "integer" $Native.Country [1] "factor" $Salary.Group [1] "factor" |
We can now simply use is.numeric to select the numeric
columns and store the result in a new dataset, using
sapply to extract the list of numeric columns:
> survey.numeric <- survey[,sapply(survey, is.numeric)] |
You could instead build a list of the columns to remove and then explicitly remove them from the dataset in place, so that you don't create a need for extra data storage.
First build a numeric list of columns to remove, and reverse it since after we remove a column, all the remaining columns are shifted left and their index is then one less! We use sapply to extract the list of numeric columns (those for which is.numeric is true).
> rmcols <- rev(seq(1,ncol(survey))[!as.logical(sapply(survey, is.numeric))]) > rmcols [1] 15 14 10 9 8 7 6 4 2 |
Now remove the columns from the dataset simply by setting the column to NULL.
> for (i in rmcols) survey[[i]] <- NULL > colnames(survey) [1] "Age" "fnlwgt" "Education.Num" "Capital.Gain" [5] "Capital.Loss" "Hours.Per.Week" |
This same process can be used to remove or retain columns of any type, simply by using the appropriate R function: e.g., is.factor, is.logical, is.integer, or is.numeric.
Copyright © 2004-2006 Graham.Williams@togaware.com Support further development through the purchase of the PDF version of the book.