DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
The survey dataset is a little larger (3.8MB) and illustrates many more of the options to the read.csv function. The data was extracted from the US Census Bureau database, and is again available from the UCI Machine Learning Repository.
UCI <- "ftp://ftp.ics.uci.edu/pub" REPOS <- "machine-learning-databases" survey.url <- sprintf("%s/%s/adult/adult.data", UCI, REPOS) survey <- read.csv(survey.url, header=F, strip.white=TRUE, na.strings="?", col.names=c("Age", "Workclass", "fnlwgt", "Education", "Education.Num", "Marital.Status", "Occupation", "Relationship", "Race", "Sex", "Capital.Gain", "Capital.Loss", "Hours.Per.Week", "Native.Country", "Salary.Group")) write.table(survey, "survey.csv", sep=",", row.names=F) save(survey, file="survey.Rdata", compress=TRUE) } |
> dim(survey) [1] 32561 15 > str(survey) `data.frame': 32561 obs. of 15 variables: $ Age : int 39 50 38 53 28 37 49 52 31 42 ... $ Workclass : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ... $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 ... $ Education : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 ... $ Education.Num : int 13 13 9 7 13 14 5 9 14 13 ... $ Marital.Status: Factor w/ 7 levels "Divorced",..: 5 3 1 3 3 3 4 3 5 3 ... $ Occupation : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 ... $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 ... $ Race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 ... $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ... $ Capital.Gain : int 2174 0 0 0 0 0 0 0 14084 5178 ... $ Capital.Loss : int 0 0 0 0 0 0 0 0 0 0 ... $ Hours.Per.Week: int 40 13 40 40 40 40 16 45 50 40 ... $ Native.Country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 ... $ Salary.Group : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ... |
Once again, the dataset can be read in from the CSV file or else loaded as an R dataset:
> survey <- read.csv("survey.csv") OR > load("survey.RData") |