Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


The Adult Survey Dataset

The survey dataset is a little larger (3.8MB) and illustrates many more of the options to the read.csv function. The data was extracted from the US Census Bureau database, and is again available from the UCI Machine Learning Repository.



UCI <- "ftp://ftp.ics.uci.edu/pub"
REPOS <- "machine-learning-databases"
survey.url <- sprintf("%s/%s/adult/adult.data", UCI, REPOS)
survey <- read.csv(survey.url, header=F, strip.white=TRUE,
                   na.strings="?",
                   col.names=c("Age", "Workclass", "fnlwgt", 
                     "Education", "Education.Num", "Marital.Status", 
                     "Occupation", "Relationship", "Race", "Sex", 
                     "Capital.Gain", "Capital.Loss", 
                     "Hours.Per.Week", "Native.Country", 
                     "Salary.Group"))
write.table(survey, "survey.csv", sep=",", row.names=F)
save(survey, file="survey.Rdata", compress=TRUE)
}

http://rattle.togaware.com/code/get-survey.R



> dim(survey)
[1] 32561    15
> str(survey)
`data.frame':   32561 obs. of  15 variables:
 $ Age           : int  39 50 38 53 28 37 49 52 31 42 ...
 $ Workclass     : Factor w/ 8 levels "Federal-gov",..: 7 6 4 4 4 4 4 6 4 4 ...
 $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 ...
 $ Education     : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 ...
 $ Education.Num : int  13 13 9 7 13 14 5 9 14 13 ...
 $ Marital.Status: Factor w/ 7 levels "Divorced",..: 5 3 1 3 3 3 4 3 5 3 ...
 $ Occupation    : Factor w/ 14 levels "Adm-clerical",..: 1 4 6 6 10 4 8 ...
 $ Relationship  : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 ...
 $ Race          : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 ...
 $ Sex           : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
 $ Capital.Gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
 $ Capital.Loss  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Hours.Per.Week: int  40 13 40 40 40 40 16 45 50 40 ...
 $ Native.Country: Factor w/ 41 levels "Cambodia","Canada",..: 39 39 39 39 ...
 $ Salary.Group  : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

Once again, the dataset can be read in from the CSV file or else loaded as an R dataset:



> survey <- read.csv("survey.csv")
OR
> load("survey.RData")



Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.