DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
We saw in Chapter some of the R functions that help
us get a basic picture of the scope and type of data in any
dataset. These include the most basic of information including the
number and names of columns and rows (for data frames) and a summary
of the data values themselves. We illustrate this again with the
wine dataset (see See Section ):
> load("wine.RData") > dim(wine) [1] 178 14 > nrow(wine) [1] 178 > ncol(wine) [1] 14 > colnames(wine) [1] "Type" "Alcohol" "Malic" "Ash" [5] "Alcalinity" "Magnesium" "Phenols" "Flavanoids" [9] "Nonflavanoids" "Proanthocyanins" "Color" "Hue" [13] "Dilution" "Proline" > rownames(wine) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" [13] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" [...] [157] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168" [169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178" |
Next, we'd like to see what the data itself looks like. We can list
the first few rows of the data using head:
> head(wine) Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 Proanthocyanins Color Hue Dilution Proline 1 2.29 5.64 1.04 3.92 1065 2 1.28 4.38 1.05 3.40 1050 3 2.81 5.68 1.03 3.17 1185 4 2.18 7.80 0.86 3.45 1480 5 1.82 4.32 1.04 2.93 735 6 1.97 6.75 1.05 2.85 1450 |
Next we might look at the structure of the data using the
str (structure) function. This provides a basic overview
of both values and their data type:
> str(wine) `data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ... $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ... |
The final step in the first look at the data is to get a summary of
each variable using summary:
> summary(wine) Type Alcohol Malic Ash Alcalinity 1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 [...] |
Copyright © 2004-2006 Graham.Williams@togaware.com Support further development through the purchase of the PDF version of the book.