Data Mining Survivor: Data_Selection - Training and Test Datasets

DATA MINING
Desktop Survival Guide
by Graham Williams

Training and Test Datasets

Often in modelling we build our model on a training set and then test its performance on a test set. The simplest approach to generating a partitioning of your dataset into a training and test set is with the sample function:

> sub <- sample(nrow(iris), floor(nrow(iris) * 0.8)) > iris.train <- iris[sub, ] > iris.test <- iris[-sub, ]

The first argument to sample is the top of the range of integers you wish to choose from, and the second is the number to choose.

The sample.split function of the caTools package also comes in handy here. It will split a vector into two subsets, two thirds in one and one third in the other, maintaining the relative ratio of the different categorical values represented in the vector:

> mask <- sample.split(iris$Species) > mask [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE [...] [145] TRUE TRUE TRUE TRUE FALSE TRUE > table(iris$Species) setosa versicolor virginica 50 50 50 > table(iris$Species[mask]) setosa versicolor virginica 33 33 33 > table(iris$Species[!mask]) setosa versicolor virginica 17 17 17

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.