Data Mining Survivor: Memory_Management

DATA MINING
Desktop Survival Guide
by Graham Williams

Memory Usage

Large datasets often present challenges for R on memory limited machines. While you may be able to load a large dataset, processing it and modelling may lead to an error indicating the memory could not be allocated.

To maximise 's capabilities on large datasets, be sure to run a 64bit operating system on a 64 bit platform (e.g., http://www.togaware.com/linux/survivorDebian GNU/Linux) on 64 bit hardware (e.g., http://en.wikipedia.org/wiki/AMD64AMD64) with plenty of RAM (e.g., 16GB). Such capable machines are quite affordable.

Selecting and subsetting required datasets off a database (e.g., through the RODBC package) or through other means (e.g., using Python) will generally be faster.

On MS/Windows you may need to set the memory size using the command-line flag -max-mem-size. The amount of memory currently in use and allocated to the R process, is given by the memory.size function.

The example below indicates that some 470MB is in use, altogether about 1GB has been allocated.

> memory.size() # Current memory in use: 470MB [1] 477706008 > memory.size(TRUE) # Current memory allocated: 1GB [1] 1050681344

The memory limit currently in force in R is reported by the memory.limit function which can also be used to set the limit.

> memory.limit() # Current memory limit: 1GB [1] 1073741824 > memory.limit(2073741824) # New memory limit: 2GB NULL > memory.limit() [1] 2684354560

A suggested process is to work with a subset of all the data loaded in memory, using a dataset small enough to make this viable. Explore the data, explore for the choice of models, and prototype the final analysis using this smaller dataset. For the final full analyses one may need to allow R to run overnight with enough RAM.

A data frame of 150,000 rows and some 55 columns will be about 500MB of RAM.

Also, note the difference between data frames and arrays/matrices. For example, rbind'ing data frames is much more expensive than rbind'ing arrays/matrices. However, an array/matrix must have all data of the same data type in each column while data frames can have different data types in different columns. A number of functions are written to handle either data frames or matrices (e.g., rpart) and it is best, if possible, to use a matrix in these cases. The coercion back to a data frame can always be done afterwards.

Note that to convert a data frame to a matrix you can use as.matrix:

> m <- as.matrix(dframe)

However, if there are any character columns, all the data is converted to character.

To obtain an estimate of the amount of memory being used by an object in R use the object.size function:

> object.size(ds) # Object ds is using 181MB [1] 181694428

The following function can be used to explore memory requirements:

sizes <- function(rows, cols=1) { testListLength <- 1000 cellSize <- object.size(seq(0.5, testListLength/2, 0.5))/testListLength cells <- rows * cols required <- cells * cellSize if (required > 1e12) result <- sprintf("%dTB", required %/% 1e12) else if (required > 1e9) result <- sprintf("%dGB", required %/% 1e9) else if (required > 1e6) result <- sprintf("%dMB", required %/% 1e6) else if (required > 1e3) result <- sprintf("%dKB", required %/% 1e3) else result <- sprintf("%dBytes", required) return(result) }

For example, on a 32bit machine, a 1 million row dataset with 400 columns might require about 3GB of memory!

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.