DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
The basic usage of RODBC will connect to a known ODBC object using the odbcConnect function and query the database for the tables it exports using sqlTables:
> library(RODBC) > channel <- odbcConnect("DWH") # This may pop up a window to enter username and password > tables <- sqlTables(channel) > columns <- sqlColumns(channel, "clients") |
You can then retrieve the full contents of a table with
sqlFetch:
> ds <- sqlFetch(channel, "tablename") |
> ds <- sqlQuery(channel, "SELECT * FROM clients WHERE age > 35") |
Some ODBC drivers, such as the Netezza ODBC driver, have a pre-fetch option that interacts poorly with applications connecting through the driver. With a pre-fetch option the driver appears to report fewer rows being available than actually available. It seems that the number of rows reported is in fact the pre-fetch limited number of rows. For the Netezza ODBC driver, for example, the default is 256 rows. This confuses the application connecting to ODBC (in this case, R through the RODBC). The symptom is that we only receive 256 rows from the table. Internally, the application is probably using either the SQLExtendedFetch or SQLFetchScroll ODBC functions.
There are a number of solutions to this issue. One from the applications side is to set believeNRows to FALSE. This will then retrieve all the rows from the table. Another solution is at the driver configuration level. For example, in connecting through the Netezza ODBC driver a configuration option is available where you can change the default Prefetch Count value.
An example of the issue is illustrated below:
> channel <- odbcConnect("netezza") > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 500") > dim(orders) [1] 256 9 > orders <- sqlQuery(channel, "select * from orders limit 500", believeNRows=FALSE) > dim(orders) [1] 500 9 > odbcCloseAll() |
We can reopen the driver and in the resulting GUI configuration set the Prefetch Count to perhaps 10,000. Then:
> channel <- odbcConnect("netezza") > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 500") > dim(orders) [1] 500 9 > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 50000") > dim(orders) [1] 10000 9 > orders <- sqlQuery(channel, "SELECT * FROM orders LIMIT 50000", believeNRows=FALSE) > dim(orders) [1] 50000 9 > |
Note that we would not want to default beleiveNRows to FALSE since, for example, with a Teradata query this increase the query time by some 3 times!
For an SQLite database, edit .odbc.ini
[audit] Description=SQLite test audit database for Rattle Driver=SQLite3 Database=/home/kayon/projects/rattle/audit.db # optional lock timeout in milliseconds Timeout=2000 |
Copyright © 2004-2006 Graham.Williams@togaware.com Support further development through the purchase of the PDF version of the book.