Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Usage

Virtual Predict was deployed by the department of Medicinal Chemistry at AstraZeneca together with consultants from Virtual Genetics. Known molecular properties from about 1,000 substances were used to create a model to predict water solubility in new untested drug leads. Molecular properties included the number of atoms, bonds and rings, graph radius and diameter, Wiener index, carbon, nitrogen and oxygen counts, molecular weight and volume, and polarizability. By including known solubility values Virtual Predict identified important criteria for predicting solubility of unknown substances with a precision between 87%-93%.

Virtual Predict handles both numerical and categorical data. Structured data in the form of ..... is also handled.

The user can choose from a range of data mining techniques including decision trees (divide-and-conquer, or DAC, in Virtual Predict terminology) for classification and regression. Minimum Description Length (MDL) is employed for ..... Options for choosing variables include Information Gain. Pruning, boosting, bagging, boosted stumps, and naive Bayes are all options. And SAC?

Building a model is straight-forward with the simple and uncluttered MSWindows graphical user interface. Data in comma separated format (CSV) can easily be imported. However, visualisations of the resulting models are not provided.

Building multiple models with different approaches is straight forward and test protocols are provided to empirically compare the performance of the various methods for a particular dataset.

Virtual Predict employs a representation language based on the common logic programming language Prolog.

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.