Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Text Mining with R

A common procedure for text mining is to `score' each document by a vector that records the frequency of occurrence of commonly used and subject matter specific words and phrases. Assuming the documents are themselves classified into a number of classes already (perhaps those that are relevant versus those that are not) you can use this ``training set'' with any of the many supervised learning or classification tools in R (e.g., trees, logistic regression, boosting, Random Forests, support vector machines, linear discriminant analysis, etc.).

See ttda

Text mining begins with feature extraction. Techniques include:

Keyword extraction

Bag of words

Term weighting

Co-occurrence of words



Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.