DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
A common procedure for text mining is to `score' each document by a vector that records the frequency of occurrence of commonly used and subject matter specific words and phrases. Assuming the documents are themselves classified into a number of classes already (perhaps those that are relevant versus those that are not) you can use this ``training set'' with any of the many supervised learning or classification tools in R (e.g., trees, logistic regression, boosting, Random Forests, support vector machines, linear discriminant analysis, etc.).
Text mining begins with feature extraction. Techniques include:
Keyword extraction
Bag of words
Term weighting
Co-occurrence of words