Data Mining |
1.Introduction
What is data mining?
Predominant areas in the computing
history
What is this course about?
Course structure
Association analysis, data classification,
and clustering.
2.Decision Tree Construction
On-line references:
Structure of decision trees
Data input
Decision tree construction: A simplified
example
The Concept Learning System (CLS)
Information Gain
Training, testing and predictive accuracy
Information gain vs the gain ratio
criterion
Difficulties with decision tree
construction
Overfitting
3. Association Analysis
A mathematical model for association
analysis
Large itemsets and association rules
Apriori: constructs large itemsets with
minisup by iterations
Interestingness of Discovered Association
Rules
Application examples
Association analysis vs. classification
Machine Learning Software in Java at the
University of Waikato
Experiments/exercises with weka.associations.Apriori
4.Clustering
Clustering: unsupervised learning
Types of clusters
Different clustering methods
k-means: iterative distance-based
clustering
Dealing with discrete values in k-means
Constructing a hierarchical clustering
using k-means
Incremental clustering/classification:
pros and cons
Steps in COBWEB to construct a clustering
tree
Category utility
4 choices at each level when inserting a
new instance
The COBWEB algorithm in Weka
The cutoff parameter (-C percentage)
How to combine clustering and
classification?
How to measure the quality of clustering?
Density-based clustering methods
Outlier analysis
5.Rule Induction
Classification rules
Decision lists and disjunctive normal
form (DNF)
1R ("1-rule")
Steps in c4.5rules
Running c4.5rules on Mansfield
The default (or no information) rule
Rule Induction by Covering
PRISM: Constructing correct and "perfect"
rules
Divide-and-Conquer vs Separate-and-Conquer
Rule induction algorithms in Weka
Classification vs prediction
Lazy vs eager learning
The k-nearest neighbor algorithm
Genetic algorithms (GA)
6.Bayesian Methods
Alternative hypotheses
Prior knowledge
Imperfect data indicators
Conditional probability
Bayes theorem
Maximum A Posteriori (MAP)
Naive Bayes Classifier
The PlayTennis data with Naive Bayes
Day & Outlook & Temperature &
Humidity & Windy &
Belief networks
Network topology
Conditional probability tables (CPT)
Joint probability distribution in a
belief network
Training belief networks
Incremental construction of belief
networks
Inference in belief networks
The Naive Bayes algorithm in Weka,
7.Dealing with Noise and Real-Valued
Attributes
Artificial vs. real-world databases
The Monk's Problems: An example
Sources Of Noise
Noise Handling
Cross validation
Dealing with contradictions and
redundancy
Expansion of Don't Care values
Handling of ? values
Generation of nonexistent examples
Light-weight leaves/rules
Stopping criteria to avoid overfitting
Overfitting vs underfitting
Occam's Razor
Reduced error pruning with a separate
pruning set
Truncation of rules - TRUNC
"No match" and "multiple
match" when deduction of induction
results
Measure of fit
Estimate of probability
Dealing with real-valued attributes:
Discretization
Criteria to stop the recursive splitting
Discretization in C4.5
8.Data Mining from Very Large
Databases
Why large databases?
Data partitioning
Sampling techniques
Cross validation
Windowing in C4.5
Integrative windowing
Bagging, boosting, and their differences
Boosting in C5.0
Incremental batch learning
Aggregation of rules from different data
sources
Leading data mining tools
|
|
|