Knowledge Discovery and Data Mining: CIS 4523/5523, Spring 2022
Assignment 6
Out: March 7
Due: MONDAY March 14 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: One problem for total of 50 points
Problem 1: (50 points)
Write a research proposal for the class project that you plan to perform, present on April 21, and submit the project report by April 28.
Teams of two undergraduate students are allowed and teams of one undergraduate and one graduate student. Teams of two graduate students are not allowed.
Write the proposal using the following format:
- Your name(s) and e-mail address (such that the instructor can approve your topic quickly or to ask for a revision/clarification)
- Title;
- Objective and Significance;
- Background;
- Proposed Approach (make sure to explain where you will get data and how much preprocessing is needed);
- References.
The proposal description may not exceed 2 pages in 12 pt style.
Following are some research project topic ideas suggested by authors of the KDDM textbook:
- Evaluating Performance of Classifiers
- Compare the bias and variance of models generated using different evaluation methods (leave one out, cross validation, bootstrap, stratification, etc.)
- References:
- Kohavi, R., A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (1995)
- Efron, B. and Tibshirani, R., Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule (1995)
- Martin, J.K., and Hirschberg, D.S., Small Sample Statistics for Classification Error Rates I: Error Rate Measurements (1996)
- Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms (1998)
- Support Vector Machine (SVM)
- Cost-sensitive learning
- A comparative study and implementation of different techniques for ensemble learning such as bagging, boosting, etc.
- References:
- Freund Y. and Schapire, R.E., A short introduction to boosting (1999)
- Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002)
- Quinlan, J.R., Boosting, Bagging and C4.5 (1996)
- Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants (1999)
- Semi-supervised learning (classification with labeled and unlabeled data)
- Applying different semi-supervised learning techniques to UCI data sets.
- References:
- Nigam, K., Using Unlabeled Data to Improve Text Classification (2001)
- Seeger, M., Learning with labeled and unlabeled data (2001)
- Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-training (2000)
- Vittaut, J.N., Amini, M-R., Gallinari, P., Learning Classification with Both Labeled and Unlabeled Data (2002).
- Classification for rare-class problems
- A comparative study and/or implementation of different classification techniques to analyze rare class problems
- References:
- Joshi, M.V., and Agrawal, R., PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection) (2001)
- Joshi, M.V., Agrawal, R., and Kumar, V., Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction (2001)
- Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002)
- Joshi, M.V., Kumar, V., Agrawal, R., On Evaluating Performance of Classifiers for Rare Classes (2002) (2002)
- Time Series Prediction/Classification
- A comparative study and/or implementation of time series prediction/classification techniques
- References:
- Geurts, P., Pattern Extraction for Time Series Classification (2001)
- Kadous, M.W., A General Architecture for Supervised Classification of Multivariate Time Series (1998)
- Giles, C.L., Lawrence, S. and Tsoi, A.C., Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference (2001)
- Keogh, E.J. and Pazzani, M.J., An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998)
- Chatfield, C., The Analysis of Time Series, Chapman & Hall (1989)
- Sequence Prediction
- A comparative study and implementation of sequence prediction techniques
- References:
- Laird, P.D., Saul, R. Discrete Sequence Prediction and Its Applications. Machine Learning, 15(1): 43-68 (1994)
- Sun, R. and Lee Giles, C., Sequence Learning: From Recognition and Prediction to Sequential Decision Making (2001)
- Lesh, N., Zaki, M.J., and Ogihara, M., Mining features for Sequence Classification (1999)
- Association Rules for Classification
- A comparative study and implementation of classification using association patterns (rules and itemsets)
- References:
- Liu, B., Hsu, W., and Ma, Y., Integrating Classification and Association Rule Mining (1998)
- Liu, B., Ma, Y. and Wong, C-K, Classification Using Association Rules: Weaknesses and Enhancements (2001)
- Li, W., Han, J. and Pei, J., CMAR: Accurate and Efficient Classification Based on Multiple Class-Association (2001)
- Deshpande, M. and Karypis, G., Using Conjunction of Attribute Values for Classification (2002)
- Spatial Association Rule Mining
- A comparative study on spatial association rule mining.
- References:
- Koperski, K., and Han, J., Discovery of Spatial Association Rules in Geographic Information Databases (1995)
- Shekhar, S. and Huang, Y., Discovering Spatial Co-location Patterns: A Summary of Results (2001)
- Malerba, D., Esposito, F. and Lisi, F., Mining Spatial Association Rules in Census Data (2001)
- Temporal Association Rule Mining
- A comparative study and/or implementation of temporal association rule mining techniques
- References:
- Li, Y., Ning, P., Wang, and S., Jajodia, S., Discovering Calendar-based Temporal Association Rules (2001)
- Chen, X. and Petrounias, Mining temporal features in association rules
- Lee, C.H., Lin, C.R. and Chen, M.S., On Mining General Temporal Association Rules in a Publication Database (2001)
- Ozden, B., Ramaswamy, Silberschatz, Cyclic Association Rules (1998)
- Literature on Sequential Association Rule Mining below
- Sequential Association Rule Mining
- A comparative study and/or implementation of sequential association rule mining techniques
- References:
- Srikant, R. and Agrawal, R., Mining Sequential Patterns: Generalizations and Performance Improvements (1996)
- Mannila, H. and Toivonen, H., Verkamo, A.I., Discovery of Frequent Episodes in Event Sequences (1997)
- Joshi, M., Karypis, G., and Kumar, V., A Universal Formulation of Sequential Patterns (1999)
- Borges J., and Levene, M., Mining Association Rules in Hypertext Databases (1998)
- Outlier Detection
- A comparative study and/or implementation of outlier detection techniques.
- References:
- Knorr, Ng, A Unified Notion of Outliers: Properties and Computation, - 1997
- Knorr, Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets - 1998
- Breunig, Kriegel, Ng, Sander, LOF: Identifying Density-Based Local Outliers - 2000
- Aggarwal, Yu, Outlier Detection for High Dimensional Data – 2001
- Tang, Chen, Fu, Cheung, A Robust Outlier Detection Scheme for Large Data Sets – 2001
- Parallel Formulations of Clustering
- Clustering of Time Series
- Scalable clustering algorithms
- Clustering association rules and frequent item sets
Following are some of research project topics investigated by Temple KDDM students in previous years:
- Early prediction of spatio-temporal events
- Applications of Graph Neural Networks for modeling partially observed data
- Classifying sports events based on spatio-temporal data of players and ball using unreliable labels
- Predictive modeling to forecast store sales
- Emotion analysis based on text mining
- Clustering NYC 311 requests
- MP3 to MIDI conversion via deep learning
- Co-Localization of multiple objects in images using activation map
- Generalized procedure for selecting methylation CpGs associated with cancer
- Classification of vaccine-related tweets using deep learning
- Automatically building book indices
- Time-series and clustering analysis for systemic lupus erythematosus patients’ study
- Identifying complexity of Wikipedia text
- Graphlet-assisted structured regression
- Wire bonding: Predicting failures for the Ultrasonic Transducer
- Exploring bias and variance in supervised learning algorithms
- Clustering of gene expression cancer RNA-Seq Data Set
- Analysis of online product review
- Class imbalance: Credit card fraud analysis
- Physician social network and patient outcomes: An empirical investigation
- Exploring underlying structures of tweets with URLs via clustering
- Dynamic changes of structure of large-scale evolving temporal graphs
- Temporal predictive modeling with sample/feature size constraints
- Opportunistic routing assisted by decision trees in CRNs
- Uncertainty estimation of structured models on evolving graphs
- Structured output prediction with spatio-temporal data
- Using nonlinear gated-experts for traffic speed forecasting
- New clustering schemes to improve the analysis of antibody CDR structure
- Missing data, latent variables and PCA
- Text mining of evaluations of commercial banks
- Health care data mining
- Short text data mining and analysis
- Label propagation for multi-label prediction
- Ready for human-machine cooperation in hierarchical clustering?
- Decentralized estimation using learning vector quantization
- Batch mode active learning for classification and regression
- Inverse active learning modeling in simulated AOD prediction
- Pollution prediction using pre-clustering on informative features
- Uncertainty estimation for predicted aerosol optical depth
- Feature selection for microarray classification
- Analysis of gene functional expression profiles using GO semantic similarity
- Disease data mining survival prediction based on gene expression data
- Using movement data to detect significant regions of infection
- Shape matching improvement
- Relationships between environmental aspects of police officer’s work, family life and stress
Following are some websites where you can find data mining projects with given data: