Knowledge Discovery and Data Mining: CIS 4523/5523, Spring 2022

Assignment 6

Out: March 7

Due: MONDAY March 14 by 5:30pm on Canvas

*Please write your name and TUID at the top of your CANVAS submission.

Number of problems/points: One problem for total of 50 points

Problem 1: (50 points)

Write a research proposal for the class project that you plan to perform, present on April 21, and submit the project report by April 28.

Teams of two undergraduate students are allowed and teams of one undergraduate and one graduate student. Teams of two graduate students are not allowed.

Write the proposal using the following format:

Your name(s) and e-mail address (such that the instructor can approve your topic quickly or to ask for a revision/clarification)
Title;
Objective and Significance;
Background;
Proposed Approach (make sure to explain where you will get data and how much preprocessing is needed);
References.
The proposal description may not exceed 2 pages in 12 pt style.

Following are some research project topic ideas suggested by authors of the KDDM textbook:

Evaluating Performance of Classifiers
- Compare the bias and variance of models generated using different evaluation methods (leave one out, cross validation, bootstrap, stratification, etc.)
- References:
  1. Kohavi, R., A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (1995)
  2. Efron, B. and Tibshirani, R., Cross-Validation and the Bootstrap: Estimating the Error Rate of a Prediction Rule (1995)
  3. Martin, J.K., and Hirschberg, D.S., Small Sample Statistics for Classification Error Rates I: Error Rate Measurements (1996)
  4. Dietterich, T.G., Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms (1998)
Support Vector Machine (SVM)
- Present an overview of SVM or applying Support Vector Machines to various application domains.
- References:
  1. Mangasarian, O.L., Data Mining via Support Vector Machines (2001)
  2. Burges, C.J.C., A Tutorial on Support Vector Machines for Pattern Recognition (1998)
  3. Joachims, T., Text Categorization with Support Vector Machines: Learning with Many Relevant Features (1998)
  4. Salomon, J., Support Vector Machines for Phoneme Classification (2001)
Cost-sensitive learning
- A comparative study and implementation of different techniques for ensemble learning such as bagging, boosting, etc.
- References:
  1. Freund Y. and Schapire, R.E., A short introduction to boosting (1999)
  2. Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002)
  3. Quinlan, J.R., Boosting, Bagging and C4.5 (1996)
  4. Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants (1999)
Semi-supervised learning (classification with labeled and unlabeled data)
- Applying different semi-supervised learning techniques to UCI data sets.
- References:
  1. Nigam, K., Using Unlabeled Data to Improve Text Classification (2001)
  2. Seeger, M., Learning with labeled and unlabeled data (2001)
  3. Nigam, K. and Ghani, R., Analyzing the Effectiveness and Applicability of Co-training (2000)
  4. Vittaut, J.N., Amini, M-R., Gallinari, P., Learning Classification with Both Labeled and Unlabeled Data (2002).
Classification for rare-class problems
- A comparative study and/or implementation of different classification techniques to analyze rare class problems
- References:
  1. Joshi, M.V., and Agrawal, R., PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-study in Network Intrusion Detection) (2001)
  2. Joshi, M.V., Agrawal, R., and Kumar, V., Mining Needles in a Haystack: Classifying Rare Classes via Two-Phase Rule Induction (2001)
  3. Joshi, M.V., Kumar, V., Agrawal, R., Predicting Rare Classes: Can Boosting Make Any Weak Learner Strong? (2002)
  4. Joshi, M.V., Kumar, V., Agrawal, R., On Evaluating Performance of Classifiers for Rare Classes (2002) (2002)
Time Series Prediction/Classification
- A comparative study and/or implementation of time series prediction/classification techniques
- References:
  1. Geurts, P., Pattern Extraction for Time Series Classification (2001)
  2. Kadous, M.W., A General Architecture for Supervised Classification of Multivariate Time Series (1998)
  3. Giles, C.L., Lawrence, S. and Tsoi, A.C., Noisy Time Series Prediction using a Recurrent Neural Network and Grammatical Inference (2001)
  4. Keogh, E.J. and Pazzani, M.J., An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback (1998)
  5. Chatfield, C., The Analysis of Time Series, Chapman & Hall (1989)
Sequence Prediction
- A comparative study and implementation of sequence prediction techniques
- References:
  1. Laird, P.D., Saul, R. Discrete Sequence Prediction and Its Applications. Machine Learning, 15(1): 43-68 (1994)
  2. Sun, R. and Lee Giles, C., Sequence Learning: From Recognition and Prediction to Sequential Decision Making (2001)
  3. Lesh, N., Zaki, M.J., and Ogihara, M., Mining features for Sequence Classification (1999)
Association Rules for Classification
- A comparative study and implementation of classification using association patterns (rules and itemsets)
- References:
  1. Liu, B., Hsu, W., and Ma, Y., Integrating Classification and Association Rule Mining (1998)
  2. Liu, B., Ma, Y. and Wong, C-K, Classification Using Association Rules: Weaknesses and Enhancements (2001)
  3. Li, W., Han, J. and Pei, J., CMAR: Accurate and Efficient Classification Based on Multiple Class-Association (2001)
  4. Deshpande, M. and Karypis, G., Using Conjunction of Attribute Values for Classification (2002)
Spatial Association Rule Mining
- A comparative study on spatial association rule mining.
- References:
  1. Koperski, K., and Han, J., Discovery of Spatial Association Rules in Geographic Information Databases (1995)
  2. Shekhar, S. and Huang, Y., Discovering Spatial Co-location Patterns: A Summary of Results (2001)
  3. Malerba, D., Esposito, F. and Lisi, F., Mining Spatial Association Rules in Census Data (2001)
Temporal Association Rule Mining
- A comparative study and/or implementation of temporal association rule mining techniques
- References:
  1. Li, Y., Ning, P., Wang, and S., Jajodia, S., Discovering Calendar-based Temporal Association Rules (2001)
  2. Chen, X. and Petrounias, Mining temporal features in association rules
  3. Lee, C.H., Lin, C.R. and Chen, M.S., On Mining General Temporal Association Rules in a Publication Database (2001)
  4. Ozden, B., Ramaswamy, Silberschatz, Cyclic Association Rules (1998)
  5. Literature on Sequential Association Rule Mining below
Sequential Association Rule Mining
- A comparative study and/or implementation of sequential association rule mining techniques
- References:
  1. Srikant, R. and Agrawal, R., Mining Sequential Patterns: Generalizations and Performance Improvements (1996)
  2. Mannila, H. and Toivonen, H., Verkamo, A.I., Discovery of Frequent Episodes in Event Sequences (1997)
  3. Joshi, M., Karypis, G., and Kumar, V., A Universal Formulation of Sequential Patterns (1999)
  4. Borges J., and Levene, M., Mining Association Rules in Hypertext Databases (1998)
Outlier Detection
- A comparative study and/or implementation of outlier detection techniques.
- References:
  1. Knorr, Ng, A Unified Notion of Outliers: Properties and Computation, - 1997
  2. Knorr, Ng, Algorithms for Mining Distance-Based Outliers in Large Datasets - 1998
  3. Breunig, Kriegel, Ng, Sander, LOF: Identifying Density-Based Local Outliers - 2000
  4. Aggarwal, Yu, Outlier Detection for High Dimensional Data – 2001
  5. Tang, Chen, Fu, Cheung, A Robust Outlier Detection Scheme for Large Data Sets – 2001
Parallel Formulations of Clustering
- Study and possible implementation of parallel formulations of clustering techniques.
- References:
  1. Olson, Parallel Algorithms for Hierarchical Clustering – 1993
  2. Nagesh, High Performance Subspace Clustering for Massive Data Sets - 1999
  3. Skillicorn, Strategies for Parallel Data Mining, 1999
  4. Dhillon, Modha, A Data-Clustering Algorithm On Distributed Memory Multiprocessors - 2000
Clustering of Time Series
- Study and possible implementation of time series clustering techniques on actual NASA time series data.
- References:
  1. Oates, Clustering Time Series with Hidden Markov Models and Dynamic Time Warping - 1999
  2. Konstantinos Kalpakis, Dhiral Gada, and Vasundhara Puttagunta, Distance Measures for Effective Clustering of ARIMA Time Series
  3. Tim, Identifying Distinctive Subsequences in Multivariate Time Series by Clustering – 1999
Scalable clustering algorithms
- A comparative study of scalable data mining techniques.
- References:
  1. Tian Zhang, BIRCH: An Efficient Data Clustering Method for Very Large Databases -. 1999
  2. Ganti, Ramakrishnan, Clustering Large Datasets in Arbitrary Metric Spaces, 1998
  3. Bradley, Fayyad, Reina Scaling Clustering Algorithms to Large Databases –1998
  4. Farnstrom, Lewis, Elkan, Scalability for Clustering Algorithms Revisited - 2000
Clustering association rules and frequent item sets
- A comparative study of techniques for clustering association rules.
- References:
  1. Toivonen, Klemettinen, Pruning and Grouping Discovered Association Rules, 1995
  2. Widom, Clustering Association Rules - Lent, Swami - 1997
  3. Gunjan K. Gupta , Alexander Strehl AND Joydeep Ghosh, Distance Based Clustering of Association Rules

Following are some of research project topics investigated by Temple KDDM students in previous years:

Early prediction of spatio-temporal events
Applications of Graph Neural Networks for modeling partially observed data
Classifying sports events based on spatio-temporal data of players and ball using unreliable labels
Predictive modeling to forecast store sales
Emotion analysis based on text mining
Clustering NYC 311 requests
MP3 to MIDI conversion via deep learning
Co-Localization of multiple objects in images using activation map
Generalized procedure for selecting methylation CpGs associated with cancer
Classification of vaccine-related tweets using deep learning
Automatically building book indices
Time-series and clustering analysis for systemic lupus erythematosus patients’ study
Identifying complexity of Wikipedia text
Graphlet-assisted structured regression
Wire bonding: Predicting failures for the Ultrasonic Transducer
Exploring bias and variance in supervised learning algorithms
Clustering of gene expression cancer RNA-Seq Data Set
Analysis of online product review
Class imbalance: Credit card fraud analysis
Physician social network and patient outcomes: An empirical investigation
Exploring underlying structures of tweets with URLs via clustering
Dynamic changes of structure of large-scale evolving temporal graphs
Temporal predictive modeling with sample/feature size constraints
Opportunistic routing assisted by decision trees in CRNs
Uncertainty estimation of structured models on evolving graphs
Structured output prediction with spatio-temporal data
Using nonlinear gated-experts for traffic speed forecasting
New clustering schemes to improve the analysis of antibody CDR structure
Missing data, latent variables and PCA
Text mining of evaluations of commercial banks
Health care data mining
Short text data mining and analysis
Label propagation for multi-label prediction
Ready for human-machine cooperation in hierarchical clustering?
Decentralized estimation using learning vector quantization
Batch mode active learning for classification and regression
Inverse active learning modeling in simulated AOD prediction
Pollution prediction using pre-clustering on informative features
Uncertainty estimation for predicted aerosol optical depth
Feature selection for microarray classification
Analysis of gene functional expression profiles using GO semantic similarity
Disease data mining survival prediction based on gene expression data
Using movement data to detect significant regions of infection
Shape matching improvement
Relationships between environmental aspects of police officer’s work, family life and stress

Following are some websites where you can find data mining projects with given data:

Reproducibility challenge: https://paperswithcode.com/rc2021 : The goal is to take a published paper and reproduce its results. Reports will be submitted to the next round of Reproducibility challenge competition.
SpaceML: https://spaceml.org/repo
Kaggle competitions: https://www.kaggle.com/competitions
Driven data: https://www.drivendata.org/competitions/
CrowdAnalytix: https://www.crowdanalytix.com/community
InnoCentive Challenge: https://www.innocentive.com/challenge/
CodaLab Competitions: https://codalab.lisn.upsaclay.fr/competitions/
Zindi Competitions: https://zindi.africa/competitions
Datasource Compeitions: https://www.datasource.ai/en/home/data-science-competitions-for-startups
Bitgrit Competitions: https://bitgrit.net/competition/
AI crowd Challenges: https://www.aicrowd.com/challenges
ML Contests: https://mlcontests.com/