Knowledge Discovery and Data Mining: CIS 4523/5523, Spring 2022

Assignment 6

Out: February 24

Due: MONDAY, March 7 by 5:30 pm on Canvas

*Please write your name and TUID at the top of your CANVAS submission.

Number of problems/points: Five problems for a total of 100 points

Problem 1: (15 points)

Illustrate an example of the vanishing gradient problem for a deep neural network (with many hidden layers) if using the sigmoid activation function.
What is a way to overcome this problem (explain how)?

Problem 2: (15 points)

The leader algorithm represents each cluster using a point, known as a leader, and assigns each point to the cluster corresponding to the closest leader unless this distance is above a user-specified threshold. In

that case, the point becomes the leader of a new cluster.

(a) What are the advantages and disadvantages of the leader algorithm as

compared to K-means?

(b) Suggest ways in which the leader algorithm might be improved.

Problem 3: (10 points)

Traditional agglomerative hierarchical clustering routines merge two clusters at each step.

Does it seem likely that such an approach accurately captures the (nested) cluster structure of a set of data points? If not, explain how you might post-process the data to obtain a more accurate view of the cluster structure.

Problem 4: (40 points)

Download and install CLUTO software for clustering high dimensional data (http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview).

Apply this software to cluster Enron Emails dataset available at

https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

(a) Report clustering results when using partitional clustering in the CLUTO package. You are allowed to apply CLUTO on a sample if the data is too large for your computer. In such a case report the sample size you used and how consistent the result is if you repeat experiments 3 times on 3 samples of that size.

(b) Report results when using agglomerative clustering algorithms in the CLUTO package. In agglomerative clustering compare the results of when using complete-link vs. single-link merging schemes. Then, for single-link merging compare the results when using cosine versus Euclidean distance function.

Problem 5: (20 points)

Propose a ranked list of five Knowledge Discovery and Data Mining topics of which you would possibly learn one on your own and present as a mini-lecture in class. Mini-lecture topics will be assigned by considering your preferences, but if multiple people rank the same topic as their preference, this topic will be assigned to one who provides the most convincing references which will be used to prepare the presentation.

Mini-lectures will be presented on March 24, March 31, April 7, and April 14, and the order will be determined based on topics and announced by March 14.

The proposed topics for mini-lectures should be different from topics already discussed in class. Each topic should be appropriate for a 15-minute presentation. You can prepare a presentation based on materials from two textbooks, but you are also allowed to use conference tutorial slides, articles etc. Following are possible topics to consider (you can also propose different topics relevant to Knowledge Discovery and Data Mining):

Large scale hierarchical classification
Advanced concepts in cluster analysis
Association rules mining
Advanced concepts in association analysis
Anomaly detection
Data stream mining
Text and web mining
Time-series mining
Mining big time series
Sequence pattern mining
Survival analysis
Mining spatial data
Mining graphs
Graphs sketching, sampling, streaming
Mining web data
Mining social networks
Privacy-preserving data mining
Mining Spatio-temporal data
Mining semistructured data
Mining with constraints
False discoveries
Lifelong machine learning
Deep Bayesian mining
Data mining for drug discovery
Mining electronic health records
Data mining in transportation
Data mining in power systems
Sports analytics
Explainable data modeling
Active learning
Human-in-the-loop learning
Visual analytics
Fairness-aware machine learning
Transfer learning
Fake news detection
Zero-shoot learning
Mining temporal networks
Reinforcement learning
Graph neural networks
Deep reinforcement learning
Deep learning for personalized search and recommender systems
A/B testing at scale