Knowledge Discovery and Data Mining: CIS 4523/5523, Spring 2022
Assignment 6
Out: February 24
Due: MONDAY, March 7 by 5:30 pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Five problems for a total of 100 points
Problem 1: (15 points)
Problem 2: (15 points)
The leader algorithm represents each cluster using a point, known as a leader, and assigns each point to the cluster corresponding to the closest leader unless this distance is above a user-specified threshold. In
that case, the point becomes the leader of a new cluster.
(a) What are the advantages and disadvantages of the leader algorithm as
compared to K-means?
(b) Suggest ways in which the leader algorithm might be improved.
Problem 3: (10 points)
Traditional agglomerative hierarchical clustering routines merge two clusters at each step.
Does it seem likely that such an approach accurately captures the (nested) cluster structure of a set of data points? If not, explain how you might post-process the data to obtain a more accurate view of the cluster structure.
Problem 4: (40 points)
Download and install CLUTO software for clustering high dimensional data (http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview).
Apply this software to cluster Enron Emails dataset available at
https://archive.ics.uci.edu/ml/datasets/Bag+of+Words
(a) Report clustering results when using partitional clustering in the CLUTO package. You are allowed to apply CLUTO on a sample if the data is too large for your computer. In such a case report the sample size you used and how consistent the result is if you repeat experiments 3 times on 3 samples of that size.
(b) Report results when using agglomerative clustering algorithms in the CLUTO package. In agglomerative clustering compare the results of when using complete-link vs. single-link merging schemes. Then, for single-link merging compare the results when using cosine versus Euclidean distance function.
Problem 5: (20 points)
Propose a ranked list of five Knowledge Discovery and Data Mining topics of which you would possibly learn one on your own and present as a mini-lecture in class. Mini-lecture topics will be assigned by considering your preferences, but if multiple people rank the same topic as their preference, this topic will be assigned to one who provides the most convincing references which will be used to prepare the presentation.
Mini-lectures will be presented on March 24, March 31, April 7, and April 14, and the order will be determined based on topics and announced by March 14.
The proposed topics for mini-lectures should be different from topics already discussed in class. Each topic should be appropriate for a 15-minute presentation. You can prepare a presentation based on materials from two textbooks, but you are also allowed to use conference tutorial slides, articles etc. Following are possible topics to consider (you can also propose different topics relevant to Knowledge Discovery and Data Mining):