\documentstyle[12pt,fullpage]{article}

Knowledge Discovery and Data Mining

CIS 4523/5523, Spring 2024

Assignment 3

Out: February 8

Due: February 15 by 5:30pm on Canvas

*Please write your name and TUID at the top of your CANVAS submission.

Number of problems/points: Eight problems for total of 100 points

Problem 1: (10 points)

The following algorithm aims to find the K nearest neighbors of a data object:

1: for i = 1 to number of data objects do

2: Find the distances of the i-th object to all other objects.

3: Sort these distances in decreasing order.

(Keep track of which object is associated with each distance.)

4: return the objects associated with the first K distances of the sorted list

5: end for

(a) (5 points) Describe the potential problems with this algorithm if there are duplicate objects in the data set.

(b) (5 points) How would you fix this problem?

Problem 2: (20 points)

Compute the cosine measure using the frequencies between the following two sentences:

(a) “The sly fox jumped over the lazy dog.”

(b) “The dog jumped at the intruder.”

Problem 3: (10 points)

Transform correlation to a similarity measure with [0,1] range that could be used for clustering time series.

Problem 4: (10 points)

Transform correlation to a similarity measure with [0,1] range that could be used for predicting the behavior of one time series given another.

Problem 5: (20 points)

This exercise compares and contrasts some similarity and distance measures.

(a) (10 points) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

(b) (10 points) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

Problem 6: (10 points)

Donor data consists of 11 records in the following format: Name Age Salary Donor(Y/N). Donor training dataset:

Nancy 21 37,000 N

Jim 27 41,000 N

Allen 43 61,000 Y

Jane 38 55,000 N

Steve 44 30,000 N

Peter 51 56,000 Y

Sayani 53 70,000 Y

Lata 56 74,000 Y

Mary 59 25,000 N

Victor 61 68,000 Y

Dale 63 51,000 Y

Compute the Gini index for the entire Donor data set, with respect to the two classes. Compute the Gini index for the portion of the data set with age at least 50.

Problem 7: (10 points)

Repeat the computation of the previous exercise with the use of the entropy criterion. Compute the entropy for the portion of the data set with age greater than 50.

Problem 8: (10 points)

What is the best classification accuracy that can be obtained on Donor dataset with a decision tree of depth 2, where each test results in a binary split?