CIS 4523/5523: Knowledge Discovery and Data Mining
Spring 2024

Homework Policies (applicable for all assignments):


  1. You are required to do the homework problems in order to pass.
  2. Understandability of the solution is as desired as correctness.
  3. Penalty for late homework assignments submissions is 20% per day. So, do it on time.
  4. Solutions are expected to be your own work. Group work is not allowed unless explicitly approved for a particular problem. If you obtained a hint with help (e.g., through library work, discussion with another person, etc.) acknowledge your source, and write up the solution on your own. Plagiarism and other anti-intellectual behavior will be dealt with severely.

Assignment 1
Out: January 18
Due: February 01 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.

PROBLEM 1:
Solve nine tasks described below and submit to Canvas a report as a .pdf file. For each task includes code, its output, and comments/description.

The goal is to provide better data about the top 50 solar flares recorded so far than those shown by SpaceWeatherLive.com. Use this messy NASA data to add more features for the top 50 solar flares. You need to scrape this information directly from each HTML page. You can read here more about Solar Flares, coronal mass ejections, and the solar flare alphabet soup.

Use any programming language of your choice. Python is recommended (and used for further explanation), but this can be done in R, Java, and other languages as well. A tutorial on Python is available at  www.learnpython.org.

PART 1: Data scraping and preparation

Task 1: Scrape your competitor's data (10 pts)
Scrape data for the top 50 solar flares shown in SpaceWeatherLive.com.
Steps to do this are (if you are using python):

  1. pip install or conda install the following Python packages: beautifulsoup4, requests, pandas, NumPy, matplotlib  (for visualization)
  2. Use requests to get page content (as in, HTTP GET)
  3. Extract the text from the page
  4. Use BeautifulSoup to read and parse the data, either as html or lxml
  5. Use prettify( ) to view the content and find the appropriate table
  6. Use find( ) to save the aforementioned table as a variable
  7. Use pandas to read in the HTML file. HINT make-sure the above data is properly typecast.
  8. Set reasonable names for the table columns, e.g., rank, x_classification, date, region, start_time, maximum_time, end_time, movie. Pandas.columns  makes this very simple.

The result should be a data frame, with the first few rows as:

Dimension: 50 × 8

rank x_class date region start_time max_time end_time movie
1 1 X28.0 2003/11/04 0486 19:29 19:53 20:06 MovieView archive
2 2 X20 2001/04/02 9393 21:32 21:51 22:03 MovieView archive
3 3 X17.2 2003/10/28 0486 09:51 11:10 11:24 MovieView archive
4 4 X17.0 2005/09/07 0808 17:17 17:40 18:03 MovieView archive
5 5 X14.4 2001/04/15 9415 13:19 13:50 13:55 MovieView archive
6 6 X10.0 2003/10/29 0486 20:37 20:49 21:01 MovieView archive
7 7 X9.4 1997/11/06 - 11:49 11:55 12:01 MovieView archive
8 8 X9.0 2006/12/05 0930 10:18 10:35 10:45 MovieView archive
9 9 X8.3 2003/11/02 0486 17:03 17:25 17:39 MovieView archive
10 10 X7.1 2005/01/20 0720 06:36 07:01 07:26 MovieView archive
... with 40 more rows

Task 2: Tidy the top 50 solar flare data (10 pts)

Make sure this table is usable using pandas:

  1. Drop the last column of the table, since we are not going to use it moving forward.
  2. Use datetime import to combine the date and each of the three time columns into three datetime columns. You will see why this is useful later on. iterrows() should prove useful here.
  3. Update the values in the dataframe as you do this. Set_value should prove useful.
  4. Set regions coded as - as missing (NaN). You can use dataframe.replace() here.

The result of this step should be a data frame with the first few rows as:

A dataframe: 50 × 6

rank x_class start_datetime max_datetime end_datetime region
1 1 X28.0 2003-11-04 19:29:00 2003-11-04 19:53:00 2003-11-04 20:06:00 0486
2 2 X20 2001-04-02 21:32:00 2001-04-02 21:51:00 2001-04-02 22:03:00 9393
3 3 X17.2 2003-10-28 09:51:00 2003-10-28 11:10:00 2003-10-28 11:24:00 0486
4 4 X17.0 2005-09-07 17:17:00 2005-09-07 17:40:00 2005-09-07 18:03:00 0808
5 5 X14.4 2001-04-15 13:19:00 2001-04-15 13:50:00 2001-04-15 13:55:00 9415
6 6 X10.0 2003-10-29 20:37:00 2003-10-29 20:49:00 2003-10-29 21:01:00 0486
7 7 X9.4 1997-11-06 11:49:00 1997-11-06 11:55:00 1997-11-06 12:01:00 <NA>
8 8 X9.0 2006-12-05 10:18:00 2006-12-05 10:35:00 2006-12-05 10:45:00 0930
9 9 X8.3 2003-11-02 17:03:00 2003-11-02 17:25:00 2003-11-02 17:39:00 0486
10 10 X7.1 2005-01-20 06:36:00 2005-01-20 07:01:00 2005-01-20 07:26:00 0720
... with 40 more rows

Task 3: Scrape the NASA data (15 pts)

Next, you need to scrape NASA data to get additional features about these solar flares. This table format is described here.

Once scraped, do the next steps:

  1. Use BeautifulSoup functions (e.g., find, findAll) and string functions (e.g., split and built-in slicing capabilities) to obtain each row of data as a long string.
  2. Use the split function to separate each line of text into a data row.
  3. Create a DataFrame with the data from the table.
  4. Choose appropriate names for columns.

The result of this step should be similar to:

Dimension: 482 × 14

start_date start_time end_date end_time start_frequency end_frequency flare_location flare_region
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1997/04/01 14:00 04/01 14:15 8000 4000 S25E16 8026
2 1997/04/07 14:30 04/07 17:30 11000 1000 S28E19 8027
3 1997/05/12 05:15 05/14 16:00 12000 80 N21W08 8038
4 1997/05/21 20:20 05/21 22:00 5000 500 N05W12 8040
5 1997/09/23 21:53 09/23 22:16 6000 2000 S29E25 8088
6 1997/11/03 05:15 11/03 12:00 14000 250 S20W13 8100
7 1997/11/03 10:30 11/03 11:30 14000 5000 S16W21 8100
8 1997/11/04 06:00 11/05 04:30 14000 100 S14W33 8100
9 1997/11/06 12:20 11/07 08:30 14000 100 S18W63 8100
10 1997/11/27 13:30 11/27 14:00 14000 7000 N17E63 8113
... with 472 more rows, and 6 more variables: flare_classification <chr>, cme_date <chr>, cme_time <chr>, cme_angle <chr>, cme_width <chr>, cme_speed <chr>

Task 4: Tidy the NASA table (15 pts)

Here we will code missing observations properly, recode columns that correspond to more than one piece of information, and treat dates and times appropriately.

  1. Recode any missing entries as NaN. Refer to the data description to see how missing entries are encoded in each column. Be sure to look carefully at the actual data, as the nasa descriptions might not be completely accurate.
  2. The CPA column (cme_angle) contains angles in degrees for most rows, except for halo flares, which are coded as Halo. Create a new column that indicates if a row corresponds to a halo flare or not, and then replace Halo entries in the cme_angle column with NaN.
  3. The width column indicates if the given value is a lower bound. Create a new column that indicates if width is given as a lower bound, and remove any non-numeric part of the width column.
  4. Combine date and time columns for start, end and cme so they can be encoded as datetime objects.

The output of this step should be similar to this:

start_datetime end_datetime start_frequency end_frequency flare_location flare_region importance cme_datetime cpa width speed plot is_halo width_lower_bound
0 1997-04-01 14:00:00 1997-04-01 14:15:00 8000 4000 S25E16 8026 M1.3 1997-04-01 15:18:00 74 79 312 PHTX False False
1 1997-04-07 14:30:00 1997-04-07 17:30:00 11000 1000 S28E19 8027 C6.8 1997-04-07 14:27:00 NaN 360 878 PHTX True False
2 1997-05-12 05:15:00 1997-05-14 16:00:00 12000 80 N21W08 8038 C1.3 1997-05-12 05:30:00 NaN 360 464 PHTX True False
3 1997-05-21 20:20:00 1997-05-21 22:00:00 5000 500 N05W12 8040 M1.3 1997-05-21 21:00:00 263 165 296 PHTX False False
4 1997-09-23 21:53:00 1997-09-23 22:16:00 6000 2000 S29E25 8088 C1.4 1997-09-23 22:02:00 133 155 712 PHTX False False
5 1997-11-03 05:15:00 1997-11-03 12:00:00 14000 250 S20W13 8100 C8.6 1997-11-03 05:28:00 240 109 227 PHTX False False

PART 2: Analysis

Now that you have data from both sites, let’s start some analysis.

Task 5: Replication (10 pts)

Replicate as many as possible of the top 50 solar flare table in SpaceWeatherLive.com using the data obtained from NASA. If you get the top 50 solar flares from the NASA table based on their classification (e.g., X28 is the highest), do you get data for the same solar flare events? Include code used to get the top 50 solar flares from the NASA table (be careful when ordering by classification). Write a sentence or two discussing how well you can replicate the SpaceWeatherLive data from the NASA data.

Task 6: Integration (15 pts)

Write a function that finds the best matching row in the NASA data for each of the top 50 solar flares in the SpaceWeatherLive data. Here, you have to decide for yourself how you determine what is the best matching entry in the NASA data for each of the top 50 solar flares. In your submission, include an explanation of how you are defining best matching rows across the two datasets in addition to the code used to find the best matches. Finally, use your function to add a new column to the NASA dataset indicating its rank according to SpaceWeatherLive, if it appears in that dataset.

Task 7: Attributes visualization (7 pts)

Plot attributes in the NASA dataset (e.g., starting or ending frequenciues, flare height or width) over time. Use graphical elements (e.g., text or points) to indicate flares in the top 50 flares.

Task 8: Attributes comparison (8 pts)

Do flares in the top 50 tend to have Halo CMEs? You can make a barplot that compares the number (or proportion) of Halo CMEs in the top 50 flares vs. the dataset as a whole.

Task 9: Events distribution (10 pts)

Do strong flares cluster in time? Plot the number of flares per month over time, add a graphical element to indicate (e.g., text or points) to indicate the number of strong flares (in the top 50) to see if they cluster.

Submission

Prepare a .pdf file that includes code, its output, and comments/description for each part and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Assignment 2
Out: February 01
Due: February 08 by 5:30pm on Canvas
* Submit a .pdf file that includes code, its output, and comments/description for each problem and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Problem 1: (10 points)
You are given a set of m objects that is divided into K groups, where the i-th group is of size mi.If the goal is to obtain a sample of size n < m , what is the difference between the following two sampling schemes? (Assume sampling with replacement.)
  1. We randomly select n * mi /m elements from each group.
  2. We randomly select n elements from the data set, without regard for the group to which an object belongs.

Problem 2: (10 points)
Download the image hw2_2022_problem4_Face.pgm from the class homework data folder. Find a PCA package and use it to compute eigenvectors and eigenvalues for this image.
  1. (5 points) Compute 2, 5, and 10 principal components and show original and the resulting images.
  2. (5 points) What is the minimal number of principal components needed to retain 80% of data variance?

Problem 3: (40 points)
Download Heart Disease dataset from http://archive.ics.uci.edu/ml/datasets/Heart+Disease. This dataset contains patient information from Cleveland hospital where each row represents a patient. Labels are the test results of the presence of the disease where "0" means no presence for heart disease and 1-4 represent the level of the disease. The dataset contains some missing values, and these values are denoted as "?". There are 303 patients in the original dataset and 75 features. The processed version of the dataset has the following attributes (which will be used in this assignment):
  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type:
    • value 1: typical angina
    • value 2: atypical angina
    • value 3: non-anginal pain
    • value 4: asymptomatic
  • trestbps resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholestoral in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results:
    • value 0: normal
    • value 1: having ST-T wave abnormality (T wave inversions and/or STelevation or depression of > 0.05 mV)
    • value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  • thalach: maximum heart rate achieved
  • exang exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment:
    • value 1: upsloping
    • value 2: flat
    • value 3: downsloping
  • ca: number of major vessels (0-3) colored by flourosopy
  • thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
  • num: Label 0 - 4

Answer the following questions using one of the following programming languages Python, R, or Java (you cannot use excel or any equivalent software):

  1. (5 points) The associated task with this dataset is multiclass classification. Change the problem to binary classification and compute the proportion of each class in the binary case? Is this a balanced dataset?
  2. (5 points) Remove all patients that have any missing values in their records, how many patients do you have now?
  3. (5 points) Now, impute missing values by mean values of corresponding attributes. Report how this imputation affected the overall distribution of corresponding attributes?
  4. (5 points) Draw a scatter plot and explain the relationship between chest pain type and age?
  5. (5 points) How sex affects having or not having a heart disease? Draw a box plot and explain.
  6. Generate 6 random samples (without replacement) of size 50 and answer the following:
    1. (5 points) What the proportion of each class in each sample? Is each sample a balanced dataset?
    2. (5 points) How sex affects having or not having a heart disease in each sample? Draw a box plot.
  7. (5 points) Compare results from e with results from f.ii

Problem 4: (40 points)
The decision-makers at GymX would like to improve their services using data mining and machine learning techniques to better understand their customers. They have a large database that contains many fields such as customer_id, customer_name, age, sex, height, weight, membership_type, diet_restrictions, and more. The problem is that the database has many missing data, because most customer do not fill all necessary fields when they join the gym. This problem will affect their customer analysis. Help GymX to solve their problem. Download hw2_2024_problem4_GymX.xlsx dataset from the class homework data folder. The dataset contains the following attributes:
  • Customer ID
  • Customer Name
  • Age
  • Sex (male = 1, female = 0)
  • Height in feet
  • Weight in pounds
  • Membership type (adult, youth, or kids)
  1. (10 points) Report the number of missing values in each feature.
  2. (10 points) Describe a naive solution for missing values and use it to solve the missing data problem. What are the advantages/disadvantages of this solution?
  3. (10 points) Propose a better solution and use it to solve the missing data problem.
  4. (10 points) Compare results of the naïve handling of missing data vs your better solutions based on:
    1. (5 points) Plot a histogram of weight for all customers and report mean and standard deviation
    2. (5 points) Create a bar plot that shows the number of customers from each sex from each membership type.

Assignment 3
Out: February 8
Due: February 15 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission
Number of problems/points: Eight problems for total of 100 points

Problem 1: (10 points)
The following algorithm aims to find the K nearest neighbors of a data object:
      1: for i = 1 to number of data objects do
      2:      Find the distances of the i-th object to all other objects.
      3:      Sort these distances in decreasing order.
               (Keep track of which object is associated with each distance.)
      4: return the objects associated with the first K distances of the sorted list
      5: end for

(a) (5 points) Describe the potential problems with this algorithm if there are duplicate objects in the data set.
(b) (5 points) How would you fix this problem?

Problem 2: (20 points)

Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."

Problem 3: (10 points)
Transform correlation to a similarity measure with [0,1] range that could be used for clustering time series.

Problem 4: (10 points)
Transform correlation to a similarity measure with [0,1] range that could be used for predicting the behavior of one time series given another.

Problem 5: (20 points)
This exercise compares and contrasts some similarity and distance measures.
  1. (10 points) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

  2. (10 points) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

Problem 6: (10 points)
Donor data consists of 11 records in the following format: Name Age Salary Donor(Y/N). Donor training dataset:
Name Age Salary Donor(Y/N)
Nancy2137,000N
Jim2741,000N
Allen4361,000Y
Jane3855,000N
Steve4430,000N
Peter5156,000Y
Sayani5370,000Y
Lata5674,000Y
Mary5925,000N
Victor6168,000Y
Dale6351,000Y
Compute the Gini index for the entire Donor data set, with respect to the two classes. Compute the Gini index for the portion of the data set with age at least 50.

Problem 7: (10 points)
Repeat the computation of the previous exercise with the use of the entropy criterion. Compute the entropy for the portion of the data set with age greater than 50.

Problem 8: (10 points)
What is the best classification accuracy that can be obtained on Donor dataset with a decision tree of depth 2, where each test results in a binary split?

Homework 4
Out: February 15
Due: February 22 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for total of 100 points

Problem 1: (15 points)
In reservoir sampling with reservoir of size k the n-th incoming stream data point is insert into the reservoir with probability k/n and one of the old k data points is removed from the reservoir at random to make room for the newly arriving point. After n stream points have arrived, prove that probability of any point being included in the reservoir is k/n.

Problem 2: (15 points)
Show that the entropy of a node in a decision tree never increases after splitting it into smaller successor nodes.

Problem 3: (60 points)
  1. Develop a decision tree classification software on your own in Python, R, or Java and apply it to develop a classification model for Donor dataset from Homework 3. Since this dataset is small, apply leave-one-out training and testing framework and report your findings.
  2. Now apply this classifier to solve the Mushroom problem defined at https://www.kaggle.com/uciml/mushroom-classification
    Evaluate results using 10-fold cross validation and report your findings.
  3. Compare test set accuracy when training a mushroom classifier using 500 vs. 5,000 training examples.
  4. Report accuracy on training and test data when using a decision tree that has 10 vs 30 leaves.
  5. Report attribute tests used in the decision tree with 10 leaves.
  6. For a decision tree with 10 leaves report the number of positive and negative training examples at each internal node and at leaves.

Problem 4: (10 points)
Discuss the advantages and disadvantages of a nearest neighbor classifier, over a decision tree.

Assignment 5

Out: February 22
Due: February 29 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for total of 100 points

Problem 1: (30 points)
Download Census Income data set from UCI repository - link is https://archive.ics.uci.edu/ml/datasets/census+income
Use training data do develop a model aimed to determine whether a person makes over 50K a year.
Solve this problem using k-nearest neighbors’ method with k=3 and k=9 and report F1 score on test data.

Problem 2: (30 points)
Solve this problem using a using feed-forward neural network and report ROC on test data.

Problem 3: (20 points)
Consider a data set containing four points located at the corners of the square. The two points on one diagonal belong to one class, and the two points on the other diagonal belong to the other class. Is this data set linearly separable? Provide a proof.

Problem 4: (20 points)
  1. Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?
  2. Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student?
  3. Repeat part (b) assuming that the student is a smoker.
  4. Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.

Assignment 6

Out: February 29
Due: MONDAY March 11 by noon on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: One problem for a total of 50 points.

Problem 1: (50 points)
Propose a ranked list of FIVE Knowledge Discovery and Data Mining topics of which you would possibly learn one on your own and present as a mini-lecture in class. For each of 5 selected topics list references that you will use to prepare a mini-lecture. Mini-lecture topics and the presentation day will be assigned by considering your preferences, but if multiple people rank the same topic as their preference, this topic will be assigned to one who provides the most convincing references which will be used to prepare the presentation.

Mini lectures will be presented on March 28, April 4, and April 11. The order of presentations will be determined based on topics and will be announced on March 14.

The proposed topics for mini-lectures should be different from topics already discussed in class. Each topic should be appropriate for a 20-minute presentation. You can prepare a presentation based on materials from two textbooks, but you are also allowed to use conference tutorial slides, articles, etc. The following are possible topics to consider. You can also propose different topics relevant to Knowledge Discovery and Data Mining:

  • Large-scale hierarchical classification
  • Advanced concepts in cluster analysis
  • Association rules mining
  • Advanced concepts in association analysis
  • Anomaly detection
  • Data stream mining
  • Text and web mining
  • Time series mining
  • Mining big time series
  • Sequence pattern mining
  • Survival analysis
  • Mining spatial data
  • Mining graphs
  • Graphs sketching, sampling, streaming
  • Mining web data
  • Mining social networks
  • Privacy-preserving data mining
  • Mining spatio-temporal data
  • Mining semi-structured data
  • Mining with constraints
  • False discoveries
  • Lifelong machine learning
  • Deep Bayesian mining
  • Data mining for drug discovery
  • Mining electronic health records
  • Data mining in transportation
  • Data mining in power systems
  • Sports analytics
  • Explainable data modeling
  • Active learning
  • Human-in-the-loop learning
  • Visual analytics
  • Fairness-aware machine learning
  • Transfer learning
  • Fake news detection
  • Zero-shoot and few-shot learning
  • Mining temporal networks
  • Reinforcement learning
  • Graph neural networks
  • Deep reinforcement learning
  • Deep learning for personalized search and recommender systems
  • A/B testing at scale
  • Parallel and distributed data science (cloud, map-reduce, federated learning)

Assignment 7

Out: February 29
Due: March 14 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for a total of 100 points.

Problem 1: (15 points)
  1. Illustrate on an example the vanishing gradient problem for a deep neural network (with many hidden layers) if using a sigmoid activation function.
  2. What is a way to overcome this problem (explain how)?

Problem 2: (15 points)
The leader algorithm represents each cluster using a point, known as a leader, and assigns each point to the cluster corresponding to the closest leader unless this distance is above a user-specified threshold. In that case, the point becomes the leader of a new cluster.
  1. What are the advantages and disadvantages of the leader algorithm as compared to K-means?
  2. Suggest ways in which the leader algorithm might be improved.

Problem 3: (15 points)
Traditional agglomerative hierarchical clustering routines merge two clusters at each step.
  1. Does it seem likely that such an approach accurately captures the (nested) cluster structure of a set of data points?
  2. If not, explain how you might post-process the data to obtain a more accurate view of the cluster structure.

Problem 4: (55 points)
Download and install CLUTO software for clustering high-dimensional data (http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview). Apply this software to cluster the Enron Emails dataset available at https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

  1. Report clustering results when using partitional clustering in the CLUTO package. You are allowed to apply CLUTO on a sample if the data is too large for your computer. In such a case report sample size you use and how consistent the result is if you repeat experiments 3 times on 3 samples of that size.

  2. Report results when using agglomerative clustering algorithms in the CLUTO package. In agglomerative clustering compare the results of when using complete-link vs. single-link merging schemes. Then, for single-link merging compare the results when using cosine versus Euclidean distance function.

Assignment 8

Out: March 14
Due: MONDAY March 25 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: One problem for a total of 50 points

Problem 1: (50 points)

Write a research proposal for the class project that you plan to perform, present progress on April 18 and 25, and submit the project report by May 2.

Teams of two undergraduate students are allowed and teams of one undergraduate and one graduate student. Teams of two graduate students are not allowed.

Write the proposal using the following format:
(0) Your name(s) and e-mail address (such that the instructor can approve your topic quickly or to ask for a revision/clarification)
(1) Title;
(2) Objective and Significance;
(3) Background;
(4) Proposed Approach (make sure to explain where you will get data and how much preprocessing is needed);
(5) References.
The proposal description may not exceed 2 pages in 12 pt style.

Following are some of the research project topics investigated by Temple KDDM students in previous years:

  1. Early prediction of spatio-temporal events
  2. Applications of Graph Neural Networks for modeling partially observed data
  3. Classifying sports events based on spatio-temporal data of players and ball using unreliable labels
  4. Predictive modeling to forecast store sales
  5. Emotion analysis based on text mining
  6. Clustering NYC 311 requests
  7. MP3 to MIDI conversion via deep learning
  8. Co-Localization of multiple objects in images using activation map
  9. Generalized procedure for selecting methylation CpGs associated with cancer
  10. Classification of vaccine-related tweets using deep learning
  11. Automatically building book indices
  12. Time-series and clustering analysis for systemic lupus erythematosus patients’ study
  13. Identifying complexity of Wikipedia text
  14. Graphlet-assisted structured regression
  15. Wire bonding: Predicting failures for the Ultrasonic Transducer
  16. Exploring bias and variance in supervised learning algorithms
  17. Clustering of gene expression cancer RNA-Seq Data Set
  18. Analysis of online product review
  19. Class imbalance: Credit card fraud analysis
  20. Physician social network and patient outcomes: An empirical investigation
  21. Exploring underlying structures of tweets with URLs via clustering
  22. Dynamic changes of structure of large-scale evolving temporal graphs
  23. Temporal predictive modeling with sample/feature size constraints
  24. Opportunistic routing assisted by decision trees in CRNs
  25. Uncertainty estimation of structured models on evolving graphs
  26. Structured output prediction with spatio-temporal data
  27. Using nonlinear gated-experts for traffic speed forecasting
  28. New clustering schemes to improve the analysis of antibody CDR structure
  29. Missing data, latent variables and PCA
  30. Text mining of evaluations of commercial banks
  31. Health care data mining
  32. Short text data mining and analysis
  33. Label propagation for multi-label prediction
  34. Ready for human-machine cooperation in hierarchical clustering?
  35. Decentralized estimation using learning vector quantization
  36. Batch mode active learning for classification and regression
  37. Inverse active learning modeling in simulated AOD prediction
  38. Pollution prediction using pre-clustering on informative features
  39. Uncertainty estimation for predicted aerosol optical depth
  40. Feature selection for microarray classification
  41. Analysis of gene functional expression profiles using GO semantic similarity
  42. Disease data mining survival prediction based on gene expression data
  43. Using movement data to detect significant regions of infection
  44. Shape matching improvement
  45. Relationships between environmental aspects of police officer’s work, family life and stress
  46. Classification of Basketball Strategies using Spatio-Temporal Data
  47. Sentiment Analysis on Social Media about Covid Vaccines to Analyze Public Reaction
  48. Spotify Playlist Recommender System
  49. Spotify Playlist Recommender System
  50. Fake News Detection and Analysis
  51. Data Mining and Analysis on Tweets Related to Current Events in the Russo-Ukrainian War
  52. A Comparison of Bagging and Boosting for Regression and Classification Tasks
  53. A Formal Framework for Credit Card Fraud Analysis
  54. Cancer pathology stage prediction from gene expression quantification data
  55. The Crowd vs. The Expert; Comparing Ensembles for Eigenface Emotion Classification
  56. Does a Clutch Factor Exist in Basketball?
  57. Forecasting store sales
  58. Forecasting store sales
  59. Classification of Salary by Occupation, Gender, and Other Metrics
  60. Exploring bias and variance of models on animal faces classification
  61. Comparison of Recent Conditional Generative Adversarial Networks Models for Image Translation
  62. Twitter bot detection and classification with sentiment analysis
  63. Mineral Classification from Spectral Data
  64. Comparative Analysis of Missing Data Imputation Techniques for MCAR Data
  65. Predicting Philadelphia Voter Turnout with a Random Forest Model
  66. Topic Clustering of Autism Subreddit Data
  67. Location-Based Species Presence Prediction
  68. Cancer Pathological Stage Prediction using MRNA Gene Expression Data
  69. False Discoveries
  70. Comparing Unsupervised Visual Representation Techniques on Hotel Room Images
  71. How do weather conditions affect users’ engagement in social media?
  72. Comparing Image Classifying Methods
  73. A Comparison of Various Forecasting Models in Predicting Rainfall from Spatio-Temporal Data
  74. Reddit-based graph generation
  75. Sentiment analysis in Text mining
  76. A Machine Learning Framework to Identify Early Alzheimer's Disease
  77. Clustering and Regression of Philadelphia Bike Share Data

Following are some research project topic ideas suggested by authors of the KDDM textbook:

  1. Evaluating Performance of Classifiers
  2. Support Vector Machine (SVM)
  3. Cost-sensitive learning
  4. Semi-supervised learning (classification with labeled and unlabeled data)
  5. Classification for rare-class problems
  6. Time Series Prediction/Classification
  7. Sequence Prediction
  8. Association Rules for Classification
  9. Spatial Association Rule Mining
  10. Temporal Association Rule Mining
  11. Sequential Association Rule Mining
  12. Outlier Detection
  13. Parallel Formulations of Clustering
  14. Clustering of Time Series
  15. Scalable clustering algorithms
  16. Clustering association rules and frequent item sets