CIS 4523/5523: Knowledge Discovery and Data Mining
Spring 2026

Homework Policies (applicable for all assignments):


  1. You are required to do the homework problems in order to pass.
  2. Understandability of the solution is as desired as correctness.
  3. Penalty for late homework assignments submissions is 20% per day. So, do it on time.
  4. Solutions are expected to be your own work. Group work is not allowed unless explicitly approved for a particular problem. If you obtained a hint with help (e.g., through library work, discussion with another person, etc.) acknowledge your source, and write up the solution on your own. Plagiarism and other anti-intellectual behavior will be dealt with severely.

Assignment 1
Out: January 15
Due: January 29 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.

PROBLEM 1:
Solve nine tasks described below and submit to Canvas a report as a .pdf file. For each task includes code, its output, and comments/description.

The goal is to provide better data about the top 50 solar flares recorded so far than those shown by SpaceWeatherLive.com. Use this messy NASA data to add more features for the top 50 solar flares. You need to scrape this information directly from each HTML page. You can read here more about Solar Flares, coronal mass ejections, and the solar flare alphabet soup.

Use any programming language of your choice. Python is recommended (and used for further explanation), but this can be done in R, Java, and other languages as well. A tutorial on Python is available at  www.learnpython.org.

PART 1: Data scraping and preparation

Task 1: Scrape your competitor's data (10 pts)
Scrape data for the top 50 solar flares shown in SpaceWeatherLive.com.
Steps to do this are (if you are using python):

  1. pip install or conda install the following Python packages: beautifulsoup4, requests, pandas, NumPy, matplotlib  (for visualization)
  2. Use requests to get page content (as in, HTTP GET)
  3. Extract the text from the page
  4. Use BeautifulSoup to read and parse the data, either as html or lxml
  5. Use prettify( ) to view the content and find the appropriate table
  6. Use find( ) to save the aforementioned table as a variable
  7. Use pandas to read in the HTML file. HINT make-sure the above data is properly typecast.
  8. Set reasonable names for the table columns, e.g., rank, x_classification, date, region, start_time, maximum_time, end_time, movie. Pandas.columns  makes this very simple.

The result should be a data frame, with the first few rows as:

Dimension: 50 × 8

rank x_class date region start_time max_time end_time movie
1 1 X28.0 2003/11/04 0486 19:29 19:53 20:06 MovieView archive
2 2 X20 2001/04/02 9393 21:32 21:51 22:03 MovieView archive
3 3 X17.2 2003/10/28 0486 09:51 11:10 11:24 MovieView archive
4 4 X17.0 2005/09/07 0808 17:17 17:40 18:03 MovieView archive
5 5 X14.4 2001/04/15 9415 13:19 13:50 13:55 MovieView archive
6 6 X10.0 2003/10/29 0486 20:37 20:49 21:01 MovieView archive
7 7 X9.4 1997/11/06 - 11:49 11:55 12:01 MovieView archive
8 8 X9.0 2006/12/05 0930 10:18 10:35 10:45 MovieView archive
9 9 X8.3 2003/11/02 0486 17:03 17:25 17:39 MovieView archive
10 10 X7.1 2005/01/20 0720 06:36 07:01 07:26 MovieView archive
... with 40 more rows

Task 2: Tidy the top 50 solar flare data (10 pts)

Make sure this table is usable using pandas:

  1. Drop the last column of the table, since we are not going to use it moving forward.
  2. Use datetime import to combine the date and each of the three time columns into three datetime columns. You will see why this is useful later on. iterrows() should prove useful here.
  3. Update the values in the dataframe as you do this. Set_value should prove useful.
  4. Set regions coded as - as missing (NaN). You can use dataframe.replace() here.

The result of this step should be a data frame with the first few rows as:

A dataframe: 50 × 6

rank x_class start_datetime max_datetime end_datetime region
1 1 X28.0 2003-11-04 19:29:00 2003-11-04 19:53:00 2003-11-04 20:06:00 0486
2 2 X20 2001-04-02 21:32:00 2001-04-02 21:51:00 2001-04-02 22:03:00 9393
3 3 X17.2 2003-10-28 09:51:00 2003-10-28 11:10:00 2003-10-28 11:24:00 0486
4 4 X17.0 2005-09-07 17:17:00 2005-09-07 17:40:00 2005-09-07 18:03:00 0808
5 5 X14.4 2001-04-15 13:19:00 2001-04-15 13:50:00 2001-04-15 13:55:00 9415
6 6 X10.0 2003-10-29 20:37:00 2003-10-29 20:49:00 2003-10-29 21:01:00 0486
7 7 X9.4 1997-11-06 11:49:00 1997-11-06 11:55:00 1997-11-06 12:01:00 <NA>
8 8 X9.0 2006-12-05 10:18:00 2006-12-05 10:35:00 2006-12-05 10:45:00 0930
9 9 X8.3 2003-11-02 17:03:00 2003-11-02 17:25:00 2003-11-02 17:39:00 0486
10 10 X7.1 2005-01-20 06:36:00 2005-01-20 07:01:00 2005-01-20 07:26:00 0720
... with 40 more rows

Task 3: Scrape the NASA data (15 pts)

Next, you need to scrape NASA data to get additional features about these solar flares. This table format is described here.

Once scraped, do the next steps:

  1. Use BeautifulSoup functions (e.g., find, findAll) and string functions (e.g., split and built-in slicing capabilities) to obtain each row of data as a long string.
  2. Use the split function to separate each line of text into a data row.
  3. Create a DataFrame with the data from the table.
  4. Choose appropriate names for columns.

The result of this step should be similar to:

Dimension: 482 × 14

start_date start_time end_date end_time start_frequency end_frequency flare_location flare_region
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1997/04/01 14:00 04/01 14:15 8000 4000 S25E16 8026
2 1997/04/07 14:30 04/07 17:30 11000 1000 S28E19 8027
3 1997/05/12 05:15 05/14 16:00 12000 80 N21W08 8038
4 1997/05/21 20:20 05/21 22:00 5000 500 N05W12 8040
5 1997/09/23 21:53 09/23 22:16 6000 2000 S29E25 8088
6 1997/11/03 05:15 11/03 12:00 14000 250 S20W13 8100
7 1997/11/03 10:30 11/03 11:30 14000 5000 S16W21 8100
8 1997/11/04 06:00 11/05 04:30 14000 100 S14W33 8100
9 1997/11/06 12:20 11/07 08:30 14000 100 S18W63 8100
10 1997/11/27 13:30 11/27 14:00 14000 7000 N17E63 8113
... with 472 more rows, and 6 more variables: flare_classification <chr>, cme_date <chr>, cme_time <chr>, cme_angle <chr>, cme_width <chr>, cme_speed <chr>

Task 4: Tidy the NASA table (15 pts)

Here we will code missing observations properly, recode columns that correspond to more than one piece of information, and treat dates and times appropriately.

  1. Recode any missing entries as NaN. Refer to the data description to see how missing entries are encoded in each column. Be sure to look carefully at the actual data, as the nasa descriptions might not be completely accurate.
  2. The CPA column (cme_angle) contains angles in degrees for most rows, except for halo flares, which are coded as Halo. Create a new column that indicates if a row corresponds to a halo flare or not, and then replace Halo entries in the cme_angle column with NaN.
  3. The width column indicates if the given value is a lower bound. Create a new column that indicates if width is given as a lower bound, and remove any non-numeric part of the width column.
  4. Combine date and time columns for start, end and cme so they can be encoded as datetime objects.

The output of this step should be similar to this:

start_datetime end_datetime start_frequency end_frequency flare_location flare_region importance cme_datetime cpa width speed plot is_halo width_lower_bound
0 1997-04-01 14:00:00 1997-04-01 14:15:00 8000 4000 S25E16 8026 M1.3 1997-04-01 15:18:00 74 79 312 PHTX False False
1 1997-04-07 14:30:00 1997-04-07 17:30:00 11000 1000 S28E19 8027 C6.8 1997-04-07 14:27:00 NaN 360 878 PHTX True False
2 1997-05-12 05:15:00 1997-05-14 16:00:00 12000 80 N21W08 8038 C1.3 1997-05-12 05:30:00 NaN 360 464 PHTX True False
3 1997-05-21 20:20:00 1997-05-21 22:00:00 5000 500 N05W12 8040 M1.3 1997-05-21 21:00:00 263 165 296 PHTX False False
4 1997-09-23 21:53:00 1997-09-23 22:16:00 6000 2000 S29E25 8088 C1.4 1997-09-23 22:02:00 133 155 712 PHTX False False
5 1997-11-03 05:15:00 1997-11-03 12:00:00 14000 250 S20W13 8100 C8.6 1997-11-03 05:28:00 240 109 227 PHTX False False

PART 2: Analysis

Now that you have data from both sites, let’s start some analysis.

Task 5: Replication (10 pts)

Replicate as many as possible of the top 50 solar flare table in SpaceWeatherLive.com using the data obtained from NASA. If you get the top 50 solar flares from the NASA table based on their classification (e.g., X28 is the highest), do you get data for the same solar flare events? Include code used to get the top 50 solar flares from the NASA table (be careful when ordering by classification). Write a sentence or two discussing how well you can replicate the SpaceWeatherLive data from the NASA data.

Task 6: Integration (15 pts)

Write a function that finds the best matching row in the NASA data for each of the top 50 solar flares in the SpaceWeatherLive data. Here, you have to decide for yourself how you determine what is the best matching entry in the NASA data for each of the top 50 solar flares. In your submission, include an explanation of how you are defining best matching rows across the two datasets in addition to the code used to find the best matches. Finally, use your function to add a new column to the NASA dataset indicating its rank according to SpaceWeatherLive, if it appears in that dataset.

Task 7: Attributes visualization (7 pts)

Plot attributes in the NASA dataset (e.g., starting or ending frequenciues, flare height or width) over time. Use graphical elements (e.g., text or points) to indicate flares in the top 50 flares.

Task 8: Attributes comparison (8 pts)

Do flares in the top 50 tend to have Halo CMEs? You can make a barplot that compares the number (or proportion) of Halo CMEs in the top 50 flares vs. the dataset as a whole.

Task 9: Events distribution (10 pts)

Do strong flares cluster in time? Plot the number of flares per month over time, add a graphical element to indicate (e.g., text or points) to indicate the number of strong flares (in the top 50) to see if they cluster.

Submission

Prepare a .pdf file that includes code, its output, and comments/description for each part and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Assignment 2
Out: January 29
Due: February 05 by 5:30pm on Canvas
* Submit a .pdf file that includes code, its output, and comments/description for each problem and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Problem 1: (10 points)
You are given a set of m objects that is divided into K groups, where the i-th group is of size mi.If the goal is to obtain a sample of size n < m , what is the difference between the following two sampling schemes? (Assume sampling with replacement.)
  1. We randomly select n * mi /m elements from each group.
  2. We randomly select n elements from the data set, without regard for the group to which an object belongs.

Problem 2: (10 points)
Download the image hw2_Face.pbm from the class homework data folder. Find a PCA package and use it to compute eigenvectors and eigenvalues for this image.
  1. (5 points) Compute 2, 5, and 10 principal components and show original and the resulting images.
  2. (5 points) What is the minimal number of principal components needed to retain 80% of data variance?

Problem 3: (40 points)
Download Heart Disease dataset from https://archive.ics.uci.edu/ml/datasets/Heart+Disease. This dataset contains patient information from Cleveland hospital where each row represents a patient. Labels are the test results of the presence of the disease where "0" means no presence for heart disease and 1-4 represent the level of the disease. The dataset contains some missing values, and these values are denoted as "?". There are 303 patients in the original dataset and 75 features. The processed version of the dataset has the following attributes (which will be used in this assignment):
  • age: age in years
  • sex: sex (1 = male; 0 = female)
  • cp: chest pain type:
    • value 1: typical angina
    • value 2: atypical angina
    • value 3: non-anginal pain
    • value 4: asymptomatic
  • trestbps resting blood pressure (in mm Hg on admission to the hospital)
  • chol: serum cholestoral in mg/dl
  • fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg: resting electrocardiographic results:
    • value 0: normal
    • value 1: having ST-T wave abnormality (T wave inversions and/or STelevation or depression of > 0.05 mV)
    • value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  • thalach: maximum heart rate achieved
  • exang exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment:
    • value 1: upsloping
    • value 2: flat
    • value 3: downsloping
  • ca: number of major vessels (0-3) colored by flourosopy
  • thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
  • num: Label 0 - 4

Answer the following questions using one of the following programming languages Python, R, or Java (you cannot use excel or any equivalent software):

  1. (5 points) The associated task with this dataset is multiclass classification. Change the problem to binary classification and compute the proportion of each class in the binary case? Is this a balanced dataset?
  2. (5 points) Remove all patients who have any missing values in their records. How many patients do you have now?
  3. (5 points) Now, impute missing values by mean values of corresponding attributes. Report how this imputation affected the overall distribution of corresponding attributes?
  4. (5 points) Draw a scatter plot and explain the relationship between chest pain type and age?
  5. (5 points) How does sex affect having or not having a heart disease? Draw a box plot and explain.
  6. Generate 6 random samples (without replacement) of size 50 and answer the following:
    1. (5 points) What is the proportion of each class in each sample? Is each sample a balanced dataset?
    2. (5 points) How does sex affect having or not having a heart disease in each sample? Draw a box plot.
  7. (5 points) Compare results from e with results from f.ii

Problem 4: (40 points)
The decision-makers at GymX would like to improve their services using data mining and machine learning techniques to better understand their customers. They have a large database with many fields, including customer_id, customer_name, age, sex, height, weight, membership_type, diet_restrictions, and more. The problem is that the database has a lot of missing data because most customer do not fill in all the required fields when they join the gym. This problem will affect their customer analysis. Help GymX to solve their problem. Download hw2_GymX.xlsx dataset from the class homework data folder. The dataset contains the following attributes:
  • Customer ID
  • Customer Name
  • Age
  • Sex (male = 1, female = 0)
  • Height in feet
  • Weight in pounds
  • Membership type (adult, youth, or kids)
  1. (10 points) Report the number of missing values in each feature.
  2. (10 points) Describe a naive solution for missing values and use it to solve the missing data problem. What are the advantages/disadvantages of this solution?
  3. (10 points) Propose a better solution and use it to solve the missing data problem.
  4. (10 points) Compare results of the naïve handling of missing data vs your better solutions based on:
    1. (5 points) Plot a histogram of weight for all customers and report mean and standard deviation
    2. (5 points) Create a bar plot showing the number of customers by sex and membership type.

Assignment 3
Out: February 5
Due: February 12 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission
Number of problems/points: Eight problems for total of 100 points

Problem 1: (10 points)
The following algorithm aims to find the K nearest neighbors of a data object:
      1: for i = 1 to number of data objects do
      2:      Find the distances of the i-th object to all other objects.
      3:      Sort these distances in decreasing order.
               (Keep track of which object is associated with each distance.)
      4: return the objects associated with the first K distances of the sorted list
      5: end for

(a) (5 points) Describe the potential problems with this algorithm if there are duplicate objects in the data set.
(b) (5 points) How would you fix this problem?

Problem 2: (20 points)

Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."

Problem 3: (10 points)
Transform correlation to a similarity measure with [0,1] range that could be used for clustering time series.

Problem 4: (10 points)
Transform correlation to a similarity measure with [0,1] range that could be used for predicting the behavior of one time series given another.

Problem 5: (20 points)
This exercise compares and contrasts some similarity and distance measures.
  1. (10 points) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

  2. (10 points) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

Problem 6: (10 points)
Donor data consists of 11 records in the following format: Name Age Salary Donor(Y/N). Donor training dataset:
Name Age Salary Donor(Y/N)
Nancy2137,000N
Jim2741,000N
Allen4361,000Y
Jane3855,000N
Steve4430,000N
Peter5156,000Y
Sayani5370,000Y
Lata5674,000Y
Mary5925,000N
Victor6168,000Y
Dale6351,000Y
Compute the Gini index for the entire Donor data set, with respect to the two classes. Compute the Gini index for the portion of the data set with age at least 50.

Problem 7: (10 points)
Repeat the computation of the previous exercise with the use of the entropy criterion. Compute the entropy for the portion of the data set with age greater than 50.

Problem 8: (10 points)
What is the best classification accuracy that can be obtained on Donor dataset with a decision tree of depth 2, where each test results in a binary split?

Homework 4
No assignment yet.
Assignment 5
No assignment yet.
Assignment 6
No assignment yet.
Assignment 7
No assignment yet.
Assignment 8
No assignment yet.