Homework Assignments

Homework Policies (applicable for all assignments):

You are required to do the homework problems in order to pass.
Understandability of the solution is as desired as correctness.
Penalty for late homework assignments submissions is 20% per day. So, do it on time.
Solutions are expected to be your own work. Group work is not allowed unless explicitly approved for a particular problem. If you obtained a hint with help (e.g., through library work, discussion with another person, etc.) acknowledge your source, and write up the solution on your own. Plagiarism and other anti-intellectual behavior will be dealt with severely.

Assignment 1

Out: January 18
Due: February 01 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.

PROBLEM 1:

Solve nine tasks described below and submit to Canvas a report as a .pdf file. For each task includes code, its output, and comments/description.

The goal is to provide better data about the top 50 solar flares recorded so far than those shown by SpaceWeatherLive.com. Use this messy NASA data to add more features for the top 50 solar flares. You need to scrape this information directly from each HTML page. You can read here more about Solar Flares, coronal mass ejections, and the solar flare alphabet soup.

Use any programming language of your choice. Python is recommended (and used for further explanation), but this can be done in R, Java, and other languages as well. A tutorial on Python is available at www.learnpython.org.

PART 1: Data scraping and preparation

Task 1: Scrape your competitor's data (10 pts)
Scrape data for the top 50 solar flares shown in SpaceWeatherLive.com.
Steps to do this are (if you are using python):

pip install or conda install the following Python packages: beautifulsoup4, requests, pandas, NumPy, matplotlib (for visualization)
Use requests to get page content (as in, HTTP GET)
Extract the text from the page
Use BeautifulSoup to read and parse the data, either as html or lxml
Use prettify( ) to view the content and find the appropriate table
Use find( ) to save the aforementioned table as a variable
Use pandas to read in the HTML file. HINT make-sure the above data is properly typecast.
Set reasonable names for the table columns, e.g., rank, x_classification, date, region, start_time, maximum_time, end_time, movie. Pandas.columns makes this very simple.

The result should be a data frame, with the first few rows as:

Dimension: 50 × 8

rank x_class date region start_time max_time end_time movie
1 1 X28.0 2003/11/04 0486 19:29 19:53 20:06 MovieView archive
2 2 X20 2001/04/02 9393 21:32 21:51 22:03 MovieView archive
3 3 X17.2 2003/10/28 0486 09:51 11:10 11:24 MovieView archive
4 4 X17.0 2005/09/07 0808 17:17 17:40 18:03 MovieView archive
5 5 X14.4 2001/04/15 9415 13:19 13:50 13:55 MovieView archive
6 6 X10.0 2003/10/29 0486 20:37 20:49 21:01 MovieView archive
7 7 X9.4 1997/11/06 - 11:49 11:55 12:01 MovieView archive
8 8 X9.0 2006/12/05 0930 10:18 10:35 10:45 MovieView archive
9 9 X8.3 2003/11/02 0486 17:03 17:25 17:39 MovieView archive
10 10 X7.1 2005/01/20 0720 06:36 07:01 07:26 MovieView archive
... with 40 more rows

Task 2: Tidy the top 50 solar flare data (10 pts)

Make sure this table is usable using pandas:

Drop the last column of the table, since we are not going to use it moving forward.
Use datetime import to combine the date and each of the three time columns into three datetime columns. You will see why this is useful later on. iterrows() should prove useful here.
Update the values in the dataframe as you do this. Set_value should prove useful.
Set regions coded as - as missing (NaN). You can use dataframe.replace() here.

The result of this step should be a data frame with the first few rows as:

A dataframe: 50 × 6

rank x_class start_datetime max_datetime end_datetime region
1 1 X28.0 2003-11-04 19:29:00 2003-11-04 19:53:00 2003-11-04 20:06:00 0486
2 2 X20 2001-04-02 21:32:00 2001-04-02 21:51:00 2001-04-02 22:03:00 9393
3 3 X17.2 2003-10-28 09:51:00 2003-10-28 11:10:00 2003-10-28 11:24:00 0486
4 4 X17.0 2005-09-07 17:17:00 2005-09-07 17:40:00 2005-09-07 18:03:00 0808
5 5 X14.4 2001-04-15 13:19:00 2001-04-15 13:50:00 2001-04-15 13:55:00 9415
6 6 X10.0 2003-10-29 20:37:00 2003-10-29 20:49:00 2003-10-29 21:01:00 0486
7 7 X9.4 1997-11-06 11:49:00 1997-11-06 11:55:00 1997-11-06 12:01:00 <NA>
8 8 X9.0 2006-12-05 10:18:00 2006-12-05 10:35:00 2006-12-05 10:45:00 0930
9 9 X8.3 2003-11-02 17:03:00 2003-11-02 17:25:00 2003-11-02 17:39:00 0486
10 10 X7.1 2005-01-20 06:36:00 2005-01-20 07:01:00 2005-01-20 07:26:00 0720
... with 40 more rows

Task 3: Scrape the NASA data (15 pts)

Next, you need to scrape NASA data to get additional features about these solar flares. This table format is described here.

Once scraped, do the next steps:

Use BeautifulSoup functions (e.g., find, findAll) and string functions (e.g., split and built-in slicing capabilities) to obtain each row of data as a long string.
Use the split function to separate each line of text into a data row.
Create a DataFrame with the data from the table.
Choose appropriate names for columns.

The result of this step should be similar to:

Dimension: 482 × 14

start_date start_time end_date end_time start_frequency end_frequency flare_location flare_region
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1997/04/01 14:00 04/01 14:15 8000 4000 S25E16 8026
2 1997/04/07 14:30 04/07 17:30 11000 1000 S28E19 8027
3 1997/05/12 05:15 05/14 16:00 12000 80 N21W08 8038
4 1997/05/21 20:20 05/21 22:00 5000 500 N05W12 8040
5 1997/09/23 21:53 09/23 22:16 6000 2000 S29E25 8088
6 1997/11/03 05:15 11/03 12:00 14000 250 S20W13 8100
7 1997/11/03 10:30 11/03 11:30 14000 5000 S16W21 8100
8 1997/11/04 06:00 11/05 04:30 14000 100 S14W33 8100
9 1997/11/06 12:20 11/07 08:30 14000 100 S18W63 8100
10 1997/11/27 13:30 11/27 14:00 14000 7000 N17E63 8113
... with 472 more rows, and 6 more variables: flare_classification <chr>, cme_date <chr>, cme_time <chr>, cme_angle <chr>, cme_width <chr>, cme_speed <chr>

Task 4: Tidy the NASA table (15 pts)

Here we will code missing observations properly, recode columns that correspond to more than one piece of information, and treat dates and times appropriately.

Recode any missing entries as NaN. Refer to the data description to see how missing entries are encoded in each column. Be sure to look carefully at the actual data, as the nasa descriptions might not be completely accurate.
The CPA column (cme_angle) contains angles in degrees for most rows, except for halo flares, which are coded as Halo. Create a new column that indicates if a row corresponds to a halo flare or not, and then replace Halo entries in the cme_angle column with NaN.
The width column indicates if the given value is a lower bound. Create a new column that indicates if width is given as a lower bound, and remove any non-numeric part of the width column.
Combine date and time columns for start, end and cme so they can be encoded as datetime objects.

The output of this step should be similar to this:

start_datetime end_datetime start_frequency end_frequency flare_location flare_region importance cme_datetime cpa width speed plot is_halo width_lower_bound
0 1997-04-01 14:00:00 1997-04-01 14:15:00 8000 4000 S25E16 8026 M1.3 1997-04-01 15:18:00 74 79 312 PHTX False False
1 1997-04-07 14:30:00 1997-04-07 17:30:00 11000 1000 S28E19 8027 C6.8 1997-04-07 14:27:00 NaN 360 878 PHTX True False
2 1997-05-12 05:15:00 1997-05-14 16:00:00 12000 80 N21W08 8038 C1.3 1997-05-12 05:30:00 NaN 360 464 PHTX True False
3 1997-05-21 20:20:00 1997-05-21 22:00:00 5000 500 N05W12 8040 M1.3 1997-05-21 21:00:00 263 165 296 PHTX False False
4 1997-09-23 21:53:00 1997-09-23 22:16:00 6000 2000 S29E25 8088 C1.4 1997-09-23 22:02:00 133 155 712 PHTX False False
5 1997-11-03 05:15:00 1997-11-03 12:00:00 14000 250 S20W13 8100 C8.6 1997-11-03 05:28:00 240 109 227 PHTX False False

PART 2: Analysis

Now that you have data from both sites, let’s start some analysis.

Task 5: Replication (10 pts)

Replicate as many as possible of the top 50 solar flare table in SpaceWeatherLive.com using the data obtained from NASA. If you get the top 50 solar flares from the NASA table based on their classification (e.g., X28 is the highest), do you get data for the same solar flare events? Include code used to get the top 50 solar flares from the NASA table (be careful when ordering by classification). Write a sentence or two discussing how well you can replicate the SpaceWeatherLive data from the NASA data.

Task 6: Integration (15 pts)

Write a function that finds the best matching row in the NASA data for each of the top 50 solar flares in the SpaceWeatherLive data. Here, you have to decide for yourself how you determine what is the best matching entry in the NASA data for each of the top 50 solar flares. In your submission, include an explanation of how you are defining best matching rows across the two datasets in addition to the code used to find the best matches. Finally, use your function to add a new column to the NASA dataset indicating its rank according to SpaceWeatherLive, if it appears in that dataset.

Task 7: Attributes visualization (7 pts)

Plot attributes in the NASA dataset (e.g., starting or ending frequenciues, flare height or width) over time. Use graphical elements (e.g., text or points) to indicate flares in the top 50 flares.

Task 8: Attributes comparison (8 pts)

Do flares in the top 50 tend to have Halo CMEs? You can make a barplot that compares the number (or proportion) of Halo CMEs in the top 50 flares vs. the dataset as a whole.

Task 9: Events distribution (10 pts)

Do strong flares cluster in time? Plot the number of flares per month over time, add a graphical element to indicate (e.g., text or points) to indicate the number of strong flares (in the top 50) to see if they cluster.

Submission

Prepare a .pdf file that includes code, its output, and comments/description for each part and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Assignment 2

Out: February 01
Due: February 08 by 5:30pm on Canvas
* Submit a .pdf file that includes code, its output, and comments/description for each problem and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Problem 1: (10 points)

You are given a set of m objects that is divided into K groups, where the i-th group is of size m_i.If the goal is to obtain a sample of size n < m , what is the difference between the following two sampling schemes? (Assume sampling with replacement.)

We randomly select n * m_i /m elements from each group.
We randomly select n elements from the data set, without regard for the group to which an object belongs.

Problem 2: (10 points)

Download the image hw2_2022_problem4_Face.pgm from the class homework data folder. Find a PCA package and use it to compute eigenvectors and eigenvalues for this image.

(5 points) Compute 2, 5, and 10 principal components and show original and the resulting images.
(5 points) What is the minimal number of principal components needed to retain 80% of data variance?

Problem 3: (40 points)

Download Heart Disease dataset from http://archive.ics.uci.edu/ml/datasets/Heart+Disease. This dataset contains patient information from Cleveland hospital where each row represents a patient. Labels are the test results of the presence of the disease where "0" means no presence for heart disease and 1-4 represent the level of the disease. The dataset contains some missing values, and these values are denoted as "?". There are 303 patients in the original dataset and 75 features. The processed version of the dataset has the following attributes (which will be used in this assignment):

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type:
- value 1: typical angina
- value 2: atypical angina
- value 3: non-anginal pain
- value 4: asymptomatic
trestbps resting blood pressure (in mm Hg on admission to the hospital)
chol: serum cholestoral in mg/dl
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg: resting electrocardiographic results:
- value 0: normal
- value 1: having ST-T wave abnormality (T wave inversions and/or STelevation or depression of > 0.05 mV)
- value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
thalach: maximum heart rate achieved
exang exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment:
- value 1: upsloping
- value 2: flat
- value 3: downsloping
ca: number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
num: Label 0 - 4

Answer the following questions using one of the following programming languages Python, R, or Java (you cannot use excel or any equivalent software):

(5 points) The associated task with this dataset is multiclass classification. Change the problem to binary classification and compute the proportion of each class in the binary case? Is this a balanced dataset?
(5 points) Remove all patients that have any missing values in their records, how many patients do you have now?
(5 points) Now, impute missing values by mean values of corresponding attributes. Report how this imputation affected the overall distribution of corresponding attributes?
(5 points) Draw a scatter plot and explain the relationship between chest pain type and age?
(5 points) How sex affects having or not having a heart disease? Draw a box plot and explain.
Generate 6 random samples (without replacement) of size 50 and answer the following:
1. (5 points) What the proportion of each class in each sample? Is each sample a balanced dataset?
2. (5 points) How sex affects having or not having a heart disease in each sample? Draw a box plot.
(5 points) Compare results from e with results from f.ii

Problem 4: (40 points)

The decision-makers at GymX would like to improve their services using data mining and machine learning techniques to better understand their customers. They have a large database that contains many fields such as customer_id, customer_name, age, sex, height, weight, membership_type, diet_restrictions, and more. The problem is that the database has many missing data, because most customer do not fill all necessary fields when they join the gym. This problem will affect their customer analysis. Help GymX to solve their problem. Download hw2_2024_problem4_GymX.xlsx dataset from the class homework data folder. The dataset contains the following attributes:

Customer ID
Customer Name
Age
Sex (male = 1, female = 0)
Height in feet
Weight in pounds
Membership type (adult, youth, or kids)

(10 points) Report the number of missing values in each feature.
(10 points) Describe a naive solution for missing values and use it to solve the missing data problem. What are the advantages/disadvantages of this solution?
(10 points) Propose a better solution and use it to solve the missing data problem.
(10 points) Compare results of the naïve handling of missing data vs your better solutions based on:
1. (5 points) Plot a histogram of weight for all customers and report mean and standard deviation
2. (5 points) Create a bar plot that shows the number of customers from each sex from each membership type.

Assignment 3

Out: February 8
Due: February 15 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission
Number of problems/points: Eight problems for total of 100 points

Problem 1: (10 points)

The following algorithm aims to find the K nearest neighbors of a data object:
      1: for i = 1 to number of data objects do
      2:      Find the distances of the i-th object to all other objects.
      3:      Sort these distances in decreasing order.
               (Keep track of which object is associated with each distance.)
      4: return the objects associated with the first K distances of the sorted list
      5: end for
(a) (5 points) Describe the potential problems with this algorithm if there are duplicate objects in the data set.
(b) (5 points) How would you fix this problem?

Problem 2: (20 points)
Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."

Problem 3: (10 points)

Transform correlation to a similarity measure with [0,1] range that could be used for clustering time series.

Problem 4: (10 points)

Transform correlation to a similarity measure with [0,1] range that could be used for predicting the behavior of one time series given another.

Problem 5: (20 points)

This exercise compares and contrasts some similarity and distance measures.

(10 points) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

(10 points) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

Problem 6: (10 points)

Donor data consists of 11 records in the following format: Name Age Salary Donor(Y/N). Donor training dataset:

Name	Age	Salary	Donor(Y/N)
Nancy	21	37,000	N
Jim	27	41,000	N
Allen	43	61,000	Y
Jane	38	55,000	N
Steve	44	30,000	N
Peter	51	56,000	Y
Sayani	53	70,000	Y
Lata	56	74,000	Y
Mary	59	25,000	N
Victor	61	68,000	Y
Dale	63	51,000	Y

Compute the Gini index for the entire Donor data set, with respect to the two classes. Compute the Gini index for the portion of the data set with age at least 50.

Problem 7: (10 points)

Repeat the computation of the previous exercise with the use of the entropy criterion. Compute the entropy for the portion of the data set with age greater than 50.

Problem 8: (10 points)

What is the best classification accuracy that can be obtained on Donor dataset with a decision tree of depth 2, where each test results in a binary split?

Homework 4

Out: February 15
Due: February 22 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for total of 100 points

Problem 1: (15 points)

In reservoir sampling with reservoir of size k the n-th incoming stream data point is insert into the reservoir with probability k/n and one of the old k data points is removed from the reservoir at random to make room for the newly arriving point. After n stream points have arrived, prove that probability of any point being included in the reservoir is k/n.

Problem 2: (15 points)

Show that the entropy of a node in a decision tree never increases after splitting it into smaller successor nodes.

Problem 3: (60 points)

Develop a decision tree classification software on your own in Python, R, or Java and apply it to develop a classification model for Donor dataset from Homework 3. Since this dataset is small, apply leave-one-out training and testing framework and report your findings.
Now apply this classifier to solve the Mushroom problem defined at https://www.kaggle.com/uciml/mushroom-classification
Evaluate results using 10-fold cross validation and report your findings.
Compare test set accuracy when training a mushroom classifier using 500 vs. 5,000 training examples.
Report accuracy on training and test data when using a decision tree that has 10 vs 30 leaves.
Report attribute tests used in the decision tree with 10 leaves.
For a decision tree with 10 leaves report the number of positive and negative training examples at each internal node and at leaves.

Problem 4: (10 points)

Discuss the advantages and disadvantages of a nearest neighbor classifier, over a decision tree.

Assignment 5

Out: February 22
Due: February 29 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for total of 100 points

Problem 1: (30 points)

Download Census Income data set from UCI repository - link is https://archive.ics.uci.edu/ml/datasets/census+income
Use training data do develop a model aimed to determine whether a person makes over 50K a year.
Solve this problem using k-nearest neighbors’ method with k=3 and k=9 and report F1 score on test data.

Problem 2: (30 points)

Solve this problem using a using feed-forward neural network and report ROC on test data.

Problem 3: (20 points)

Consider a data set containing four points located at the corners of the square. The two points on one diagonal belong to one class, and the two points on the other diagonal belong to the other class. Is this data set linearly separable? Provide a proof.

Problem 4: (20 points)

Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student?
Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student?
Repeat part (b) assuming that the student is a smoker.
Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke.

Assignment 6

Out: February 29
Due: MONDAY March 11 by noon on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: One problem for a total of 50 points.

Problem 1: (50 points)

Propose a ranked list of FIVE Knowledge Discovery and Data Mining topics of which you would possibly learn one on your own and present as a mini-lecture in class. For each of 5 selected topics list references that you will use to prepare a mini-lecture. Mini-lecture topics and the presentation day will be assigned by considering your preferences, but if multiple people rank the same topic as their preference, this topic will be assigned to one who provides the most convincing references which will be used to prepare the presentation.

Mini lectures will be presented on March 28, April 4, and April 11. The order of presentations will be determined based on topics and will be announced on March 14.

The proposed topics for mini-lectures should be different from topics already discussed in class. Each topic should be appropriate for a 20-minute presentation. You can prepare a presentation based on materials from two textbooks, but you are also allowed to use conference tutorial slides, articles, etc. The following are possible topics to consider. You can also propose different topics relevant to Knowledge Discovery and Data Mining:

Large-scale hierarchical classification
Advanced concepts in cluster analysis
Association rules mining
Advanced concepts in association analysis
Anomaly detection
Data stream mining
Text and web mining
Time series mining
Mining big time series
Sequence pattern mining
Survival analysis
Mining spatial data
Mining graphs
Graphs sketching, sampling, streaming
Mining web data
Mining social networks
Privacy-preserving data mining
Mining spatio-temporal data
Mining semi-structured data
Mining with constraints
False discoveries
Lifelong machine learning
Deep Bayesian mining
Data mining for drug discovery
Mining electronic health records
Data mining in transportation
Data mining in power systems
Sports analytics
Explainable data modeling
Active learning
Human-in-the-loop learning
Visual analytics
Fairness-aware machine learning
Transfer learning
Fake news detection
Zero-shoot and few-shot learning
Mining temporal networks
Reinforcement learning
Graph neural networks
Deep reinforcement learning
Deep learning for personalized search and recommender systems
A/B testing at scale
Parallel and distributed data science (cloud, map-reduce, federated learning)

Assignment 7

Out: February 29
Due: March 14 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for a total of 100 points.

Problem 1: (15 points)

Illustrate on an example the vanishing gradient problem for a deep neural network (with many hidden layers) if using a sigmoid activation function.
What is a way to overcome this problem (explain how)?

Problem 2: (15 points)

The leader algorithm represents each cluster using a point, known as a leader, and assigns each point to the cluster corresponding to the closest leader unless this distance is above a user-specified threshold. In that case, the point becomes the leader of a new cluster.

What are the advantages and disadvantages of the leader algorithm as compared to K-means?
Suggest ways in which the leader algorithm might be improved.

Problem 3: (15 points)

Traditional agglomerative hierarchical clustering routines merge two clusters at each step.

Does it seem likely that such an approach accurately captures the (nested) cluster structure of a set of data points?
If not, explain how you might post-process the data to obtain a more accurate view of the cluster structure.

Problem 4: (55 points)

Download and install CLUTO software for clustering high-dimensional data (http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview). Apply this software to cluster the Enron Emails dataset available at https://archive.ics.uci.edu/ml/datasets/Bag+of+Words

Report clustering results when using partitional clustering in the CLUTO package. You are allowed to apply CLUTO on a sample if the data is too large for your computer. In such a case report sample size you use and how consistent the result is if you repeat experiments 3 times on 3 samples of that size.

Report results when using agglomerative clustering algorithms in the CLUTO package. In agglomerative clustering compare the results of when using complete-link vs. single-link merging schemes. Then, for single-link merging compare the results when using cosine versus Euclidean distance function.

Assignment 8

Out: March 14
Due: MONDAY March 25 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: One problem for a total of 50 points

Problem 1: (50 points)

Write a research proposal for the class project that you plan to perform, present progress on April 18 and 25, and submit the project report by May 2.

Teams of two undergraduate students are allowed and teams of one undergraduate and one graduate student. Teams of two graduate students are not allowed.

Write the proposal using the following format:
(0) Your name(s) and e-mail address (such that the instructor can approve your topic quickly or to ask for a revision/clarification)
(1) Title;
(2) Objective and Significance;
(3) Background;
(4) Proposed Approach (make sure to explain where you will get data and how much preprocessing is needed);
(5) References.
The proposal description may not exceed 2 pages in 12 pt style.

Following are some of the research project topics investigated by Temple KDDM students in previous years:

Early prediction of spatio-temporal events
Applications of Graph Neural Networks for modeling partially observed data
Classifying sports events based on spatio-temporal data of players and ball using unreliable labels
Predictive modeling to forecast store sales
Emotion analysis based on text mining
Clustering NYC 311 requests
MP3 to MIDI conversion via deep learning
Co-Localization of multiple objects in images using activation map
Generalized procedure for selecting methylation CpGs associated with cancer
Classification of vaccine-related tweets using deep learning
Automatically building book indices
Time-series and clustering analysis for systemic lupus erythematosus patients’ study
Identifying complexity of Wikipedia text
Graphlet-assisted structured regression
Wire bonding: Predicting failures for the Ultrasonic Transducer
Exploring bias and variance in supervised learning algorithms
Clustering of gene expression cancer RNA-Seq Data Set
Analysis of online product review
Class imbalance: Credit card fraud analysis
Physician social network and patient outcomes: An empirical investigation
Exploring underlying structures of tweets with URLs via clustering
Dynamic changes of structure of large-scale evolving temporal graphs
Temporal predictive modeling with sample/feature size constraints
Opportunistic routing assisted by decision trees in CRNs
Uncertainty estimation of structured models on evolving graphs
Structured output prediction with spatio-temporal data
Using nonlinear gated-experts for traffic speed forecasting
New clustering schemes to improve the analysis of antibody CDR structure
Missing data, latent variables and PCA
Text mining of evaluations of commercial banks
Health care data mining
Short text data mining and analysis
Label propagation for multi-label prediction
Ready for human-machine cooperation in hierarchical clustering?
Decentralized estimation using learning vector quantization
Batch mode active learning for classification and regression
Inverse active learning modeling in simulated AOD prediction
Pollution prediction using pre-clustering on informative features
Uncertainty estimation for predicted aerosol optical depth
Feature selection for microarray classification
Analysis of gene functional expression profiles using GO semantic similarity
Disease data mining survival prediction based on gene expression data
Using movement data to detect significant regions of infection
Shape matching improvement
Relationships between environmental aspects of police officer’s work, family life and stress
Classification of Basketball Strategies using Spatio-Temporal Data
Sentiment Analysis on Social Media about Covid Vaccines to Analyze Public Reaction
Spotify Playlist Recommender System
Spotify Playlist Recommender System
Fake News Detection and Analysis
Data Mining and Analysis on Tweets Related to Current Events in the Russo-Ukrainian War
A Comparison of Bagging and Boosting for Regression and Classification Tasks
A Formal Framework for Credit Card Fraud Analysis
Cancer pathology stage prediction from gene expression quantification data
The Crowd vs. The Expert; Comparing Ensembles for Eigenface Emotion Classification
Does a Clutch Factor Exist in Basketball?
Forecasting store sales
Forecasting store sales
Classification of Salary by Occupation, Gender, and Other Metrics
Exploring bias and variance of models on animal faces classification
Comparison of Recent Conditional Generative Adversarial Networks Models for Image Translation
Twitter bot detection and classification with sentiment analysis
Mineral Classification from Spectral Data
Comparative Analysis of Missing Data Imputation Techniques for MCAR Data
Predicting Philadelphia Voter Turnout with a Random Forest Model
Topic Clustering of Autism Subreddit Data
Location-Based Species Presence Prediction
Cancer Pathological Stage Prediction using MRNA Gene Expression Data
False Discoveries
Comparing Unsupervised Visual Representation Techniques on Hotel Room Images
How do weather conditions affect users’ engagement in social media?
Comparing Image Classifying Methods
A Comparison of Various Forecasting Models in Predicting Rainfall from Spatio-Temporal Data
Reddit-based graph generation
Sentiment analysis in Text mining
A Machine Learning Framework to Identify Early Alzheimer's Disease
Clustering and Regression of Philadelphia Bike Share Data

Following are some research project topic ideas suggested by authors of the KDDM textbook:

Evaluating Performance of Classifiers
- Compare the bias and variance of models generated using different evaluation methods (leave one out, cross validation, bootstrap, stratification, etc.)
- References
Support Vector Machine (SVM)
- Present an overview of SVM or applying Support Vector Machines to various application domains.
- References
Cost-sensitive learning
- A comparative study and implementation of different techniques for ensemble learning such as bagging, boosting, etc.
- References
Semi-supervised learning (classification with labeled and unlabeled data)
- Applying different semi-supervised learning techniques to UCI data sets.
- References
Classification for rare-class problems
- A comparative study and/or implementation of different classification techniques to analyze rare class problems
- References
Time Series Prediction/Classification
- A comparative study and/or implementation of time series prediction/classification techniques
- References
Sequence Prediction
- A comparative study and implementation of sequence prediction techniques
- References
Association Rules for Classification
- A comparative study and implementation of classification using association patterns (rules and itemsets
- References
Spatial Association Rule Mining
- A comparative study on spatial association rule mining.
- References
Temporal Association Rule Mining
- A comparative study and/or implementation of temporal association rule mining techniques
- References
Sequential Association Rule Mining
- A comparative study and/or implementation of sequential association rule mining techniques
- References
Outlier Detection
- A comparative study and/or implementation of outlier detection techniques.
- References
Parallel Formulations of Clustering
- Study and possible implementation of parallel formulations of clustering techniques.
- References
Clustering of Time Series
- Study and possible implementation of time series clustering techniques on actual NASA time series data.
- References
Scalable clustering algorithms
- A comparative study of scalable data mining techniques.
- References
Clustering association rules and frequent item sets
- A comparative study of techniques for clustering association rules.
- References

Homework Policies (applicable for all assignments):

Homework 1

Homework 2

Homework 3

Homework 4

Homework 5

Homework 6

Homework 7

Homework 8

Assignment 1

PROBLEM 1:

Submission

Assignment 2

Problem 1: (10 points)

Problem 2: (10 points)

Problem 3: (40 points)

Problem 4: (40 points)

Assignment 3

Problem 1: (10 points)

Problem 2: (20 points) Compute the cosine measure using the frequencies between the following two sentences: (a) "The sly fox jumped over the lazy dog." (b) "The dog jumped at the intruder."

Problem 3: (10 points)

Problem 4: (10 points)

Problem 5: (20 points)

Problem 6: (10 points)

Problem 7: (10 points)

Problem 8: (10 points)

Homework 4

Problem 1: (15 points)

Problem 2: (15 points)

Problem 3: (60 points)

Problem 4: (10 points)

Assignment 5

Problem 1: (30 points)

Problem 2: (30 points)

Problem 3: (20 points)

Problem 4: (20 points)

Assignment 6

Problem 1: (50 points)

Assignment 7

Problem 1: (15 points)

Problem 2: (15 points)

Problem 3: (15 points)

Problem 4: (55 points)

Assignment 8

Problem 1: (50 points)

Problem 2: (20 points)
Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."