Homework Assignments

Homework Policies (applicable for all assignments):

You are required to do the homework problems in order to pass.
Understandability of the solution is as desired as correctness.
Penalty for late homework assignments submissions is 20% per day. So, do it on time.
Solutions are expected to be your own work. Group work is not allowed unless explicitly approved for a particular problem. If you obtained a hint with help (e.g., through library work, discussion with another person, etc.) acknowledge your source, and write up the solution on your own. Plagiarism and other anti-intellectual behavior will be dealt with severely.

Assignment 1

Out: January 15
Due: January 29 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.

PROBLEM 1:

Solve nine tasks described below and submit to Canvas a report as a .pdf file. For each task includes code, its output, and comments/description.

The goal is to provide better data about the top 50 solar flares recorded so far than those shown by SpaceWeatherLive.com. Use this messy NASA data to add more features for the top 50 solar flares. You need to scrape this information directly from each HTML page. You can read here more about Solar Flares, coronal mass ejections, and the solar flare alphabet soup.

Use any programming language of your choice. Python is recommended (and used for further explanation), but this can be done in R, Java, and other languages as well. A tutorial on Python is available at www.learnpython.org.

PART 1: Data scraping and preparation

Task 1: Scrape your competitor's data (10 pts)
Scrape data for the top 50 solar flares shown in SpaceWeatherLive.com.
Steps to do this are (if you are using python):

pip install or conda install the following Python packages: beautifulsoup4, requests, pandas, NumPy, matplotlib (for visualization)
Use requests to get page content (as in, HTTP GET)
Extract the text from the page
Use BeautifulSoup to read and parse the data, either as html or lxml
Use prettify( ) to view the content and find the appropriate table
Use find( ) to save the aforementioned table as a variable
Use pandas to read in the HTML file. HINT make-sure the above data is properly typecast.
Set reasonable names for the table columns, e.g., rank, x_classification, date, region, start_time, maximum_time, end_time, movie. Pandas.columns makes this very simple.

The result should be a data frame, with the first few rows as:

Dimension: 50 × 8

rank x_class date region start_time max_time end_time movie
1 1 X28.0 2003/11/04 0486 19:29 19:53 20:06 MovieView archive
2 2 X20 2001/04/02 9393 21:32 21:51 22:03 MovieView archive
3 3 X17.2 2003/10/28 0486 09:51 11:10 11:24 MovieView archive
4 4 X17.0 2005/09/07 0808 17:17 17:40 18:03 MovieView archive
5 5 X14.4 2001/04/15 9415 13:19 13:50 13:55 MovieView archive
6 6 X10.0 2003/10/29 0486 20:37 20:49 21:01 MovieView archive
7 7 X9.4 1997/11/06 - 11:49 11:55 12:01 MovieView archive
8 8 X9.0 2006/12/05 0930 10:18 10:35 10:45 MovieView archive
9 9 X8.3 2003/11/02 0486 17:03 17:25 17:39 MovieView archive
10 10 X7.1 2005/01/20 0720 06:36 07:01 07:26 MovieView archive
... with 40 more rows

Task 2: Tidy the top 50 solar flare data (10 pts)

Make sure this table is usable using pandas:

Drop the last column of the table, since we are not going to use it moving forward.
Use datetime import to combine the date and each of the three time columns into three datetime columns. You will see why this is useful later on. iterrows() should prove useful here.
Update the values in the dataframe as you do this. Set_value should prove useful.
Set regions coded as - as missing (NaN). You can use dataframe.replace() here.

The result of this step should be a data frame with the first few rows as:

A dataframe: 50 × 6

rank x_class start_datetime max_datetime end_datetime region
1 1 X28.0 2003-11-04 19:29:00 2003-11-04 19:53:00 2003-11-04 20:06:00 0486
2 2 X20 2001-04-02 21:32:00 2001-04-02 21:51:00 2001-04-02 22:03:00 9393
3 3 X17.2 2003-10-28 09:51:00 2003-10-28 11:10:00 2003-10-28 11:24:00 0486
4 4 X17.0 2005-09-07 17:17:00 2005-09-07 17:40:00 2005-09-07 18:03:00 0808
5 5 X14.4 2001-04-15 13:19:00 2001-04-15 13:50:00 2001-04-15 13:55:00 9415
6 6 X10.0 2003-10-29 20:37:00 2003-10-29 20:49:00 2003-10-29 21:01:00 0486
7 7 X9.4 1997-11-06 11:49:00 1997-11-06 11:55:00 1997-11-06 12:01:00 <NA>
8 8 X9.0 2006-12-05 10:18:00 2006-12-05 10:35:00 2006-12-05 10:45:00 0930
9 9 X8.3 2003-11-02 17:03:00 2003-11-02 17:25:00 2003-11-02 17:39:00 0486
10 10 X7.1 2005-01-20 06:36:00 2005-01-20 07:01:00 2005-01-20 07:26:00 0720
... with 40 more rows

Task 3: Scrape the NASA data (15 pts)

Next, you need to scrape NASA data to get additional features about these solar flares. This table format is described here.

Once scraped, do the next steps:

Use BeautifulSoup functions (e.g., find, findAll) and string functions (e.g., split and built-in slicing capabilities) to obtain each row of data as a long string.
Use the split function to separate each line of text into a data row.
Create a DataFrame with the data from the table.
Choose appropriate names for columns.

The result of this step should be similar to:

Dimension: 482 × 14

start_date start_time end_date end_time start_frequency end_frequency flare_location flare_region
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1997/04/01 14:00 04/01 14:15 8000 4000 S25E16 8026
2 1997/04/07 14:30 04/07 17:30 11000 1000 S28E19 8027
3 1997/05/12 05:15 05/14 16:00 12000 80 N21W08 8038
4 1997/05/21 20:20 05/21 22:00 5000 500 N05W12 8040
5 1997/09/23 21:53 09/23 22:16 6000 2000 S29E25 8088
6 1997/11/03 05:15 11/03 12:00 14000 250 S20W13 8100
7 1997/11/03 10:30 11/03 11:30 14000 5000 S16W21 8100
8 1997/11/04 06:00 11/05 04:30 14000 100 S14W33 8100
9 1997/11/06 12:20 11/07 08:30 14000 100 S18W63 8100
10 1997/11/27 13:30 11/27 14:00 14000 7000 N17E63 8113
... with 472 more rows, and 6 more variables: flare_classification <chr>, cme_date <chr>, cme_time <chr>, cme_angle <chr>, cme_width <chr>, cme_speed <chr>

Task 4: Tidy the NASA table (15 pts)

Here we will code missing observations properly, recode columns that correspond to more than one piece of information, and treat dates and times appropriately.

Recode any missing entries as NaN. Refer to the data description to see how missing entries are encoded in each column. Be sure to look carefully at the actual data, as the nasa descriptions might not be completely accurate.
The CPA column (cme_angle) contains angles in degrees for most rows, except for halo flares, which are coded as Halo. Create a new column that indicates if a row corresponds to a halo flare or not, and then replace Halo entries in the cme_angle column with NaN.
The width column indicates if the given value is a lower bound. Create a new column that indicates if width is given as a lower bound, and remove any non-numeric part of the width column.
Combine date and time columns for start, end and cme so they can be encoded as datetime objects.

The output of this step should be similar to this:

start_datetime end_datetime start_frequency end_frequency flare_location flare_region importance cme_datetime cpa width speed plot is_halo width_lower_bound
0 1997-04-01 14:00:00 1997-04-01 14:15:00 8000 4000 S25E16 8026 M1.3 1997-04-01 15:18:00 74 79 312 PHTX False False
1 1997-04-07 14:30:00 1997-04-07 17:30:00 11000 1000 S28E19 8027 C6.8 1997-04-07 14:27:00 NaN 360 878 PHTX True False
2 1997-05-12 05:15:00 1997-05-14 16:00:00 12000 80 N21W08 8038 C1.3 1997-05-12 05:30:00 NaN 360 464 PHTX True False
3 1997-05-21 20:20:00 1997-05-21 22:00:00 5000 500 N05W12 8040 M1.3 1997-05-21 21:00:00 263 165 296 PHTX False False
4 1997-09-23 21:53:00 1997-09-23 22:16:00 6000 2000 S29E25 8088 C1.4 1997-09-23 22:02:00 133 155 712 PHTX False False
5 1997-11-03 05:15:00 1997-11-03 12:00:00 14000 250 S20W13 8100 C8.6 1997-11-03 05:28:00 240 109 227 PHTX False False

PART 2: Analysis

Now that you have data from both sites, let’s start some analysis.

Task 5: Replication (10 pts)

Replicate as many as possible of the top 50 solar flare table in SpaceWeatherLive.com using the data obtained from NASA. If you get the top 50 solar flares from the NASA table based on their classification (e.g., X28 is the highest), do you get data for the same solar flare events? Include code used to get the top 50 solar flares from the NASA table (be careful when ordering by classification). Write a sentence or two discussing how well you can replicate the SpaceWeatherLive data from the NASA data.

Task 6: Integration (15 pts)

Write a function that finds the best matching row in the NASA data for each of the top 50 solar flares in the SpaceWeatherLive data. Here, you have to decide for yourself how you determine what is the best matching entry in the NASA data for each of the top 50 solar flares. In your submission, include an explanation of how you are defining best matching rows across the two datasets in addition to the code used to find the best matches. Finally, use your function to add a new column to the NASA dataset indicating its rank according to SpaceWeatherLive, if it appears in that dataset.

Task 7: Attributes visualization (7 pts)

Plot attributes in the NASA dataset (e.g., starting or ending frequenciues, flare height or width) over time. Use graphical elements (e.g., text or points) to indicate flares in the top 50 flares.

Task 8: Attributes comparison (8 pts)

Do flares in the top 50 tend to have Halo CMEs? You can make a barplot that compares the number (or proportion) of Halo CMEs in the top 50 flares vs. the dataset as a whole.

Task 9: Events distribution (10 pts)

Do strong flares cluster in time? Plot the number of flares per month over time, add a graphical element to indicate (e.g., text or points) to indicate the number of strong flares (in the top 50) to see if they cluster.

Submission

Prepare a .pdf file that includes code, its output, and comments/description for each part and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Assignment 2

Out: January 29
Due: February 05 by 5:30pm on Canvas
* Submit a .pdf file that includes code, its output, and comments/description for each problem and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.

Problem 1: (10 points)

You are given a set of m objects that is divided into K groups, where the i-th group is of size m_i.If the goal is to obtain a sample of size n < m , what is the difference between the following two sampling schemes? (Assume sampling with replacement.)

We randomly select n * m_i /m elements from each group.
We randomly select n elements from the data set, without regard for the group to which an object belongs.

Problem 2: (10 points)

Download the image hw2_Face.pbm from the class homework data folder. Find a PCA package and use it to compute eigenvectors and eigenvalues for this image.

(5 points) Compute 2, 5, and 10 principal components and show original and the resulting images.
(5 points) What is the minimal number of principal components needed to retain 80% of data variance?

Problem 3: (40 points)

Download Heart Disease dataset from https://archive.ics.uci.edu/ml/datasets/Heart+Disease. This dataset contains patient information from Cleveland hospital where each row represents a patient. Labels are the test results of the presence of the disease where "0" means no presence for heart disease and 1-4 represent the level of the disease. The dataset contains some missing values, and these values are denoted as "?". There are 303 patients in the original dataset and 75 features. The processed version of the dataset has the following attributes (which will be used in this assignment):

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type:
- value 1: typical angina
- value 2: atypical angina
- value 3: non-anginal pain
- value 4: asymptomatic
trestbps resting blood pressure (in mm Hg on admission to the hospital)
chol: serum cholestoral in mg/dl
fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
restecg: resting electrocardiographic results:
- value 0: normal
- value 1: having ST-T wave abnormality (T wave inversions and/or STelevation or depression of > 0.05 mV)
- value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
thalach: maximum heart rate achieved
exang exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment:
- value 1: upsloping
- value 2: flat
- value 3: downsloping
ca: number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
num: Label 0 - 4

Answer the following questions using one of the following programming languages Python, R, or Java (you cannot use excel or any equivalent software):

(5 points) The associated task with this dataset is multiclass classification. Change the problem to binary classification and compute the proportion of each class in the binary case? Is this a balanced dataset?
(5 points) Remove all patients who have any missing values in their records. How many patients do you have now?
(5 points) Now, impute missing values by mean values of corresponding attributes. Report how this imputation affected the overall distribution of corresponding attributes?
(5 points) Draw a scatter plot and explain the relationship between chest pain type and age?
(5 points) How does sex affect having or not having a heart disease? Draw a box plot and explain.
Generate 6 random samples (without replacement) of size 50 and answer the following:
1. (5 points) What is the proportion of each class in each sample? Is each sample a balanced dataset?
2. (5 points) How does sex affect having or not having a heart disease in each sample? Draw a box plot.
(5 points) Compare results from e with results from f.ii

Problem 4: (40 points)

The decision-makers at GymX would like to improve their services using data mining and machine learning techniques to better understand their customers. They have a large database with many fields, including customer_id, customer_name, age, sex, height, weight, membership_type, diet_restrictions, and more. The problem is that the database has a lot of missing data because most customer do not fill in all the required fields when they join the gym. This problem will affect their customer analysis. Help GymX to solve their problem. Download hw2_GymX.xlsx dataset from the class homework data folder. The dataset contains the following attributes:

Customer ID
Customer Name
Age
Sex (male = 1, female = 0)
Height in feet
Weight in pounds
Membership type (adult, youth, or kids)

(10 points) Report the number of missing values in each feature.
(10 points) Describe a naive solution for missing values and use it to solve the missing data problem. What are the advantages/disadvantages of this solution?
(10 points) Propose a better solution and use it to solve the missing data problem.
(10 points) Compare results of the naïve handling of missing data vs your better solutions based on:
1. (5 points) Plot a histogram of weight for all customers and report mean and standard deviation
2. (5 points) Create a bar plot showing the number of customers by sex and membership type.

Assignment 3

Out: February 5
Due: February 12 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission
Number of problems/points: Eight problems for total of 100 points

Problem 1: (10 points)

The following algorithm aims to find the K nearest neighbors of a data object:
      1: for i = 1 to number of data objects do
      2:      Find the distances of the i-th object to all other objects.
      3:      Sort these distances in decreasing order.
               (Keep track of which object is associated with each distance.)
      4: return the objects associated with the first K distances of the sorted list
      5: end for
(a) (5 points) Describe the potential problems with this algorithm if there are duplicate objects in the data set.
(b) (5 points) How would you fix this problem?

Problem 2: (20 points)
Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."

Problem 3: (10 points)

Transform correlation to a similarity measure with [0,1] range that could be used for clustering time series.

Problem 4: (10 points)

Transform correlation to a similarity measure with [0,1] range that could be used for predicting the behavior of one time series given another.

Problem 5: (20 points)

This exercise compares and contrasts some similarity and distance measures.

(10 points) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different between two binary vectors. The Jaccard similarity is a measure of the similarity between two binary vectors. Suppose that you are comparing how similar two organisms of different species are in terms of the number of genes they share. Describe which measure, Hamming or Jaccard, would be more appropriate for comparing the genetic makeup of two organisms. Explain. (Assume that each animal is represented as a binary vector, where each attribute is 1 if a particular gene is present in the organism and 0 otherwise.)

(10 points) If you wanted to compare the genetic makeup of two organisms of the same species, e.g., two human beings, would you use the Hamming distance, the Jaccard coefficient, or a different measure of similarity or distance? Explain. (Note that two human beings share > 99.9% of the same genes.)

Problem 6: (10 points)

Donor data consists of 11 records in the following format: Name Age Salary Donor(Y/N). Donor training dataset:

Name	Age	Salary	Donor(Y/N)
Nancy	21	37,000	N
Jim	27	41,000	N
Allen	43	61,000	Y
Jane	38	55,000	N
Steve	44	30,000	N
Peter	51	56,000	Y
Sayani	53	70,000	Y
Lata	56	74,000	Y
Mary	59	25,000	N
Victor	61	68,000	Y
Dale	63	51,000	Y

Compute the Gini index for the entire Donor data set, with respect to the two classes. Compute the Gini index for the portion of the data set with age at least 50.

Problem 7: (10 points)

Repeat the computation of the previous exercise with the use of the entropy criterion. Compute the entropy for the portion of the data set with age greater than 50.

Problem 8: (10 points)

What is the best classification accuracy that can be obtained on Donor dataset with a decision tree of depth 2, where each test results in a binary split?

Homework 4

Out: February 12
Due: February 19 by 5:30pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Four problems for total of 100 points

Problem 1: (15 points)

In reservoir sampling with reservoir of size k the n-th incoming stream data point is insert into the reservoir with probability k/n and one of the old k data points is removed from the reservoir at random to make room for the newly arriving point. After n stream points have arrived, prove that probability of any point being included in the reservoir is k/n.

Problem 2: (15 points)

Show that the entropy of a node in a decision tree never increases after splitting it into smaller successor nodes.

Problem 3: (60 points)

Develop a decision tree classification software on your own and apply it to develop a classification model for Donor dataset from Homework 3. Do not use the built-in model from Scikit-Learn and other libraries. Since this dataset is small, apply a leave-one-out training and testing framework and report your findings.
Now apply this classifier to solve the Mushroom problem defined at https://www.kaggle.com/uciml/mushroom-classification
Evaluate results using 10-fold cross-validation and report your findings.
Compare test-set accuracy when training a mushroom classifier using 500 vs. 5,000 training examples.
Report accuracy on training and test data for a decision tree with 10 vs. 30 leaves.
Report attribute tests used in the decision tree with 10 leaves.
For a decision tree with 10 leaves, report the number of positive and negative training examples at each internal node and at leaves.

Problem 4: (10 points)

Discuss the advantages and disadvantages of a nearest neighbor classifier, over a decision tree.

Assignment 5

Out: February 19
Due: February 26 by 5:00pm on Canvas
*Please write your name and TUID at the top of your CANVAS submission.
Number of problems/points: Two problems for total of 100 points

Problem 1: (50 points)

Download Census Income data set from UCI repository - link is https://archive.ics.uci.edu/ml/datasets/census+income
Use the training data do develop a model that determine whether a person earns over 60K per year.

Solve this problem using the k-nearest neighbors’ method with k=3 and k=9 and report F1 score on test data.
Describe exactly how you encoded categorical variables. How does one-hot encoding affect Euclidean distance? What happens if you do not normalize numerical features? What happens if you normalize after one-hot encoding?
Pick one misclassified test example. Print its nearest neighbors and explain why it was misclassified. Would a different k fix it?
Are there groups where performance degrades significantly (sex, race, education, marital status)? What would you do to make your classifier more fair?
Would you deploy k=3 or k=9? Justify using your actual metrics.

Problem 2: (50 points)

The MiniBatcahKMeans is a variant of the KMeans clustering where the algorithm iterates between two major steps, similar to KMmeans. In the first step, samples are drawn randomly from the dataset, to form a mini-batch. These are then assigned to the nearest centroid. In the second step, the centroids are updated. In contrast to KMeans, this is done on a per-sample basis. For each sample in the mini-batch, the assigned centroid is updated by taking the streaming average of the sample and all previous samples assigned to that centroid. This is aimed at decreasing the rate of change of a centroid over time. These steps are performed until convergence or a predetermined number of iterations is reached.

Report run time and the results for K=3 and K=9 when using KMeans vs. MiniBatchKMeans in sklearn.cluster module of Python to cluster the Enron Emails dataset available at https://archive.ics.uci.edu/ml/datasets/Bag+of+Words Apply K-Means and MiniBatch clustering to a sample of the Enron Emails data if it is too large for your computer. In such a case, the sample size you use and report how consistent the results are when you repeat the experiment 3 times on 3 samples of that size.
Analyze the clusters obtained for K=3 and report if the obtained grouping seems relevant.
Report the clustering runtime and results when applying K-Means to the Enron data reduced using PCA.

Assignment 6

No assignment yet.

Assignment 7

No assignment yet.

Assignment 8

No assignment yet.

Homework Policies (applicable for all assignments):

Homework 1

Homework 2

Homework 3

Homework 4

Homework 5

Homework 6

Homework 7

Homework 8

Assignment 1

PROBLEM 1:

Submission

Assignment 2

Problem 1: (10 points)

Problem 2: (10 points)

Problem 3: (40 points)

Problem 4: (40 points)

Assignment 3

Problem 1: (10 points)

Problem 2: (20 points)
Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."

Problem 3: (10 points)

Problem 4: (10 points)

Problem 5: (20 points)

Problem 6: (10 points)

Problem 7: (10 points)

Problem 8: (10 points)

Homework 4

Problem 1: (15 points)

Problem 2: (15 points)

Problem 3: (60 points)

Problem 4: (10 points)

Assignment 5

Problem 1: (50 points)

Problem 2: (50 points)

Assignment 6

Assignment 7

Assignment 8

Homework Policies (applicable for all assignments):

Homework 1

Homework 2

Homework 3

Homework 4

Homework 5

Homework 6

Homework 7

Homework 8

Assignment 1

PROBLEM 1:

Submission

Assignment 2

Problem 1: (10 points)

Problem 2: (10 points)

Problem 3: (40 points)

Problem 4: (40 points)

Assignment 3

Problem 1: (10 points)

Problem 2: (20 points) Compute the cosine measure using the frequencies between the following two sentences: (a) "The sly fox jumped over the lazy dog." (b) "The dog jumped at the intruder."

Problem 3: (10 points)

Problem 4: (10 points)

Problem 5: (20 points)

Problem 6: (10 points)

Problem 7: (10 points)

Problem 8: (10 points)

Homework 4

Problem 1: (15 points)

Problem 2: (15 points)

Problem 3: (60 points)

Problem 4: (10 points)

Assignment 5

Problem 1: (50 points)

Problem 2: (50 points)

Assignment 6

Assignment 7

Assignment 8

Problem 2: (20 points)
Compute the cosine measure using the frequencies between the following two sentences:
(a) "The sly fox jumped over the lazy dog."
(b) "The dog jumped at the intruder."