The goal is to provide better data about the top 50 solar flares recorded so far than those shown by SpaceWeatherLive.com. Use this messy NASA data to add more features for the top 50 solar flares. You need to scrape this information directly from each HTML page. You can read here more about Solar Flares, coronal mass ejections, and the solar flare alphabet soup.
Use any programming language of your choice. Python is recommended (and used for further explanation), but this can be done in R, Java, and other languages as well. A tutorial on Python is available at www.learnpython.org.
PART 1: Data scraping and preparation
Task 1: Scrape your competitor's data (10 pts)
Scrape data for the top 50 solar flares shown in
SpaceWeatherLive.com.
Steps to do this are (if you are using python):
The result should be a data frame, with the first few rows as:
Dimension: 50 × 8
rank x_class date region start_time max_time
end_time movie
1 1 X28.0 2003/11/04 0486 19:29 19:53 20:06 MovieView archive
2 2 X20 2001/04/02 9393 21:32 21:51 22:03 MovieView archive
3 3 X17.2 2003/10/28 0486 09:51 11:10 11:24 MovieView archive
4 4 X17.0 2005/09/07 0808 17:17 17:40 18:03 MovieView archive
5 5 X14.4 2001/04/15 9415 13:19 13:50 13:55 MovieView archive
6 6 X10.0 2003/10/29 0486 20:37 20:49 21:01 MovieView archive
7 7 X9.4 1997/11/06 - 11:49 11:55 12:01 MovieView archive
8 8 X9.0 2006/12/05 0930 10:18 10:35 10:45 MovieView archive
9 9 X8.3 2003/11/02 0486 17:03 17:25 17:39 MovieView archive
10 10 X7.1 2005/01/20 0720 06:36 07:01 07:26 MovieView archive
... with 40 more rows
Task 2: Tidy the top 50 solar flare data (10 pts)
Make sure this table is usable using pandas:
The result of this step should be a data frame with the first few rows as:
A dataframe: 50 × 6
rank x_class start_datetime
max_datetime end_datetime
region
1 1 X28.0 2003-11-04 19:29:00 2003-11-04 19:53:00 2003-11-04 20:06:00 0486
2 2 X20 2001-04-02 21:32:00 2001-04-02 21:51:00 2001-04-02 22:03:00 9393
3 3 X17.2 2003-10-28 09:51:00 2003-10-28 11:10:00 2003-10-28 11:24:00 0486
4 4 X17.0 2005-09-07 17:17:00 2005-09-07 17:40:00 2005-09-07 18:03:00 0808
5 5 X14.4 2001-04-15 13:19:00 2001-04-15 13:50:00 2001-04-15 13:55:00 9415
6 6 X10.0 2003-10-29 20:37:00 2003-10-29 20:49:00 2003-10-29 21:01:00 0486
7 7 X9.4 1997-11-06 11:49:00 1997-11-06 11:55:00 1997-11-06 12:01:00 <NA>
8 8 X9.0 2006-12-05 10:18:00 2006-12-05 10:35:00 2006-12-05 10:45:00 0930
9 9 X8.3 2003-11-02 17:03:00 2003-11-02 17:25:00 2003-11-02 17:39:00 0486
10 10 X7.1 2005-01-20 06:36:00 2005-01-20 07:01:00 2005-01-20 07:26:00 0720
... with 40 more rows
Task 3: Scrape the NASA data (15 pts)
Next, you need to scrape NASA data to get additional features about these solar flares. This table format is described here.
Once scraped, do the next steps:
The result of this step should be similar to:
Dimension: 482 × 14
start_date start_time end_date
end_time start_frequency end_frequency
flare_location flare_region
* <chr> <chr> <chr> <chr> <chr> <chr>
<chr> <chr>
1 1997/04/01 14:00 04/01 14:15 8000 4000 S25E16 8026
2 1997/04/07 14:30 04/07 17:30 11000 1000 S28E19 8027
3 1997/05/12 05:15 05/14 16:00 12000 80 N21W08 8038
4 1997/05/21 20:20 05/21 22:00 5000 500 N05W12 8040
5 1997/09/23 21:53 09/23 22:16 6000 2000 S29E25 8088
6 1997/11/03 05:15 11/03 12:00 14000 250 S20W13 8100
7 1997/11/03 10:30 11/03 11:30 14000 5000 S16W21 8100
8 1997/11/04 06:00 11/05 04:30 14000 100 S14W33 8100
9 1997/11/06 12:20 11/07 08:30 14000 100 S18W63 8100
10 1997/11/27 13:30 11/27 14:00 14000 7000 N17E63 8113
... with 472 more rows, and 6 more variables: flare_classification
<chr>, cme_date <chr>, cme_time
<chr>, cme_angle <chr>, cme_width
<chr>, cme_speed <chr>
Task 4: Tidy the NASA table (15 pts)
Here we will code missing observations properly, recode columns that correspond to more than one piece of information, and treat dates and times appropriately.
The output of this step should be similar to this:
start_datetime end_datetime start_frequency
end_frequency flare_location flare_region
importance cme_datetime cpa
width speed plot is_halo width_lower_bound
0 1997-04-01 14:00:00 1997-04-01 14:15:00 8000 4000 S25E16 8026 M1.3 1997-04-01
15:18:00 74 79 312 PHTX False False
1 1997-04-07 14:30:00 1997-04-07 17:30:00 11000 1000 S28E19 8027 C6.8
1997-04-07 14:27:00 NaN 360 878 PHTX True False
2 1997-05-12 05:15:00 1997-05-14 16:00:00 12000 80 N21W08 8038 C1.3 1997-05-12
05:30:00 NaN 360 464 PHTX True False
3 1997-05-21 20:20:00 1997-05-21 22:00:00 5000 500 N05W12 8040 M1.3 1997-05-21
21:00:00 263 165 296 PHTX False False
4 1997-09-23 21:53:00 1997-09-23 22:16:00 6000 2000 S29E25 8088 C1.4 1997-09-23
22:02:00 133 155 712 PHTX False False
5 1997-11-03 05:15:00 1997-11-03 12:00:00 14000 250 S20W13 8100 C8.6 1997-11-03
05:28:00 240 109 227 PHTX False False
PART 2: Analysis
Now that you have data from both sites, let’s start some analysis.
Task 5: Replication (10 pts)
Replicate as many as possible of the top 50 solar flare table in SpaceWeatherLive.com using the data obtained from NASA. If you get the top 50 solar flares from the NASA table based on their classification (e.g., X28 is the highest), do you get data for the same solar flare events? Include code used to get the top 50 solar flares from the NASA table (be careful when ordering by classification). Write a sentence or two discussing how well you can replicate the SpaceWeatherLive data from the NASA data.
Task 6: Integration (15 pts)
Write a function that finds the best matching row in the NASA data for each of the top 50 solar flares in the SpaceWeatherLive data. Here, you have to decide for yourself how you determine what is the best matching entry in the NASA data for each of the top 50 solar flares. In your submission, include an explanation of how you are defining best matching rows across the two datasets in addition to the code used to find the best matches. Finally, use your function to add a new column to the NASA dataset indicating its rank according to SpaceWeatherLive, if it appears in that dataset.
Task 7: Attributes visualization (7 pts)
Plot attributes in the NASA dataset (e.g., starting or ending frequenciues, flare height or width) over time. Use graphical elements (e.g., text or points) to indicate flares in the top 50 flares.
Task 8: Attributes comparison (8 pts)
Do flares in the top 50 tend to have Halo CMEs? You can make a barplot that compares the number (or proportion) of Halo CMEs in the top 50 flares vs. the dataset as a whole.
Task 9: Events distribution (10 pts)
Do strong flares cluster in time? Plot the number of flares per month over time, add a graphical element to indicate (e.g., text or points) to indicate the number of strong flares (in the top 50) to see if they cluster.
Prepare a .pdf file that includes code, its output, and comments/description for each part and submit to Canvas. Comments and descriptions should be up to 1 sentence. Make sure to name your file in format Firstname_Lastname.pdf.
Answer the following questions using one of the following programming languages Python, R, or Java (you cannot use excel or any equivalent software):
Name | Age | Salary | Donor(Y/N) |
---|---|---|---|
Nancy | 21 | 37,000 | N |
Jim | 27 | 41,000 | N |
Allen | 43 | 61,000 | Y |
Jane | 38 | 55,000 | N |
Steve | 44 | 30,000 | N |
Peter | 51 | 56,000 | Y |
Sayani | 53 | 70,000 | Y |
Lata | 56 | 74,000 | Y |
Mary | 59 | 25,000 | N |
Victor | 61 | 68,000 | Y |
Dale | 63 | 51,000 | Y |
Mini lectures will be presented on March 28, April 4, and April 11. The order of presentations will be determined based on topics and will be announced on March 14.
The proposed topics for mini-lectures should be different from topics already discussed in class. Each topic should be appropriate for a 20-minute presentation. You can prepare a presentation based on materials from two textbooks, but you are also allowed to use conference tutorial slides, articles, etc. The following are possible topics to consider. You can also propose different topics relevant to Knowledge Discovery and Data Mining:
Write a research proposal for the class project that you plan to perform, present progress on April 18 and 25, and submit the project report by May 2.
Teams of two undergraduate students are allowed and teams of one undergraduate and one graduate student. Teams of two graduate students are not allowed.
Write the proposal using the following format:
(0) Your name(s) and e-mail address (such that the instructor can approve your topic quickly or to ask for a revision/clarification)
(1) Title;
(2) Objective and Significance;
(3) Background;
(4) Proposed Approach (make sure to explain where you will get data and how much preprocessing is needed);
(5) References.
The proposal description may not exceed 2 pages in 12 pt style.
Following are some of the research project topics investigated by Temple KDDM students in previous years:
Following are some research project topic ideas suggested by authors of the KDDM textbook: