Tutorial (Structured Data Processing)#

(Last updated: Jan 24, 2025)1

This tutorial will familiarize you with the data science pipeline of processing structured data, using a real-world example of building models to predict and explain the presence of bad smell events in Pittsburgh based on air quality and weather data. The models are used to send push notifications about bad smell events to inform citizens, as well as to explain local pollution patterns to inform stakeholders.

The scenario is in the next section of this tutorial, and more details are in the introduction section of the Smell Pittsburgh paper. We will use the same dataset as used in the Smell Pittsburgh paper as an example of structured data. During this tutorial, we will explain what the variables in the dataset mean and also guide you through model building. Below is the pipeline of this tutorial.

../_images/smellpgh-pipeline.png

You can use the following link to jump to the tasks and assignments:

Task 4: Preprocess Sensor Data
- Assignment for Task 4
Task 5: Preprocess Smell Data
- Assignment for Task 5
Task 6: Prepare Features and Labels
- Assignment for Task 6
Task 7: Train and Evaluate Models

Scenario#

Local citizens in Pittsburgh are organizing communities to advocate for changes in air pollution regulations. Their goal is to investigate the air pollution patterns in the city to understand the potential sources related to the bad odor. The communities rely on the Smell Pittsburgh application (as indicated in the figure below) to collect smell reports from citizens that live in the Pittsburgh region. Also, there are air quality and weather monitoring stations in the Pittsburgh city region that provide sensor measurements, including common air pollutants and wind information.

../_images/smellpgh-ui.png

You work in a data science team to develop models to map the sensor data to bad smell events. Your team has been working with the Pittsburgh local citizens closely for a long time, and therefore you know the meaning of each variable in the feature set that is used to train the machine learning model. The Pittsburgh community needs your help timely to analyze the data that can help them present evidence of air pollution to the municipality and explain the patterns to the general public.

Import Packages#

Important

You need to install the packages in this link in your Python development environment.

We put all the packages that are needed for this tutorial below:

Set the global style of all Pandas table output:

import pandas as pd
import numpy as np
import seaborn as sns
from os.path import isfile, join
from os import listdir
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import plotly.express as px

# Import the answers for the tasks
from util.answer import (
   answer_preprocess_sensor,
   answer_preprocess_smell,
   answer_sum_current_and_future_data,
   answer_experiment
)

# Import the utility functions that are provided
from util.util import (
    check_answer_df,
    compute_feature_importance,
    train_and_evaluate,
    plot_smell_by_day_and_hour,
    convert_wind_direction,
    insert_previous_data_to_cols,
    get_pca_result
)

from IPython.core.display import HTML

def set_global_df_style():
    styles = """
    <style>
        .dataframe * {font-size: 1em !important;}
    </style>
    """
    return HTML(styles)

# Call this function in a notebook cell to apply the style globally
set_global_df_style()

Task Answers#

Click on one of the following links to check answers for the assignments in this tutorial. Do not check the answers before practicing the tasks.

Click this for task answers if you open this notebook on your local machine
Click this for task answers if you view this notebook on a web browser

Utility File#

Click on one of the following links to check the provided functions in the utility file for this tutorial.

Click this for utility functions if you open this notebook on your local machine
Click this for utility functions if you view this notebook on a web browser

Task 4: Preprocess Sensor Data#

In this task, we will process the sensor data from various air quality monitoring stations in Pittsburgh. First, we need to load all the sensor data.

path = "smellpgh-v1/esdr_raw"
list_of_files = [f for f in listdir(path) if isfile(join(path, f))]
sensor_raw_list = []
for f in list_of_files:
    sensor_raw_list.append(pd.read_csv(join(path, f)).set_index("EpochTime"))

Now, the sensor_raw_list variable contains all the data frames with sensor values from different air quality monitoring stations. Noted that sensor_raw_list is an array of data frames. We can print one of them to take a look, as shown below.

sensor_raw_list[0]

	3.feed_1.SO2_PPM	3.feed_1.H2S_PPM	3.feed_1.SIGTHETA_DEG	3.feed_1.SONICWD_DEG	3.feed_1.SONICWS_MPH
EpochTime
1477891800	0.0	0.0	51.7	343.0	3.6
1477895400	0.0	0.0	52.7	351.0	3.5
1477899000	0.0	0.0	52.6	359.0	3.4
1477902600	0.0	0.0	48.3	5.0	2.1
1477906200	0.0	0.0	31.1	41.0	2.2
...	...	...	...	...	...
1538267400	0.0	0.0	35.2	39.0	1.7
1538271000	0.0	0.0	48.2	53.0	1.3
1538274600	0.0	0.0	30.9	62.0	1.5
1538278200	0.0	0.0	21.5	53.0	2.0
1538281800	0.0	0.0	52.1	36.0	1.7

16729 rows × 5 columns

The EpochTime index is the timestamp in epoch time, which means the number of seconds that have elapsed since January 1st, 1970 (midnight UTC/GMT). Other columns mean the sensor data from an air quality monitoring station.

Next, we need to resample and merge all the sensor data frames so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:

df_sensor = answer_preprocess_sensor(sensor_raw_list)
df_sensor

	3.feed_1.SO2_PPM	3.feed_1.H2S_PPM	3.feed_1.SIGTHETA_DEG	3.feed_1.SONICWD_DEG	3.feed_1.SONICWS_MPH	3.feed_23.CO_PPM	3.feed_23.PM10_UG_M3	3.feed_29.PM10_UG_M3	3.feed_29.PM25_UG_M3	3.feed_11067.CO_PPB..3.feed_43.CO_PPB	...	3.feed_3.SO2_PPM	3.feed_3.SONICWD_DEG	3.feed_3.SONICWS_MPH	3.feed_3.SIGTHETA_DEG	3.feed_3.PM10B_UG_M3	3.feed_5975.PM2_5	3.feed_27.NO_PPB	3.feed_27.NOY_PPB	3.feed_27.CO_PPB	3.feed_27.SO2_PPB
EpochTime
2016-10-31 06:00:00+00:00	0.0	0.0	51.7	343.0	3.6	0.2	7.0	8.0	8.0	159.5	...	0.0	344.0	2.9	43.0	9.0	0.0	0.1	2.6	-1.0	0.2
2016-10-31 07:00:00+00:00	0.0	0.0	52.7	351.0	3.5	0.2	8.0	8.0	8.0	-1.0	...	0.0	330.0	2.5	43.6	13.0	5.0	-1.0	-1.0	106.1	0.0
2016-10-31 08:00:00+00:00	0.0	0.0	52.6	359.0	3.4	0.2	5.0	7.0	7.0	133.0	...	0.0	0.0	3.1	40.9	7.0	9.0	0.2	2.1	105.8	-1.0
2016-10-31 09:00:00+00:00	0.0	0.0	48.3	5.0	2.1	0.2	3.0	4.0	4.0	236.6	...	0.0	325.0	1.9	40.0	11.0	3.0	0.1	3.1	111.7	0.0
2016-10-31 10:00:00+00:00	0.0	0.0	31.1	41.0	2.2	0.2	5.0	5.0	4.0	269.3	...	0.0	347.0	1.4	45.1	10.0	9.0	0.1	2.5	127.2	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2018-09-30 01:00:00+00:00	0.0	0.0	35.2	39.0	1.7	0.3	20.0	15.0	10.0	455.1	...	0.0	39.0	1.3	57.3	11.0	5.0	0.1	12.1	301.0	0.0
2018-09-30 02:00:00+00:00	0.0	0.0	48.2	53.0	1.3	0.3	25.0	19.0	12.0	761.2	...	0.0	70.0	1.0	54.4	21.0	7.0	0.2	13.5	357.7	0.0
2018-09-30 03:00:00+00:00	0.0	0.0	30.9	62.0	1.5	0.4	23.0	55.0	33.0	1125.4	...	0.0	75.0	0.7	59.5	33.0	8.0	0.6	13.8	373.6	0.0
2018-09-30 04:00:00+00:00	0.0	0.0	21.5	53.0	2.0	0.5	23.0	63.0	39.0	1039.5	...	0.0	74.0	0.7	50.7	32.0	17.0	0.6	11.4	381.0	0.0
2018-09-30 05:00:00+00:00	0.0	0.0	52.1	36.0	1.7	0.4	25.0	47.0	33.0	1636.3	...	0.0	65.0	0.4	77.1	27.0	18.0	2.0	13.1	377.7	0.0

16776 rows × 43 columns

In the expected output above, the EpochTime index is converted from timestamps into pandas datetime objects, which has the format year-month-day hour:minute:second+timezone. The +00:00 string means the GMT/UTC timezone. Other columns mean the average value of the sensor data in the previous hour. For example, 2016-10-31 06:00:00+00:00 means October 31 in 2016 at 6AM UTC time, and the cell with column 3.feed_1.SO2_PPM means the averaged SO2 (sulfur dioxide) values from 5:00 to 6:00.

The column name suffix SO2_PPM means sulfur dioxide in unit PPM (parts per million). The prefix 3.feed_1. in the column name means a specific sensor (feed ID 1). You can ignore the 3. at the begining of the column name. You can find the list of sensors, their names with feed ID (which will be in the data frame columns), and also the meaning of all the suffixes from this link.

Some column names look like 3.feed_11067.X..3.feed_43.X. This means that the column has data from two sensor stations (feed ID 11067 and 43). The reason is that some sensor stations are replaced by the new ones over time. So in this case, we merge sensor readings from both feed ID 11067 and 43.

Assignment for Task 4#

Your task (which is your assignment) is to write a function to do the following:

Sensors can report in various frequencies. So, for each data frame, we need to resample the data by computing the hourly average of sensor measurements from the “previous” hour. For example, at time 8:00, we want to know the average of sensor values between 7:00 and 8:00.
- Hint: Use the pandas.to_datetime function when converting timestamps to datetime objects. Type ?pd.to_datetime in a code cell for more information.
- Hint: Use the pandas.DataFrame.resample function to resample data. Type ?pd.DataFrame.resample in a code cell for more information.
- Hint: Use the pandas.merge_ordered function when merging data frames. Type ?pd.merge_ordered in a code cell for more information.
Then, merge all the data frames based on their time stamp, which is the EpochTime column.
Finally, fill in the missing data with the value -1. The reason for not using 0 is that we want the model to know if sensors give values (including zero) or no data.
- Hint: Use the pandas.DataFrame.fillna function when treating missing values. Type ?pd.DataFrame.fillna in a code cell for more information.
Notice that you need to convert the column types in the dataframe to float to avoid errors when checking if the answers match the desired output. You can use the pandas.DataFrame.astype function to do this df = df.astype(float).

def preprocess_sensor(df_list):
    """
    Preprocess sensor data.

    Parameters
    ----------
    df_list : list of pandas.DataFrame
        A list of data frames that contain sensor data from multiple stations.

    Returns
    -------
    pandas.DataFrame
        The preprocessed sensor data.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the output of your function matches the expected output.

check_answer_df(preprocess_sensor(sensor_raw_list), df_sensor, n=1)

Test case 1 failed.

Your output is:
None

Expected output is:
                           3.feed_1.SO2_PPM  3.feed_1.H2S_PPM  \
EpochTime                                                       
2016-10-31 06:00:00+00:00               0.0               0.0   
2016-10-31 07:00:00+00:00               0.0               0.0   
2016-10-31 08:00:00+00:00               0.0               0.0   
2016-10-31 09:00:00+00:00               0.0               0.0   
2016-10-31 10:00:00+00:00               0.0               0.0   
...                                     ...               ...   
2018-09-30 01:00:00+00:00               0.0               0.0   
2018-09-30 02:00:00+00:00               0.0               0.0   
2018-09-30 03:00:00+00:00               0.0               0.0   
2018-09-30 04:00:00+00:00               0.0               0.0   
2018-09-30 05:00:00+00:00               0.0               0.0   

                           3.feed_1.SIGTHETA_DEG  3.feed_1.SONICWD_DEG  \
EpochTime                                                                
2016-10-31 06:00:00+00:00                   51.7                 343.0   
2016-10-31 07:00:00+00:00                   52.7                 351.0   
2016-10-31 08:00:00+00:00                   52.6                 359.0   
2016-10-31 09:00:00+00:00                   48.3                   5.0   
2016-10-31 10:00:00+00:00                   31.1                  41.0   
...                                          ...                   ...   
2018-09-30 01:00:00+00:00                   35.2                  39.0   
2018-09-30 02:00:00+00:00                   48.2                  53.0   
2018-09-30 03:00:00+00:00                   30.9                  62.0   
2018-09-30 04:00:00+00:00                   21.5                  53.0   
2018-09-30 05:00:00+00:00                   52.1                  36.0   

                           3.feed_1.SONICWS_MPH  3.feed_23.CO_PPM  \
EpochTime                                                           
2016-10-31 06:00:00+00:00                   3.6               0.2   
2016-10-31 07:00:00+00:00                   3.5               0.2   
2016-10-31 08:00:00+00:00                   3.4               0.2   
2016-10-31 09:00:00+00:00                   2.1               0.2   
2016-10-31 10:00:00+00:00                   2.2               0.2   
...                                         ...               ...   
2018-09-30 01:00:00+00:00                   1.7               0.3   
2018-09-30 02:00:00+00:00                   1.3               0.3   
2018-09-30 03:00:00+00:00                   1.5               0.4   
2018-09-30 04:00:00+00:00                   2.0               0.5   
2018-09-30 05:00:00+00:00                   1.7               0.4   

                           3.feed_23.PM10_UG_M3  3.feed_29.PM10_UG_M3  \
EpochTime                                                               
2016-10-31 06:00:00+00:00                   7.0                   8.0   
2016-10-31 07:00:00+00:00                   8.0                   8.0   
2016-10-31 08:00:00+00:00                   5.0                   7.0   
2016-10-31 09:00:00+00:00                   3.0                   4.0   
2016-10-31 10:00:00+00:00                   5.0                   5.0   
...                                         ...                   ...   
2018-09-30 01:00:00+00:00                  20.0                  15.0   
2018-09-30 02:00:00+00:00                  25.0                  19.0   
2018-09-30 03:00:00+00:00                  23.0                  55.0   
2018-09-30 04:00:00+00:00                  23.0                  63.0   
2018-09-30 05:00:00+00:00                  25.0                  47.0   

                           3.feed_29.PM25_UG_M3  \
EpochTime                                         
2016-10-31 06:00:00+00:00                   8.0   
2016-10-31 07:00:00+00:00                   8.0   
2016-10-31 08:00:00+00:00                   7.0   
2016-10-31 09:00:00+00:00                   4.0   
2016-10-31 10:00:00+00:00                   4.0   
...                                         ...   
2018-09-30 01:00:00+00:00                  10.0   
2018-09-30 02:00:00+00:00                  12.0   
2018-09-30 03:00:00+00:00                  33.0   
2018-09-30 04:00:00+00:00                  39.0   
2018-09-30 05:00:00+00:00                  33.0   

                           3.feed_11067.CO_PPB..3.feed_43.CO_PPB  ...  \
EpochTime                                                         ...   
2016-10-31 06:00:00+00:00                                  159.5  ...   
2016-10-31 07:00:00+00:00                                   -1.0  ...   
2016-10-31 08:00:00+00:00                                  133.0  ...   
2016-10-31 09:00:00+00:00                                  236.6  ...   
2016-10-31 10:00:00+00:00                                  269.3  ...   
...                                                          ...  ...   
2018-09-30 01:00:00+00:00                                  455.1  ...   
2018-09-30 02:00:00+00:00                                  761.2  ...   
2018-09-30 03:00:00+00:00                                 1125.4  ...   
2018-09-30 04:00:00+00:00                                 1039.5  ...   
2018-09-30 05:00:00+00:00                                 1636.3  ...   

                           3.feed_3.SO2_PPM  3.feed_3.SONICWD_DEG  \
EpochTime                                                           
2016-10-31 06:00:00+00:00               0.0                 344.0   
2016-10-31 07:00:00+00:00               0.0                 330.0   
2016-10-31 08:00:00+00:00               0.0                   0.0   
2016-10-31 09:00:00+00:00               0.0                 325.0   
2016-10-31 10:00:00+00:00               0.0                 347.0   
...                                     ...                   ...   
2018-09-30 01:00:00+00:00               0.0                  39.0   
2018-09-30 02:00:00+00:00               0.0                  70.0   
2018-09-30 03:00:00+00:00               0.0                  75.0   
2018-09-30 04:00:00+00:00               0.0                  74.0   
2018-09-30 05:00:00+00:00               0.0                  65.0   

                           3.feed_3.SONICWS_MPH  3.feed_3.SIGTHETA_DEG  \
EpochTime                                                                
2016-10-31 06:00:00+00:00                   2.9                   43.0   
2016-10-31 07:00:00+00:00                   2.5                   43.6   
2016-10-31 08:00:00+00:00                   3.1                   40.9   
2016-10-31 09:00:00+00:00                   1.9                   40.0   
2016-10-31 10:00:00+00:00                   1.4                   45.1   
...                                         ...                    ...   
2018-09-30 01:00:00+00:00                   1.3                   57.3   
2018-09-30 02:00:00+00:00                   1.0                   54.4   
2018-09-30 03:00:00+00:00                   0.7                   59.5   
2018-09-30 04:00:00+00:00                   0.7                   50.7   
2018-09-30 05:00:00+00:00                   0.4                   77.1   

                           3.feed_3.PM10B_UG_M3  3.feed_5975.PM2_5  \
EpochTime                                                            
2016-10-31 06:00:00+00:00                   9.0                0.0   
2016-10-31 07:00:00+00:00                  13.0                5.0   
2016-10-31 08:00:00+00:00                   7.0                9.0   
2016-10-31 09:00:00+00:00                  11.0                3.0   
2016-10-31 10:00:00+00:00                  10.0                9.0   
...                                         ...                ...   
2018-09-30 01:00:00+00:00                  11.0                5.0   
2018-09-30 02:00:00+00:00                  21.0                7.0   
2018-09-30 03:00:00+00:00                  33.0                8.0   
2018-09-30 04:00:00+00:00                  32.0               17.0   
2018-09-30 05:00:00+00:00                  27.0               18.0   

                           3.feed_27.NO_PPB  3.feed_27.NOY_PPB  \
EpochTime                                                        
2016-10-31 06:00:00+00:00               0.1                2.6   
2016-10-31 07:00:00+00:00              -1.0               -1.0   
2016-10-31 08:00:00+00:00               0.2                2.1   
2016-10-31 09:00:00+00:00               0.1                3.1   
2016-10-31 10:00:00+00:00               0.1                2.5   
...                                     ...                ...   
2018-09-30 01:00:00+00:00               0.1               12.1   
2018-09-30 02:00:00+00:00               0.2               13.5   
2018-09-30 03:00:00+00:00               0.6               13.8   
2018-09-30 04:00:00+00:00               0.6               11.4   
2018-09-30 05:00:00+00:00               2.0               13.1   

                           3.feed_27.CO_PPB  3.feed_27.SO2_PPB  
EpochTime                                                       
2016-10-31 06:00:00+00:00              -1.0                0.2  
2016-10-31 07:00:00+00:00             106.1                0.0  
2016-10-31 08:00:00+00:00             105.8               -1.0  
2016-10-31 09:00:00+00:00             111.7                0.0  
2016-10-31 10:00:00+00:00             127.2                0.0  
...                                     ...                ...  
2018-09-30 01:00:00+00:00             301.0                0.0  
2018-09-30 02:00:00+00:00             357.7                0.0  
2018-09-30 03:00:00+00:00             373.6                0.0  
2018-09-30 04:00:00+00:00             381.0                0.0  
2018-09-30 05:00:00+00:00             377.7                0.0  

[16776 rows x 43 columns]

Task 5: Preprocess Smell Data#

In this task, we will preprocess the smell data. First, we need to load the raw smell data.

smell_raw = pd.read_csv("smellpgh-v1/smell_raw.csv").set_index("EpochTime")
smell_raw

	feelings_symptoms	smell_description	smell_value	zipcode
EpochTime
1477935134	NaN	NaN	1	15221
1477956180	NaN	Woodsmoke	2	15218
1477956293	NaN	Wood smoke	3	15218
1477973293	Eye irritation, nose burns, headache, woke me up	Industrial	5	15207
1478001989	NaN	Industrial smoke	2	15213
...	...	...	...	...
1538248172	NaN	Sour sewage	3	15213
1538255258	Coughing	Smoke	2	15104
1538268796	NaN	Like a burning candle	3	15232
1538281653	No	Greasy air	3	15222
1538282980	Eye irritation	Wood smoke and gun powder very Smokey too as f...	5	15217

10353 rows × 4 columns

The meaning of EpochTime is explained in the previous task. Other columns mean the self-reported symptoms, descriptions of smell, severity ratings (the smell_value column), and the zipcode where the report is submitted in Pittsburgh, Pennsylvania. For example, the second row means that the smell report was submitted from the 15218 zipcode with wood smoke description and severity rating 2. For more description about the smell, please check the Smell Pittsburgh website.

Next, we need to resample the smell data so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:

df_smell = answer_preprocess_smell(smell_raw)
df_smell

	smell_value
EpochTime
2016-10-31 23:00:00+00:00	3.0
2016-11-01 00:00:00+00:00	0.0
2016-11-01 01:00:00+00:00	0.0
2016-11-01 02:00:00+00:00	0.0
2016-11-01 03:00:00+00:00	0.0
...	...
2018-09-30 00:00:00+00:00	3.0
2018-09-30 01:00:00+00:00	0.0
2018-09-30 02:00:00+00:00	0.0
2018-09-30 03:00:00+00:00	0.0
2018-09-30 04:00:00+00:00	8.0

16758 rows × 1 columns

In the latest row, the timestamp is 2018-09-30 04:00:00+00:00, which means this row contains the data from 3:00 to 4:00 on September 30 in 2018. This row has smell_value 8, which means the sum of smell report ratings in the above mentioned time range. Notice that the expected output ignores all smell ratings from 1 to 2. This is becasue we only want the ratings that indicate bad smell, which will be further explained below.

Assignment for Task 5#

Your task (which is your assignment) is to write a function to do the following:

First, remove the feelings_symptoms, smell_description, and zipcode columns since we do not need them.
- Hint: Use the pandas.DataFrame.drop function. Type ?pd.DataFrame.drop in a code cell for more information.
We only want the reports that indicate bad smell. You need to select only the reports with rating 3, 4, or 5 in the smell_value column.
Then, we want to know the severity of bad smell within an hour in the future. For example, at time 8:00, we want to know the sum of smell values between 8:00 and 9:00. So you need to resample the data by computing the hourly sum of smell values from the “future” hour.
- Hint: Use the pandas.to_datetime function when converting timestamps to datetime objects. Type ?pd.to_datetime in a code cell for more information.
- Hint: Use the pandas.DataFrame.resample function to resample data. Type ?pd.DataFrame.resample in a code cell for more information.
Finally, fill in the missing data with the value 0. The reason is that missing data means there are no smell reports (provided by citizens) within an hour, so we assume that there is no bad smell within this period of time. Notice that this is an assumption and also a limitation since citizens rarely report good smell.
- Hint: Use the pandas.DataFrame.fillna function when treating missing values. Type ?pd.DataFrame.fillna in a code cell for more information.
Notice that you need to convert the column types in the dataframe to float to avoid errors when checking if the answers match the desired output. You can use the pandas.DataFrame.astype function to do this df = df.astype(float).

def preprocess_smell(df):
    """
    Preprocess smell data.

    Parameters
    ----------
    df : pandas.DataFrame
        The raw smell reports data.

    Returns
    -------
    pandas.DataFrame
        The preprocessed smell data.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the output of your function matches the expected output.

check_answer_df(preprocess_smell(smell_raw), df_smell, n=1)

Test case 1 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-10-31 23:00:00+00:00          3.0
2016-11-01 00:00:00+00:00          0.0
2016-11-01 01:00:00+00:00          0.0
2016-11-01 02:00:00+00:00          0.0
2016-11-01 03:00:00+00:00          0.0
...                                ...
2018-09-30 00:00:00+00:00          3.0
2018-09-30 01:00:00+00:00          0.0
2018-09-30 02:00:00+00:00          0.0
2018-09-30 03:00:00+00:00          0.0
2018-09-30 04:00:00+00:00          8.0

[16758 rows x 1 columns]

Now, we can plot the distribution of smell values by using the pandas.DataFrame.plot function.

fig = df_smell.plot(kind="hist", bins=20, ylim=(0,100), edgecolor="black").set_yticks([0,50,100], labels=["0","50",">100"])

../_images/tutorial-structured-data_51_0.png

From the plot above, we can observe that a lot of the time, the smell values are fairly low. This means that smell events only happen occasionally, and thus our dataset is highly imbalanced.

We can also plot the average number of smell reports distributed by the day of week (Sunday to Saturday) and the hour of day (0 to 23), using a provided function plot_smell_by_day_and_hour in the utility file.

plot_smell_by_day_and_hour(df_smell)

../_images/tutorial-structured-data_54_0.png

From the plot above, we can observe that citizens tend to report smell in the morning.

Task 6: Prepare Features and Labels#

Now we have the preprocessed data in the df_sensor and df_smell variables. Our next task is to prepare features and labels for modeling, as shown in the figure below.

../_images/smellpgh-predict.png

Our goal is to construct two data frames, df_x and df_y, that represent the features and labels, respectively. First, we will deal with the sensor data. We need a function to insert columns (that indicate previous n_hr hours of sensor) to the existing data frame, where n_hr should be a parameter that we can control. We provide the insert_previous_data_to_cols function in the utility file to do this.

The code below shows a test case, which is a part of the sensor data.

# Below is an example input.
df_sensor_example_in = df_sensor[["3.feed_1.SONICWS_MPH"]][0:5]
df_sensor_example_in

	3.feed_1.SONICWS_MPH
EpochTime
2016-10-31 06:00:00+00:00	3.6
2016-10-31 07:00:00+00:00	3.5
2016-10-31 08:00:00+00:00	3.4
2016-10-31 09:00:00+00:00	2.1
2016-10-31 10:00:00+00:00	2.2

# Below is the expected output of the above example input.
df_sensor_example_out = insert_previous_data_to_cols(df_sensor_example_in, n_hr=2)
df_sensor_example_out

	3.feed_1.SONICWS_MPH_pre_1h	3.feed_1.SONICWS_MPH_pre_2h	3.feed_1.SONICWS_MPH_pre_3h
EpochTime
2016-10-31 08:00:00+00:00	3.4	3.5	3.6
2016-10-31 09:00:00+00:00	2.1	3.4	3.5
2016-10-31 10:00:00+00:00	2.2	2.1	3.4

The reason that there are 2 less rows in the expected output is because we set n_hr=2, which means there are missing data in the original first and second row (because there was no previous data for these rows). So in the code, we removed these rows.

Notice that the insert_previous_data_to_cols function added suffixes to the column names to indicate the number of hours that the sensor measurements came from previously. Pay attention to the meaning of time range here.

For example, in the first row, the 3.feed_1.SONICWS_MPH_pre_1h column has value 3.4, which means the average reading of wind speed (in unit MPH) between the current time stamp (which is 8:00) and the previous 1 hour (which is 7:00).
In the second column of the first row, the 3.feed_1.SONICWS_MPH_pre_2h has value 3.5, which means the average reading of wind speed between the previous 1 hour (which is 7:00) and 2 hours (which is 6:00).
It is important to note here that suffix pre_2h does not mean the average rating within 2 hours between the current time stamp and the time that is 2 hours ago.

Then, we also need a function to convert wind direction into sine and cosine components, which is a common technique for encoding cyclical features (i.e., any that that circulates within a set of values, such as hours of the day, days of the week). The formula is below:

\[\begin{split} x_{sin} = \sin{\Big(\frac{2 \cdot \pi \cdot x}{\max{(x)}}\Big)} \\ x_{cos} = \cos{\Big(\frac{2 \cdot \pi \cdot x}{\max{(x)}}\Big)} \end{split}\]

There are several reasons to do this instead of using the original wind direction degrees (that range from 0 to 360). First, by applying sine and cosine to the degrees, we can transform the original data to a continuous variable. The original data is not continuous since there are no values below 0 or above 360, and there is no information to tell that 0 degrees and 360 degrees are the same. Second, the decomposed sine and cosine components allow us to inspect the effect of wind on the north-south and east-west directions separately, which may help us explain the importance of wind directions. See the code for function convert_wind_direction in the utility file for achieving this.

The code below shows a test case, which is a part of the sensor data.

# Below is an example input.
df_wind_example_in = df_sensor[["3.feed_1.SONICWD_DEG"]][0:5]
df_wind_example_in

	3.feed_1.SONICWD_DEG
EpochTime
2016-10-31 06:00:00+00:00	343.0
2016-10-31 07:00:00+00:00	351.0
2016-10-31 08:00:00+00:00	359.0
2016-10-31 09:00:00+00:00	5.0
2016-10-31 10:00:00+00:00	41.0

# Below is the expected output of the above example input.
df_wind_example_out = convert_wind_direction(df_wind_example_in)
df_wind_example_out

	3.feed_1.SONICWD_DEG_cosine	3.feed_1.SONICWD_DEG_sine
EpochTime
2016-10-31 06:00:00+00:00	0.956305	-0.292372
2016-10-31 07:00:00+00:00	0.987688	-0.156434
2016-10-31 08:00:00+00:00	0.999848	-0.017452
2016-10-31 09:00:00+00:00	0.996195	0.087156
2016-10-31 10:00:00+00:00	0.754710	0.656059

We have dealt with the sensor data. Next, we will deal with the smell data. We need a function to sum up smell values in the future hours, where n_hr should be a parameter that we can control. The code below shows a test case, which is a part of the smell data.

# Below is an example input.
df_smell_example_in = df_smell[107:112]
df_smell_example_in

	smell_value
EpochTime
2016-11-05 10:00:00+00:00	8.0
2016-11-05 11:00:00+00:00	13.0
2016-11-05 12:00:00+00:00	40.0
2016-11-05 13:00:00+00:00	22.0
2016-11-05 14:00:00+00:00	4.0

# Below is the expected output of the above example input.
df_smell_example_out1 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=1)
df_smell_example_out1

	smell_value
EpochTime
2016-11-05 10:00:00+00:00	21.0
2016-11-05 11:00:00+00:00	53.0
2016-11-05 12:00:00+00:00	62.0
2016-11-05 13:00:00+00:00	26.0

# Below is another expected output with a different n_hr.
df_smell_example_out2 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=3)
df_smell_example_out2

	smell_value
EpochTime
2016-11-05 10:00:00+00:00	83.0
2016-11-05 11:00:00+00:00	79.0

In the output above, notice that row 2016-11-05 10:00:00+00:00 has smell value 83, and the setting is n_hr=3, which means the sum of smell values within n_hr+1 hours (i.e., from 10:00 to 14:00) is 83. Pay attention to this setup since it can be confusing. The reason of n_hr+1 (but not n_hr) is because the input data already indicates the sum of smell values within the future 1 hour.

# Below is another expected output when n_hr is 0.
df_smell_example_out3 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=0)
df_smell_example_out3

	smell_value
EpochTime
2016-11-05 10:00:00+00:00	8.0
2016-11-05 11:00:00+00:00	13.0
2016-11-05 12:00:00+00:00	40.0
2016-11-05 13:00:00+00:00	22.0
2016-11-05 14:00:00+00:00	4.0

Assignment for Task 6#

Your task (which is your assignment) is to write a function to do the following:

First, perform a windowing operation to sum up smell values within a specified n_hr time window.
- Hint: Use the pandas.DataFrame.rolling function when summing up values within a window. Type ?pd.DataFrame.rolling in a code cell for more information.
- Hint: Use the pandas.DataFrame.shift fuction to shift the rolled data back n_hr hours because we want to sum up the values in the future (the rolling function operates on the values in the past). Type ?pd.DataFrame.shift in a code cell for more information.
Finally, Remove the last n_hr hours of data because they have the wrong data due to shifting. For example, the last row does not have data in the future to operate if we set n_hr=1.
- Hint: Use pandas.DataFrame.iloc to select the rows that you want to keep. Type ?pd.DataFrame.iloc in a code cell for more information.
You need to handle the edge case when n_hr=0, which should output the original data frame.
Notice that you need to convert the column types in the dataframe to float to avoid errors when checking if the answers match the desired output. You can use the pandas.DataFrame.astype function to do this df = df.astype(float).

def sum_current_and_future_data(df, n_hr=0):
    """
    Sum up data in the current and future hours.

    Parameters
    ----------
    df : pandas.DataFrame
        The preprocessed smell data.
    n_hr : int
         Number of hours that we want to sum up the future smell data.

    Returns
    -------
    pandas.DataFrame
        The transformed smell data.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the output of your function matches the expected output.

check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=1), df_smell_example_out1, n=1)

Test case 1 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-11-05 10:00:00+00:00         21.0
2016-11-05 11:00:00+00:00         53.0
2016-11-05 12:00:00+00:00         62.0
2016-11-05 13:00:00+00:00         26.0

check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=3), df_smell_example_out2, n=2)

Test case 2 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-11-05 10:00:00+00:00         83.0
2016-11-05 11:00:00+00:00         79.0

check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=0), df_smell_example_out3, n=3)

Test case 3 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-11-05 10:00:00+00:00          8.0
2016-11-05 11:00:00+00:00         13.0
2016-11-05 12:00:00+00:00         40.0
2016-11-05 13:00:00+00:00         22.0
2016-11-05 14:00:00+00:00          4.0

Finally, we need a function to compute the features and labels, based on the insert_previous_data_to_cols and sum_current_and_future_data functions. The code is below.

def compute_feature_label(df_smell, df_sensor, b_hr_sensor=0, f_hr_smell=0):
    """
    Compute features and labels from the smell and sensor data.

    Parameters
    ----------
    df_smell : pandas.DataFrame
        The preprocessed smell data.
    df_sensor : pandas.DataFrame
        The preprocessed sensor data.
    b_hr_sensor : int
        Number of hours that we want to insert the previous sensor data.
    f_hr_smell : int
        Number of hours that we want to sum up the future smell data.

    Returns
    -------
    df_x : pandas.DataFrame
        The features that we want to use for modeling.
    df_y : pandas.DataFrame
        The labels that we want to use for modeling.
    """
    # Copy data frames to prevent editing the original ones.
    df_smell = df_smell.copy(deep=True)
    df_sensor = df_sensor.copy(deep=True)

    # Replace -1 values in sensor data to NaN
    df_sensor[df_sensor==-1] = np.nan

    # Convert all wind directions.
    df_sensor = convert_wind_direction(df_sensor)

    # Scale sensor data and fill in missing values
    df_sensor = (df_sensor - df_sensor.mean()) / df_sensor.std()
    df_sensor = df_sensor.round(6)
    df_sensor = df_sensor.fillna(-1)

    # Insert previous sensor data as features.
    # Noice that the df_sensor is already using the previous data.
    # So b_hr_sensor=0 means using data from the previous 1 hour.
    # And b_hr_sensor=n means using data from the previous n+1 hours.
    df_sensor = insert_previous_data_to_cols(df_sensor, b_hr_sensor)

    # Sum up current and future smell values as label.
    # Notice that the df_smell is already the data from the future 1 hour.
    # (as indicated in the preprocessing phase of smell data)
    # So f_hr_smell=0 means using data from the future 1 hour.
    # And f_hr_smell=n means using data from the future n+1 hours.
    df_smell = answer_sum_current_and_future_data(df_smell, f_hr_smell)

    # Add suffix to the column name of the smell data to prevent confusion.
    # See the description above for the reason of adding 1 to the f_hr_smell.
    df_smell.columns += "_future_" + str(f_hr_smell+1) + "h"

    # We need to first merge these two timestamps based on the available data.
    # In this way, we synchronize the time stamps in the sensor and smell data.
    # This also means that the sensor and smell data have the same number of data points.
    df = pd.merge_ordered(df_sensor.reset_index(), df_smell.reset_index(), on=df_smell.index.name, how="inner", fill_method=None)

    # Sanity check: there should be no missing data.
    assert df.isna().sum().sum() == 0, "Error! There is missing data."

    # Separate features (x) and labels (y).
    df_x = df[df_sensor.columns].copy()
    df_y = df[df_smell.columns].copy()

    return df_x, df_y

We will use the sensor data within the previous 3 hours to predict bad smell within the future 8 hours. To use the compute_feature_label function that we just build, we need to set b_hr_sensor=2 and f_hr_smell=7 because originally df_sensor already contains data from the previous 1 hour, and df_smell already contains data from the future 1 hour.

Note that b_hr_sensor=n means that we want to insert previous n+1 hours of sensor data , and f_hr_smell=m means that we want to sum up the smell values of the future m+1 hours. For example, suppose that the current time is 8:00, setting b_hr_sensor=2 means that we use all sensor data from 5:00 to 8:00 (as features df_x in prediction), and setting f_hr_smell=7 means that we sum up the smell values from 8:00 to 16:00 (as labels df_y in prediction).

df_x, df_y = compute_feature_label(df_smell, df_sensor, b_hr_sensor=2, f_hr_smell=7)

Below is the data frame of features (i.e., the predictor variable). There are negative float numbers because the compute_feature_label function scales the sensor data by subtracting its mean value and dividing it by the standard deviation.

df_x

	3.feed_1.SO2_PPM_pre_1h	3.feed_1.H2S_PPM_pre_1h	3.feed_1.SIGTHETA_DEG_pre_1h	3.feed_1.SONICWS_MPH_pre_1h	3.feed_23.CO_PPM_pre_1h	3.feed_23.PM10_UG_M3_pre_1h	3.feed_29.PM10_UG_M3_pre_1h	3.feed_29.PM25_UG_M3_pre_1h	3.feed_11067.CO_PPB..3.feed_43.CO_PPB_pre_1h	3.feed_11067.NO2_PPB..3.feed_43.NO2_PPB_pre_1h	...	3.feed_1.SONICWD_DEG_cosine_pre_3h	3.feed_1.SONICWD_DEG_sine_pre_3h	3.feed_11067.SONICWD_DEG..3.feed_43.SONICWD_DEG_cosine_pre_3h	3.feed_11067.SONICWD_DEG..3.feed_43.SONICWD_DEG_sine_pre_3h	3.feed_28.SONICWD_DEG_cosine_pre_3h	3.feed_28.SONICWD_DEG_sine_pre_3h	3.feed_26.SONICWD_DEG_cosine_pre_3h	3.feed_26.SONICWD_DEG_sine_pre_3h	3.feed_3.SONICWD_DEG_cosine_pre_3h	3.feed_3.SONICWD_DEG_sine_pre_3h
0	-0.273112	-0.403688	-1.520058	-0.599075	-0.388936	-0.777225	-0.406466	-0.395826	-0.716551	-0.585693	...	-1.456291	0.536881	-1.196856	0.088730	0.279097	1.746934	-0.383942	1.929446	-0.542867	1.331119
1	-0.273112	-0.403688	-1.433654	-0.684709	-0.388936	-0.690974	0.007500	-0.305936	-0.426597	0.488014	...	0.417996	1.487643	0.440858	1.692426	1.089779	1.481480	0.945548	1.350182	0.512949	1.355712
2	-0.273112	-0.403688	1.142731	-0.941610	0.147335	-0.173473	-0.147737	-0.216045	-0.444787	0.829648	...	0.810100	1.341408	0.340580	1.703113	0.799733	1.640186	0.726159	1.603583	0.537757	1.347897
3	-0.273112	-0.403688	-0.082623	-0.941610	0.147335	-0.432224	-0.302974	-0.216045	-0.796641	0.081306	...	1.194228	1.053158	0.589287	1.663170	0.960380	1.562966	1.185067	0.816616	0.512949	1.355712
4	-0.273112	-0.403688	1.527618	-0.984426	0.147335	-0.259723	-0.458211	-0.485717	-0.762976	-0.504352	...	1.551329	0.235803	0.850317	1.569438	1.623480	0.780539	1.225168	0.602477	0.659294	1.303117
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
16746	-0.273112	-0.403688	0.011635	-0.128090	-0.925207	-0.604724	-0.561703	-0.575607	0.529598	-0.748376	...	1.194228	-0.675401	1.597272	0.742850	1.707841	0.380565	1.119099	-0.254164	1.210281	-0.738344
16747	-0.273112	-0.403688	0.443651	0.000361	-0.925207	-0.690974	-0.406466	-0.665498	0.662087	-0.292864	...	0.692773	-1.019622	1.666606	0.012696	1.693445	0.048275	1.098706	-0.303805	1.327922	-0.583204
16748	-0.273112	-0.403688	0.443651	-0.256540	-0.925207	-0.604724	-0.458211	-0.575607	0.181817	-0.862254	...	0.545323	-1.074614	1.515300	-0.445851	1.489886	-0.500401	0.609087	-0.931955	0.798657	-1.062282
16749	-0.273112	-0.403688	0.270844	-0.085273	-0.925207	-0.518474	-0.302974	-0.575607	0.856204	-0.439279	...	-0.138053	-1.148336	1.526941	-0.423174	1.402368	-0.626362	0.237194	-1.124325	0.706601	-1.107672
16750	-0.273112	-0.403688	-0.341833	0.085995	-0.925207	-0.432224	-0.406466	-0.395826	0.798647	-0.309133	...	-0.322341	-1.121240	-0.438378	-1.051525	0.581575	-1.153330	-0.684058	-1.060369	-0.161757	-1.243870

16751 rows × 144 columns

Below is the data frame of labels (i.e., the response variable).

df_y

	smell_value_future_8h
0	8.0
1	5.0
2	5.0
3	5.0
4	5.0
...	...
16746	6.0
16747	6.0
16748	6.0
16749	3.0
16750	11.0

16751 rows × 1 columns

Task 7: Train and Evaluate Models#

We have processed raw data and prepared the compute_feature_label function to convert smell and sensor data into features df_x and labels df_y. In this task, you will work on training basic and advanced models to predict bad smell events (i.e., the situation that the smell value is high) using the sensor data from air quality monitoring stations.

First, let us threshold the features to make them binary for our classification task. We will use value 40 as the threshold to indicate a smell event. The threshold 40 was used in the Smell Pittsburgh research paper. It is equivalent to the situation that 10 people reported smell with rating 4 within 8 hours.

df_y_40 = (df_y>=40).astype(int)
df_y_40

	smell_value_future_8h
0	0
1	0
2	0
3	0
4	0
...	...
16746	0
16747	0
16748	0
16749	0
16750	0

16751 rows × 1 columns

The descriptive statistics below tell us that the dataset is imbalanced, which means that the numbers of data points in the positive and negative label groups have a big difference.

print("There are %d rows with smell events." % (df_y_40.sum().iloc[0]))
print("This means %.2f proportion of the data has smell events." % (df_y_40.sum().iloc[0]/len(df_y_40)))

There are 1439 rows with smell events.
This means 0.09 proportion of the data has smell events.

Principal Component Analysis#

We can also plot the result from the Principal Component Analysis (PCA) to check if there are patterns in the data.

PCA is a linear dimensionality reduction technique aiming to transform the original variables into a new set of uncorrelated variables (principal components) that preserve as much information or variability in the dataset as possible. To do so, it applies a linear transformation to the data onto a new coordinate system such that the directions (principal components) that capture the largest variation in the data can be identified.

If you are interested in the mathematical explanation of PCA, you can check the following hidden block. Otherwise, please feel free to skip the math.

Click the button to show the math for PCA

Principal components are the eigenvectors of the covariance matrix of the centered data. The magnitude of their corresponding eigenvalues indicates the explained variance that each principal component captures. That way, the eigenvectors with the \(n\) highest principal components will be chosen as principal components.

The step to apply PCA with \(n\) components to a dataset \(X\) are the following:

Calculate covariance matrix of the centered data:

Subtract the mean (\(\mu_X\)) of each feature

\[ \bar{X} = X - \mu_X, \]

and compute the covariance matrix of the centered data \(\bar{X}\):

\[ C = \frac{1}{N} \bar{X}^T \bar{X}.\]
Compute Eigenvalues and Eigenvectors:

Derive the eigenvalues \(\{\lambda_i\}\) and eigenvectors \(\{v_i\}\) of the covariance matrix \(C\). Each eigenvector represents a principal component, and its corresponding eigenvalue indicates the variance explained by that principal component.
Choose the principal components

Order the eigenvalues in descending order.

\[ \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_D \]

and select \(v_1\), \(v_2\), …, \(v_n\) as principal components.
Data Transformation:

Use the original data \(X\) and the chosen principal components to obtain the transformed data \(X'\):

\[ X' = X P_k \]

where \(P_n\) is the matrix with the first \(n\) eigenvectors as columns.

Dimensionality reduction techiniques are often used to visualize larger-dimensional data in a 2D or 3D space. Let’s apply PCA to our example.

The get_pca_result function in the utility file creates a dataframe with the n principal components in the dataset.

The code below prints the outputs from the get_pca_result function.

df_pc_pca, df_r_pca = get_pca_result(df_x, df_y_40, n=3)
print(df_pc_pca)
print(df_r_pca)

            PC1       PC2       PC3  y  size
    2.523064 -7.411848 -4.221265  0    15
    1.880829 -7.723335 -1.952091  0    15
    0.004774 -7.342995 -0.352490  0    15
   -1.981156 -6.788872  0.291602  0    15
   -3.210972 -6.704316 -0.586818  0    15
...         ...       ...       ... ..   ...
3.846017 -4.412932  3.026400  0    15
3.778439 -3.782017  3.155601  0    15
3.554057 -2.261703  2.188408  0    15
3.308703 -2.001378  1.878951  0    15
3.675381 -1.200747  1.680042  0    15

[16751 rows x 5 columns]
     var   pc
0.257  PC1
0.120  PC2
0.085  PC3

Now let us plot the ratios of explained variances (i.e., the eigenvalues). The intuition is that if the explained variance ratio is larger, the corresponding principal component is more important and can represent more information.

# Plot the eigenvalues.
ax1 = sns.barplot(x="pc",y="var", data=df_r_pca, color="c")
ax1 = ax1.set(xlabel="Principal Component",
       ylabel="Ratio of Variance Explained",
       title="Variance Explained by Principal Components")

../_images/tutorial-structured-data_107_0.png

We can also plot the first and second principal components to see the distribution of labels. In the figure below, we see that label 1 (which means having a smell event) is concentrated on one side. However, it looks like labels 0 and 1 do not separate well, which means that the classifier will have difficulty distinguishing these two groups.

# Plot two principal components.
ax2 = sns.lmplot(x="PC1", y="PC2", data=df_pc_pca,
    fit_reg=False, hue="y", legend=True,
    scatter_kws={"s": 10, "alpha" :0.4})
ax2 = ax2.set(title="Distribution of Labels by Principal Components")

../_images/tutorial-structured-data_109_0.png

In the interactive figure below, we can zoom in, zoom out, and rotate the view to check the distribution of labels. We can confirm that label 1 is concentrated on one side. But we also observe that there is a large part of the data that may not be separated well by the classifier (especially linear classifiers). We probably need non-linear classifiers to be able to tell the differences between these two groups.

# Use the Plotly package to show the PCA results.
fig = px.scatter_3d(df_pc_pca.sample(n=5000), x="PC1", y="PC2", z="PC3",
                    color="y", symbol="y", opacity=0.6,
                    size="size", size_max=15,
                    category_orders=dict(y=["0", "1"]),
                    height=700)
fig.show()

Evaluation Metrics#

Next, let us pick a subset of the sensor data (instead of using all of them) and prepare the function for computing the evaluation metrics. Our intuition is that the smell may come from chemical compounds near major pollution sources. From the knowledge of local people, there is a large pollution source, which is the Clairton Mill Works that belongs to the United States Steel Corporation. This pollution source is located at the south part of Pittsburgh. This factory produces petroleum coke, which is a fuel to refine steel. And during the coke refining process, it generates pollutants.

We think that H2S (hydrogen sulfide, smells like rotten eggs) and SO2 (sulfur dioxide, smells like burnt matches) near the pollution source may be good features. So we first select the columns with H2S and SO2 measurements from a monitoring station near this pollution source.

df_x_subset = df_x[["3.feed_28.H2S_PPM_pre_1h", "3.feed_28.SO2_PPM_pre_1h"]]
df_x_subset

	3.feed_28.H2S_PPM_pre_1h	3.feed_28.SO2_PPM_pre_1h
0	-0.426339	-0.348809
1	-0.426339	-0.453409
2	-0.426339	-0.453409
3	-0.426339	-0.348809
4	-0.426339	-0.453409
...	...	...
16746	-0.426339	-0.244209
16747	-0.426339	-0.244209
16748	-0.426339	-0.244209
16749	-0.426339	-0.244209
16750	-0.426339	-0.244209

16751 rows × 2 columns

Next, we will train and evaluate a model (F) that maps features (i.e., the sensor readings) to labels (i.e., the smell events). We have the scorer and train_and_evaluate functions ready to help us train and evaluate models.

The train_and_evaluate function prints the averaged f1-score, averaged precision, averaged recall, and averaged accuracy across all the folds. These metrics are always in the range of zero and one, with zero being the worst and one being the best. We also printed the confusion matrix that contains true positives, false positives, true negatives, and false negatives. To understand the evaluation metrics, let us first take a look at the confusion matrix, explained below:

True Positives
- There is a smell event in the real world, and the model correctly predicts that there is a smell event.
False Positives
- There is no smell event in the real world, but the model falsely predicts that there is a smell event.
True Negatives
- There is no smell event in the real world, and the model correctly predicts that there is no smell event.
False Negatives
- There is a smell event in the real world, but the model falsely predicts that there is no smell event.

The accuracy metric is defined in the equation below:

accuracy = (true positives + true negatives) / total number of data points

Accuracy is the number of correct predictions divided by the total number of data points. It is a good metric if the data distribution is not skewed (i.e., the number of data records that have a bad smell and do not have a bad smell is roughly equal). But if the data is skewed, which is the case in our dataset, we will need another set of evaluation metrics: f1-score, precision, and recall. We will use an example later to explain why accuracy is an unfair metric for our dataset.

The precision metric is defined in the equation below:

precision = true positives / (true positives + false positives)

In other words, precision means how precise the prediction is. High precision means that if the model predicts “yes” for smell events, it is highly likely that the prediction is correct. We want high precision because we want the model to be as precise as possible when it says there will be smell events.

Next, the recall metric is defined in the equation below:

recall = true positives / (true positives + false negatives)

In other words, recall means the ability of the model to catch events. High recall means that the model has a low chance to miss the events that happen in the real world. We want high recall because we want the model to catch all smell events without missing them.

Typically, there is a tradeoff between precision and recall, and one may need to choose to go for a high precision but low recall model, or we go for a high recall but low precision model. The tradeoff depends on the context. For example, in medical applications, one may not want to miss the events (e.g., cancer) since the events are connected to patients’ quality of life. In our application of predicting smell events, we may not want the model to make false predictions when saying “yes” to smell events. The reason is that people may lose trust in the prediction model when we make real-world interventions incorrectly, such as sending push notifications to inform the users about the bad smell events.

The f1-score metric is a combination of recall and precision, as indicated below:

f1-score = 2 * (precision * recall) / (precision + recall)

Use the Dummy Classifier#

Now that we have explained the evaluation metrics. Let us first use a dummy classifier that always predicts no smell events. In other words, the dummy classifier never predicts “yes” about the presence of smell events. Later we will guide you through using more advanced machine learning models.

dummy_model = DummyClassifier(strategy="constant", constant=0)
train_and_evaluate(dummy_model, df_x_subset, df_y_40, train_size=336, test_size=168)

Use model DummyClassifier(constant=0, strategy='constant')
Perform cross-validation, please wait...
================================================
average f1-score: 0.0
average precision: 0.0
average recall: 0.0
average accuracy: 0.92
================================================

The printed message above shows the evaluation result of the dummy classifier. We see that the accuracy is 0.92, which is very high. But the f1-score, precision, and recall are zero since there are no true positives. This is because the Smell Pittsburgh dataset has a skewed distribution of smell events, which means that there are a lot of “no” (i.e., label 0) but only a small part of “yes” (i.e., label 1). This skewed data distribution corresponds to what happened in Pittsburgh. Most of the time, the odors in the city area are OK and not too bad. Occasionally, there can be very bad pollution odors, where many people complain.

By the definition of accuracy, the dummy classifier (which always says “no”) has a very high accuracy of 0.92. This is because only 9% of the data indicate bad smell events. So, you can see that accuracy is not a fair evaluation metric for the Smell Pittsburgh dataset. And instead, we need to go for the f1-score, precision, and recall metrics.

This step uses cross-validation to evaluate the machine learning model, where the data is divided into several parts, and some parts are used for training. Other parts are used for testing. Typically people use K-fold cross-validation, which means that the entire dataset is split into K parts. One part is used for testing (i.e., the testing set), and the other parts are used for training (i.e., the training set). This procedure is repeated K times so that every fold has the chance of being tested. The result is averaged to indicate the performance of the model, for example, averaged accuracy. We can then compare the results for different machine learning pipelines.

However, the script uses a different cross-validation approach, where we only use the previous folds to train the model to test future folds. For example, if we want to test the third fold, we will only use a part of the data from the first and second fold to train the model. The reason is that the Smell Pittsburgh dataset is primarily time-series data, which means the dataset has timestamps for every data record. In other words, things that happened in the past may affect the future. So, in fact, it does not make sense to use the data in the future to train a model to predict what happened in the past. Our time-series cross-validation approach is shown in the following figure.

../_images/smellpgh-cross-validation.png

For the train_and_evaluate function, test_size is the number of samples for testing, and train_size is the number of samples for training. We need to set these numbers for time-series cross-validation. For example, setting test_size to 168 means using 168 samples for testing, which also means 168 hours (or 7 days) of data. Setting train_size to 336 means using 336 samples for testing, which also means 336 hours (or 14 days) of data. So, this means we are using previous 14 days of sensor data to train the model, and then use the model to predict the smell events in the next 7 days. In this setting, every Sunday we can re-train the model with the updated data, so that we have the updated model to predict smell events every week.

Use the Decision Tree Model#

Now, instead of using the dummy classifier, we are going to use a different model. Let us use the Decision Tree model and compare its performance with the dummy classifier.

dt_model = DecisionTreeClassifier()
train_and_evaluate(dt_model, df_x_subset, df_y_40, train_size=336, test_size=168)

Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.14
average precision: 0.22
average recall: 0.12
average accuracy: 0.9
================================================

From the printed message above, notice that the Decision Tree model produces non-zero true positives and false positives (compared to the dummy classifier). Also, notice that f1-score, precision, and recall are no longer zero.

Decision Tree is a type of machine learning model. You can think of it as how a medical doctor diagnoses patients. For example, to determine if the patients need treatments, the medical doctor may ask the patients to describe symptoms. Depending on the symptoms, the doctor decides which treatment should be applied for the patient.

One can think of the smell prediction model as an air quality expert who is diagnosing the pollution patterns based on air quality and weather data. The treatment is to send a push notification to citizens to inform them of the presence of bad smell to help people plan daily activities. This decision-making process can be seen as a tree structure as shown in the following figure, where the first node is the most important factor to decide the treatment.

../_images/smellpgh-decision-tree.png

The above figure is just a toy example to show what a decision tree is. In our case, we put the features (X) into the decision tree to train it. Then, the tree will decide which feature to use and what is the threshold to split the data based on the features. This procedure is repeated several times (represented by the depth of the tree). Finally, the model will make a prediction (y) at the final node of the tree.

You can find the visualization of a decision tree (trained using the real data) in Figure 8 in the Smell Pittsburgh paper. More information about the Decision Tree can be found in the following link and paper:

Use the Random Forest Model#

Now that you have tried the Decision Tree model. Let us use a more advanced model, Random Forest, for smell event prediction.

rf_model = RandomForestClassifier()
train_and_evaluate(rf_model, df_x_subset, df_y_40, train_size=336, test_size=168)

Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.13
average precision: 0.23
average recall: 0.1
average accuracy: 0.91
================================================

Notice that the performance of the model does not look much better than the Decision Tree model. And in fact, both models currently have poor performance. This can have several meanings, as indicated in the following list. You will explore some of these questions in the assignment for this task.

Firstly, do we really believe that we are using a good set of features? Is it sufficient to only use the H2S and SO2 features? Is it sufficient to only include the data from the previous hour (i.e., the "3.feed_28.H2S_PPM_pre_1h" column)?
Secondly, the machine learning pipeline uses 14 days of data in the past (i.e., train_size=336) to predict smell events in the future 7 days (i.e., test_size=168). Do we believe that 14 days are sufficient for training a good model?
Finally, the decision tree and random forest models has many hyper-parameters (e.g., maximum depth of tree). Currently, the model uses the default hyper-parameters. Do we believe that the default setting is good?

Now let us take a look at the Random Forest model, which is a type of ensemble model. You can think about the ensemble model as a committee that makes decisions collaboratively, such as using majority voting. For example, to determine the treatment of a patient, we can ask the committee of medical doctors for a collaborative decision. The committee has several members who correspond to various machine learning models. The Random Forest model is a committee that is formed with many decision trees. Each tree is trained using different sets of data, as shown in the following figure.

../_images/smellpgh-random-forest.png

In other words, we first trained many decision trees, and each of them has access to only a part of the data (but not all of the data) that are randomly selected. So, each decision tree sees different sets of features. Then, we ask the committee members (i.e., the decision trees) to make predictions, and the final result is the one that receives the highest votes.

The intuition for having a committee (instead of only a single tree) is that we believe a diverse set of models can make better decisions collaboratively. There is mathematical proof about this intuition, but the proof is outside the scope of this course. More information about the Random Forest model can be found in the following link and paper:

Compute Feature Importance#

After training the model and evaluate its performance, we now have a better understanding about how to predict smell events. However, what if we want to know which are the important features? For example, which pollutants are the major source of the bad smell? Which pollution source is likely related to the bad smell? Under what situation will the pollutants travel to the Pittsburgh city area? This information can be important to help the municipality evaluate air pollution policies. This information can also help local communities to advocate for policy changes.

It turns out that we can permute the data in a specific column to know the importance. If a column (corresponding to a feature) is important, permuting the data specifically for the column will make the model performacne decrease. A higher decrease of a metric (e.g., f1-score) means that the feature is more important. It also means the feature is important for the model to make decisions. So, we can compute the “decrease” of a metric and use it as feature importance. We have provided a function compute_feature_importance in the utility file to do this.

compute_feature_importance(rf_model, df_x_subset, df_y_40, scoring="f1")

Computer feature importance using RandomForestClassifier()
=====================================================================
   feature_importance              feature_name
0             0.39398  3.feed_28.H2S_PPM_pre_1h
1             0.24576  3.feed_28.SO2_PPM_pre_1h
=====================================================================

Notice that the compute_feature_importance function can take a very long time to run if you use a large amount of training data.

The above code prints feature importance, which indicates the influence of each feature on the model prediction result. Here we use the Random Forest model. For example, in the printed message, the 3.feed_28.H2S_PPM_pre_1h feature has the highest importance. The values in the feature importance here represent the decrease of f1-score (because we use scoring="f1") if we randomly permute the data related to the feature.

For example, if we randomly permute the H2S measurement in the 3.feed_28.H2S_PPM_pre_1h column for all the data records (but keep other features the same), the f1-score of the Random Forest model will drop about 0.39. The intuition is that if a feature is more important, the model performance will decrease more when the feature is randomly permuted (i.e., when the data that is associated with the feature are messed up). More information can be found in the following link:

More information about feature importance

Notice that to use this technique, the model needs to fit the data reasonably well. Also depending on the number of features you are using, the step of computing the feature importance can take a lot of time.

Assignment for Task 7#

You have learned how to use different models and feature sets. For this assignment, use the knowledge that you learned to conduct experiments to answer the following questions:

Is including sensor data in the past a good idea to help improve model performance?
- Hint: Check precision, recall, and F1-score.
Is the wind direction from the air quality monitoring station (i.e., feed 28) that near the pollution source a good feature to predict bad smell?
- Hint: Compare the performance of models with different feature sets and check the feature importance.

You need to tweak parameters in the code to conduct a pilot experiment to understand if wind direction is a good feature for predicting the presence of bad smell. Also, you need to inspect if including more data from the previous hours is helpful. Specifically, you need to do the experiment using the combinations of two models (Decision Tree and Random Forest) and four feature sets (as described below). So in total, there should be 8 situations (2 models and 4 feature sets) to compare.

The 1st feature set below has the H2S and SO2 data from the previous 1 hour:

3.feed_28.H2S_PPM_pre_1h
3.feed_28.SO2_PPM_pre_1h

The 2nd feature set below has the H2S, SO2, and wind data from the previous 1 hour:

3.feed_28.H2S_PPM_pre_1h
3.feed_28.SO2_PPM_pre_1h
3.feed_28.SONICWD_DEG_sine_pre_1h
3.feed_28.SONICWD_DEG_cosine_pre_1h

The 3rd feature set below has the H2S and SO2 data from the previous 2 hours:

3.feed_28.H2S_PPM_pre_1h
3.feed_28.H2S_PPM_pre_2h
3.feed_28.SO2_PPM_pre_1h
3.feed_28.SO2_PPM_pre_2h

The 4th feature set below has the H2S, SO2, and wind data from the previous 2 hours:

3.feed_28.H2S_PPM_pre_1h
3.feed_28.SO2_PPM_pre_1h
3.feed_28.SONICWD_DEG_sine_pre_1h
3.feed_28.SONICWD_DEG_cosine_pre_1h
3.feed_28.H2S_PPM_pre_2h
3.feed_28.SO2_PPM_pre_2h
3.feed_28.SONICWD_DEG_sine_pre_2h
3.feed_28.SONICWD_DEG_cosine_pre_2h

def experiment(df_x, df_y):
    """
    Perform experiments and print the results.

    Parameters
    ----------
    df_x : pandas.DataFrame
        The data frame that contains all features.
    df_y : pandas.DataFrame
         The data frame that contains labels.
    """
    ###################################
    # Fill in your answer here
    print("None")
    ###################################

experiment(df_x, df_y_40)

None

You need to write the function above that can do similar things as the following from the answer.

answer_experiment(df_x, df_y_40)

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.17
average precision: 0.28
average recall: 0.15
average accuracy: 0.91
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
   feature_importance              feature_name
0             0.36567  3.feed_28.H2S_PPM_pre_1h
1             0.24042  3.feed_28.SO2_PPM_pre_1h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h', '3.feed_28.SONICWD_DEG_sine_pre_1h', '3.feed_28.SONICWD_DEG_cosine_pre_1h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.22
average precision: 0.24
average recall: 0.23
average accuracy: 0.88
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
   feature_importance                         feature_name
0             0.60139             3.feed_28.H2S_PPM_pre_1h
1             0.53011             3.feed_28.SO2_PPM_pre_1h
2             0.48496    3.feed_28.SONICWD_DEG_sine_pre_1h
3             0.42971  3.feed_28.SONICWD_DEG_cosine_pre_1h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h', '3.feed_28.H2S_PPM_pre_2h', '3.feed_28.SO2_PPM_pre_2h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.19
average precision: 0.24
average recall: 0.17
average accuracy: 0.89
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
   feature_importance              feature_name
0             0.60700  3.feed_28.H2S_PPM_pre_1h
1             0.51984  3.feed_28.SO2_PPM_pre_1h
2             0.42018  3.feed_28.H2S_PPM_pre_2h
3             0.40782  3.feed_28.SO2_PPM_pre_2h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h', '3.feed_28.H2S_PPM_pre_2h', '3.feed_28.SO2_PPM_pre_2h', '3.feed_28.SONICWD_DEG_sine_pre_1h', '3.feed_28.SONICWD_DEG_cosine_pre_1h', '3.feed_28.SONICWD_DEG_sine_pre_2h', '3.feed_28.SONICWD_DEG_cosine_pre_2h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.22
average precision: 0.22
average recall: 0.24
average accuracy: 0.87
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
   feature_importance                         feature_name
0             0.64432             3.feed_28.H2S_PPM_pre_1h
1             0.63424    3.feed_28.SONICWD_DEG_sine_pre_1h
2             0.53237    3.feed_28.SONICWD_DEG_sine_pre_2h
3             0.48941  3.feed_28.SONICWD_DEG_cosine_pre_2h
4             0.44407             3.feed_28.H2S_PPM_pre_2h
5             0.42375  3.feed_28.SONICWD_DEG_cosine_pre_1h
6             0.37589             3.feed_28.SO2_PPM_pre_1h
7             0.36048             3.feed_28.SO2_PPM_pre_2h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.16
average precision: 0.27
average recall: 0.13
average accuracy: 0.91
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
   feature_importance              feature_name
0             0.39334  3.feed_28.H2S_PPM_pre_1h
1             0.25501  3.feed_28.SO2_PPM_pre_1h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h', '3.feed_28.SONICWD_DEG_sine_pre_1h', '3.feed_28.SONICWD_DEG_cosine_pre_1h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.19
average precision: 0.28
average recall: 0.18
average accuracy: 0.9
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
   feature_importance                         feature_name
0             0.62467             3.feed_28.H2S_PPM_pre_1h
1             0.45242             3.feed_28.SO2_PPM_pre_1h
2             0.43412    3.feed_28.SONICWD_DEG_sine_pre_1h
3             0.37243  3.feed_28.SONICWD_DEG_cosine_pre_1h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h', '3.feed_28.H2S_PPM_pre_2h', '3.feed_28.SO2_PPM_pre_2h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.19
average precision: 0.29
average recall: 0.16
average accuracy: 0.91
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
   feature_importance              feature_name
0             0.66323  3.feed_28.H2S_PPM_pre_1h
1             0.54248  3.feed_28.H2S_PPM_pre_2h
2             0.43138  3.feed_28.SO2_PPM_pre_2h
3             0.36777  3.feed_28.SO2_PPM_pre_1h
=====================================================================

Use feature set ['3.feed_28.H2S_PPM_pre_1h', '3.feed_28.SO2_PPM_pre_1h', '3.feed_28.H2S_PPM_pre_2h', '3.feed_28.SO2_PPM_pre_2h', '3.feed_28.SONICWD_DEG_sine_pre_1h', '3.feed_28.SONICWD_DEG_cosine_pre_1h', '3.feed_28.SONICWD_DEG_sine_pre_2h', '3.feed_28.SONICWD_DEG_cosine_pre_2h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.19
average precision: 0.33
average recall: 0.16
average accuracy: 0.91
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
   feature_importance                         feature_name
0             0.44466             3.feed_28.H2S_PPM_pre_1h
1             0.39277    3.feed_28.SONICWD_DEG_sine_pre_1h
2             0.37966             3.feed_28.H2S_PPM_pre_2h
3             0.36402    3.feed_28.SONICWD_DEG_sine_pre_2h
4             0.33916  3.feed_28.SONICWD_DEG_cosine_pre_1h
5             0.28473             3.feed_28.SO2_PPM_pre_2h
6             0.27674             3.feed_28.SO2_PPM_pre_1h
7             0.27121  3.feed_28.SONICWD_DEG_cosine_pre_2h
=====================================================================

Now, think about the following two questions that we asked you.

In your opinion, is including sensor data in the past a good idea to help improve model performance?
In your opinion, is the wind direction from the air quality monitoring station (i.e., feed 28) that near the pollution source a good feature to predict bad smell?

1: Credit: this teaching material is created by Yen-Chia Hsu and revised by Alejandro Monroy.

Structured Data Processing

Tutorial (Structured Data Processing)

Contents

Tutorial (Structured Data Processing)#

Scenario#

Import Packages#

Task Answers#

Utility File#

Task 4: Preprocess Sensor Data#

Assignment for Task 4#

Task 5: Preprocess Smell Data#

Assignment for Task 5#

Task 6: Prepare Features and Labels#

Assignment for Task 6#

Task 7: Train and Evaluate Models#

Principal Component Analysis#

Evaluation Metrics#

Use the Dummy Classifier#

Use the Decision Tree Model#

Use the Random Forest Model#

Compute Feature Importance#

Assignment for Task 7#