Tutorial (Structured Data Processing)
Contents
Tutorial (Structured Data Processing)#
(Last updated: Feb 6, 2024)1
This tutorial will familiarize you with the data science pipeline of processing structured data, using a real-world example of building models to predict and explain the presence of bad smell events in Pittsburgh based on air quality and weather data. The models are used to send push notifications about bad smell events to inform citizens, as well as to explain local pollution patterns to inform stakeholders.
The scenario is in the next section of this tutorial, and more details are in the introduction section of the Smell Pittsburgh paper. We will use the same dataset as used in the Smell Pittsburgh paper as an example of structured data. During this tutorial, we will explain what the variables in the dataset mean and also guide you through model building. Below is the pipeline of this tutorial.
You can use the following link to jump to the tasks and assignments:
Scenario#
Local citizens in Pittsburgh are organizing communities to advocate for changes in air pollution regulations. Their goal is to investigate the air pollution patterns in the city to understand the potential sources related to the bad odor. The communities rely on the Smell Pittsburgh application (as indicated in the figure below) to collect smell reports from citizens that live in the Pittsburgh region. Also, there are air quality and weather monitoring stations in the Pittsburgh city region that provide sensor measurements, including common air pollutants and wind information.
You work in a data science team to develop models to map the sensor data to bad smell events. Your team has been working with the Pittsburgh local citizens closely for a long time, and therefore you know the meaning of each variable in the feature set that is used to train the machine learning model. The Pittsburgh community needs your help timely to analyze the data that can help them present evidence of air pollution to the municipality and explain the patterns to the general public.
Import Packages#
Important
You need to install the packages in this link in your Python development environment.
We put all the packages that are needed for this tutorial below:
import pandas as pd
import numpy as np
import pytz
import seaborn as sns
import matplotlib.pyplot as plt
from os.path import isfile, join
from os import listdir
from copy import deepcopy
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
from pandas.api.indexers import FixedForwardWindowIndexer
from sklearn.decomposition import PCA
import plotly.express as px
Set the global style of all Pandas table output:
from IPython.core.display import HTML
def set_global_df_style():
styles = """
<style>
.dataframe * {font-size: 1em !important;}
</style>
"""
return HTML(styles)
# Call this function in a notebook cell to apply the style globally
set_global_df_style()
Task Answers#
The code block below contains answers for the assignments in this tutorial. Do not check the answers in the next cell before practicing the tasks.
def check_answer_df(df_result, df_answer, n=1):
"""
This function checks if two output dataframes are the same.
Parameters
----------
df_result : pandas.DataFrame
The result from the output of a function.
df_answer: pandas.DataFrame
The expected output of the function.
n : int
The numbering of the test case.
"""
try:
assert df_answer.equals(df_result)
print("Test case %d passed." % n)
except:
print("Test case %d failed." % n)
print("")
print("Your output is:")
print(df_result)
print("")
print("Expected output is:")
print(df_answer)
def answer_preprocess_sensor(df_list):
"""
This function is the answer of task 5.
Preprocess sensor data.
Parameters
----------
df_list : list of pandas.DataFrame
A list of data frames that contain sensor data from multiple stations.
Returns
-------
pandas.DataFrame
The preprocessed sensor data.
"""
# Resample all the data frames.
df_resample_list = []
for df in df_list:
# Convert the timestamp to datetime.
df.index = pd.to_datetime(df.index, unit="s", utc=True)
# Resample the timestamps by hour and average all the previous values.
# Because we want data from the past, so label need to be "right".
df_resample_list.append(df.resample("60Min", label="right").mean())
# Merge all data frames.
df = df_resample_list.pop(0)
index_name = df.index.name
while len(df_resample_list) != 0:
# We need to use outer merging since we want to preserve data from both data frames.
df = pd.merge_ordered(df, df_resample_list.pop(0), on=df.index.name, how="outer", fill_method=None)
# Move the datetime column to index
df = df.set_index(index_name)
# Fill in the missing data with value -1.
df = df.fillna(-1)
return df
def answer_preprocess_smell(df):
"""
This function is the answer of task 4.
Preprocess smell data.
Parameters
----------
df : pandas.DataFrame
The raw smell reports data.
Returns
-------
pandas.DataFrame
The preprocessed smell data.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
# Drop the columns that we do not need.
df = df.drop(columns=["feelings_symptoms", "smell_description", "zipcode"])
# Select only the reports within the range of 3 and 5.
df = df[(df["smell_value"]>=3)&(df["smell_value"]<=5)]
# Convert the timestamp to datetime.
df.index = pd.to_datetime(df.index, unit="s", utc=True)
# Resample the timestamps by hour and sum up all the future values.
# Because we want data from the future, so label need to be "left".
df = df.resample("60Min", label="left").sum()
# Fill in the missing data with value 0.
df = df.fillna(0)
return df
def answer_sum_current_and_future_data(df, n_hr=0):
"""
This function is the answer of task 6.
Sum up data in the current and future hours.
Parameters
----------
df : pandas.DataFrame
The preprocessed smell data.
n_hr : int
Number of hours that we want to sum up the future smell data.
Returns
-------
pandas.DataFrame
The transformed smell data.
"""
# Copy data frame to prevent editing the original one.
df = df.copy(deep=True)
# Fast return if n_hr is 0
if n_hr == 0: return df
# Sum up all smell_values in future hours.
# The rolling function only works for summing up previous values.
# So we need to shift back to get the value in the future.
# Be careful that we need to add 1 to the rolling window size.
# Becasue window size 1 means only using the current data.
# Parameter "closed" need to be "right" because we want the current data.
df = df.rolling(n_hr+1, min_periods=1, closed="right").sum().shift(-1*n_hr)
# Below is an alternative implementation of rolling.
#indexer = FixedForwardWindowIndexer(window_size=n_hr+1)
#df = df.rolling(indexer, min_periods=1).sum()
# Delete the last n_hr rows.
# These n_hr rows have wrong data due to data shifting.
df = df.iloc[:-1*n_hr]
return df
def answer_experiment(df_x, df_y):
"""
This function is the answer of task 7.
Perform experiments and print the results.
Parameters
----------
df_x : pandas.DataFrame
The data frame that contains all features.
df_y : pandas.DataFrame
The data frame that contains labels.
"""
wind = "3.feed_28.SONICWD_DEG"
h2s = "3.feed_28.H2S_PPM"
dow = "day_of_week"
hod = "hour_of_day"
fs1 = [h2s + "_pre_1h", dow + "_sine", dow + "_cosine", hod + "_sine", hod + "_cosine"]
fs1w = fs1 + [wind + "_sine_pre_1h", wind + "_cosine_pre_1h"]
fs2 = fs1 + [h2s + "_pre_2h"]
fs2w = fs2 + [wind + "_sine_pre_2h", wind + "_cosine_pre_2h"]
feature_sets = [fs1, fs1w, fs2, fs2w]
models = [DecisionTreeClassifier(), RandomForestClassifier()]
for m in models:
for fs in feature_sets:
print("Use feature set %s" % (str(fs)))
df_x_fs = df_x[fs]
train_and_evaluate(m, df_x_fs, df_y, train_size=336, test_size=168)
compute_feature_importance(m, df_x_fs, df_y, scoring="f1")
print("")
Task 4: Preprocess Sensor Data#
In this task, we will process the sensor data from various air quality monitoring stations in Pittsburgh. First, we need to load all the sensor data.
path = "smellpgh-v1/esdr_raw"
list_of_files = [f for f in listdir(path) if isfile(join(path, f))]
sensor_raw_list = []
for f in list_of_files:
sensor_raw_list.append(pd.read_csv(join(path, f)).set_index("EpochTime"))
Now, the sensor_raw_list
variable contains all the data frames with sensor values from different air quality monitoring stations. Noted that sensor_raw_list
is an array of data frames. We can print one of them to take a look, as shown below.
sensor_raw_list[0]
3.feed_1.SO2_PPM | 3.feed_1.H2S_PPM | 3.feed_1.SIGTHETA_DEG | 3.feed_1.SONICWD_DEG | 3.feed_1.SONICWS_MPH | |
---|---|---|---|---|---|
EpochTime | |||||
1477891800 | 0.0 | 0.0 | 51.7 | 343.0 | 3.6 |
1477895400 | 0.0 | 0.0 | 52.7 | 351.0 | 3.5 |
1477899000 | 0.0 | 0.0 | 52.6 | 359.0 | 3.4 |
1477902600 | 0.0 | 0.0 | 48.3 | 5.0 | 2.1 |
1477906200 | 0.0 | 0.0 | 31.1 | 41.0 | 2.2 |
... | ... | ... | ... | ... | ... |
1538267400 | 0.0 | 0.0 | 35.2 | 39.0 | 1.7 |
1538271000 | 0.0 | 0.0 | 48.2 | 53.0 | 1.3 |
1538274600 | 0.0 | 0.0 | 30.9 | 62.0 | 1.5 |
1538278200 | 0.0 | 0.0 | 21.5 | 53.0 | 2.0 |
1538281800 | 0.0 | 0.0 | 52.1 | 36.0 | 1.7 |
16729 rows × 5 columns
The EpochTime
index is the timestamp in epoch time, which means the number of seconds that have elapsed since January 1st, 1970 (midnight UTC/GMT). Other columns mean the sensor data from an air quality monitoring station.
Next, we need to resample and merge all the sensor data frames so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:
df_sensor = answer_preprocess_sensor(sensor_raw_list)
df_sensor
3.feed_1.SO2_PPM | 3.feed_1.H2S_PPM | 3.feed_1.SIGTHETA_DEG | 3.feed_1.SONICWD_DEG | 3.feed_1.SONICWS_MPH | 3.feed_23.CO_PPM | 3.feed_23.PM10_UG_M3 | 3.feed_29.PM10_UG_M3 | 3.feed_29.PM25_UG_M3 | 3.feed_11067.CO_PPB..3.feed_43.CO_PPB | ... | 3.feed_3.SO2_PPM | 3.feed_3.SONICWD_DEG | 3.feed_3.SONICWS_MPH | 3.feed_3.SIGTHETA_DEG | 3.feed_3.PM10B_UG_M3 | 3.feed_5975.PM2_5 | 3.feed_27.NO_PPB | 3.feed_27.NOY_PPB | 3.feed_27.CO_PPB | 3.feed_27.SO2_PPB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EpochTime | |||||||||||||||||||||
2016-10-31 06:00:00+00:00 | 0.0 | 0.0 | 51.7 | 343.0 | 3.6 | 0.2 | 7.0 | 8.0 | 8.0 | 159.5 | ... | 0.0 | 344.0 | 2.9 | 43.0 | 9.0 | 0.0 | 0.1 | 2.6 | -1.0 | 0.2 |
2016-10-31 07:00:00+00:00 | 0.0 | 0.0 | 52.7 | 351.0 | 3.5 | 0.2 | 8.0 | 8.0 | 8.0 | -1.0 | ... | 0.0 | 330.0 | 2.5 | 43.6 | 13.0 | 5.0 | -1.0 | -1.0 | 106.1 | 0.0 |
2016-10-31 08:00:00+00:00 | 0.0 | 0.0 | 52.6 | 359.0 | 3.4 | 0.2 | 5.0 | 7.0 | 7.0 | 133.0 | ... | 0.0 | 0.0 | 3.1 | 40.9 | 7.0 | 9.0 | 0.2 | 2.1 | 105.8 | -1.0 |
2016-10-31 09:00:00+00:00 | 0.0 | 0.0 | 48.3 | 5.0 | 2.1 | 0.2 | 3.0 | 4.0 | 4.0 | 236.6 | ... | 0.0 | 325.0 | 1.9 | 40.0 | 11.0 | 3.0 | 0.1 | 3.1 | 111.7 | 0.0 |
2016-10-31 10:00:00+00:00 | 0.0 | 0.0 | 31.1 | 41.0 | 2.2 | 0.2 | 5.0 | 5.0 | 4.0 | 269.3 | ... | 0.0 | 347.0 | 1.4 | 45.1 | 10.0 | 9.0 | 0.1 | 2.5 | 127.2 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-09-30 01:00:00+00:00 | 0.0 | 0.0 | 35.2 | 39.0 | 1.7 | 0.3 | 20.0 | 15.0 | 10.0 | 455.1 | ... | 0.0 | 39.0 | 1.3 | 57.3 | 11.0 | 5.0 | 0.1 | 12.1 | 301.0 | 0.0 |
2018-09-30 02:00:00+00:00 | 0.0 | 0.0 | 48.2 | 53.0 | 1.3 | 0.3 | 25.0 | 19.0 | 12.0 | 761.2 | ... | 0.0 | 70.0 | 1.0 | 54.4 | 21.0 | 7.0 | 0.2 | 13.5 | 357.7 | 0.0 |
2018-09-30 03:00:00+00:00 | 0.0 | 0.0 | 30.9 | 62.0 | 1.5 | 0.4 | 23.0 | 55.0 | 33.0 | 1125.4 | ... | 0.0 | 75.0 | 0.7 | 59.5 | 33.0 | 8.0 | 0.6 | 13.8 | 373.6 | 0.0 |
2018-09-30 04:00:00+00:00 | 0.0 | 0.0 | 21.5 | 53.0 | 2.0 | 0.5 | 23.0 | 63.0 | 39.0 | 1039.5 | ... | 0.0 | 74.0 | 0.7 | 50.7 | 32.0 | 17.0 | 0.6 | 11.4 | 381.0 | 0.0 |
2018-09-30 05:00:00+00:00 | 0.0 | 0.0 | 52.1 | 36.0 | 1.7 | 0.4 | 25.0 | 47.0 | 33.0 | 1636.3 | ... | 0.0 | 65.0 | 0.4 | 77.1 | 27.0 | 18.0 | 2.0 | 13.1 | 377.7 | 0.0 |
16776 rows × 43 columns
In the expected output above, the EpochTime
index is converted from timestamps into pandas datetime objects, which has the format year-month-day hour:minute:second+timezone
. The +00:00
string means the GMT/UTC timezone. Other columns mean the average value of the sensor data in the previous hour. For example, 2016-10-31 06:00:00+00:00
means October 31 in 2016 at 6AM UTC time, and the cell with column 3.feed_1.SO2_PPM
means the averaged SO2 (sulfur dioxide) values from 5:00 to 6:00.
The column name suffix SO2_PPM
means sulfur dioxide in unit PPM (parts per million). The prefix 3.feed_1.
in the column name means a specific sensor (feed ID 1). You can ignore the 3.
at the begining of the column name. You can find the list of sensors, their names with feed ID (which will be in the data frame columns), and also the meaning of all the suffixes from this link.
Some column names look like 3.feed_11067.SIGTHETA_DEG..3.feed_43.SIGTHETA_DEG
. This means that the column has data from two sensor stations (feed ID 11067 and 43). The reason is that some sensor stations are replaced by the new ones over time. So in this case, we merge sensor readings from both feed ID 11067 and 43.
Assignment for Task 4#
Your task (which is your assignment) is to write a function to do the following:
Sensors can report in various frequencies. So, for each data frame, we need to resample the data by computing the hourly average of sensor measurements from the “previous” hour. For example, at time 8:00, we want to know the average of sensor values between 7:00 and 8:00.
Hint: Use the
pandas.to_datetime
function when converting timestamps to datetime objects. Type?pd.to_datetime
in a code cell for more information.Hint: Use the
pandas.DataFrame.resample
function to resample data. Type?pd.DataFrame.resample
in a code cell for more information.Hint: Use the
pandas.merge_ordered
function when merging data frames. Type?pd.merge_ordered
in a code cell for more information.
Then, merge all the data frames based on their time stamp, which is the
EpochTime
column.Finally, fill in the missing data with the value -1. The reason for not using 0 is that we want the model to know if sensors give values (including zero) or no data.
Hint: Use the
pandas.DataFrame.fillna
function when treating missing values. Type?pd.DataFrame.fillna
in a code cell for more information.
def preprocess_sensor(df_list):
"""
Preprocess sensor data.
Parameters
----------
df_list : list of pandas.DataFrame
A list of data frames that contain sensor data from multiple stations.
Returns
-------
pandas.DataFrame
The preprocessed sensor data.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the output of your function matches the expected output.
check_answer_df(preprocess_sensor(sensor_raw_list), df_sensor, n=1)
Test case 1 failed.
Your output is:
None
Expected output is:
3.feed_1.SO2_PPM 3.feed_1.H2S_PPM \
EpochTime
2016-10-31 06:00:00+00:00 0.0 0.0
2016-10-31 07:00:00+00:00 0.0 0.0
2016-10-31 08:00:00+00:00 0.0 0.0
2016-10-31 09:00:00+00:00 0.0 0.0
2016-10-31 10:00:00+00:00 0.0 0.0
... ... ...
2018-09-30 01:00:00+00:00 0.0 0.0
2018-09-30 02:00:00+00:00 0.0 0.0
2018-09-30 03:00:00+00:00 0.0 0.0
2018-09-30 04:00:00+00:00 0.0 0.0
2018-09-30 05:00:00+00:00 0.0 0.0
3.feed_1.SIGTHETA_DEG 3.feed_1.SONICWD_DEG \
EpochTime
2016-10-31 06:00:00+00:00 51.7 343.0
2016-10-31 07:00:00+00:00 52.7 351.0
2016-10-31 08:00:00+00:00 52.6 359.0
2016-10-31 09:00:00+00:00 48.3 5.0
2016-10-31 10:00:00+00:00 31.1 41.0
... ... ...
2018-09-30 01:00:00+00:00 35.2 39.0
2018-09-30 02:00:00+00:00 48.2 53.0
2018-09-30 03:00:00+00:00 30.9 62.0
2018-09-30 04:00:00+00:00 21.5 53.0
2018-09-30 05:00:00+00:00 52.1 36.0
3.feed_1.SONICWS_MPH 3.feed_23.CO_PPM \
EpochTime
2016-10-31 06:00:00+00:00 3.6 0.2
2016-10-31 07:00:00+00:00 3.5 0.2
2016-10-31 08:00:00+00:00 3.4 0.2
2016-10-31 09:00:00+00:00 2.1 0.2
2016-10-31 10:00:00+00:00 2.2 0.2
... ... ...
2018-09-30 01:00:00+00:00 1.7 0.3
2018-09-30 02:00:00+00:00 1.3 0.3
2018-09-30 03:00:00+00:00 1.5 0.4
2018-09-30 04:00:00+00:00 2.0 0.5
2018-09-30 05:00:00+00:00 1.7 0.4
3.feed_23.PM10_UG_M3 3.feed_29.PM10_UG_M3 \
EpochTime
2016-10-31 06:00:00+00:00 7.0 8.0
2016-10-31 07:00:00+00:00 8.0 8.0
2016-10-31 08:00:00+00:00 5.0 7.0
2016-10-31 09:00:00+00:00 3.0 4.0
2016-10-31 10:00:00+00:00 5.0 5.0
... ... ...
2018-09-30 01:00:00+00:00 20.0 15.0
2018-09-30 02:00:00+00:00 25.0 19.0
2018-09-30 03:00:00+00:00 23.0 55.0
2018-09-30 04:00:00+00:00 23.0 63.0
2018-09-30 05:00:00+00:00 25.0 47.0
3.feed_29.PM25_UG_M3 \
EpochTime
2016-10-31 06:00:00+00:00 8.0
2016-10-31 07:00:00+00:00 8.0
2016-10-31 08:00:00+00:00 7.0
2016-10-31 09:00:00+00:00 4.0
2016-10-31 10:00:00+00:00 4.0
... ...
2018-09-30 01:00:00+00:00 10.0
2018-09-30 02:00:00+00:00 12.0
2018-09-30 03:00:00+00:00 33.0
2018-09-30 04:00:00+00:00 39.0
2018-09-30 05:00:00+00:00 33.0
3.feed_11067.CO_PPB..3.feed_43.CO_PPB ... \
EpochTime ...
2016-10-31 06:00:00+00:00 159.5 ...
2016-10-31 07:00:00+00:00 -1.0 ...
2016-10-31 08:00:00+00:00 133.0 ...
2016-10-31 09:00:00+00:00 236.6 ...
2016-10-31 10:00:00+00:00 269.3 ...
... ... ...
2018-09-30 01:00:00+00:00 455.1 ...
2018-09-30 02:00:00+00:00 761.2 ...
2018-09-30 03:00:00+00:00 1125.4 ...
2018-09-30 04:00:00+00:00 1039.5 ...
2018-09-30 05:00:00+00:00 1636.3 ...
3.feed_3.SO2_PPM 3.feed_3.SONICWD_DEG \
EpochTime
2016-10-31 06:00:00+00:00 0.0 344.0
2016-10-31 07:00:00+00:00 0.0 330.0
2016-10-31 08:00:00+00:00 0.0 0.0
2016-10-31 09:00:00+00:00 0.0 325.0
2016-10-31 10:00:00+00:00 0.0 347.0
... ... ...
2018-09-30 01:00:00+00:00 0.0 39.0
2018-09-30 02:00:00+00:00 0.0 70.0
2018-09-30 03:00:00+00:00 0.0 75.0
2018-09-30 04:00:00+00:00 0.0 74.0
2018-09-30 05:00:00+00:00 0.0 65.0
3.feed_3.SONICWS_MPH 3.feed_3.SIGTHETA_DEG \
EpochTime
2016-10-31 06:00:00+00:00 2.9 43.0
2016-10-31 07:00:00+00:00 2.5 43.6
2016-10-31 08:00:00+00:00 3.1 40.9
2016-10-31 09:00:00+00:00 1.9 40.0
2016-10-31 10:00:00+00:00 1.4 45.1
... ... ...
2018-09-30 01:00:00+00:00 1.3 57.3
2018-09-30 02:00:00+00:00 1.0 54.4
2018-09-30 03:00:00+00:00 0.7 59.5
2018-09-30 04:00:00+00:00 0.7 50.7
2018-09-30 05:00:00+00:00 0.4 77.1
3.feed_3.PM10B_UG_M3 3.feed_5975.PM2_5 \
EpochTime
2016-10-31 06:00:00+00:00 9.0 0.0
2016-10-31 07:00:00+00:00 13.0 5.0
2016-10-31 08:00:00+00:00 7.0 9.0
2016-10-31 09:00:00+00:00 11.0 3.0
2016-10-31 10:00:00+00:00 10.0 9.0
... ... ...
2018-09-30 01:00:00+00:00 11.0 5.0
2018-09-30 02:00:00+00:00 21.0 7.0
2018-09-30 03:00:00+00:00 33.0 8.0
2018-09-30 04:00:00+00:00 32.0 17.0
2018-09-30 05:00:00+00:00 27.0 18.0
3.feed_27.NO_PPB 3.feed_27.NOY_PPB \
EpochTime
2016-10-31 06:00:00+00:00 0.1 2.6
2016-10-31 07:00:00+00:00 -1.0 -1.0
2016-10-31 08:00:00+00:00 0.2 2.1
2016-10-31 09:00:00+00:00 0.1 3.1
2016-10-31 10:00:00+00:00 0.1 2.5
... ... ...
2018-09-30 01:00:00+00:00 0.1 12.1
2018-09-30 02:00:00+00:00 0.2 13.5
2018-09-30 03:00:00+00:00 0.6 13.8
2018-09-30 04:00:00+00:00 0.6 11.4
2018-09-30 05:00:00+00:00 2.0 13.1
3.feed_27.CO_PPB 3.feed_27.SO2_PPB
EpochTime
2016-10-31 06:00:00+00:00 -1.0 0.2
2016-10-31 07:00:00+00:00 106.1 0.0
2016-10-31 08:00:00+00:00 105.8 -1.0
2016-10-31 09:00:00+00:00 111.7 0.0
2016-10-31 10:00:00+00:00 127.2 0.0
... ... ...
2018-09-30 01:00:00+00:00 301.0 0.0
2018-09-30 02:00:00+00:00 357.7 0.0
2018-09-30 03:00:00+00:00 373.6 0.0
2018-09-30 04:00:00+00:00 381.0 0.0
2018-09-30 05:00:00+00:00 377.7 0.0
[16776 rows x 43 columns]
Task 5: Preprocess Smell Data#
In this task, we will preprocess the smell data. First, we need to load the raw smell data.
smell_raw = pd.read_csv("smellpgh-v1/smell_raw.csv").set_index("EpochTime")
smell_raw
feelings_symptoms | smell_description | smell_value | zipcode | |
---|---|---|---|---|
EpochTime | ||||
1477935134 | NaN | NaN | 1 | 15221 |
1477956180 | NaN | Woodsmoke | 2 | 15218 |
1477956293 | NaN | Wood smoke | 3 | 15218 |
1477973293 | Eye irritation, nose burns, headache, woke me up | Industrial | 5 | 15207 |
1478001989 | NaN | Industrial smoke | 2 | 15213 |
... | ... | ... | ... | ... |
1538248172 | NaN | Sour sewage | 3 | 15213 |
1538255258 | Coughing | Smoke | 2 | 15104 |
1538268796 | NaN | Like a burning candle | 3 | 15232 |
1538281653 | No | Greasy air | 3 | 15222 |
1538282980 | Eye irritation | Wood smoke and gun powder very Smokey too as f... | 5 | 15217 |
10353 rows × 4 columns
The meaning of EpochTime
is explained in the previous task. Other columns mean the self-reported symptoms, descriptions of smell, severity ratings (the smell_value
column), and the zipcode where the report is submitted in Pittsburgh, Pennsylvania. For example, the second row means that the smell report was submitted from the 15218 zipcode with wood smoke description and severity rating 2. For more description about the smell, please check the Smell Pittsburgh website.
Next, we need to resample the smell data so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:
df_smell = answer_preprocess_smell(smell_raw)
df_smell
smell_value | |
---|---|
EpochTime | |
2016-10-31 23:00:00+00:00 | 3 |
2016-11-01 00:00:00+00:00 | 0 |
2016-11-01 01:00:00+00:00 | 0 |
2016-11-01 02:00:00+00:00 | 0 |
2016-11-01 03:00:00+00:00 | 0 |
... | ... |
2018-09-30 00:00:00+00:00 | 3 |
2018-09-30 01:00:00+00:00 | 0 |
2018-09-30 02:00:00+00:00 | 0 |
2018-09-30 03:00:00+00:00 | 0 |
2018-09-30 04:00:00+00:00 | 8 |
16758 rows × 1 columns
In the latest row, the timestamp is 2018-09-30 04:00:00+00:00
, which means this row contains the data from 3:00 to 4:00 on September 30 in 2018. This row has smell_value
8, which means the sum of smell report ratings in the above mentioned time range. Notice that the expected output ignores all smell ratings from 1 to 2. This is becasue we only want the ratings that indicate bad smell, which will be further explained below.
Assignment for Task 5#
Your task (which is your assignment) is to write a function to do the following:
First, remove the
feelings_symptoms
,smell_description
, andzipcode
columns since we do not need them.Hint: Use the
pandas.DataFrame.drop
function. Type?pd.DataFrame.drop
in a code cell for more information.
We only want the reports that indicate bad smell. You need to select only the reports with rating 3, 4, or 5 in the
smell_value
column.Then, we want to know the severity of bad smell within an hour in the future. For example, at time 8:00, we want to know the sum of smell values between 8:00 and 9:00. So you need to resample the data by computing the hourly sum of smell values from the “future” hour.
Hint: Use the
pandas.to_datetime
function when converting timestamps to datetime objects. Type?pd.to_datetime
in a code cell for more information.Hint: Use the
pandas.DataFrame.resample
function to resample data. Type?pd.DataFrame.resample
in a code cell for more information.
Finally, fill in the missing data with the value 0. The reason is that missing data means there are no smell reports (provided by citizens) within an hour, so we assume that there is no bad smell within this period of time. Notice that this is an assumption and also a limitation since citizens rarely report good smell.
Hint: Use the
pandas.DataFrame.fillna
function when treating missing values. Type?pd.DataFrame.fillna
in a code cell for more information.
def preprocess_smell(df):
"""
Preprocess smell data.
Parameters
----------
df : pandas.DataFrame
The raw smell reports data.
Returns
-------
pandas.DataFrame
The preprocessed smell data.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the output of your function matches the expected output.
check_answer_df(preprocess_smell(smell_raw), df_smell, n=1)
Test case 1 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-10-31 23:00:00+00:00 3
2016-11-01 00:00:00+00:00 0
2016-11-01 01:00:00+00:00 0
2016-11-01 02:00:00+00:00 0
2016-11-01 03:00:00+00:00 0
... ...
2018-09-30 00:00:00+00:00 3
2018-09-30 01:00:00+00:00 0
2018-09-30 02:00:00+00:00 0
2018-09-30 03:00:00+00:00 0
2018-09-30 04:00:00+00:00 8
[16758 rows x 1 columns]
Now, we can plot the distribution of smell values by using the pandas.DataFrame.plot
function.
fig = df_smell.plot(kind="hist", bins=20, ylim=(0,100), edgecolor="black").set_yticks([0,50,100], labels=["0","50",">100"])
From the plot above, we can observe that a lot of the time, the smell values are fairly low. This means that smell events only happen occasionally, and thus our dataset is highly imbalanced.
We can also plot the average number of smell reports distributed by the day of week (Sunday to Saturday) and the hour of day (0 to 23), using the code below.
def is_datetime_obj_tz_aware(dt):
"""
Find if the datetime object is timezone aware.
Parameters
----------
dt : pandas.DatetimeIndex
A datatime index object.
"""
return dt.tz is not None
def plot_smell_by_day_and_hour(df):
"""
Plot the average number of smell reports by the day of week and the hour of day.
Parameters
----------
df : pandas.DataFrame
The preprocessed smell data.
"""
# Copy the data frame to prevent editing the original one.
df = df.copy(deep=True)
# Convert timestamps to the local time in Pittsburgh.
if is_datetime_obj_tz_aware(df.index):
df.index = df.index.tz_convert(pytz.timezone("US/Eastern"))
else:
df.index = df.index.tz_localize(pytz.utc, ambiguous="infer").tz_convert(pytz.timezone("US/Eastern"))
# Compute the day of week and the hour of day.
df["day_of_week"] = df.index.dayofweek
df["hour_of_day"] = df.index.hour
# Plot the graph.
y_l = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
df_pivot = pd.pivot_table(df, values="smell_value", index=["day_of_week"], columns=["hour_of_day"], aggfunc="mean")
f, ax = plt.subplots(figsize=(14, 6))
sns.heatmap(df_pivot, annot=False, cmap="Blues", fmt="g", linewidths=0.1, yticklabels=y_l, ax=ax)
plot_smell_by_day_and_hour(df_smell)
From the plot above, we can observe that citizens tend to report smell in the morning.
Task 6: Prepare Features and Labels#
Now we have the preprocessed data in the df_sensor
and df_smell
variables. Our next task is to prepare features and labels for modeling, as shown in the figure below.
Our goal is to construct two data frames, df_x
and df_y
, that represent the features and labels, respectively. First, we will deal with the sensor data. We need a function to insert columns (that indicate previous n_hr
hours of sensor) to the existing data frame, where n_hr
should be a parameter that we can control. The code below can help us achieve this.
def insert_previous_data_to_cols(df, n_hr=0):
"""
Insert columns to indicate the data from the previous hours.
Parameters
----------
df : pandas.DataFrame
The preprocessed sensor data.
n_hr : int
Number of hours that we want to insert the previous sensor data.
Returns
-------
pandas.DataFrame
The transformed sensor data.
"""
# Copy data frame to prevent editing the original one.
df = df.copy(deep=True)
# Add the data from the previous hours.
df_all = []
for h in range(1, n_hr + 1):
# Shift the data frame to get previous data.
df_pre = df.shift(h)
# Edit the name to indicate it is previous data.
# The orginal data frame already has data from the previous 1 hour.
# (as indicated in the preprocessing phase of sensor data)
# So we need to add 1 here.
df_pre.columns += "_pre_" + str(h+1) + "h"
# Add the data to an array for merging.
df_all.append(df_pre)
# Rename the columns in the original data frame.
# The orginal data frame already has data from the previous 1 hour.
# (as indicated in the preprocessing phase of sensor data)
df.columns += "_pre_1h"
# Merge all data.
df_merge = df
for d in df_all:
# The join function merges dataframes by index.
df_merge = df_merge.join(d)
# Delete the first n_hr rows.
# These n_hr rows have no data due to data shifting.
df_merge = df_merge.iloc[n_hr:]
return df_merge
The code below shows a test case, which is a part of the sensor data.
# Below is an example input.
df_sensor_example_in = df_sensor[["3.feed_1.SONICWS_MPH"]][0:5]
df_sensor_example_in
3.feed_1.SONICWS_MPH | |
---|---|
EpochTime | |
2016-10-31 06:00:00+00:00 | 3.6 |
2016-10-31 07:00:00+00:00 | 3.5 |
2016-10-31 08:00:00+00:00 | 3.4 |
2016-10-31 09:00:00+00:00 | 2.1 |
2016-10-31 10:00:00+00:00 | 2.2 |
# Below is the expected output of the above example input.
df_sensor_example_out = insert_previous_data_to_cols(df_sensor_example_in, n_hr=2)
df_sensor_example_out
3.feed_1.SONICWS_MPH_pre_1h | 3.feed_1.SONICWS_MPH_pre_2h | 3.feed_1.SONICWS_MPH_pre_3h | |
---|---|---|---|
EpochTime | |||
2016-10-31 08:00:00+00:00 | 3.4 | 3.5 | 3.6 |
2016-10-31 09:00:00+00:00 | 2.1 | 3.4 | 3.5 |
2016-10-31 10:00:00+00:00 | 2.2 | 2.1 | 3.4 |
The reason that there are 2 less rows in the expected output is because we set n_hr=2
, which means there are missing data in the original first and second row (because there was no previous data for these rows). So in the code, we removed these rows.
Notice that the insert_previous_data_to_cols
function added suffixes to the column names to indicate the number of hours that the sensor measurements came from previously. Pay attention to the meaning of time range here.
For example, in the first row, the
3.feed_1.SONICWS_MPH_pre_1h
column has value3.4
, which means the average reading of wind speed (in unit MPH) between the current time stamp (which is8:00
) and the previous 1 hour (which is7:00
).In the second column of the first row, the
3.feed_1.SONICWS_MPH_pre_2h
has value3.5
, which means the average reading of wind speed between the previous 1 hour (which is7:00
) and 2 hours (which is6:00
).It is important to note here that suffix
pre_2h
does not mean the average rating within 2 hours between the current time stamp and the time that is 2 hours ago.
Then, we also need a function to convert wind direction into sine and cosine components, which is a common technique for encoding cyclical features (i.e., any that that circulates within a set of values, such as hours of the day, days of the week). The formula is below:
There are several reasons to do this instead of using the original wind direction degrees (that range from 0 to 360). First, by applying sine and cosine to the degrees, we can transform the original data to a continuous variable. The original data is not continuous since there are no values below 0 or above 360, and there is no information to tell that 0 degrees and 360 degrees are the same. Second, the decomposed sine and cosine components allow us to inspect the effect of wind on the north-south and east-west directions separately, which may help us explain the importance of wind directions. Below is the code for achieving this.
def convert_wind_direction(df):
"""
Convert wind directions to sine and cosine components.
Parameters
----------
df : pandas.DataFrame
The data frame that contains the wind direction data.
Returns
-------
pandas.DataFrame
The transformed data frame.
"""
# Copy data frame to prevent editing the original one.
df_cp = df.copy(deep=True)
# Convert columns with wind directions.
for c in df.columns:
if "SONICWD_DEG" in c:
df_c = df[c]
df_c_cos = np.cos(np.deg2rad(df_c))
df_c_sin = np.sin(np.deg2rad(df_c))
df_c_cos.name += "_cosine"
df_c_sin.name += "_sine"
df_cp.drop([c], axis=1, inplace=True)
df_cp[df_c_cos.name] = df_c_cos
df_cp[df_c_sin.name] = df_c_sin
return df_cp
The code below shows a test case, which is a part of the sensor data.
# Below is an example input.
df_wind_example_in = df_sensor[["3.feed_1.SONICWD_DEG"]][0:5]
df_wind_example_in
3.feed_1.SONICWD_DEG | |
---|---|
EpochTime | |
2016-10-31 06:00:00+00:00 | 343.0 |
2016-10-31 07:00:00+00:00 | 351.0 |
2016-10-31 08:00:00+00:00 | 359.0 |
2016-10-31 09:00:00+00:00 | 5.0 |
2016-10-31 10:00:00+00:00 | 41.0 |
# Below is the expected output of the above example input.
df_wind_example_out = convert_wind_direction(df_wind_example_in)
df_wind_example_out
3.feed_1.SONICWD_DEG_cosine | 3.feed_1.SONICWD_DEG_sine | |
---|---|---|
EpochTime | ||
2016-10-31 06:00:00+00:00 | 0.956305 | -0.292372 |
2016-10-31 07:00:00+00:00 | 0.987688 | -0.156434 |
2016-10-31 08:00:00+00:00 | 0.999848 | -0.017452 |
2016-10-31 09:00:00+00:00 | 0.996195 | 0.087156 |
2016-10-31 10:00:00+00:00 | 0.754710 | 0.656059 |
We have dealt with the sensor data. Next, we will deal with the smell data. We need a function to sum up smell values in the future hours, where n_hr
should be a parameter that we can control. The code below shows a test case, which is a part of the smell data.
# Below is an example input.
df_smell_example_in = df_smell[107:112]
df_smell_example_in
smell_value | |
---|---|
EpochTime | |
2016-11-05 10:00:00+00:00 | 8 |
2016-11-05 11:00:00+00:00 | 13 |
2016-11-05 12:00:00+00:00 | 40 |
2016-11-05 13:00:00+00:00 | 22 |
2016-11-05 14:00:00+00:00 | 4 |
# Below is the expected output of the above example input.
df_smell_example_out1 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=1)
df_smell_example_out1
smell_value | |
---|---|
EpochTime | |
2016-11-05 10:00:00+00:00 | 21.0 |
2016-11-05 11:00:00+00:00 | 53.0 |
2016-11-05 12:00:00+00:00 | 62.0 |
2016-11-05 13:00:00+00:00 | 26.0 |
# Below is another expected output with a different n_hr.
df_smell_example_out2 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=3)
df_smell_example_out2
smell_value | |
---|---|
EpochTime | |
2016-11-05 10:00:00+00:00 | 83.0 |
2016-11-05 11:00:00+00:00 | 79.0 |
In the output above, notice that row 2016-11-05 10:00:00+00:00
has smell value 83
, and the setting is n_hr=3
, which means the sum of smell values within n_hr+1
hours (i.e., from 10:00
to 14:00
) is 83. Pay attention to this setup since it can be confusing. The reason of n_hr+1
(but not n_hr
) is because the input data already indicates the sum of smell values within the future 1 hour.
# Below is another expected output when n_hr is 0.
df_smell_example_out3 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=0)
df_smell_example_out3
smell_value | |
---|---|
EpochTime | |
2016-11-05 10:00:00+00:00 | 8 |
2016-11-05 11:00:00+00:00 | 13 |
2016-11-05 12:00:00+00:00 | 40 |
2016-11-05 13:00:00+00:00 | 22 |
2016-11-05 14:00:00+00:00 | 4 |
Assignment for Task 6#
Your task (which is your assignment) is to write a function to do the following:
First, perform a windowing operation to sum up smell values within a specified
n_hr
time window.Hint: Use the
pandas.DataFrame.rolling
function when summing up values within a window. Type?pd.DataFrame.rolling
in a code cell for more information.Hint: Use the
pandas.DataFrame.shift
fuction to shift the rolled data backn_hr
hours because we want to sum up the values in the future (the rolling function operates on the values in the past). Type?pd.DataFrame.shift
in a code cell for more information.
Finally, Remove the last
n_hr
hours of data because they have the wrong data due to shifting. For example, the last row does not have data in the future to operate if we setn_hr=1
.Hint: Use
pandas.DataFrame.iloc
to select the rows that you want to keep. Type?pd.DataFrame.iloc
in a code cell for more information.
You need to handle the edge case when
n_hr=0
, which should output the original data frame.
def sum_current_and_future_data(df, n_hr=0):
"""
Sum up data in the current and future hours.
Parameters
----------
df : pandas.DataFrame
The preprocessed smell data.
n_hr : int
Number of hours that we want to sum up the future smell data.
Returns
-------
pandas.DataFrame
The transformed smell data.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the output of your function matches the expected output.
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=1), df_smell_example_out1, n=1)
Test case 1 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-11-05 10:00:00+00:00 21.0
2016-11-05 11:00:00+00:00 53.0
2016-11-05 12:00:00+00:00 62.0
2016-11-05 13:00:00+00:00 26.0
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=3), df_smell_example_out2, n=2)
Test case 2 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-11-05 10:00:00+00:00 83.0
2016-11-05 11:00:00+00:00 79.0
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=0), df_smell_example_out3, n=3)
Test case 3 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-11-05 10:00:00+00:00 8
2016-11-05 11:00:00+00:00 13
2016-11-05 12:00:00+00:00 40
2016-11-05 13:00:00+00:00 22
2016-11-05 14:00:00+00:00 4
Finally, we need a function to compute the features and labels, based on the above insert_previous_data_to_cols
and sum_current_and_future_data
functions that you implemented. The code is below.
def compute_feature_label(df_smell, df_sensor, b_hr_sensor=0, f_hr_smell=0):
"""
Compute features and labels from the smell and sensor data.
Parameters
----------
df_smell : pandas.DataFrame
The preprocessed smell data.
df_sensor : pandas.DataFrame
The preprocessed sensor data.
b_hr_sensor : int
Number of hours that we want to insert the previous sensor data.
f_hr_smell : int
Number of hours that we want to sum up the future smell data.
Returns
-------
df_x : pandas.DataFrame
The features that we want to use for modeling.
df_y : pandas.DataFrame
The labels that we want to use for modeling.
"""
# Copy data frames to prevent editing the original ones.
df_smell = df_smell.copy(deep=True)
df_sensor = df_sensor.copy(deep=True)
# Replace -1 values in sensor data to NaN
df_sensor[df_sensor==-1] = np.nan
# Convert all wind directions.
df_sensor = convert_wind_direction(df_sensor)
# Scale sensor data and fill in missing values
df_sensor = (df_sensor - df_sensor.mean()) / df_sensor.std()
df_sensor = df_sensor.round(6)
df_sensor = df_sensor.fillna(-1)
# Insert previous sensor data as features.
# Noice that the df_sensor is already using the previous data.
# So b_hr_sensor=0 means using data from the previous 1 hour.
# And b_hr_sensor=n means using data from the previous n+1 hours.
df_sensor = insert_previous_data_to_cols(df_sensor, b_hr_sensor)
# Sum up current and future smell values as label.
# Notice that the df_smell is already the data from the future 1 hour.
# (as indicated in the preprocessing phase of smell data)
# So f_hr_smell=0 means using data from the future 1 hour.
# And f_hr_smell=n means using data from the future n+1 hours.
df_smell = answer_sum_current_and_future_data(df_smell, f_hr_smell)
# Add suffix to the column name of the smell data to prevent confusion.
# See the description above for the reason of adding 1 to the f_hr_smell.
df_smell.columns += "_future_" + str(f_hr_smell+1) + "h"
# We need to first merge these two timestamps based on the available data.
# In this way, we synchronize the time stamps in the sensor and smell data.
# This also means that the sensor and smell data have the same number of data points.
df = pd.merge_ordered(df_sensor.reset_index(), df_smell.reset_index(), on=df_smell.index.name, how="inner", fill_method=None)
# Sanity check: there should be no missing data.
assert df.isna().sum().sum() == 0, "Error! There is missing data."
# Separate features (x) and labels (y).
df_x = df[df_sensor.columns].copy()
df_y = df[df_smell.columns].copy()
# Add the hour of day and the day of week.
dow_radian = df["EpochTime"].dt.dayofweek.copy(deep=True) * 2 * np.pi / 6.0
tod_radian = df["EpochTime"].dt.hour.copy(deep=True) * 2 * np.pi / 23.0
df_x.loc[:,"day_of_week_sine"] = np.sin(dow_radian)
df_x.loc[:,"day_of_week_cosine"] = np.cos(dow_radian)
df_x.loc[:,"hour_of_day_sine"] = np.sin(tod_radian)
df_x.loc[:,"hour_of_day_cosine"] = np.cos(tod_radian)
return df_x, df_y
We will use the sensor data within the previous 3 hours to predict bad smell within the future 8 hours. To use the compute_feature_label
function that we just build, we need to set b_hr_sensor=2
and f_hr_smell=7
because originally df_sensor
already contains data from the previous 1 hour, and df_smell
already contains data from the future 1 hour.
Note that b_hr_sensor=n
means that we want to insert previous n+1
hours of sensor data , and f_hr_smell=m
means that we want to sum up the smell values of the future m+1
hours. For example, suppose that the current time is 8:00, setting b_hr_sensor=2
means that we use all sensor data from 5:00 to 8:00 (as features df_x
in prediction), and setting f_hr_smell=7
means that we sum up the smell values from 8:00 to 16:00 (as labels df_y
in prediction).
df_x, df_y = compute_feature_label(df_smell, df_sensor, b_hr_sensor=2, f_hr_smell=7)
Below is the data frame of features (i.e., the predictor variable).
df_x
3.feed_1.SO2_PPM_pre_1h | 3.feed_1.H2S_PPM_pre_1h | 3.feed_1.SIGTHETA_DEG_pre_1h | 3.feed_1.SONICWS_MPH_pre_1h | 3.feed_23.CO_PPM_pre_1h | 3.feed_23.PM10_UG_M3_pre_1h | 3.feed_29.PM10_UG_M3_pre_1h | 3.feed_29.PM25_UG_M3_pre_1h | 3.feed_11067.CO_PPB..3.feed_43.CO_PPB_pre_1h | 3.feed_11067.NO2_PPB..3.feed_43.NO2_PPB_pre_1h | ... | 3.feed_28.SONICWD_DEG_cosine_pre_3h | 3.feed_28.SONICWD_DEG_sine_pre_3h | 3.feed_26.SONICWD_DEG_cosine_pre_3h | 3.feed_26.SONICWD_DEG_sine_pre_3h | 3.feed_3.SONICWD_DEG_cosine_pre_3h | 3.feed_3.SONICWD_DEG_sine_pre_3h | day_of_week_sine | day_of_week_cosine | hour_of_day_sine | hour_of_day_cosine | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.273112 | -0.403688 | -1.520058 | -0.599075 | -0.388936 | -0.777225 | -0.406466 | -0.395826 | -0.716551 | -0.585693 | ... | 0.279097 | 1.746934 | -0.383942 | 1.929446 | -0.542867 | 1.331119 | 0.000000 | 1.0 | -2.449294e-16 | 1.000000 |
1 | -0.273112 | -0.403688 | -1.433654 | -0.684709 | -0.388936 | -0.690974 | 0.007500 | -0.305936 | -0.426597 | 0.488014 | ... | 1.089779 | 1.481480 | 0.945548 | 1.350182 | 0.512949 | 1.355712 | 0.866025 | 0.5 | 0.000000e+00 | 1.000000 |
2 | -0.273112 | -0.403688 | 1.142731 | -0.941610 | 0.147335 | -0.173473 | -0.147737 | -0.216045 | -0.444787 | 0.829648 | ... | 0.799733 | 1.640186 | 0.726159 | 1.603583 | 0.537757 | 1.347897 | 0.866025 | 0.5 | 2.697968e-01 | 0.962917 |
3 | -0.273112 | -0.403688 | -0.082623 | -0.941610 | 0.147335 | -0.432224 | -0.302974 | -0.216045 | -0.796641 | 0.081306 | ... | 0.960380 | 1.562966 | 1.185067 | 0.816616 | 0.512949 | 1.355712 | 0.866025 | 0.5 | 5.195840e-01 | 0.854419 |
4 | -0.273112 | -0.403688 | 1.527618 | -0.984426 | 0.147335 | -0.259723 | -0.458211 | -0.485717 | -0.762976 | -0.504352 | ... | 1.623480 | 0.780539 | 1.225168 | 0.602477 | 0.659294 | 1.303117 | 0.866025 | 0.5 | 7.308360e-01 | 0.682553 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
16746 | -0.273112 | -0.403688 | 0.011635 | -0.128090 | -0.925207 | -0.604724 | -0.561703 | -0.575607 | 0.529598 | -0.748376 | ... | 1.707841 | 0.380565 | 1.119099 | -0.254164 | 1.210281 | -0.738344 | -0.866025 | 0.5 | -9.976688e-01 | -0.068242 |
16747 | -0.273112 | -0.403688 | 0.443651 | 0.000361 | -0.925207 | -0.690974 | -0.406466 | -0.665498 | 0.662087 | -0.292864 | ... | 1.693445 | 0.048275 | 1.098706 | -0.303805 | 1.327922 | -0.583204 | -0.866025 | 0.5 | -9.790841e-01 | 0.203456 |
16748 | -0.273112 | -0.403688 | 0.443651 | -0.256540 | -0.925207 | -0.604724 | -0.458211 | -0.575607 | 0.181817 | -0.862254 | ... | 1.489886 | -0.500401 | 0.609087 | -0.931955 | 0.798657 | -1.062282 | -0.866025 | 0.5 | -8.878852e-01 | 0.460065 |
16749 | -0.273112 | -0.403688 | 0.270844 | -0.085273 | -0.925207 | -0.518474 | -0.302974 | -0.575607 | 0.856204 | -0.439279 | ... | 1.402368 | -0.626362 | 0.237194 | -1.124325 | 0.706601 | -1.107672 | -0.866025 | 0.5 | -7.308360e-01 | 0.682553 |
16750 | -0.273112 | -0.403688 | -0.341833 | 0.085995 | -0.925207 | -0.432224 | -0.406466 | -0.395826 | 0.798647 | -0.309133 | ... | 0.581575 | -1.153330 | -0.684058 | -1.060369 | -0.161757 | -1.243870 | -0.866025 | 0.5 | -5.195840e-01 | 0.854419 |
16751 rows × 148 columns
Below is the data frame of labels (i.e., the response variable).
df_y
smell_value_future_8h | |
---|---|
0 | 8.0 |
1 | 5.0 |
2 | 5.0 |
3 | 5.0 |
4 | 5.0 |
... | ... |
16746 | 6.0 |
16747 | 6.0 |
16748 | 6.0 |
16749 | 3.0 |
16750 | 11.0 |
16751 rows × 1 columns
Task 7: Train and Evaluate Models#
We have processed raw data and prepared the compute_feature_label
function to convert smell and sensor data into features df_x
and labels df_y
. In this task, you will work on training basic and advanced models to predict bad smell events (i.e., the situation that the smell value is high) using the sensor data from air quality monitoring stations.
First, let us threshold the features to make them binary for our classification task. We will use value 40 as the threshold to indicate a smell event. The threshold 40 was used in the Smell Pittsburgh research paper. It is equivalent to the situation that 10 people reported smell with rating 4 within 8 hours.
df_y_40 = (df_y>=40).astype(int)
df_y_40
smell_value_future_8h | |
---|---|
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
... | ... |
16746 | 0 |
16747 | 0 |
16748 | 0 |
16749 | 0 |
16750 | 0 |
16751 rows × 1 columns
The descriptive statistics below tell us that the dataset is imbalanced, which means that the numbers of data points in the positive and negative label groups have a big difference.
print("There are %d rows with smell events." % (df_y_40.sum().iloc[0]))
print("This means %.2f proportion of the data has smell events." % (df_y_40.sum().iloc[0]/len(df_y_40)))
There are 1439 rows with smell events.
This means 0.09 proportion of the data has smell events.
Principal Component Analysis#
We can also plot the result from the Principal Component Analysis (PCA) to check if there are patterns in the data.
PCA is a linear dimensionality reduction technique aiming to transform the original variables into a new set of uncorrelated variables (principal components) that preserve as much information or variability in the dataset as possible. To do so, it applies a linear transformation to the data onto a new coordinate system such that the directions (principal components) that capture the largest variation in the data can be identified.
If you are interested in the mathematical explanation of PCA, you can check the following hidden block. Otherwise, please feel free to skip the math.
Click the button to show the math for PCA
Principal components are the eigenvectors of the covariance matrix of the centered data. The magnitude of their corresponding eigenvalues indicates the explained variance that each principal component captures. That way, the eigenvectors with the \(n\) highest principal components will be chosen as principal components.
The step to apply PCA with \(n\) components to a dataset \(X\) are the following:
Calculate covariance matrix of the centered data:
Subtract the mean (\(\mu_X\)) of each feature
\[ \bar{X} = X - \mu_X, \]and compute the covariance matrix of the centered data \(\bar{X}\):
\[ C = \frac{1}{N} \bar{X}^T \bar{X}.\]Compute Eigenvalues and Eigenvectors:
Derive the eigenvalues \(\{\lambda_i\}\) and eigenvectors \(\{v_i\}\) of the covariance matrix \(C\). Each eigenvector represents a principal component, and its corresponding eigenvalue indicates the variance explained by that principal component.
Choose the principal components
Order the eigenvalues in descending order.
\[ \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_D \]and select \(v_1\), \(v_2\), …, \(v_n\) as principal components.
Data Transformation:
Use the original data \(X\) and the chosen principal components to obtain the transformed data \(X'\):
\[ X' = X P_k \]where \(P_n\) is the matrix with the first \(n\) eigenvectors as columns.
Dimensionality reduction techiniques are often used to visualize larger-dimensional data in a 2D or 3D space. Let’s apply PCA to our example.
The following function creates a dataframe with the n
principal components in the dataset:
def get_pca_result(x, y, n=3):
"""
Get the result of Principal Component Analysis.
Parameters
----------
x : pandas.DataFrame
The features.
y : pandas.DataFrame
The labels.
n : int
Number of principal components.
Returns
-------
df_pc : pandas.DataFrame
A data frame with information about principal components.
df_r : pandas.DataFrame
A data frame with information about ratios of explained variances.
"""
# Copy the data to prevent editing it.
x, y = deepcopy(x), deepcopy(y)
# Run the principal component analysis.
pca = PCA(n_components=n)
# Compute the eigenvectors, which are the principal components.
pc = pca.fit_transform(x)
columns = ["PC" + str(i) for i in range(1,1+n)]
df_pc = pd.DataFrame(data=pc, columns=columns)
# Set the label for plotting.
df_pc["y"] = y.astype(str)
# Set the marker size for plotting.
df_pc["size"] = 15
# Get eigenvalues (i.e., variances explained by principal components).
r = np.round(pca.explained_variance_ratio_, 3)
df_r = pd.DataFrame({"var":r, "pc":columns})
return df_pc, df_r
The code below prints the outputs from the get_pca_result
function.
df_pc_pca, df_r_pca = get_pca_result(df_x, df_y_40, n=3)
print(df_pc_pca)
print(df_r_pca)
PC1 PC2 PC3 y size
0 -2.544053 -7.423711 -4.220014 0 15
1 -1.899612 -7.726739 -1.953110 0 15
2 -0.020275 -7.352263 -0.359142 0 15
3 1.970582 -6.803431 0.281523 0 15
4 3.207700 -6.725773 -0.596667 0 15
... ... ... ... .. ...
16746 -3.848883 -4.378285 3.056364 0 15
16747 -3.791501 -3.742726 3.182619 0 15
16748 -3.576693 -2.224402 2.208638 0 15
16749 -3.338212 -1.966243 1.894476 0 15
16750 -3.708470 -1.175944 1.687017 0 15
[16751 rows x 5 columns]
var pc
0 0.254 PC1
1 0.118 PC2
2 0.084 PC3
Now let us plot the ratios of explained variances (i.e., the eigenvalues). The intuition is that if the explained variance ratio is larger, the corresponding principal component is more important and can represent more information.
# Plot the eigenvalues.
ax1 = sns.barplot(x="pc",y="var", data=df_r_pca, color="c")
ax1 = ax1.set(xlabel="Principal Component",
ylabel="Ratio of Variance Explained",
title="Variance Explained by Principal Components")
We can also plot the first and second principal components to see the distribution of labels. In the figure below, we see that label 1
(which means having a smell event) is concentrated on one side. However, it looks like labels 0
and 1
do not separate well, which means that the classifier will have difficulty distinguishing these two groups.
# Plot two principal components.
ax2 = sns.lmplot(x="PC1", y="PC2", data=df_pc_pca,
fit_reg=False, hue="y", legend=True,
scatter_kws={"s": 10, "alpha" :0.4})
ax2 = ax2.set(title="Distribution of Labels by Principal Components")
In the interactive figure below, we can zoom in, zoom out, and rotate the view to check the distribution of labels. We can confirm that label 1
is concentrated on one side. But we also observe that there is a large part of the data that may not be separated well by the classifier (especially linear classifiers). We probably need non-linear classifiers to be able to tell the differences between these two groups.
# Use the Plotly package to show the PCA results.
fig = px.scatter_3d(df_pc_pca.sample(n=5000), x="PC1", y="PC2", z="PC3",
color="y", symbol="y", opacity=0.6,
size="size", size_max=15,
category_orders=dict(y=["0", "1"]),
height=700)
fig.show()
Evaluation Metrics#
Next, let us pick a subset of the sensor data (instead of using all of them) and prepare the function for computing the evaluation metrics. Our intuition is that the smell may come from chemical compounds near major pollution sources. From the knowledge of local people, there is a large pollution source, which is the Clairton Mill Works that belongs to the United States Steel Corporation. This pollution source is located at the south part of Pittsburgh. This factory produces petroleum coke, which is a fuel to refine steel. And during the coke refining process, it generates pollutants.
One of the pollutant is H2S (hydrogen sulfide), which smells like rotten eggs. We think that H2S near the pollution source may be a good feature. So we first select the column with H2S measurements from a monitoring station near this pollution source. We also think that the day of week and the hour of day may be good features, as the air pollution may have a certain patterns in time.
df_x_subset = df_x[["3.feed_28.H2S_PPM_pre_1h", "day_of_week_sine", "day_of_week_cosine", "hour_of_day_sine", "hour_of_day_cosine"]]
df_x_subset
3.feed_28.H2S_PPM_pre_1h | day_of_week_sine | day_of_week_cosine | hour_of_day_sine | hour_of_day_cosine | |
---|---|---|---|---|---|
0 | -0.426339 | 0.000000 | 1.0 | -2.449294e-16 | 1.000000 |
1 | -0.426339 | 0.866025 | 0.5 | 0.000000e+00 | 1.000000 |
2 | -0.426339 | 0.866025 | 0.5 | 2.697968e-01 | 0.962917 |
3 | -0.426339 | 0.866025 | 0.5 | 5.195840e-01 | 0.854419 |
4 | -0.426339 | 0.866025 | 0.5 | 7.308360e-01 | 0.682553 |
... | ... | ... | ... | ... | ... |
16746 | -0.426339 | -0.866025 | 0.5 | -9.976688e-01 | -0.068242 |
16747 | -0.426339 | -0.866025 | 0.5 | -9.790841e-01 | 0.203456 |
16748 | -0.426339 | -0.866025 | 0.5 | -8.878852e-01 | 0.460065 |
16749 | -0.426339 | -0.866025 | 0.5 | -7.308360e-01 | 0.682553 |
16750 | -0.426339 | -0.866025 | 0.5 | -5.195840e-01 | 0.854419 |
16751 rows × 5 columns
Next, we will train and evaluate a model (F) that maps features (i.e., the sensor readings) to labels (i.e., the smell events). We have the functions ready in the following code cell to help us train and evaluate models.
def scorer(model, x, y):
"""
A customized scoring function to evaluate a classifier.
Parameters
----------
model : a sklearn model object
The classifier model.
x : pandas.DataFrame
The feature matrix.
y : pandas.Series
The label vector.
Returns
-------
dict of int or float
A dictionary of evaluation metrics.
"""
y_pred = model.predict(x)
c = confusion_matrix(y, y_pred, labels=[0,1])
p = precision_recall_fscore_support(y, y_pred, average="binary", zero_division=0)
a = accuracy_score(y, y_pred)
return {"tn": c[0,0], "fp": c[0,1], "fn": c[1,0], "tp": c[1,1],
"precision": p[0], "recall": p[1], "f1": p[2], "accuracy": a}
def train_and_evaluate(model, df_x, df_y, train_size=336, test_size=168):
"""
Parameters
----------
model : a sklearn model object
The classifier model.
df_x : pandas.DataFrame
The dataframe with features.
df_y : pandas.DataFrame
The dataframe with labels.
train_size : int
Number of samples for training.
test_size : int
Number of samples for testing.
"""
print("Use model", model)
print("Perform cross-validation, please wait...")
# Create time series splits for cross-validation.
splits = []
dataset_size = df_x.shape[0]
for i in range(train_size, dataset_size, test_size):
start = i - train_size
end = i + test_size
if (end >= dataset_size): break
train_index = range(start, i)
test_index = range(i, end)
splits.append((list(train_index), list(test_index)))
# Perform cross-validation.
cv_res = cross_validate(model, df_x, df_y.squeeze(), cv=splits, scoring=scorer)
# Print evaluation metrics.
print("================================================")
print("average f1-score:", round(np.mean(cv_res["test_f1"]), 2))
print("average precision:", round(np.mean(cv_res["test_precision"]), 2))
print("average recall:", round(np.mean(cv_res["test_recall"]), 2))
print("average accuracy:", round(np.mean(cv_res["test_accuracy"]), 2))
print("================================================")
The train_and_evaluate
function prints the averaged f1-score, averaged precision, averaged recall, and averaged accuracy across all the folds. These metrics are always in the range of zero and one, with zero being the worst and one being the best. We also printed the confusion matrix that contains true positives, false positives, true negatives, and false negatives. To understand the evaluation metrics, let us first take a look at the confusion matrix, explained below:
True Positives
There is a smell event in the real world, and the model correctly predicts that there is a smell event.
False Positives
There is no smell event in the real world, but the model falsely predicts that there is a smell event.
True Negatives
There is no smell event in the real world, and the model correctly predicts that there is no smell event.
False Negatives
There is a smell event in the real world, but the model falsely predicts that there is no smell event.
The accuracy metric is defined in the equation below:
accuracy = (true positives + true negatives) / total number of data points
Accuracy is the number of correct predictions divided by the total number of data points. It is a good metric if the data distribution is not skewed (i.e., the number of data records that have a bad smell and do not have a bad smell is roughly equal). But if the data is skewed, which is the case in our dataset, we will need another set of evaluation metrics: f1-score, precision, and recall. We will use an example later to explain why accuracy is an unfair metric for our dataset.
The precision metric is defined in the equation below:
precision = true positives / (true positives + false positives)
In other words, precision means how precise the prediction is. High precision means that if the model predicts “yes” for smell events, it is highly likely that the prediction is correct. We want high precision because we want the model to be as precise as possible when it says there will be smell events.
Next, the recall metric is defined in the equation below:
recall = true positives / (true positives + false negatives)
In other words, recall means the ability of the model to catch events. High recall means that the model has a low chance to miss the events that happen in the real world. We want high recall because we want the model to catch all smell events without missing them.
Typically, there is a tradeoff between precision and recall, and one may need to choose to go for a high precision but low recall model, or we go for a high recall but low precision model. The tradeoff depends on the context. For example, in medical applications, one may not want to miss the events (e.g., cancer) since the events are connected to patients’ quality of life. In our application of predicting smell events, we may not want the model to make false predictions when saying “yes” to smell events. The reason is that people may lose trust in the prediction model when we make real-world interventions incorrectly, such as sending push notifications to inform the users about the bad smell events.
The f1-score metric is a combination of recall and precision, as indicated below:
f1-score = 2 * (precision * recall) / (precision + recall)
Use the Dummy Classifier#
Now that we have explained the evaluation metrics. Let us first use a dummy classifier that always predicts no smell events. In other words, the dummy classifier never predicts “yes” about the presence of smell events. Later we will guide you through using more advanced machine learning models.
dummy_model = DummyClassifier(strategy="constant", constant=0)
train_and_evaluate(dummy_model, df_x_subset, df_y_40, train_size=336, test_size=168)
Use model DummyClassifier(constant=0, strategy='constant')
Perform cross-validation, please wait...
================================================
average f1-score: 0.0
average precision: 0.0
average recall: 0.0
average accuracy: 0.92
================================================
The printed message above shows the evaluation result of the dummy classifier. We see that the accuracy is 0.92, which is very high. But the f1-score, precision, and recall are zero since there are no true positives. This is because the Smell Pittsburgh dataset has a skewed distribution of smell events, which means that there are a lot of “no” (i.e., label 0
) but only a small part of “yes” (i.e., label 1
). This skewed data distribution corresponds to what happened in Pittsburgh. Most of the time, the odors in the city area are OK and not too bad. Occasionally, there can be very bad pollution odors, where many people complain.
By the definition of accuracy, the dummy classifier (which always says “no”) has a very high accuracy of 0.92. This is because only 9% of the data indicate bad smell events. So, you can see that accuracy is not a fair evaluation metric for the Smell Pittsburgh dataset. And instead, we need to go for the f1-score, precision, and recall metrics.
This step uses cross-validation to evaluate the machine learning model, where the data is divided into several parts, and some parts are used for training. Other parts are used for testing. Typically people use K-fold cross-validation, which means that the entire dataset is split into K parts. One part is used for testing (i.e., the testing set), and the other parts are used for training (i.e., the training set). This procedure is repeated K times so that every fold has the chance of being tested. The result is averaged to indicate the performance of the model, for example, averaged accuracy. We can then compare the results for different machine learning pipelines.
However, the script uses a different cross-validation approach, where we only use the previous folds to train the model to test future folds. For example, if we want to test the third fold, we will only use a part of the data from the first and second fold to train the model. The reason is that the Smell Pittsburgh dataset is primarily time-series data, which means the dataset has timestamps for every data record. In other words, things that happened in the past may affect the future. So, in fact, it does not make sense to use the data in the future to train a model to predict what happened in the past. Our time-series cross-validation approach is shown in the following figure.
For the train_and_evaluate
function, test_size
is the number of samples for testing, and train_size
is the number of samples for training. We need to set these numbers for time-series cross-validation. For example, setting test_size
to 168 means using 168 samples for testing, which also means 168 hours (or 7 days) of data. Setting train_size
to 336 means using 336 samples for testing, which also means 336 hours (or 14 days) of data. So, this means we are using previous 14 days of sensor data to train the model, and then use the model to predict the smell events in the next 7 days. In this setting, every Sunday we can re-train the model with the updated data, so that we have the updated model to predict smell events every week.
Use the Decision Tree Model#
Now, instead of using the dummy classifier, we are going to use a different model. Let us use the Decision Tree model and compare its performance with the dummy classifier.
dt_model = DecisionTreeClassifier()
train_and_evaluate(dt_model, df_x_subset, df_y_40, train_size=336, test_size=168)
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.13
average precision: 0.18
average recall: 0.13
average accuracy: 0.88
================================================
From the printed message above, notice that the Decision Tree model produces non-zero true positives and false positives (compared to the dummy classifier). Also, notice that f1-score, precision, and recall are no longer zero.
Decision Tree is a type of machine learning model. You can think of it as how a medical doctor diagnoses patients. For example, to determine if the patients need treatments, the medical doctor may ask the patients to describe symptoms. Depending on the symptoms, the doctor decides which treatment should be applied for the patient.
One can think of the smell prediction model as an air quality expert who is diagnosing the pollution patterns based on air quality and weather data. The treatment is to send a push notification to citizens to inform them of the presence of bad smell to help people plan daily activities. This decision-making process can be seen as a tree structure as shown in the following figure, where the first node is the most important factor to decide the treatment.
The above figure is just a toy example to show what a decision tree is. In our case, we put the features (X) into the decision tree to train it. Then, the tree will decide which feature to use and what is the threshold to split the data based on the features. This procedure is repeated several times (represented by the depth of the tree). Finally, the model will make a prediction (y) at the final node of the tree.
You can find the visualization of a decision tree (trained using the real data) in Figure 8 in the Smell Pittsburgh paper. More information about the Decision Tree can be found in the following URL and paper:
More information about Decision Tree: https://scikit-learn.org/stable/modules/tree.html
Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1(1), 81-106.
Use the Random Forest Model#
Now that you have tried the Decision Tree model. Let us use a more advanced model, Random Forest, for smell event prediction.
rf_model = RandomForestClassifier()
train_and_evaluate(rf_model, df_x_subset, df_y_40, train_size=336, test_size=168)
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.11
average precision: 0.18
average recall: 0.1
average accuracy: 0.89
================================================
Notice that the performance of the model does not look much better than the Decision Tree model. And in fact, both models currently have poor performance. This can have several meanings, as indicated in the following list. You will explore these questions more in the assignment for this task.
Firstly, do we really believe that we are using a good set of features? Is it sufficient to only use the H2S (hydrogen sulfide) feature? Is it sufficient to only include the data from the previous hour (i.e., the
"3.feed_28.H2S_PPM_pre_1h"
column)?Secondly, the machine learning pipeline uses 14 days of data in the past (i.e.,
train_size=336
) to predict smell events in the future 7 days (i.e.,test_size=168
). Do we believe that 14 days are sufficient for training a good model?Finally, the decision tree and random forest models has many hyper-parameters (e.g., maximum depth of tree). Currently, the model uses the default hyper-parameters. Do we believe that the default setting is good?
Now let us take a look at the Random Forest model, which is a type of ensemble model. You can think about the ensemble model as a committee that makes decisions collaboratively, such as using majority voting. For example, to determine the treatment of a patient, we can ask the committee of medical doctors for a collaborative decision. The committee has several members who correspond to various machine learning models. The Random Forest model is a committee that is formed with many decision trees. Each tree is trained using different sets of data, as shown in the following figure.
In other words, we first trained many decision trees, and each of them has access to only a part of the data (but not all of the data) that are randomly selected. So, each decision tree sees different sets of features. Then, we ask the committee members (i.e., the decision trees) to make predictions, and the final result is the one that receives the highest votes.
The intuition for having a committee (instead of only a single tree) is that we believe a diverse set of models can make better decisions collaboratively. There is mathematical proof about this intuition, but the proof is outside the scope of this course. More information about the Random Forest model can be found in the following URL and paper:
More information about Random Forest: https://scikit-learn.org/stable/modules/ensemble.html
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Compute Feature Importance#
After training the model and evaluate its performance, we now have a better understanding about how to predict smell events. However, what if we want to know which are the important features? For example, which pollutants are the major source of the bad smell? Which pollution source is likely related to the bad smell? Under what situation will the pollutants travel to the Pittsburgh city area? This information can be important to help the municipality evaluate air pollution policies. This information can also help local communities to advocate for policy changes.
It turns out that we can permute the data in a specific column to know the importance. If a column (corresponding to a feature) is important, permuting the data specifically for the column will make the model performacne decrease. A higher decrease of a metric (e.g., f1-score) means that the feature is more important. It also means the feature is important for the model to make decisions. So, we can compute the “decrease” of a metric and use it as feature importance. We have provided the function in the following coding block to do this.
def compute_feature_importance(model, df_x, df_y, scoring="f1"):
"""
Compute feature importance of a model.
Parameters
----------
model : a sklearn model object
The classifier model.
df_x : pandas.DataFrame
The dataframe with features.
df_y : pandas.DataFrame
The dataframe with labels.
scoring : str
A scoring function as documented below.
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
"""
if model is None:
model = RandomForestClassifier()
print("Computer feature importance using", model)
model.fit(df_x, df_y.squeeze())
result = permutation_importance(model, df_x, df_y, n_repeats=10, random_state=0, scoring=scoring)
feat_names = df_x.columns.copy()
feat_ims = np.array(result.importances_mean)
sorted_ims_idx = np.argsort(feat_ims)[::-1]
feat_names = feat_names[sorted_ims_idx]
feat_ims = np.round(feat_ims[sorted_ims_idx], 5)
df = pd.DataFrame()
df["feature_importance"] = feat_ims
df["feature_name"] = feat_names
print("=====================================================================")
print(df)
print("=====================================================================")
compute_feature_importance(rf_model, df_x_subset, df_y_40, scoring="f1")
Computer feature importance using RandomForestClassifier()
=====================================================================
feature_importance feature_name
0 0.45507 3.feed_28.H2S_PPM_pre_1h
1 0.26435 hour_of_day_sine
2 0.22926 hour_of_day_cosine
3 0.15682 day_of_week_sine
4 0.13854 day_of_week_cosine
=====================================================================
Notice that the compute_feature_importance
function can take a very long time to run if you use a large amount of training data.
The above code prints feature importance, which indicates the influence of each feature on the model prediction result. Here we use the Random Forest model. For example, in the printed message, the 3.feed_28.H2S_PPM_pre_1h
feature has the highest importance. The values in the feature importance here represent the decrease of f1-score (because we use scoring="f1"
) if we randomly permute the data related to the feature.
For example, if we randomly permute the H2S measurement in the 3.feed_28.H2S_PPM_pre_1h
column for all the data records (but keep other features the same), the f1-score of the Random Forest model will drop about 0.45
. The intuition is that if a feature is more important, the model performance will decrease more when the feature is randomly permuted (i.e., when the data that is associated with the feature are messed up). More information can be found in the following URL:
More information about feature importance: https://scikit-learn.org/stable/modules/permutation_importance.html
Notice that to use this technique, the model needs to fit the data reasonably well. Also depending on the number of features you are using, the step of computing the feature importance can take a lot of time.
Assignment for Task 7#
You have learned how to use different models and feature sets. For this assignment, use the knowledge that you learned to conduct experiments to answer the following questions:
Is including sensor data in the past a good idea to help improve model performance?
Hint: Check precision, recall, and F1-score.
Is the wind direction from the air quality monitoring station (i.e., feed 28) that near the pollution source a good feature to predict bad smell?
Hint: Compare the performance of models with different feature sets and check the feature importance.
You need to tweak parameters in the code to conduct a pilot experiment to understand if wind direction is a good feature for predicting the presence of bad smell. Also, you need to inspect if including more data from the previous hours is helpful. Specifically, you need to do the experiment using the combinations of two models (Decision Tree and Random Forest) and four feature sets (as described below). So in total, there should be 8 situations to compare.
The 1st feature set below has the H2S data from the previous 1 hour:
3.feed_28.H2S_PPM_pre_1h
day_of_week_sine
day_of_week_cosine
hour_of_day_sine
hour_of_day_cosine
The 2nd feature set below has the H2S and wind data from the previous 1 hour:
3.feed_28.H2S_PPM_pre_1h
3.feed_28.SONICWD_DEG_sine_pre_1h
3.feed_28.SONICWD_DEG_cosine_pre_1h
day_of_week_sine
day_of_week_cosine
hour_of_day_sine
hour_of_day_cosine
The 3rd feature set below has the H2S data from the previous 2 hours:
3.feed_28.H2S_PPM_pre_1h
3.feed_28.H2S_PPM_pre_2h
day_of_week_sine
day_of_week_cosine
hour_of_day_sine
hour_of_day_cosine
The 4th feature set below has the H2S and wind data from the previous 2 hours:
3.feed_28.H2S_PPM_pre_1h
3.feed_28.SONICWD_DEG_sine_pre_1h
3.feed_28.SONICWD_DEG_cosine_pre_1h
3.feed_28.H2S_PPM_pre_2h
3.feed_28.SONICWD_DEG_sine_pre_2h
3.feed_28.SONICWD_DEG_cosine_pre_2h
day_of_week_sine
day_of_week_cosine
hour_of_day_sine
hour_of_day_cosine
def experiment(df_x, df_y):
"""
Perform experiments and print the results.
Parameters
----------
df_x : pandas.DataFrame
The data frame that contains all features.
df_y : pandas.DataFrame
The data frame that contains labels.
"""
###################################
# Fill in your answer here
print("None")
###################################
experiment(df_x, df_y_40)
None
You need to write the function above that can do similar things as the following from the answer.
answer_experiment(df_x, df_y_40)
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.12
average precision: 0.17
average recall: 0.12
average accuracy: 0.88
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
feature_importance feature_name
0 0.43245 3.feed_28.H2S_PPM_pre_1h
1 0.22597 hour_of_day_sine
2 0.18844 hour_of_day_cosine
3 0.15539 day_of_week_sine
4 0.11170 day_of_week_cosine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine', '3.feed_28.SONICWD_DEG_sine_pre_1h', '3.feed_28.SONICWD_DEG_cosine_pre_1h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.18
average precision: 0.22
average recall: 0.19
average accuracy: 0.87
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
feature_importance feature_name
0 0.66030 3.feed_28.H2S_PPM_pre_1h
1 0.63520 3.feed_28.SONICWD_DEG_sine_pre_1h
2 0.53829 3.feed_28.SONICWD_DEG_cosine_pre_1h
3 0.44334 hour_of_day_sine
4 0.40960 hour_of_day_cosine
5 0.28764 day_of_week_cosine
6 0.26576 day_of_week_sine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine', '3.feed_28.H2S_PPM_pre_2h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.14
average precision: 0.18
average recall: 0.14
average accuracy: 0.88
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
feature_importance feature_name
0 0.56613 3.feed_28.H2S_PPM_pre_1h
1 0.51759 3.feed_28.H2S_PPM_pre_2h
2 0.36724 hour_of_day_sine
3 0.34119 hour_of_day_cosine
4 0.23746 day_of_week_sine
5 0.17990 day_of_week_cosine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine', '3.feed_28.H2S_PPM_pre_2h', '3.feed_28.SONICWD_DEG_sine_pre_2h', '3.feed_28.SONICWD_DEG_cosine_pre_2h']
Use model DecisionTreeClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.16
average precision: 0.19
average recall: 0.16
average accuracy: 0.87
================================================
Computer feature importance using DecisionTreeClassifier()
=====================================================================
feature_importance feature_name
0 0.63482 3.feed_28.SONICWD_DEG_sine_pre_2h
1 0.62400 3.feed_28.H2S_PPM_pre_1h
2 0.51480 3.feed_28.SONICWD_DEG_cosine_pre_2h
3 0.47225 3.feed_28.H2S_PPM_pre_2h
4 0.41479 hour_of_day_sine
5 0.36548 hour_of_day_cosine
6 0.25713 day_of_week_sine
7 0.23126 day_of_week_cosine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.11
average precision: 0.19
average recall: 0.1
average accuracy: 0.89
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
feature_importance feature_name
0 0.45746 3.feed_28.H2S_PPM_pre_1h
1 0.26045 hour_of_day_sine
2 0.23248 hour_of_day_cosine
3 0.16363 day_of_week_sine
4 0.14233 day_of_week_cosine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine', '3.feed_28.SONICWD_DEG_sine_pre_1h', '3.feed_28.SONICWD_DEG_cosine_pre_1h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.12
average precision: 0.22
average recall: 0.1
average accuracy: 0.9
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
feature_importance feature_name
0 0.64428 3.feed_28.H2S_PPM_pre_1h
1 0.52672 3.feed_28.SONICWD_DEG_sine_pre_1h
2 0.45130 3.feed_28.SONICWD_DEG_cosine_pre_1h
3 0.44752 hour_of_day_sine
4 0.44427 hour_of_day_cosine
5 0.25992 day_of_week_cosine
6 0.24912 day_of_week_sine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine', '3.feed_28.H2S_PPM_pre_2h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.12
average precision: 0.19
average recall: 0.11
average accuracy: 0.89
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
feature_importance feature_name
0 0.63206 3.feed_28.H2S_PPM_pre_1h
1 0.50660 3.feed_28.H2S_PPM_pre_2h
2 0.40831 hour_of_day_sine
3 0.37534 hour_of_day_cosine
4 0.23481 day_of_week_sine
5 0.21186 day_of_week_cosine
=====================================================================
Use feature set ['3.feed_28.H2S_PPM_pre_1h', 'day_of_week_sine', 'day_of_week_cosine', 'hour_of_day_sine', 'hour_of_day_cosine', '3.feed_28.H2S_PPM_pre_2h', '3.feed_28.SONICWD_DEG_sine_pre_2h', '3.feed_28.SONICWD_DEG_cosine_pre_2h']
Use model RandomForestClassifier()
Perform cross-validation, please wait...
================================================
average f1-score: 0.13
average precision: 0.23
average recall: 0.11
average accuracy: 0.91
================================================
Computer feature importance using RandomForestClassifier()
=====================================================================
feature_importance feature_name
0 0.54452 3.feed_28.H2S_PPM_pre_1h
1 0.47807 3.feed_28.H2S_PPM_pre_2h
2 0.47356 3.feed_28.SONICWD_DEG_sine_pre_2h
3 0.40433 hour_of_day_cosine
4 0.40221 hour_of_day_sine
5 0.37935 3.feed_28.SONICWD_DEG_cosine_pre_2h
6 0.21947 day_of_week_sine
7 0.21812 day_of_week_cosine
=====================================================================
- 1
Credit: this teaching material is created by Yen-Chia Hsu and revised by Alejandro Monroy.