Tutorial (Structured Data Processing)
Contents
Tutorial (Structured Data Processing)#
(Last updated: Feb 16, 2023)1
This tutorial will familiarize you with the data science pipeline of processing structured data, using a real-world example of building models to predict and explain the presence of bad smell events in Pittsburgh based on air quality and weather data. The models are used to send push notifications about bad smell events to inform citizens, as well as to explain local pollution patterns to inform stakeholders.
The scenario is in the next section of this tutorial, and more details are in the introduction section of the Smell Pittsburgh paper. We will use the same dataset as used in the Smell Pittsburgh paper as an example of structured data. During this tutorial, we will explain what the variables in the dataset mean and also guide you through model building. Below is the pipeline of this tutorial.
You can use the following link to jump to the tasks and assignments:
Scenario#
Local citizens in Pittsburgh are organizing communities to advocate for changes in air pollution regulations. Their goal is to investigate the air pollution patterns in the city to understand the potential sources related to the bad odor. The communities rely on the Smell Pittsburgh application (as indicated in the figure below) to collect smell reports from citizens that live in the Pittsburgh region. Also, there are air quality and weather monitoring stations in the Pittsburgh city region that provide sensor measurements, including common air pollutants and wind information.
You work in a data science team to develop models to map the sensor data to bad smell events. Your team has been working with the Pittsburgh local citizens closely for a long time, and therefore you know the meaning of each variable in the feature set that is used to train the machine learning model. The Pittsburgh community needs your help timely to analyze the data that can help them present evidence of air pollution to the municipality and explain the patterns to the general public.
Import Packages#
We put all the packages that are needed for this tutorial below:
import pandas as pd
import numpy as np
import pytz
import seaborn as sns
import matplotlib.pyplot as plt
from os.path import isfile, join
from os import listdir
from copy import deepcopy
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
from pandas.api.indexers import FixedForwardWindowIndexer
from sklearn.decomposition import PCA
import plotly.express as px
Task Answers#
The code block below contains answers for the assignments in this tutorial. Do not check the answers in the next cell before practicing the tasks.
def check_answer_df(df_result, df_answer, n=1):
"""
This function checks if two output dataframes are the same.
Parameters
----------
df_result : pandas.DataFrame
The result from the output of a function.
df_answer: pandas.DataFrame
The expected output of the function.
n : int
The numbering of the test case.
"""
try:
assert df_answer.equals(df_result)
print("Test case %d passed." % n)
except:
print("Test case %d failed." % n)
print("")
print("Your output is:")
print(df_result)
print("")
print("Expected output is:")
print(df_answer)
def answer_preprocess_sensor(df_list):
"""
This function is the answer of task 5.
Preprocess sensor data.
Parameters
----------
df_list : list of pandas.DataFrame
A list of data frames that contain sensor data from multiple stations.
Returns
-------
pandas.DataFrame
The preprocessed sensor data.
"""
# Resample all the data frames.
df_resample_list = []
for df in df_list:
# Convert the timestamp to datetime.
df.index = pd.to_datetime(df.index, unit="s", utc=True)
# Resample the timestamps by hour and average all the previous values.
# Because we want data from the past, so label need to be "right".
df_resample_list.append(df.resample("60Min", label="right").mean())
# Merge all data frames.
df = df_resample_list.pop(0)
index_name = df.index.name
while len(df_resample_list) != 0:
# We need to use outer merging since we want to preserve data from both data frames.
df = pd.merge_ordered(df, df_resample_list.pop(0), on=df.index.name, how="outer", fill_method=None)
# Move the datetime column to index
df = df.set_index(index_name)
# Fill in the missing data with value -1.
df = df.fillna(-1)
return df
def answer_preprocess_smell(df):
"""
This function is the answer of task 4.
Preprocess smell data.
Parameters
----------
df : pandas.DataFrame
The raw smell reports data.
Returns
-------
pandas.DataFrame
The preprocessed smell data.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
# Drop the columns that we do not need.
df = df.drop(columns=["feelings_symptoms", "smell_description", "zipcode"])
# Select only the reports within the range of 3 and 5.
df = df[(df["smell_value"]>=3)&(df["smell_value"]<=5)]
# Convert the timestamp to datetime.
df.index = pd.to_datetime(df.index, unit="s", utc=True)
# Resample the timestamps by hour and sum up all the future values.
# Because we want data from the future, so label need to be "left".
df = df.resample("60Min", label="left").sum()
# Fill in the missing data with value 0.
df = df.fillna(0)
return df
def answer_sum_current_and_future_data(df, n_hr=0):
"""
This function is the answer of task 6.
Sum up data in the current and future hours.
Parameters
----------
df : pandas.DataFrame
The preprocessed smell data.
n_hr : int
Number of hours that we want to sum up the future smell data.
Returns
-------
pandas.DataFrame
The transformed smell data.
"""
# Copy data frame to prevent editing the original one.
df = df.copy(deep=True)
# Fast return if n_hr is 0
if n_hr == 0: return df
# Sum up all smell_values in future hours.
# The rolling function only works for summing up previous values.
# So we need to shift back to get the value in the future.
# Be careful that we need to add 1 to the rolling window size.
# Becasue window size 1 means only using the current data.
# Parameter "closed" need to be "right" because we want the current data.
df = df.rolling(n_hr+1, min_periods=1, closed="right").sum().shift(-1*n_hr)
# Below is an alternative implementation of rolling.
#indexer = FixedForwardWindowIndexer(window_size=n_hr+1)
#df = df.rolling(indexer, min_periods=1).sum()
# Delete the last n_hr rows.
# These n_hr rows have wrong data due to data shifting.
df = df.iloc[:-1*n_hr]
return df
def answer_experiment(df_x, df_y):
"""
This function is the answer of task 7.
Perform experiments and print the results.
Parameters
----------
df_x : pandas.DataFrame
The data frame that contains all features.
df_y : pandas.DataFrame
The data frame that contains labels.
"""
wind = "3.feed_28.SONICWD_DEG"
h2s = "3.feed_28.H2S_PPM"
dow = "day_of_week"
hod = "hour_of_day"
fs1 = [h2s + "_pre_1h", dow + "_sine", dow + "_cosine", hod + "_sine", hod + "_cosine"]
fs1w = fs1 + [wind + "_sine_pre_1h", wind + "_cosine_pre_1h"]
fs2 = fs1 + [h2s + "_pre_2h"]
fs2w = fs2 + [wind + "_sine_pre_2h", wind + "_cosine_pre_2h"]
feature_sets = [fs1, fs1w, fs2, fs2w]
models = [DecisionTreeClassifier(), RandomForestClassifier()]
for m in models:
for fs in feature_sets:
print("Use feature set %s" % (str(fs)))
df_x_fs = df_x[fs]
train_and_evaluate(m, df_x_fs, df_y, train_size=336, test_size=168)
compute_feature_importance(m, df_x_fs, df_y, scoring="f1")
print("")
Task 4: Preprocess Sensor Data#
In this task, we will process the sensor data from various air quality monitoring stations in Pittsburgh. First, we need to load all the sensor data.
path = "smellpgh-v1/esdr_raw"
list_of_files = [f for f in listdir(path) if isfile(join(path, f))]
sensor_raw_list = []
for f in list_of_files:
sensor_raw_list.append(pd.read_csv(join(path, f)).set_index("EpochTime"))
Now, the sensor_raw_list
variable contains all the data frames with sensor values from different air quality monitoring stations. Noted that sensor_raw_list
is an array of data frames. We can print one of them to take a look, as shown below.
print(sensor_raw_list[0])
3.feed_1.SO2_PPM 3.feed_1.H2S_PPM 3.feed_1.SIGTHETA_DEG \
EpochTime
1477891800 0.0 0.0 51.7
1477895400 0.0 0.0 52.7
1477899000 0.0 0.0 52.6
1477902600 0.0 0.0 48.3
1477906200 0.0 0.0 31.1
... ... ... ...
1538267400 0.0 0.0 35.2
1538271000 0.0 0.0 48.2
1538274600 0.0 0.0 30.9
1538278200 0.0 0.0 21.5
1538281800 0.0 0.0 52.1
3.feed_1.SONICWD_DEG 3.feed_1.SONICWS_MPH
EpochTime
1477891800 343.0 3.6
1477895400 351.0 3.5
1477899000 359.0 3.4
1477902600 5.0 2.1
1477906200 41.0 2.2
... ... ...
1538267400 39.0 1.7
1538271000 53.0 1.3
1538274600 62.0 1.5
1538278200 53.0 2.0
1538281800 36.0 1.7
[16729 rows x 5 columns]
The EpochTime
index is the timestamp in epoch time, which means the number of seconds that have elapsed since January 1st, 1970 (midnight UTC/GMT). Other columns mean the sensor data from an air quality monitoring station.
Next, we need to resample and merge all the sensor data frames so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:
df_sensor = answer_preprocess_sensor(sensor_raw_list)
print(df_sensor)
3.feed_1.SO2_PPM 3.feed_1.H2S_PPM \
EpochTime
2016-10-31 06:00:00+00:00 0.0 0.0
2016-10-31 07:00:00+00:00 0.0 0.0
2016-10-31 08:00:00+00:00 0.0 0.0
2016-10-31 09:00:00+00:00 0.0 0.0
2016-10-31 10:00:00+00:00 0.0 0.0
... ... ...
2018-09-30 01:00:00+00:00 0.0 0.0
2018-09-30 02:00:00+00:00 0.0 0.0
2018-09-30 03:00:00+00:00 0.0 0.0
2018-09-30 04:00:00+00:00 0.0 0.0
2018-09-30 05:00:00+00:00 0.0 0.0
3.feed_1.SIGTHETA_DEG 3.feed_1.SONICWD_DEG \
EpochTime
2016-10-31 06:00:00+00:00 51.7 343.0
2016-10-31 07:00:00+00:00 52.7 351.0
2016-10-31 08:00:00+00:00 52.6 359.0
2016-10-31 09:00:00+00:00 48.3 5.0
2016-10-31 10:00:00+00:00 31.1 41.0
... ... ...
2018-09-30 01:00:00+00:00 35.2 39.0
2018-09-30 02:00:00+00:00 48.2 53.0
2018-09-30 03:00:00+00:00 30.9 62.0
2018-09-30 04:00:00+00:00 21.5 53.0
2018-09-30 05:00:00+00:00 52.1 36.0
3.feed_1.SONICWS_MPH 3.feed_23.CO_PPM \
EpochTime
2016-10-31 06:00:00+00:00 3.6 0.2
2016-10-31 07:00:00+00:00 3.5 0.2
2016-10-31 08:00:00+00:00 3.4 0.2
2016-10-31 09:00:00+00:00 2.1 0.2
2016-10-31 10:00:00+00:00 2.2 0.2
... ... ...
2018-09-30 01:00:00+00:00 1.7 0.3
2018-09-30 02:00:00+00:00 1.3 0.3
2018-09-30 03:00:00+00:00 1.5 0.4
2018-09-30 04:00:00+00:00 2.0 0.5
2018-09-30 05:00:00+00:00 1.7 0.4
3.feed_23.PM10_UG_M3 3.feed_29.PM10_UG_M3 \
EpochTime
2016-10-31 06:00:00+00:00 7.0 8.0
2016-10-31 07:00:00+00:00 8.0 8.0
2016-10-31 08:00:00+00:00 5.0 7.0
2016-10-31 09:00:00+00:00 3.0 4.0
2016-10-31 10:00:00+00:00 5.0 5.0
... ... ...
2018-09-30 01:00:00+00:00 20.0 15.0
2018-09-30 02:00:00+00:00 25.0 19.0
2018-09-30 03:00:00+00:00 23.0 55.0
2018-09-30 04:00:00+00:00 23.0 63.0
2018-09-30 05:00:00+00:00 25.0 47.0
3.feed_29.PM25_UG_M3 \
EpochTime
2016-10-31 06:00:00+00:00 8.0
2016-10-31 07:00:00+00:00 8.0
2016-10-31 08:00:00+00:00 7.0
2016-10-31 09:00:00+00:00 4.0
2016-10-31 10:00:00+00:00 4.0
... ...
2018-09-30 01:00:00+00:00 10.0
2018-09-30 02:00:00+00:00 12.0
2018-09-30 03:00:00+00:00 33.0
2018-09-30 04:00:00+00:00 39.0
2018-09-30 05:00:00+00:00 33.0
3.feed_11067.CO_PPB..3.feed_43.CO_PPB ... \
EpochTime ...
2016-10-31 06:00:00+00:00 159.5 ...
2016-10-31 07:00:00+00:00 -1.0 ...
2016-10-31 08:00:00+00:00 133.0 ...
2016-10-31 09:00:00+00:00 236.6 ...
2016-10-31 10:00:00+00:00 269.3 ...
... ... ...
2018-09-30 01:00:00+00:00 455.1 ...
2018-09-30 02:00:00+00:00 761.2 ...
2018-09-30 03:00:00+00:00 1125.4 ...
2018-09-30 04:00:00+00:00 1039.5 ...
2018-09-30 05:00:00+00:00 1636.3 ...
3.feed_3.SO2_PPM 3.feed_3.SONICWD_DEG \
EpochTime
2016-10-31 06:00:00+00:00 0.0 344.0
2016-10-31 07:00:00+00:00 0.0 330.0
2016-10-31 08:00:00+00:00 0.0 0.0
2016-10-31 09:00:00+00:00 0.0 325.0
2016-10-31 10:00:00+00:00 0.0 347.0
... ... ...
2018-09-30 01:00:00+00:00 0.0 39.0
2018-09-30 02:00:00+00:00 0.0 70.0
2018-09-30 03:00:00+00:00 0.0 75.0
2018-09-30 04:00:00+00:00 0.0 74.0
2018-09-30 05:00:00+00:00 0.0 65.0
3.feed_3.SONICWS_MPH 3.feed_3.SIGTHETA_DEG \
EpochTime
2016-10-31 06:00:00+00:00 2.9 43.0
2016-10-31 07:00:00+00:00 2.5 43.6
2016-10-31 08:00:00+00:00 3.1 40.9
2016-10-31 09:00:00+00:00 1.9 40.0
2016-10-31 10:00:00+00:00 1.4 45.1
... ... ...
2018-09-30 01:00:00+00:00 1.3 57.3
2018-09-30 02:00:00+00:00 1.0 54.4
2018-09-30 03:00:00+00:00 0.7 59.5
2018-09-30 04:00:00+00:00 0.7 50.7
2018-09-30 05:00:00+00:00 0.4 77.1
3.feed_3.PM10B_UG_M3 3.feed_5975.PM2_5 \
EpochTime
2016-10-31 06:00:00+00:00 9.0 0.0
2016-10-31 07:00:00+00:00 13.0 5.0
2016-10-31 08:00:00+00:00 7.0 9.0
2016-10-31 09:00:00+00:00 11.0 3.0
2016-10-31 10:00:00+00:00 10.0 9.0
... ... ...
2018-09-30 01:00:00+00:00 11.0 5.0
2018-09-30 02:00:00+00:00 21.0 7.0
2018-09-30 03:00:00+00:00 33.0 8.0
2018-09-30 04:00:00+00:00 32.0 17.0
2018-09-30 05:00:00+00:00 27.0 18.0
3.feed_27.NO_PPB 3.feed_27.NOY_PPB \
EpochTime
2016-10-31 06:00:00+00:00 0.1 2.6
2016-10-31 07:00:00+00:00 -1.0 -1.0
2016-10-31 08:00:00+00:00 0.2 2.1
2016-10-31 09:00:00+00:00 0.1 3.1
2016-10-31 10:00:00+00:00 0.1 2.5
... ... ...
2018-09-30 01:00:00+00:00 0.1 12.1
2018-09-30 02:00:00+00:00 0.2 13.5
2018-09-30 03:00:00+00:00 0.6 13.8
2018-09-30 04:00:00+00:00 0.6 11.4
2018-09-30 05:00:00+00:00 2.0 13.1
3.feed_27.CO_PPB 3.feed_27.SO2_PPB
EpochTime
2016-10-31 06:00:00+00:00 -1.0 0.2
2016-10-31 07:00:00+00:00 106.1 0.0
2016-10-31 08:00:00+00:00 105.8 -1.0
2016-10-31 09:00:00+00:00 111.7 0.0
2016-10-31 10:00:00+00:00 127.2 0.0
... ... ...
2018-09-30 01:00:00+00:00 301.0 0.0
2018-09-30 02:00:00+00:00 357.7 0.0
2018-09-30 03:00:00+00:00 373.6 0.0
2018-09-30 04:00:00+00:00 381.0 0.0
2018-09-30 05:00:00+00:00 377.7 0.0
[16776 rows x 43 columns]
In the expected output above, the EpochTime
index is converted from timestamps into pandas datetime objects, which has the format year-month-day hour:minute:second+timezone
. The +00:00
string means the GMT/UTC timezone. Other columns mean the average value of the sensor data in the previous hour. For example, 2016-10-31 06:00:00+00:00
means October 31 in 2016 at 6AM UTC time, and the cell with column 3.feed_1.SO2_PPM
means the averaged SO2 (sulfur dioxide) values from 5:00 to 6:00.
The column name suffix SO2_PPM
means sulfur dioxide in unit PPM (parts per million). The prefix 3.feed_1.
in the column name means a specific sensor (feed ID 1). You can ignore the 3.
at the begining of the column name. You can find the list of sensors, their names with feed ID (which will be in the data frame columns), and also the meaning of all the suffixes from this link.
Some column names look like 3.feed_11067.SIGTHETA_DEG..3.feed_43.SIGTHETA_DEG
. This means that the column has data from two sensor stations (feed ID 11067 and 43). The reason is that some sensor stations are replaced by the new ones over time. So in this case, we merge sensor readings from both feed ID 11067 and 43.
Assignment for Task 4#
Your task (which is your assignment) is to write a function to do the following:
Sensors can report in various frequencies. So, for each data frame, we need to resample the data by computing the hourly average of sensor measurements from the “previous” hour. For example, at time 8:00, we want to know the average of sensor values between 7:00 and 8:00.
Hint: Use the
pandas.to_datetime
function when converting timestamps to datetime objects. Type?pd.to_datetime
in a code cell for more information.Hint: Use the
pandas.DataFrame.resample
function to resample data. Type?pd.DataFrame.resample
in a code cell for more information.Hint: Use the
pandas.merge_ordered
function when merging data frames. Type?pd.merge_ordered
in a code cell for more information.
Then, merge all the data frames based on their time stamp, which is the
EpochTime
column.Finally, fill in the missing data with the value -1. The reason for not using 0 is that we want the model to know if sensors give values (including zero) or no data.
Hint: Use the
pandas.DataFrame.fillna
function when treating missing values. Type?pd.DataFrame.fillna
in a code cell for more information.
def preprocess_sensor(df_list):
"""
Preprocess sensor data.
Parameters
----------
df_list : list of pandas.DataFrame
A list of data frames that contain sensor data from multiple stations.
Returns
-------
pandas.DataFrame
The preprocessed sensor data.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the output of your function matches the expected output.
check_answer_df(preprocess_sensor(sensor_raw_list), df_sensor, n=1)
Test case 1 failed.
Your output is:
None
Expected output is:
3.feed_1.SO2_PPM 3.feed_1.H2S_PPM \
EpochTime
2016-10-31 06:00:00+00:00 0.0 0.0
2016-10-31 07:00:00+00:00 0.0 0.0
2016-10-31 08:00:00+00:00 0.0 0.0
2016-10-31 09:00:00+00:00 0.0 0.0
2016-10-31 10:00:00+00:00 0.0 0.0
... ... ...
2018-09-30 01:00:00+00:00 0.0 0.0
2018-09-30 02:00:00+00:00 0.0 0.0
2018-09-30 03:00:00+00:00 0.0 0.0
2018-09-30 04:00:00+00:00 0.0 0.0
2018-09-30 05:00:00+00:00 0.0 0.0
3.feed_1.SIGTHETA_DEG 3.feed_1.SONICWD_DEG \
EpochTime
2016-10-31 06:00:00+00:00 51.7 343.0
2016-10-31 07:00:00+00:00 52.7 351.0
2016-10-31 08:00:00+00:00 52.6 359.0
2016-10-31 09:00:00+00:00 48.3 5.0
2016-10-31 10:00:00+00:00 31.1 41.0
... ... ...
2018-09-30 01:00:00+00:00 35.2 39.0
2018-09-30 02:00:00+00:00 48.2 53.0
2018-09-30 03:00:00+00:00 30.9 62.0
2018-09-30 04:00:00+00:00 21.5 53.0
2018-09-30 05:00:00+00:00 52.1 36.0
3.feed_1.SONICWS_MPH 3.feed_23.CO_PPM \
EpochTime
2016-10-31 06:00:00+00:00 3.6 0.2
2016-10-31 07:00:00+00:00 3.5 0.2
2016-10-31 08:00:00+00:00 3.4 0.2
2016-10-31 09:00:00+00:00 2.1 0.2
2016-10-31 10:00:00+00:00 2.2 0.2
... ... ...
2018-09-30 01:00:00+00:00 1.7 0.3
2018-09-30 02:00:00+00:00 1.3 0.3
2018-09-30 03:00:00+00:00 1.5 0.4
2018-09-30 04:00:00+00:00 2.0 0.5
2018-09-30 05:00:00+00:00 1.7 0.4
3.feed_23.PM10_UG_M3 3.feed_29.PM10_UG_M3 \
EpochTime
2016-10-31 06:00:00+00:00 7.0 8.0
2016-10-31 07:00:00+00:00 8.0 8.0
2016-10-31 08:00:00+00:00 5.0 7.0
2016-10-31 09:00:00+00:00 3.0 4.0
2016-10-31 10:00:00+00:00 5.0 5.0
... ... ...
2018-09-30 01:00:00+00:00 20.0 15.0
2018-09-30 02:00:00+00:00 25.0 19.0
2018-09-30 03:00:00+00:00 23.0 55.0
2018-09-30 04:00:00+00:00 23.0 63.0
2018-09-30 05:00:00+00:00 25.0 47.0
3.feed_29.PM25_UG_M3 \
EpochTime
2016-10-31 06:00:00+00:00 8.0
2016-10-31 07:00:00+00:00 8.0
2016-10-31 08:00:00+00:00 7.0
2016-10-31 09:00:00+00:00 4.0
2016-10-31 10:00:00+00:00 4.0
... ...
2018-09-30 01:00:00+00:00 10.0
2018-09-30 02:00:00+00:00 12.0
2018-09-30 03:00:00+00:00 33.0
2018-09-30 04:00:00+00:00 39.0
2018-09-30 05:00:00+00:00 33.0
3.feed_11067.CO_PPB..3.feed_43.CO_PPB ... \
EpochTime ...
2016-10-31 06:00:00+00:00 159.5 ...
2016-10-31 07:00:00+00:00 -1.0 ...
2016-10-31 08:00:00+00:00 133.0 ...
2016-10-31 09:00:00+00:00 236.6 ...
2016-10-31 10:00:00+00:00 269.3 ...
... ... ...
2018-09-30 01:00:00+00:00 455.1 ...
2018-09-30 02:00:00+00:00 761.2 ...
2018-09-30 03:00:00+00:00 1125.4 ...
2018-09-30 04:00:00+00:00 1039.5 ...
2018-09-30 05:00:00+00:00 1636.3 ...
3.feed_3.SO2_PPM 3.feed_3.SONICWD_DEG \
EpochTime
2016-10-31 06:00:00+00:00 0.0 344.0
2016-10-31 07:00:00+00:00 0.0 330.0
2016-10-31 08:00:00+00:00 0.0 0.0
2016-10-31 09:00:00+00:00 0.0 325.0
2016-10-31 10:00:00+00:00 0.0 347.0
... ... ...
2018-09-30 01:00:00+00:00 0.0 39.0
2018-09-30 02:00:00+00:00 0.0 70.0
2018-09-30 03:00:00+00:00 0.0 75.0
2018-09-30 04:00:00+00:00 0.0 74.0
2018-09-30 05:00:00+00:00 0.0 65.0
3.feed_3.SONICWS_MPH 3.feed_3.SIGTHETA_DEG \
EpochTime
2016-10-31 06:00:00+00:00 2.9 43.0
2016-10-31 07:00:00+00:00 2.5 43.6
2016-10-31 08:00:00+00:00 3.1 40.9
2016-10-31 09:00:00+00:00 1.9 40.0
2016-10-31 10:00:00+00:00 1.4 45.1
... ... ...
2018-09-30 01:00:00+00:00 1.3 57.3
2018-09-30 02:00:00+00:00 1.0 54.4
2018-09-30 03:00:00+00:00 0.7 59.5
2018-09-30 04:00:00+00:00 0.7 50.7
2018-09-30 05:00:00+00:00 0.4 77.1
3.feed_3.PM10B_UG_M3 3.feed_5975.PM2_5 \
EpochTime
2016-10-31 06:00:00+00:00 9.0 0.0
2016-10-31 07:00:00+00:00 13.0 5.0
2016-10-31 08:00:00+00:00 7.0 9.0
2016-10-31 09:00:00+00:00 11.0 3.0
2016-10-31 10:00:00+00:00 10.0 9.0
... ... ...
2018-09-30 01:00:00+00:00 11.0 5.0
2018-09-30 02:00:00+00:00 21.0 7.0
2018-09-30 03:00:00+00:00 33.0 8.0
2018-09-30 04:00:00+00:00 32.0 17.0
2018-09-30 05:00:00+00:00 27.0 18.0
3.feed_27.NO_PPB 3.feed_27.NOY_PPB \
EpochTime
2016-10-31 06:00:00+00:00 0.1 2.6
2016-10-31 07:00:00+00:00 -1.0 -1.0
2016-10-31 08:00:00+00:00 0.2 2.1
2016-10-31 09:00:00+00:00 0.1 3.1
2016-10-31 10:00:00+00:00 0.1 2.5
... ... ...
2018-09-30 01:00:00+00:00 0.1 12.1
2018-09-30 02:00:00+00:00 0.2 13.5
2018-09-30 03:00:00+00:00 0.6 13.8
2018-09-30 04:00:00+00:00 0.6 11.4
2018-09-30 05:00:00+00:00 2.0 13.1
3.feed_27.CO_PPB 3.feed_27.SO2_PPB
EpochTime
2016-10-31 06:00:00+00:00 -1.0 0.2
2016-10-31 07:00:00+00:00 106.1 0.0
2016-10-31 08:00:00+00:00 105.8 -1.0
2016-10-31 09:00:00+00:00 111.7 0.0
2016-10-31 10:00:00+00:00 127.2 0.0
... ... ...
2018-09-30 01:00:00+00:00 301.0 0.0
2018-09-30 02:00:00+00:00 357.7 0.0
2018-09-30 03:00:00+00:00 373.6 0.0
2018-09-30 04:00:00+00:00 381.0 0.0
2018-09-30 05:00:00+00:00 377.7 0.0
[16776 rows x 43 columns]
Task 5: Preprocess Smell Data#
In this task, we will preprocess the smell data. First, we need to load the raw smell data.
smell_raw = pd.read_csv("smellpgh-v1/smell_raw.csv").set_index("EpochTime")
print(smell_raw)
feelings_symptoms \
EpochTime
1477935134 NaN
1477956180 NaN
1477956293 NaN
1477973293 Eye irritation, nose burns, headache, woke me up
1478001989 NaN
... ...
1538248172 NaN
1538255258 Coughing
1538268796 NaN
1538281653 No
1538282980 Eye irritation
smell_description smell_value \
EpochTime
1477935134 NaN 1
1477956180 Woodsmoke 2
1477956293 Wood smoke 3
1477973293 Industrial 5
1478001989 Industrial smoke 2
... ... ...
1538248172 Sour sewage 3
1538255258 Smoke 2
1538268796 Like a burning candle 3
1538281653 Greasy air 3
1538282980 Wood smoke and gun powder very Smokey too as f... 5
zipcode
EpochTime
1477935134 15221
1477956180 15218
1477956293 15218
1477973293 15207
1478001989 15213
... ...
1538248172 15213
1538255258 15104
1538268796 15232
1538281653 15222
1538282980 15217
[10353 rows x 4 columns]
The meaning of EpochTime
is explained in the previous task. Other columns mean the self-reported symptoms, descriptions of smell, severity ratings (the smell_value
column), and the zipcode where the report is submitted in Pittsburgh, Pennsylvania. For example, the second row means that the smell report was submitted from the 15218 zipcode with wood smoke description and severity rating 2. For more description about the smell, please check the Smell Pittsburgh website.
Next, we need to resample the smell data so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:
df_smell = answer_preprocess_smell(smell_raw)
print(df_smell)
smell_value
EpochTime
2016-10-31 23:00:00+00:00 3
2016-11-01 00:00:00+00:00 0
2016-11-01 01:00:00+00:00 0
2016-11-01 02:00:00+00:00 0
2016-11-01 03:00:00+00:00 0
... ...
2018-09-30 00:00:00+00:00 3
2018-09-30 01:00:00+00:00 0
2018-09-30 02:00:00+00:00 0
2018-09-30 03:00:00+00:00 0
2018-09-30 04:00:00+00:00 8
[16758 rows x 1 columns]
In the latest row, the timestamp is 2018-09-30 04:00:00+00:00
, which means this row contains the data from 3:00 to 4:00 on September 30 in 2018. This row has smell_value
8, which means the sum of smell report ratings in the above mentioned time range. Notice that the expected output ignores all smell ratings from 1 to 2. This is becasue we only want the ratings that indicate bad smell, which will be further explained below.
Assignment for Task 5#
Your task (which is your assignment) is to write a function to do the following:
First, remove the
feelings_symptoms
,smell_description
, andzipcode
columns since we do not need them.Hint: Use the
pandas.DataFrame.drop
function. Type?pd.DataFrame.drop
in a code cell for more information.
We only want the reports that indicate bad smell. You need to select only the reports with rating 3, 4, or 5 in the
smell_value
column.Then, we want to know the severity of bad smell within an hour in the future. For example, at time 8:00, we want to know the sum of smell values between 8:00 and 9:00. So you need to resample the data by computing the hourly sum of smell values from the “future” hour.
Hint: Use the
pandas.to_datetime
function when converting timestamps to datetime objects. Type?pd.to_datetime
in a code cell for more information.Hint: Use the
pandas.DataFrame.resample
function to resample data. Type?pd.DataFrame.resample
in a code cell for more information.
Finally, fill in the missing data with the value 0. The reason is that missing data means there are no smell reports (provided by citizens) within an hour, so we assume that there is no bad smell within this period of time. Notice that this is an assumption and also a limitation since citizens rarely report good smell.
Hint: Use the
pandas.DataFrame.fillna
function when treating missing values. Type?pd.DataFrame.fillna
in a code cell for more information.
def preprocess_smell(df):
"""
Preprocess smell data.
Parameters
----------
df : pandas.DataFrame
The raw smell reports data.
Returns
-------
pandas.DataFrame
The preprocessed smell data.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the output of your function matches the expected output.
check_answer_df(preprocess_smell(smell_raw), df_smell, n=1)
Test case 1 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-10-31 23:00:00+00:00 3
2016-11-01 00:00:00+00:00 0
2016-11-01 01:00:00+00:00 0
2016-11-01 02:00:00+00:00 0
2016-11-01 03:00:00+00:00 0
... ...
2018-09-30 00:00:00+00:00 3
2018-09-30 01:00:00+00:00 0
2018-09-30 02:00:00+00:00 0
2018-09-30 03:00:00+00:00 0
2018-09-30 04:00:00+00:00 8
[16758 rows x 1 columns]
Now, we can plot the distribution of smell values by using the pandas.DataFrame.plot
function.
fig = df_smell.plot(kind="hist", bins=20, ylim=(0,100), edgecolor="black").set_yticks([0,50,100], labels=["0","50",">100"])

From the plot above, we can observe that a lot of the time, the smell values are fairly low. This means that smell events only happen occasionally, and thus our dataset is highly imbalanced.
We can also plot the average number of smell reports distributed by the day of week (Sunday to Saturday) and the hour of day (0 to 23), using the code below.
def is_datetime_obj_tz_aware(dt):
"""
Find if the datetime object is timezone aware.
Parameters
----------
dt : pandas.DatetimeIndex
A datatime index object.
"""
return dt.tzinfo is not None and dt.tzinfo.utcoffset(dt) is not None
def plot_smell_by_day_and_hour(df):
"""
Plot the average number of smell reports by the day of week and the hour of day.
Parameters
----------
df : pandas.DataFrame
The preprocessed smell data.
"""
# Copy the data frame to prevent editing the original one.
df = df.copy(deep=True)
# Convert timestamps to the local time in Pittsburgh.
if is_datetime_obj_tz_aware(df.index):
df.index = df.index.tz_convert(pytz.timezone("US/Eastern"))
else:
df.index = df.index.tz_localize(pytz.utc, ambiguous="infer").tz_convert(pytz.timezone("US/Eastern"))
# Compute the day of week and the hour of day.
df["day_of_week"] = df.index.dayofweek
df["hour_of_day"] = df.index.hour
# Plot the graph.
y_l = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
df_pivot = pd.pivot_table(df, values="smell_value", index=["day_of_week"], columns=["hour_of_day"], aggfunc=np.mean)
f, ax = plt.subplots(figsize=(14, 6))
sns.heatmap(df_pivot, annot=False, cmap="Blues", fmt="g", linewidths=0.1, yticklabels=y_l, ax=ax)
plot_smell_by_day_and_hour(df_smell)

From the plot above, we can observe that citizens tend to report smell in the morning.
Task 6: Prepare Features and Labels#
Now we have the preprocessed data in the df_sensor
and df_smell
variables. Our next task is to prepare features and labels for modeling, as shown in the figure below.
Our goal is to construct two data frames, df_x
and df_y
, that represent the features and labels, respectively. First, we will deal with the sensor data. We need a function to insert columns (that indicate previous n_hr
hours of sensor) to the existing data frame, where n_hr
should be a parameter that we can control. The code below can help us achieve this.
def insert_previous_data_to_cols(df, n_hr=0):
"""
Insert columns to indicate the data from the previous hours.
Parameters
----------
df : pandas.DataFrame
The preprocessed sensor data.
n_hr : int
Number of hours that we want to insert the previous sensor data.
Returns
-------
pandas.DataFrame
The transformed sensor data.
"""
# Copy data frame to prevent editing the original one.
df = df.copy(deep=True)
# Add the data from the previous hours.
df_all = []
for h in range(1, n_hr + 1):
# Shift the data frame to get previous data.
df_pre = df.shift(h)
# Edit the name to indicate it is previous data.
# The orginal data frame already has data from the previous 1 hour.
# (as indicated in the preprocessing phase of sensor data)
# So we need to add 1 here.
df_pre.columns += "_pre_" + str(h+1) + "h"
# Add the data to an array for merging.
df_all.append(df_pre)
# Rename the columns in the original data frame.
# The orginal data frame already has data from the previous 1 hour.
# (as indicated in the preprocessing phase of sensor data)
df.columns += "_pre_1h"
# Merge all data.
df_merge = df
for d in df_all:
# The join function merges dataframes by index.
df_merge = df_merge.join(d)
# Delete the first n_hr rows.
# These n_hr rows have no data due to data shifting.
df_merge = df_merge.iloc[n_hr:]
return df_merge
The code below shows a test case, which is a part of the sensor data.
# Below is an example input.
df_sensor_example_in = df_sensor[["3.feed_1.SONICWS_MPH"]][0:5]
print(df_sensor_example_in)
3.feed_1.SONICWS_MPH
EpochTime
2016-10-31 06:00:00+00:00 3.6
2016-10-31 07:00:00+00:00 3.5
2016-10-31 08:00:00+00:00 3.4
2016-10-31 09:00:00+00:00 2.1
2016-10-31 10:00:00+00:00 2.2
# Below is the expected output of the above example input.
df_sensor_example_out = insert_previous_data_to_cols(df_sensor_example_in, n_hr=2)
print(df_sensor_example_out)
3.feed_1.SONICWS_MPH_pre_1h \
EpochTime
2016-10-31 08:00:00+00:00 3.4
2016-10-31 09:00:00+00:00 2.1
2016-10-31 10:00:00+00:00 2.2
3.feed_1.SONICWS_MPH_pre_2h \
EpochTime
2016-10-31 08:00:00+00:00 3.5
2016-10-31 09:00:00+00:00 3.4
2016-10-31 10:00:00+00:00 2.1
3.feed_1.SONICWS_MPH_pre_3h
EpochTime
2016-10-31 08:00:00+00:00 3.6
2016-10-31 09:00:00+00:00 3.5
2016-10-31 10:00:00+00:00 3.4
The reason that there are 2 less rows in the expected output is because we set n_hr=2
, which means there are missing data in the original first and second row (because there was no previous data for these rows). So in the code, we removed these rows.
Notice that the insert_previous_data_to_cols
function added suffixes to the column names to indicate the number of hours that the sensor measurements came from previously. Pay attention to the meaning of time range here.
For example, in the first row, the
3.feed_1.SONICWS_MPH_pre_1h
column has value3.4
, which means the average reading of wind speed (in unit MPH) between the current time stamp (which is8:00
) and the previous 1 hour (which is7:00
).In the second column of the first row, the
3.feed_1.SONICWS_MPH_pre_2h
has value3.5
, which means the average reading of wind speed between the previous 1 hour (which is7:00
) and 2 hours (which is6:00
).It is important to note here that suffix
pre_2h
does not mean the average rating within 2 hours between the current time stamp and the time that is 2 hours ago.
Then, we also need a function to convert wind direction into sine and cosine components, which is a common technique for encoding cyclical features (i.e., any that that circulates within a set of values, such as hours of the day, days of the week). The formula is below:
There are several reasons to do this instead of using the original wind direction degrees (that range from 0 to 360). First, by applying sine and cosine to the degrees, we can transform the original data to a continuous variable. The original data is not continuous since there are no values below 0 or above 360, and there is no information to tell that 0 degrees and 360 degrees are the same. Second, the decomposed sine and cosine components allow us to inspect the effect of wind on the north-south and east-west directions separately, which may help us explain the importance of wind directions. Below is the code for achieving this.
def convert_wind_direction(df):
"""
Convert wind directions to sine and cosine components.
Parameters
----------
df : pandas.DataFrame
The data frame that contains the wind direction data.
Returns
-------
pandas.DataFrame
The transformed data frame.
"""
# Copy data frame to prevent editing the original one.
df_cp = df.copy(deep=True)
# Convert columns with wind directions.
for c in df.columns:
if "SONICWD_DEG" in c:
df_c = df[c]
df_c_cos = np.cos(np.deg2rad(df_c))
df_c_sin = np.sin(np.deg2rad(df_c))
df_c_cos.name += "_cosine"
df_c_sin.name += "_sine"
df_cp.drop([c], axis=1, inplace=True)
df_cp[df_c_cos.name] = df_c_cos
df_cp[df_c_sin.name] = df_c_sin
return df_cp
The code below shows a test case, which is a part of the sensor data.
# Below is an example input.
df_wind_example_in = df_sensor[["3.feed_1.SONICWD_DEG"]][0:5]
print(df_wind_example_in)
3.feed_1.SONICWD_DEG
EpochTime
2016-10-31 06:00:00+00:00 343.0
2016-10-31 07:00:00+00:00 351.0
2016-10-31 08:00:00+00:00 359.0
2016-10-31 09:00:00+00:00 5.0
2016-10-31 10:00:00+00:00 41.0
# Below is the expected output of the above example input.
df_wind_example_out = convert_wind_direction(df_wind_example_in)
print(df_wind_example_out)
3.feed_1.SONICWD_DEG_cosine \
EpochTime
2016-10-31 06:00:00+00:00 0.956305
2016-10-31 07:00:00+00:00 0.987688
2016-10-31 08:00:00+00:00 0.999848
2016-10-31 09:00:00+00:00 0.996195
2016-10-31 10:00:00+00:00 0.754710
3.feed_1.SONICWD_DEG_sine
EpochTime
2016-10-31 06:00:00+00:00 -0.292372
2016-10-31 07:00:00+00:00 -0.156434
2016-10-31 08:00:00+00:00 -0.017452
2016-10-31 09:00:00+00:00 0.087156
2016-10-31 10:00:00+00:00 0.656059
We have dealt with the sensor data. Next, we will deal with the smell data. We need a function to sum up smell values in the future hours, where n_hr
should be a parameter that we can control. The code below shows a test case, which is a part of the smell data.
# Below is an example input.
df_smell_example_in = df_smell[107:112]
print(df_smell_example_in)
smell_value
EpochTime
2016-11-05 10:00:00+00:00 8
2016-11-05 11:00:00+00:00 13
2016-11-05 12:00:00+00:00 40
2016-11-05 13:00:00+00:00 22
2016-11-05 14:00:00+00:00 4
# Below is the expected output of the above example input.
df_smell_example_out1 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=1)
print(df_smell_example_out1)
smell_value
EpochTime
2016-11-05 10:00:00+00:00 21.0
2016-11-05 11:00:00+00:00 53.0
2016-11-05 12:00:00+00:00 62.0
2016-11-05 13:00:00+00:00 26.0
# Below is another expected output with a different n_hr.
df_smell_example_out2 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=3)
print(df_smell_example_out2)
smell_value
EpochTime
2016-11-05 10:00:00+00:00 83.0
2016-11-05 11:00:00+00:00 79.0
In the output above, notice that row 2016-11-05 10:00:00+00:00
has smell value 83
, and the setting is n_hr=3
, which means the sum of smell values within n_hr+1
hours (i.e., from 10:00
to 14:00
) is 83. Pay attention to this setup since it can be confusing. The reason of n_hr+1
(but not n_hr
) is because the input data already indicates the sum of smell values within the future 1 hour.
# Below is another expected output when n_hr is 0.
df_smell_example_out3 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=0)
print(df_smell_example_out3)
smell_value
EpochTime
2016-11-05 10:00:00+00:00 8
2016-11-05 11:00:00+00:00 13
2016-11-05 12:00:00+00:00 40
2016-11-05 13:00:00+00:00 22
2016-11-05 14:00:00+00:00 4
Assignment for Task 6#
Your task (which is your assignment) is to write a function to do the following:
First, perform a windowing operation to sum up smell values within a specified
n_hr
time window.Hint: Use the
pandas.DataFrame.rolling
function when summing up values within a window. Type?pd.DataFrame.rolling
in a code cell for more information.Hint: Use the
pandas.DataFrame.shift
fuction to shift the rolled data backn_hr
hours because we want to sum up the values in the future (the rolling function operates on the values in the past). Type?pd.DataFrame.shift
in a code cell for more information.
Finally, Remove the last
n_hr
hours of data because they have the wrong data due to shifting. For example, the last row does not have data in the future to operate if we setn_hr=1
.Hint: Use
pandas.DataFrame.iloc
to select the rows that you want to keep. Type?pd.DataFrame.iloc
in a code cell for more information.
You need to handle the edge case when
n_hr=0
, which should output the original data frame.
def sum_current_and_future_data(df, n_hr=0):
"""
Sum up data in the current and future hours.
Parameters
----------
df : pandas.DataFrame
The preprocessed smell data.
n_hr : int
Number of hours that we want to sum up the future smell data.
Returns
-------
pandas.DataFrame
The transformed smell data.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the output of your function matches the expected output.
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=1), df_smell_example_out1, n=1)
Test case 1 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-11-05 10:00:00+00:00 21.0
2016-11-05 11:00:00+00:00 53.0
2016-11-05 12:00:00+00:00 62.0
2016-11-05 13:00:00+00:00 26.0
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=3), df_smell_example_out2, n=2)
Test case 2 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-11-05 10:00:00+00:00 83.0
2016-11-05 11:00:00+00:00 79.0
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=0), df_smell_example_out3, n=3)
Test case 3 failed.
Your output is:
None
Expected output is:
smell_value
EpochTime
2016-11-05 10:00:00+00:00 8
2016-11-05 11:00:00+00:00 13
2016-11-05 12:00:00+00:00 40
2016-11-05 13:00:00+00:00 22
2016-11-05 14:00:00+00:00 4
Finally, we need a function to compute the features and labels, based on the above insert_previous_data_to_cols
and sum_current_and_future_data
functions that you implemented. The code is below.
def compute_feature_label(df_smell, df_sensor, b_hr_sensor=0, f_hr_smell=0):
"""
Compute features and labels from the smell and sensor data.
Parameters
----------
df_smell : pandas.DataFrame
The preprocessed smell data.
df_sensor : pandas.DataFrame
The preprocessed sensor data.
b_hr_sensor : int
Number of hours that we want to insert the previous sensor data.
f_hr_smell : int
Number of hours that we want to sum up the future smell data.
Returns
-------
df_x : pandas.DataFrame
The features that we want to use for modeling.
df_y : pandas.DataFrame
The labels that we want to use for modeling.
"""
# Copy data frames to prevent editing the original ones.
df_smell = df_smell.copy(deep=True)
df_sensor = df_sensor.copy(deep=True)
# Replace -1 values in sensor data to NaN
df_sensor[df_sensor==-1] = np.nan
# Convert all wind directions.
df_sensor = convert_wind_direction(df_sensor)
# Scale sensor data and fill in missing values
df_sensor = (df_sensor - df_sensor.mean()) / df_sensor.std()
df_sensor = df_sensor.round(6)
df_sensor = df_sensor.fillna(-1)
# Insert previous sensor data as features.
# Noice that the df_sensor is already using the previous data.
# So b_hr_sensor=0 means using data from the previous 1 hour.
# And b_hr_sensor=n means using data from the previous n+1 hours.
df_sensor = insert_previous_data_to_cols(df_sensor, b_hr_sensor)
# Sum up current and future smell values as label.
# Notice that the df_smell is already the data from the future 1 hour.
# (as indicated in the preprocessing phase of smell data)
# So f_hr_smell=0 means using data from the future 1 hour.
# And f_hr_smell=n means using data from the future n+1 hours.
df_smell = answer_sum_current_and_future_data(df_smell, f_hr_smell)
# Add suffix to the column name of the smell data to prevent confusion.
# See the description above for the reason of adding 1 to the f_hr_smell.
df_smell.columns += "_future_" + str(f_hr_smell+1) + "h"
# We need to first merge these two timestamps based on the available data.
# In this way, we synchronize the time stamps in the sensor and smell data.
# This also means that the sensor and smell data have the same number of data points.
df = pd.merge_ordered(df_sensor.reset_index(), df_smell.reset_index(), on=df_smell.index.name, how="inner", fill_method=None)
# Sanity check: there should be no missing data.
assert df.isna().sum().sum() == 0, "Error! There is missing data."
# Separate features (x) and labels (y).
df_x = df[df_sensor.columns]
df_y = df[df_smell.columns]
# Add the hour of day and the day of week.
dow_radian = df["EpochTime"].dt.dayofweek.copy(deep=True) * 2 * np.pi / 6.0
tod_radian = df["EpochTime"].dt.hour.copy(deep=True) * 2 * np.pi / 23.0
df_x.loc[:,"day_of_week_sine"] = np.sin(dow_radian)
df_x.loc[:,"day_of_week_cosine"] = np.cos(dow_radian)
df_x.loc[:,"hour_of_day_sine"] = np.sin(tod_radian)
df_x.loc[:,"hour_of_day_cosine"] = np.cos(tod_radian)
return df_x, df_y
We will use the sensor data within the previous 3 hours to predict bad smell within the future 8 hours. To use the compute_feature_label
function that we just build, we need to set b_hr_sensor=2
and f_hr_smell=7
because originally df_sensor
already contains data from the previous 1 hour, and df_smell
already contains data from the future 1 hour.
Note that b_hr_sensor=n
means that we want to insert previous n+1
hours of sensor data , and f_hr_smell=m
means that we want to sum up the smell values of the future m+1
hours. For example, suppose that the current time is 8:00, setting b_hr_sensor=2
means that we use all sensor data from 5:00 to 8:00 (as features df_x
in prediction), and setting f_hr_smell=7
means that we sum up the smell values from 8:00 to 16:00 (as labels df_y
in prediction).
df_x, df_y = compute_feature_label(df_smell, df_sensor, b_hr_sensor=2, f_hr_smell=7)
/var/folders/xr/ddxdh8x16q53_r8yf2zj9m600000gn/T/ipykernel_73417/3562418639.py:70: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_x.loc[:,"day_of_week_sine"] = np.sin(dow_radian)
Below is the data frame of features (i.e., the predictor variable).
print(df_x)
3.feed_1.SO2_PPM_pre_1h 3.feed_1.H2S_PPM_pre_1h \
0 -0.273112 -0.403688
1 -0.273112 -0.403688
2 -0.273112 -0.403688
3 -0.273112 -0.403688
4 -0.273112 -0.403688
... ... ...
16746 -0.273112 -0.403688
16747 -0.273112 -0.403688
16748 -0.273112 -0.403688
16749 -0.273112 -0.403688
16750 -0.273112 -0.403688
3.feed_1.SIGTHETA_DEG_pre_1h 3.feed_1.SONICWS_MPH_pre_1h \
0 -1.520058 -0.599075
1 -1.433654 -0.684709
2 1.142731 -0.941610
3 -0.082623 -0.941610
4 1.527618 -0.984426
... ... ...
16746 0.011635 -0.128090
16747 0.443651 0.000361
16748 0.443651 -0.256540
16749 0.270844 -0.085273
16750 -0.341833 0.085995
3.feed_23.CO_PPM_pre_1h 3.feed_23.PM10_UG_M3_pre_1h \
0 -0.388936 -0.777225
1 -0.388936 -0.690974
2 0.147335 -0.173473
3 0.147335 -0.432224
4 0.147335 -0.259723
... ... ...
16746 -0.925207 -0.604724
16747 -0.925207 -0.690974
16748 -0.925207 -0.604724
16749 -0.925207 -0.518474
16750 -0.925207 -0.432224
3.feed_29.PM10_UG_M3_pre_1h 3.feed_29.PM25_UG_M3_pre_1h \
0 -0.406466 -0.395826
1 0.007500 -0.305936
2 -0.147737 -0.216045
3 -0.302974 -0.216045
4 -0.458211 -0.485717
... ... ...
16746 -0.561703 -0.575607
16747 -0.406466 -0.665498
16748 -0.458211 -0.575607
16749 -0.302974 -0.575607
16750 -0.406466 -0.395826
3.feed_11067.CO_PPB..3.feed_43.CO_PPB_pre_1h \
0 -0.716551
1 -0.426597
2 -0.444787
3 -0.796641
4 -0.762976
... ...
16746 0.529598
16747 0.662087
16748 0.181817
16749 0.856204
16750 0.798647
3.feed_11067.NO2_PPB..3.feed_43.NO2_PPB_pre_1h ... \
0 -0.585693 ...
1 0.488014 ...
2 0.829648 ...
3 0.081306 ...
4 -0.504352 ...
... ... ...
16746 -0.748376 ...
16747 -0.292864 ...
16748 -0.862254 ...
16749 -0.439279 ...
16750 -0.309133 ...
3.feed_28.SONICWD_DEG_cosine_pre_3h 3.feed_28.SONICWD_DEG_sine_pre_3h \
0 0.279097 1.746934
1 1.089779 1.481480
2 0.799733 1.640186
3 0.960380 1.562966
4 1.623480 0.780539
... ... ...
16746 1.707841 0.380565
16747 1.693445 0.048275
16748 1.489886 -0.500401
16749 1.402368 -0.626362
16750 0.581575 -1.153330
3.feed_26.SONICWD_DEG_cosine_pre_3h 3.feed_26.SONICWD_DEG_sine_pre_3h \
0 -0.383942 1.929446
1 0.945548 1.350182
2 0.726159 1.603583
3 1.185067 0.816616
4 1.225168 0.602477
... ... ...
16746 1.119099 -0.254164
16747 1.098706 -0.303805
16748 0.609087 -0.931955
16749 0.237194 -1.124325
16750 -0.684058 -1.060369
3.feed_3.SONICWD_DEG_cosine_pre_3h 3.feed_3.SONICWD_DEG_sine_pre_3h \
0 -0.542867 1.331119
1 0.512949 1.355712
2 0.537757 1.347897
3 0.512949 1.355712
4 0.659294 1.303117
... ... ...
16746 1.210281 -0.738344
16747 1.327922 -0.583204
16748 0.798657 -1.062282
16749 0.706601 -1.107672
16750 -0.161757 -1.243870
day_of_week_sine day_of_week_cosine hour_of_day_sine \
0 0.000000 1.0 -2.449294e-16
1 0.866025 0.5 0.000000e+00
2 0.866025 0.5 2.697968e-01
3 0.866025 0.5 5.195840e-01
4 0.866025 0.5 7.308360e-01
... ... ... ...
16746 -0.866025 0.5 -9.976688e-01
16747 -0.866025 0.5 -9.790841e-01
16748 -0.866025 0.5 -8.878852e-01
16749 -0.866025 0.5 -7.308360e-01
16750 -0.866025 0.5 -5.195840e-01
hour_of_day_cosine
0 1.000000
1 1.000000
2 0.962917
3 0.854419
4 0.682553
... ...
16746 -0.068242
16747 0.203456
16748 0.460065
16749 0.682553
16750 0.854419
[16751 rows x 148 columns]
Below is the data frame of labels (i.e., the response variable).
print(df_y)
smell_value_future_8h
0 8.0
1 5.0
2 5.0
3 5.0
4 5.0
... ...
16746 6.0
16747 6.0
16748 6.0
16749 3.0
16750 11.0
[16751 rows x 1 columns]
Task 7: Train and Evaluate Models#
We have processed raw data and prepared the compute_feature_label
function to convert smell and sensor data into features df_x
and labels df_y
. In this task, you will work on training basic and advanced models to predict bad smell events (i.e., the situation that the smell value is high) using the sensor data from air quality monitoring stations.
First, let us threshold the features to make them binary for our classification task. We will use value 40 as the threshold to indicate a smell event. The threshold 40 was used in the Smell Pittsburgh research paper. It is equivalent to the situation that 10 people reported smell with rating 4 within 8 hours.
df_y_40 = (df_y>=40).astype(int)
print(df_y_40)
smell_value_future_8h
0 0
1 0
2 0
3 0
4 0
... ...
16746 0
16747 0
16748 0
16749 0
16750 0
[16751 rows x 1 columns]
The descriptive statistics below tell us that the dataset is imbalanced, which means that the numbers of data points in the positive and negative label groups have a big difference.
print("There are %d rows with smell events." % (df_y_40.sum()))
print("This means %.2f proportion of the data has smell events." % (df_y_40.sum()/len(df_y_40)))
There are 1439 rows with smell events.
This means 0.09 proportion of the data has smell events.
We can also plot the result from the Principal Component Analysis to check if there are patterns in the data. The following code block outputs the result.
def get_pca_result(x, y, n=3):
"""
Get the result of Principal Component Analysis.
Parameters
----------
x : pandas.DataFrame
The features.
y : pandas.DataFrame
The labels.
n : int
Number of principal components.
Returns
-------
df_pc : pandas.DataFrame
A data frame with information about principal components.
df_r : pandas.DataFrame
A data frame with information about ratios of explained variances.
"""
# Copy the data to prevent editing it.
x, y = deepcopy(x), deepcopy(y)
# Run the principal component analysis.
pca = PCA(n_components=n)
# Compute the eigenvectors, which are the principal components.
pc = pca.fit_transform(x)
columns = ["PC" + str(i) for i in range(1,1+n)]
df_pc = pd.DataFrame(data=pc, columns=columns)
# Set the label for plotting.
df_pc["y"] = y.astype(str)
# Set the marker size for plotting.
df_pc["size"] = 15
# Get eigenvalues (i.e., variances explained by principal components).
r = np.round(pca.explained_variance_ratio_, 3)
df_r = pd.DataFrame({"var":r, "pc":columns})
return df_pc, df_r
The code below prints the outputs from the get_pca_result
function.
df_pc_pca, df_r_pca = get_pca_result(df_x, df_y_40, n=3)
print(df_pc_pca)
print(df_r_pca)
PC1 PC2 PC3 y size
0 -2.544053 -7.423711 -4.220014 0 15
1 -1.899612 -7.726739 -1.953111 0 15
2 -0.020275 -7.352264 -0.359143 0 15
3 1.970582 -6.803432 0.281522 0 15
4 3.207700 -6.725773 -0.596668 0 15
... ... ... ... .. ...
16746 -3.848883 -4.378285 3.056364 0 15
16747 -3.791501 -3.742726 3.182619 0 15
16748 -3.576693 -2.224403 2.208638 0 15
16749 -3.338212 -1.966243 1.894475 0 15
16750 -3.708470 -1.175944 1.687017 0 15
[16751 rows x 5 columns]
var pc
0 0.254 PC1
1 0.118 PC2
2 0.084 PC3
Now let us plot the ratios of explained variances (i.e., the eigenvalues). The intuition is that if the explained variance ratio is larger, the corresponding principal component is more important and can represent more information.
# Plot the eigenvalues.
ax1 = sns.barplot(x="pc",y="var", data=df_r_pca, color="c")
ax1 = ax1.set(xlabel="Principal Component",
ylabel="Ratio of Variance Explained",
title="Variance Explained by Principal Components")

We can also plot the first and second principal components to see the distribution of labels. In the figure below, we see that label 1
(which means having a smell event) is concentrated on one side. However, it looks like labels 0
and 1
do not separate well, which means that the classifier will have difficulty distinguishing these two groups.
# Plot two principal components.
ax2 = sns.lmplot(x="PC1", y="PC2", data=df_pc_pca,
fit_reg=False, hue="y", legend=True,
scatter_kws={"s": 10, "alpha" :0.4})
ax2 = ax2.set(title="Distribution of Labels by Principal Components")

In the interactive figure below, we can zoom in, zoom out, and rotate the view to check the distribution of labels. We can confirm that label 1
is concentrated on one side. But we also observe that there is a large part of the data that may not be separated well by the classifier (especially linear classifiers). We probably need non-linear classifiers to be able to tell the differences between these two groups.
# Use the Plotly package to show the PCA results.
fig = px.scatter_3d(df_pc_pca.sample(n=5000), x="PC1", y="PC2", z="PC3",
color="y", symbol="y", opacity=0.6,
size="size", size_max=15,
category_orders=dict(y=["0", "1"]),
height=700)
fig.show()