Tutorial (Structured Data Processing)#

(Last updated: Feb 6, 2024)1

This tutorial will familiarize you with the data science pipeline of processing structured data, using a real-world example of building models to predict and explain the presence of bad smell events in Pittsburgh based on air quality and weather data. The models are used to send push notifications about bad smell events to inform citizens, as well as to explain local pollution patterns to inform stakeholders.

The scenario is in the next section of this tutorial, and more details are in the introduction section of the Smell Pittsburgh paper. We will use the same dataset as used in the Smell Pittsburgh paper as an example of structured data. During this tutorial, we will explain what the variables in the dataset mean and also guide you through model building. Below is the pipeline of this tutorial.


../_images/smellpgh-pipeline.png


You can use the following link to jump to the tasks and assignments:

Scenario#

Local citizens in Pittsburgh are organizing communities to advocate for changes in air pollution regulations. Their goal is to investigate the air pollution patterns in the city to understand the potential sources related to the bad odor. The communities rely on the Smell Pittsburgh application (as indicated in the figure below) to collect smell reports from citizens that live in the Pittsburgh region. Also, there are air quality and weather monitoring stations in the Pittsburgh city region that provide sensor measurements, including common air pollutants and wind information.


../_images/smellpgh-ui.png


You work in a data science team to develop models to map the sensor data to bad smell events. Your team has been working with the Pittsburgh local citizens closely for a long time, and therefore you know the meaning of each variable in the feature set that is used to train the machine learning model. The Pittsburgh community needs your help timely to analyze the data that can help them present evidence of air pollution to the municipality and explain the patterns to the general public.

Import Packages#

Important

You need to install the packages in this link in your Python development environment.

We put all the packages that are needed for this tutorial below:

import pandas as pd
import numpy as np
import pytz
import seaborn as sns
import matplotlib.pyplot as plt
from os.path import isfile, join
from os import listdir
from copy import deepcopy
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.inspection import permutation_importance
from pandas.api.indexers import FixedForwardWindowIndexer
from sklearn.decomposition import PCA
import plotly.express as px

Set the global style of all Pandas table output:

from IPython.core.display import HTML

def set_global_df_style():
    styles = """
    <style>
        .dataframe * {font-size: 1em !important;}
    </style>
    """
    return HTML(styles)

# Call this function in a notebook cell to apply the style globally
set_global_df_style()

Task Answers#

The code block below contains answers for the assignments in this tutorial. Do not check the answers in the next cell before practicing the tasks.

def check_answer_df(df_result, df_answer, n=1):
    """
    This function checks if two output dataframes are the same.

    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        assert df_answer.equals(df_result)
        print("Test case %d passed." % n)
    except:
        print("Test case %d failed." % n)
        print("")
        print("Your output is:")
        print(df_result)
        print("")
        print("Expected output is:")
        print(df_answer)


def answer_preprocess_sensor(df_list):
    """
    This function is the answer of task 5.
    Preprocess sensor data.

    Parameters
    ----------
    df_list : list of pandas.DataFrame
        A list of data frames that contain sensor data from multiple stations.

    Returns
    -------
    pandas.DataFrame
        The preprocessed sensor data.
    """
    # Resample all the data frames.
    df_resample_list = []
    for df in df_list:
        # Convert the timestamp to datetime.
        df.index = pd.to_datetime(df.index, unit="s", utc=True)
        # Resample the timestamps by hour and average all the previous values.
        # Because we want data from the past, so label need to be "right".
        df_resample_list.append(df.resample("60Min", label="right").mean())

    # Merge all data frames.
    df = df_resample_list.pop(0)
    index_name = df.index.name
    while len(df_resample_list) != 0:
        # We need to use outer merging since we want to preserve data from both data frames.
        df = pd.merge_ordered(df, df_resample_list.pop(0), on=df.index.name, how="outer", fill_method=None)
        # Move the datetime column to index
        df = df.set_index(index_name)

    # Fill in the missing data with value -1.
    df = df.fillna(-1)
    return df


def answer_preprocess_smell(df):
    """
    This function is the answer of task 4.
    Preprocess smell data.

    Parameters
    ----------
    df : pandas.DataFrame
        The raw smell reports data.

    Returns
    -------
    pandas.DataFrame
        The preprocessed smell data.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Drop the columns that we do not need.
    df = df.drop(columns=["feelings_symptoms", "smell_description", "zipcode"])

    # Select only the reports within the range of 3 and 5.
    df = df[(df["smell_value"]>=3)&(df["smell_value"]<=5)]

    # Convert the timestamp to datetime.
    df.index = pd.to_datetime(df.index, unit="s", utc=True)

    # Resample the timestamps by hour and sum up all the future values.
    # Because we want data from the future, so label need to be "left".
    df = df.resample("60Min", label="left").sum()

    # Fill in the missing data with value 0.
    df = df.fillna(0)
    return df


def answer_sum_current_and_future_data(df, n_hr=0):
    """
    This function is the answer of task 6.
    Sum up data in the current and future hours.

    Parameters
    ----------
    df : pandas.DataFrame
        The preprocessed smell data.
    n_hr : int
         Number of hours that we want to sum up the future smell data.

    Returns
    -------
    pandas.DataFrame
        The transformed smell data.
    """
    # Copy data frame to prevent editing the original one.
    df = df.copy(deep=True)

    # Fast return if n_hr is 0
    if n_hr == 0: return df

    # Sum up all smell_values in future hours.
    # The rolling function only works for summing up previous values.
    # So we need to shift back to get the value in the future.
    # Be careful that we need to add 1 to the rolling window size.
    # Becasue window size 1 means only using the current data.
    # Parameter "closed" need to be "right" because we want the current data.
    df = df.rolling(n_hr+1, min_periods=1, closed="right").sum().shift(-1*n_hr)

    # Below is an alternative implementation of rolling.
    #indexer = FixedForwardWindowIndexer(window_size=n_hr+1)
    #df = df.rolling(indexer, min_periods=1).sum()

    # Delete the last n_hr rows.
    # These n_hr rows have wrong data due to data shifting.
    df = df.iloc[:-1*n_hr]
    return df


def answer_experiment(df_x, df_y):
    """
    This function is the answer of task 7.
    Perform experiments and print the results.

    Parameters
    ----------
    df_x : pandas.DataFrame
        The data frame that contains all features.
    df_y : pandas.DataFrame
         The data frame that contains labels.
    """
    wind = "3.feed_28.SONICWD_DEG"
    h2s = "3.feed_28.H2S_PPM"
    dow = "day_of_week"
    hod = "hour_of_day"
    fs1 = [h2s + "_pre_1h", dow + "_sine", dow + "_cosine", hod + "_sine", hod + "_cosine"]
    fs1w = fs1 + [wind + "_sine_pre_1h", wind + "_cosine_pre_1h"]
    fs2 = fs1 + [h2s + "_pre_2h"]
    fs2w = fs2 + [wind + "_sine_pre_2h", wind + "_cosine_pre_2h"]
    feature_sets = [fs1, fs1w, fs2, fs2w]
    models = [DecisionTreeClassifier(), RandomForestClassifier()]
    for m in models:
        for fs in feature_sets:
            print("Use feature set %s" % (str(fs)))
            df_x_fs = df_x[fs]
            train_and_evaluate(m, df_x_fs, df_y, train_size=336, test_size=168)
            compute_feature_importance(m, df_x_fs, df_y, scoring="f1")
            print("")

Task 4: Preprocess Sensor Data#

In this task, we will process the sensor data from various air quality monitoring stations in Pittsburgh. First, we need to load all the sensor data.

path = "smellpgh-v1/esdr_raw"
list_of_files = [f for f in listdir(path) if isfile(join(path, f))]
sensor_raw_list = []
for f in list_of_files:
    sensor_raw_list.append(pd.read_csv(join(path, f)).set_index("EpochTime"))

Now, the sensor_raw_list variable contains all the data frames with sensor values from different air quality monitoring stations. Noted that sensor_raw_list is an array of data frames. We can print one of them to take a look, as shown below.

sensor_raw_list[0]
3.feed_1.SO2_PPM 3.feed_1.H2S_PPM 3.feed_1.SIGTHETA_DEG 3.feed_1.SONICWD_DEG 3.feed_1.SONICWS_MPH
EpochTime
1477891800 0.0 0.0 51.7 343.0 3.6
1477895400 0.0 0.0 52.7 351.0 3.5
1477899000 0.0 0.0 52.6 359.0 3.4
1477902600 0.0 0.0 48.3 5.0 2.1
1477906200 0.0 0.0 31.1 41.0 2.2
... ... ... ... ... ...
1538267400 0.0 0.0 35.2 39.0 1.7
1538271000 0.0 0.0 48.2 53.0 1.3
1538274600 0.0 0.0 30.9 62.0 1.5
1538278200 0.0 0.0 21.5 53.0 2.0
1538281800 0.0 0.0 52.1 36.0 1.7

16729 rows × 5 columns

The EpochTime index is the timestamp in epoch time, which means the number of seconds that have elapsed since January 1st, 1970 (midnight UTC/GMT). Other columns mean the sensor data from an air quality monitoring station.

Next, we need to resample and merge all the sensor data frames so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:

df_sensor = answer_preprocess_sensor(sensor_raw_list)
df_sensor
3.feed_1.SO2_PPM 3.feed_1.H2S_PPM 3.feed_1.SIGTHETA_DEG 3.feed_1.SONICWD_DEG 3.feed_1.SONICWS_MPH 3.feed_23.CO_PPM 3.feed_23.PM10_UG_M3 3.feed_29.PM10_UG_M3 3.feed_29.PM25_UG_M3 3.feed_11067.CO_PPB..3.feed_43.CO_PPB ... 3.feed_3.SO2_PPM 3.feed_3.SONICWD_DEG 3.feed_3.SONICWS_MPH 3.feed_3.SIGTHETA_DEG 3.feed_3.PM10B_UG_M3 3.feed_5975.PM2_5 3.feed_27.NO_PPB 3.feed_27.NOY_PPB 3.feed_27.CO_PPB 3.feed_27.SO2_PPB
EpochTime
2016-10-31 06:00:00+00:00 0.0 0.0 51.7 343.0 3.6 0.2 7.0 8.0 8.0 159.5 ... 0.0 344.0 2.9 43.0 9.0 0.0 0.1 2.6 -1.0 0.2
2016-10-31 07:00:00+00:00 0.0 0.0 52.7 351.0 3.5 0.2 8.0 8.0 8.0 -1.0 ... 0.0 330.0 2.5 43.6 13.0 5.0 -1.0 -1.0 106.1 0.0
2016-10-31 08:00:00+00:00 0.0 0.0 52.6 359.0 3.4 0.2 5.0 7.0 7.0 133.0 ... 0.0 0.0 3.1 40.9 7.0 9.0 0.2 2.1 105.8 -1.0
2016-10-31 09:00:00+00:00 0.0 0.0 48.3 5.0 2.1 0.2 3.0 4.0 4.0 236.6 ... 0.0 325.0 1.9 40.0 11.0 3.0 0.1 3.1 111.7 0.0
2016-10-31 10:00:00+00:00 0.0 0.0 31.1 41.0 2.2 0.2 5.0 5.0 4.0 269.3 ... 0.0 347.0 1.4 45.1 10.0 9.0 0.1 2.5 127.2 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-09-30 01:00:00+00:00 0.0 0.0 35.2 39.0 1.7 0.3 20.0 15.0 10.0 455.1 ... 0.0 39.0 1.3 57.3 11.0 5.0 0.1 12.1 301.0 0.0
2018-09-30 02:00:00+00:00 0.0 0.0 48.2 53.0 1.3 0.3 25.0 19.0 12.0 761.2 ... 0.0 70.0 1.0 54.4 21.0 7.0 0.2 13.5 357.7 0.0
2018-09-30 03:00:00+00:00 0.0 0.0 30.9 62.0 1.5 0.4 23.0 55.0 33.0 1125.4 ... 0.0 75.0 0.7 59.5 33.0 8.0 0.6 13.8 373.6 0.0
2018-09-30 04:00:00+00:00 0.0 0.0 21.5 53.0 2.0 0.5 23.0 63.0 39.0 1039.5 ... 0.0 74.0 0.7 50.7 32.0 17.0 0.6 11.4 381.0 0.0
2018-09-30 05:00:00+00:00 0.0 0.0 52.1 36.0 1.7 0.4 25.0 47.0 33.0 1636.3 ... 0.0 65.0 0.4 77.1 27.0 18.0 2.0 13.1 377.7 0.0

16776 rows × 43 columns

In the expected output above, the EpochTime index is converted from timestamps into pandas datetime objects, which has the format year-month-day hour:minute:second+timezone. The +00:00 string means the GMT/UTC timezone. Other columns mean the average value of the sensor data in the previous hour. For example, 2016-10-31 06:00:00+00:00 means October 31 in 2016 at 6AM UTC time, and the cell with column 3.feed_1.SO2_PPM means the averaged SO2 (sulfur dioxide) values from 5:00 to 6:00.

The column name suffix SO2_PPM means sulfur dioxide in unit PPM (parts per million). The prefix 3.feed_1. in the column name means a specific sensor (feed ID 1). You can ignore the 3. at the begining of the column name. You can find the list of sensors, their names with feed ID (which will be in the data frame columns), and also the meaning of all the suffixes from this link.

Some column names look like 3.feed_11067.SIGTHETA_DEG..3.feed_43.SIGTHETA_DEG. This means that the column has data from two sensor stations (feed ID 11067 and 43). The reason is that some sensor stations are replaced by the new ones over time. So in this case, we merge sensor readings from both feed ID 11067 and 43.

Assignment for Task 4#

Your task (which is your assignment) is to write a function to do the following:

  • Sensors can report in various frequencies. So, for each data frame, we need to resample the data by computing the hourly average of sensor measurements from the “previous” hour. For example, at time 8:00, we want to know the average of sensor values between 7:00 and 8:00.

    • Hint: Use the pandas.to_datetime function when converting timestamps to datetime objects. Type ?pd.to_datetime in a code cell for more information.

    • Hint: Use the pandas.DataFrame.resample function to resample data. Type ?pd.DataFrame.resample in a code cell for more information.

    • Hint: Use the pandas.merge_ordered function when merging data frames. Type ?pd.merge_ordered in a code cell for more information.

  • Then, merge all the data frames based on their time stamp, which is the EpochTime column.

  • Finally, fill in the missing data with the value -1. The reason for not using 0 is that we want the model to know if sensors give values (including zero) or no data.

    • Hint: Use the pandas.DataFrame.fillna function when treating missing values. Type ?pd.DataFrame.fillna in a code cell for more information.

def preprocess_sensor(df_list):
    """
    Preprocess sensor data.

    Parameters
    ----------
    df_list : list of pandas.DataFrame
        A list of data frames that contain sensor data from multiple stations.

    Returns
    -------
    pandas.DataFrame
        The preprocessed sensor data.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the output of your function matches the expected output.

check_answer_df(preprocess_sensor(sensor_raw_list), df_sensor, n=1)
Test case 1 failed.

Your output is:
None

Expected output is:
                           3.feed_1.SO2_PPM  3.feed_1.H2S_PPM  \
EpochTime                                                       
2016-10-31 06:00:00+00:00               0.0               0.0   
2016-10-31 07:00:00+00:00               0.0               0.0   
2016-10-31 08:00:00+00:00               0.0               0.0   
2016-10-31 09:00:00+00:00               0.0               0.0   
2016-10-31 10:00:00+00:00               0.0               0.0   
...                                     ...               ...   
2018-09-30 01:00:00+00:00               0.0               0.0   
2018-09-30 02:00:00+00:00               0.0               0.0   
2018-09-30 03:00:00+00:00               0.0               0.0   
2018-09-30 04:00:00+00:00               0.0               0.0   
2018-09-30 05:00:00+00:00               0.0               0.0   

                           3.feed_1.SIGTHETA_DEG  3.feed_1.SONICWD_DEG  \
EpochTime                                                                
2016-10-31 06:00:00+00:00                   51.7                 343.0   
2016-10-31 07:00:00+00:00                   52.7                 351.0   
2016-10-31 08:00:00+00:00                   52.6                 359.0   
2016-10-31 09:00:00+00:00                   48.3                   5.0   
2016-10-31 10:00:00+00:00                   31.1                  41.0   
...                                          ...                   ...   
2018-09-30 01:00:00+00:00                   35.2                  39.0   
2018-09-30 02:00:00+00:00                   48.2                  53.0   
2018-09-30 03:00:00+00:00                   30.9                  62.0   
2018-09-30 04:00:00+00:00                   21.5                  53.0   
2018-09-30 05:00:00+00:00                   52.1                  36.0   

                           3.feed_1.SONICWS_MPH  3.feed_23.CO_PPM  \
EpochTime                                                           
2016-10-31 06:00:00+00:00                   3.6               0.2   
2016-10-31 07:00:00+00:00                   3.5               0.2   
2016-10-31 08:00:00+00:00                   3.4               0.2   
2016-10-31 09:00:00+00:00                   2.1               0.2   
2016-10-31 10:00:00+00:00                   2.2               0.2   
...                                         ...               ...   
2018-09-30 01:00:00+00:00                   1.7               0.3   
2018-09-30 02:00:00+00:00                   1.3               0.3   
2018-09-30 03:00:00+00:00                   1.5               0.4   
2018-09-30 04:00:00+00:00                   2.0               0.5   
2018-09-30 05:00:00+00:00                   1.7               0.4   

                           3.feed_23.PM10_UG_M3  3.feed_29.PM10_UG_M3  \
EpochTime                                                               
2016-10-31 06:00:00+00:00                   7.0                   8.0   
2016-10-31 07:00:00+00:00                   8.0                   8.0   
2016-10-31 08:00:00+00:00                   5.0                   7.0   
2016-10-31 09:00:00+00:00                   3.0                   4.0   
2016-10-31 10:00:00+00:00                   5.0                   5.0   
...                                         ...                   ...   
2018-09-30 01:00:00+00:00                  20.0                  15.0   
2018-09-30 02:00:00+00:00                  25.0                  19.0   
2018-09-30 03:00:00+00:00                  23.0                  55.0   
2018-09-30 04:00:00+00:00                  23.0                  63.0   
2018-09-30 05:00:00+00:00                  25.0                  47.0   

                           3.feed_29.PM25_UG_M3  \
EpochTime                                         
2016-10-31 06:00:00+00:00                   8.0   
2016-10-31 07:00:00+00:00                   8.0   
2016-10-31 08:00:00+00:00                   7.0   
2016-10-31 09:00:00+00:00                   4.0   
2016-10-31 10:00:00+00:00                   4.0   
...                                         ...   
2018-09-30 01:00:00+00:00                  10.0   
2018-09-30 02:00:00+00:00                  12.0   
2018-09-30 03:00:00+00:00                  33.0   
2018-09-30 04:00:00+00:00                  39.0   
2018-09-30 05:00:00+00:00                  33.0   

                           3.feed_11067.CO_PPB..3.feed_43.CO_PPB  ...  \
EpochTime                                                         ...   
2016-10-31 06:00:00+00:00                                  159.5  ...   
2016-10-31 07:00:00+00:00                                   -1.0  ...   
2016-10-31 08:00:00+00:00                                  133.0  ...   
2016-10-31 09:00:00+00:00                                  236.6  ...   
2016-10-31 10:00:00+00:00                                  269.3  ...   
...                                                          ...  ...   
2018-09-30 01:00:00+00:00                                  455.1  ...   
2018-09-30 02:00:00+00:00                                  761.2  ...   
2018-09-30 03:00:00+00:00                                 1125.4  ...   
2018-09-30 04:00:00+00:00                                 1039.5  ...   
2018-09-30 05:00:00+00:00                                 1636.3  ...   

                           3.feed_3.SO2_PPM  3.feed_3.SONICWD_DEG  \
EpochTime                                                           
2016-10-31 06:00:00+00:00               0.0                 344.0   
2016-10-31 07:00:00+00:00               0.0                 330.0   
2016-10-31 08:00:00+00:00               0.0                   0.0   
2016-10-31 09:00:00+00:00               0.0                 325.0   
2016-10-31 10:00:00+00:00               0.0                 347.0   
...                                     ...                   ...   
2018-09-30 01:00:00+00:00               0.0                  39.0   
2018-09-30 02:00:00+00:00               0.0                  70.0   
2018-09-30 03:00:00+00:00               0.0                  75.0   
2018-09-30 04:00:00+00:00               0.0                  74.0   
2018-09-30 05:00:00+00:00               0.0                  65.0   

                           3.feed_3.SONICWS_MPH  3.feed_3.SIGTHETA_DEG  \
EpochTime                                                                
2016-10-31 06:00:00+00:00                   2.9                   43.0   
2016-10-31 07:00:00+00:00                   2.5                   43.6   
2016-10-31 08:00:00+00:00                   3.1                   40.9   
2016-10-31 09:00:00+00:00                   1.9                   40.0   
2016-10-31 10:00:00+00:00                   1.4                   45.1   
...                                         ...                    ...   
2018-09-30 01:00:00+00:00                   1.3                   57.3   
2018-09-30 02:00:00+00:00                   1.0                   54.4   
2018-09-30 03:00:00+00:00                   0.7                   59.5   
2018-09-30 04:00:00+00:00                   0.7                   50.7   
2018-09-30 05:00:00+00:00                   0.4                   77.1   

                           3.feed_3.PM10B_UG_M3  3.feed_5975.PM2_5  \
EpochTime                                                            
2016-10-31 06:00:00+00:00                   9.0                0.0   
2016-10-31 07:00:00+00:00                  13.0                5.0   
2016-10-31 08:00:00+00:00                   7.0                9.0   
2016-10-31 09:00:00+00:00                  11.0                3.0   
2016-10-31 10:00:00+00:00                  10.0                9.0   
...                                         ...                ...   
2018-09-30 01:00:00+00:00                  11.0                5.0   
2018-09-30 02:00:00+00:00                  21.0                7.0   
2018-09-30 03:00:00+00:00                  33.0                8.0   
2018-09-30 04:00:00+00:00                  32.0               17.0   
2018-09-30 05:00:00+00:00                  27.0               18.0   

                           3.feed_27.NO_PPB  3.feed_27.NOY_PPB  \
EpochTime                                                        
2016-10-31 06:00:00+00:00               0.1                2.6   
2016-10-31 07:00:00+00:00              -1.0               -1.0   
2016-10-31 08:00:00+00:00               0.2                2.1   
2016-10-31 09:00:00+00:00               0.1                3.1   
2016-10-31 10:00:00+00:00               0.1                2.5   
...                                     ...                ...   
2018-09-30 01:00:00+00:00               0.1               12.1   
2018-09-30 02:00:00+00:00               0.2               13.5   
2018-09-30 03:00:00+00:00               0.6               13.8   
2018-09-30 04:00:00+00:00               0.6               11.4   
2018-09-30 05:00:00+00:00               2.0               13.1   

                           3.feed_27.CO_PPB  3.feed_27.SO2_PPB  
EpochTime                                                       
2016-10-31 06:00:00+00:00              -1.0                0.2  
2016-10-31 07:00:00+00:00             106.1                0.0  
2016-10-31 08:00:00+00:00             105.8               -1.0  
2016-10-31 09:00:00+00:00             111.7                0.0  
2016-10-31 10:00:00+00:00             127.2                0.0  
...                                     ...                ...  
2018-09-30 01:00:00+00:00             301.0                0.0  
2018-09-30 02:00:00+00:00             357.7                0.0  
2018-09-30 03:00:00+00:00             373.6                0.0  
2018-09-30 04:00:00+00:00             381.0                0.0  
2018-09-30 05:00:00+00:00             377.7                0.0  

[16776 rows x 43 columns]

Task 5: Preprocess Smell Data#

In this task, we will preprocess the smell data. First, we need to load the raw smell data.

smell_raw = pd.read_csv("smellpgh-v1/smell_raw.csv").set_index("EpochTime")
smell_raw
feelings_symptoms smell_description smell_value zipcode
EpochTime
1477935134 NaN NaN 1 15221
1477956180 NaN Woodsmoke 2 15218
1477956293 NaN Wood smoke 3 15218
1477973293 Eye irritation, nose burns, headache, woke me up Industrial 5 15207
1478001989 NaN Industrial smoke 2 15213
... ... ... ... ...
1538248172 NaN Sour sewage 3 15213
1538255258 Coughing Smoke 2 15104
1538268796 NaN Like a burning candle 3 15232
1538281653 No Greasy air 3 15222
1538282980 Eye irritation Wood smoke and gun powder very Smokey too as f... 5 15217

10353 rows × 4 columns

The meaning of EpochTime is explained in the previous task. Other columns mean the self-reported symptoms, descriptions of smell, severity ratings (the smell_value column), and the zipcode where the report is submitted in Pittsburgh, Pennsylvania. For example, the second row means that the smell report was submitted from the 15218 zipcode with wood smoke description and severity rating 2. For more description about the smell, please check the Smell Pittsburgh website.

Next, we need to resample the smell data so that they can be used for modeling. Our goal is to have a dataframe that looks like the following:

df_smell = answer_preprocess_smell(smell_raw)
df_smell
smell_value
EpochTime
2016-10-31 23:00:00+00:00 3
2016-11-01 00:00:00+00:00 0
2016-11-01 01:00:00+00:00 0
2016-11-01 02:00:00+00:00 0
2016-11-01 03:00:00+00:00 0
... ...
2018-09-30 00:00:00+00:00 3
2018-09-30 01:00:00+00:00 0
2018-09-30 02:00:00+00:00 0
2018-09-30 03:00:00+00:00 0
2018-09-30 04:00:00+00:00 8

16758 rows × 1 columns

In the latest row, the timestamp is 2018-09-30 04:00:00+00:00, which means this row contains the data from 3:00 to 4:00 on September 30 in 2018. This row has smell_value 8, which means the sum of smell report ratings in the above mentioned time range. Notice that the expected output ignores all smell ratings from 1 to 2. This is becasue we only want the ratings that indicate bad smell, which will be further explained below.

Assignment for Task 5#

Your task (which is your assignment) is to write a function to do the following:

  • First, remove the feelings_symptoms, smell_description, and zipcode columns since we do not need them.

    • Hint: Use the pandas.DataFrame.drop function. Type ?pd.DataFrame.drop in a code cell for more information.

  • We only want the reports that indicate bad smell. You need to select only the reports with rating 3, 4, or 5 in the smell_value column.

  • Then, we want to know the severity of bad smell within an hour in the future. For example, at time 8:00, we want to know the sum of smell values between 8:00 and 9:00. So you need to resample the data by computing the hourly sum of smell values from the “future” hour.

    • Hint: Use the pandas.to_datetime function when converting timestamps to datetime objects. Type ?pd.to_datetime in a code cell for more information.

    • Hint: Use the pandas.DataFrame.resample function to resample data. Type ?pd.DataFrame.resample in a code cell for more information.

  • Finally, fill in the missing data with the value 0. The reason is that missing data means there are no smell reports (provided by citizens) within an hour, so we assume that there is no bad smell within this period of time. Notice that this is an assumption and also a limitation since citizens rarely report good smell.

    • Hint: Use the pandas.DataFrame.fillna function when treating missing values. Type ?pd.DataFrame.fillna in a code cell for more information.

def preprocess_smell(df):
    """
    Preprocess smell data.

    Parameters
    ----------
    df : pandas.DataFrame
        The raw smell reports data.

    Returns
    -------
    pandas.DataFrame
        The preprocessed smell data.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the output of your function matches the expected output.

check_answer_df(preprocess_smell(smell_raw), df_smell, n=1)
Test case 1 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-10-31 23:00:00+00:00            3
2016-11-01 00:00:00+00:00            0
2016-11-01 01:00:00+00:00            0
2016-11-01 02:00:00+00:00            0
2016-11-01 03:00:00+00:00            0
...                                ...
2018-09-30 00:00:00+00:00            3
2018-09-30 01:00:00+00:00            0
2018-09-30 02:00:00+00:00            0
2018-09-30 03:00:00+00:00            0
2018-09-30 04:00:00+00:00            8

[16758 rows x 1 columns]

Now, we can plot the distribution of smell values by using the pandas.DataFrame.plot function.

fig = df_smell.plot(kind="hist", bins=20, ylim=(0,100), edgecolor="black").set_yticks([0,50,100], labels=["0","50",">100"])
../_images/tutorial-structured-data_48_0.png

From the plot above, we can observe that a lot of the time, the smell values are fairly low. This means that smell events only happen occasionally, and thus our dataset is highly imbalanced.

We can also plot the average number of smell reports distributed by the day of week (Sunday to Saturday) and the hour of day (0 to 23), using the code below.

def is_datetime_obj_tz_aware(dt):
    """
    Find if the datetime object is timezone aware.

    Parameters
    ----------
    dt : pandas.DatetimeIndex
        A datatime index object.
    """
    return dt.tz is not None


def plot_smell_by_day_and_hour(df):
    """
    Plot the average number of smell reports by the day of week and the hour of day.

    Parameters
    ----------
    df : pandas.DataFrame
        The preprocessed smell data.
    """
    # Copy the data frame to prevent editing the original one.
    df = df.copy(deep=True)

    # Convert timestamps to the local time in Pittsburgh.
    if is_datetime_obj_tz_aware(df.index):
        df.index = df.index.tz_convert(pytz.timezone("US/Eastern"))
    else:
        df.index = df.index.tz_localize(pytz.utc, ambiguous="infer").tz_convert(pytz.timezone("US/Eastern"))

    # Compute the day of week and the hour of day.
    df["day_of_week"] = df.index.dayofweek
    df["hour_of_day"] = df.index.hour

    # Plot the graph.
    y_l = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
    df_pivot = pd.pivot_table(df, values="smell_value", index=["day_of_week"], columns=["hour_of_day"], aggfunc="mean")
    f, ax = plt.subplots(figsize=(14, 6))
    sns.heatmap(df_pivot, annot=False, cmap="Blues", fmt="g", linewidths=0.1, yticklabels=y_l, ax=ax)


plot_smell_by_day_and_hour(df_smell)
../_images/tutorial-structured-data_51_0.png

From the plot above, we can observe that citizens tend to report smell in the morning.

Task 6: Prepare Features and Labels#

Now we have the preprocessed data in the df_sensor and df_smell variables. Our next task is to prepare features and labels for modeling, as shown in the figure below.


../_images/smellpgh-predict.png


Our goal is to construct two data frames, df_x and df_y, that represent the features and labels, respectively. First, we will deal with the sensor data. We need a function to insert columns (that indicate previous n_hr hours of sensor) to the existing data frame, where n_hr should be a parameter that we can control. The code below can help us achieve this.

def insert_previous_data_to_cols(df, n_hr=0):
    """
    Insert columns to indicate the data from the previous hours.

    Parameters
    ----------
    df : pandas.DataFrame
        The preprocessed sensor data.
    n_hr : int
        Number of hours that we want to insert the previous sensor data.

    Returns
    -------
    pandas.DataFrame
        The transformed sensor data.
    """
    # Copy data frame to prevent editing the original one.
    df = df.copy(deep=True)

    # Add the data from the previous hours.
    df_all = []
    for h in range(1, n_hr + 1):
        # Shift the data frame to get previous data.
        df_pre = df.shift(h)
        # Edit the name to indicate it is previous data.
        # The orginal data frame already has data from the previous 1 hour.
        # (as indicated in the preprocessing phase of sensor data)
        # So we need to add 1 here.
        df_pre.columns += "_pre_" + str(h+1) + "h"
        # Add the data to an array for merging.
        df_all.append(df_pre)

    # Rename the columns in the original data frame.
    # The orginal data frame already has data from the previous 1 hour.
    # (as indicated in the preprocessing phase of sensor data)
    df.columns += "_pre_1h"

    # Merge all data.
    df_merge = df
    for d in df_all:
        # The join function merges dataframes by index.
        df_merge = df_merge.join(d)

    # Delete the first n_hr rows.
    # These n_hr rows have no data due to data shifting.
    df_merge = df_merge.iloc[n_hr:]
    return df_merge

The code below shows a test case, which is a part of the sensor data.

# Below is an example input.
df_sensor_example_in = df_sensor[["3.feed_1.SONICWS_MPH"]][0:5]
df_sensor_example_in
3.feed_1.SONICWS_MPH
EpochTime
2016-10-31 06:00:00+00:00 3.6
2016-10-31 07:00:00+00:00 3.5
2016-10-31 08:00:00+00:00 3.4
2016-10-31 09:00:00+00:00 2.1
2016-10-31 10:00:00+00:00 2.2
# Below is the expected output of the above example input.
df_sensor_example_out = insert_previous_data_to_cols(df_sensor_example_in, n_hr=2)
df_sensor_example_out
3.feed_1.SONICWS_MPH_pre_1h 3.feed_1.SONICWS_MPH_pre_2h 3.feed_1.SONICWS_MPH_pre_3h
EpochTime
2016-10-31 08:00:00+00:00 3.4 3.5 3.6
2016-10-31 09:00:00+00:00 2.1 3.4 3.5
2016-10-31 10:00:00+00:00 2.2 2.1 3.4

The reason that there are 2 less rows in the expected output is because we set n_hr=2, which means there are missing data in the original first and second row (because there was no previous data for these rows). So in the code, we removed these rows.

Notice that the insert_previous_data_to_cols function added suffixes to the column names to indicate the number of hours that the sensor measurements came from previously. Pay attention to the meaning of time range here.

  • For example, in the first row, the 3.feed_1.SONICWS_MPH_pre_1h column has value 3.4, which means the average reading of wind speed (in unit MPH) between the current time stamp (which is 8:00) and the previous 1 hour (which is 7:00).

  • In the second column of the first row, the 3.feed_1.SONICWS_MPH_pre_2h has value 3.5, which means the average reading of wind speed between the previous 1 hour (which is 7:00) and 2 hours (which is 6:00).

  • It is important to note here that suffix pre_2h does not mean the average rating within 2 hours between the current time stamp and the time that is 2 hours ago.

Then, we also need a function to convert wind direction into sine and cosine components, which is a common technique for encoding cyclical features (i.e., any that that circulates within a set of values, such as hours of the day, days of the week). The formula is below:

\[\begin{split} x_{sin} = \sin{\Big(\frac{2 \cdot \pi \cdot x}{\max{(x)}}\Big)} \\ x_{cos} = \cos{\Big(\frac{2 \cdot \pi \cdot x}{\max{(x)}}\Big)} \end{split}\]

There are several reasons to do this instead of using the original wind direction degrees (that range from 0 to 360). First, by applying sine and cosine to the degrees, we can transform the original data to a continuous variable. The original data is not continuous since there are no values below 0 or above 360, and there is no information to tell that 0 degrees and 360 degrees are the same. Second, the decomposed sine and cosine components allow us to inspect the effect of wind on the north-south and east-west directions separately, which may help us explain the importance of wind directions. Below is the code for achieving this.

def convert_wind_direction(df):
    """
    Convert wind directions to sine and cosine components.

    Parameters
    ----------
    df : pandas.DataFrame
        The data frame that contains the wind direction data.

    Returns
    -------
    pandas.DataFrame
        The transformed data frame.
    """
    # Copy data frame to prevent editing the original one.
    df_cp = df.copy(deep=True)

    # Convert columns with wind directions.
    for c in df.columns:
        if "SONICWD_DEG" in c:
            df_c = df[c]
            df_c_cos = np.cos(np.deg2rad(df_c))
            df_c_sin = np.sin(np.deg2rad(df_c))
            df_c_cos.name += "_cosine"
            df_c_sin.name += "_sine"
            df_cp.drop([c], axis=1, inplace=True)
            df_cp[df_c_cos.name] = df_c_cos
            df_cp[df_c_sin.name] = df_c_sin
    return df_cp

The code below shows a test case, which is a part of the sensor data.

# Below is an example input.
df_wind_example_in = df_sensor[["3.feed_1.SONICWD_DEG"]][0:5]
df_wind_example_in
3.feed_1.SONICWD_DEG
EpochTime
2016-10-31 06:00:00+00:00 343.0
2016-10-31 07:00:00+00:00 351.0
2016-10-31 08:00:00+00:00 359.0
2016-10-31 09:00:00+00:00 5.0
2016-10-31 10:00:00+00:00 41.0
# Below is the expected output of the above example input.
df_wind_example_out = convert_wind_direction(df_wind_example_in)
df_wind_example_out
3.feed_1.SONICWD_DEG_cosine 3.feed_1.SONICWD_DEG_sine
EpochTime
2016-10-31 06:00:00+00:00 0.956305 -0.292372
2016-10-31 07:00:00+00:00 0.987688 -0.156434
2016-10-31 08:00:00+00:00 0.999848 -0.017452
2016-10-31 09:00:00+00:00 0.996195 0.087156
2016-10-31 10:00:00+00:00 0.754710 0.656059

We have dealt with the sensor data. Next, we will deal with the smell data. We need a function to sum up smell values in the future hours, where n_hr should be a parameter that we can control. The code below shows a test case, which is a part of the smell data.

# Below is an example input.
df_smell_example_in = df_smell[107:112]
df_smell_example_in
smell_value
EpochTime
2016-11-05 10:00:00+00:00 8
2016-11-05 11:00:00+00:00 13
2016-11-05 12:00:00+00:00 40
2016-11-05 13:00:00+00:00 22
2016-11-05 14:00:00+00:00 4
# Below is the expected output of the above example input.
df_smell_example_out1 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=1)
df_smell_example_out1
smell_value
EpochTime
2016-11-05 10:00:00+00:00 21.0
2016-11-05 11:00:00+00:00 53.0
2016-11-05 12:00:00+00:00 62.0
2016-11-05 13:00:00+00:00 26.0
# Below is another expected output with a different n_hr.
df_smell_example_out2 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=3)
df_smell_example_out2
smell_value
EpochTime
2016-11-05 10:00:00+00:00 83.0
2016-11-05 11:00:00+00:00 79.0

In the output above, notice that row 2016-11-05 10:00:00+00:00 has smell value 83, and the setting is n_hr=3, which means the sum of smell values within n_hr+1 hours (i.e., from 10:00 to 14:00) is 83. Pay attention to this setup since it can be confusing. The reason of n_hr+1 (but not n_hr) is because the input data already indicates the sum of smell values within the future 1 hour.

# Below is another expected output when n_hr is 0.
df_smell_example_out3 = answer_sum_current_and_future_data(df_smell_example_in, n_hr=0)
df_smell_example_out3
smell_value
EpochTime
2016-11-05 10:00:00+00:00 8
2016-11-05 11:00:00+00:00 13
2016-11-05 12:00:00+00:00 40
2016-11-05 13:00:00+00:00 22
2016-11-05 14:00:00+00:00 4

Assignment for Task 6#

Your task (which is your assignment) is to write a function to do the following:

  • First, perform a windowing operation to sum up smell values within a specified n_hr time window.

    • Hint: Use the pandas.DataFrame.rolling function when summing up values within a window. Type ?pd.DataFrame.rolling in a code cell for more information.

    • Hint: Use the pandas.DataFrame.shift fuction to shift the rolled data back n_hr hours because we want to sum up the values in the future (the rolling function operates on the values in the past). Type ?pd.DataFrame.shift in a code cell for more information.

  • Finally, Remove the last n_hr hours of data because they have the wrong data due to shifting. For example, the last row does not have data in the future to operate if we set n_hr=1.

    • Hint: Use pandas.DataFrame.iloc to select the rows that you want to keep. Type ?pd.DataFrame.iloc in a code cell for more information.

  • You need to handle the edge case when n_hr=0, which should output the original data frame.

def sum_current_and_future_data(df, n_hr=0):
    """
    Sum up data in the current and future hours.

    Parameters
    ----------
    df : pandas.DataFrame
        The preprocessed smell data.
    n_hr : int
         Number of hours that we want to sum up the future smell data.

    Returns
    -------
    pandas.DataFrame
        The transformed smell data.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the output of your function matches the expected output.

check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=1), df_smell_example_out1, n=1)
Test case 1 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-11-05 10:00:00+00:00         21.0
2016-11-05 11:00:00+00:00         53.0
2016-11-05 12:00:00+00:00         62.0
2016-11-05 13:00:00+00:00         26.0
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=3), df_smell_example_out2, n=2)
Test case 2 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-11-05 10:00:00+00:00         83.0
2016-11-05 11:00:00+00:00         79.0
check_answer_df(sum_current_and_future_data(df_smell_example_in, n_hr=0), df_smell_example_out3, n=3)
Test case 3 failed.

Your output is:
None

Expected output is:
                           smell_value
EpochTime                             
2016-11-05 10:00:00+00:00            8
2016-11-05 11:00:00+00:00           13
2016-11-05 12:00:00+00:00           40
2016-11-05 13:00:00+00:00           22
2016-11-05 14:00:00+00:00            4

Finally, we need a function to compute the features and labels, based on the above insert_previous_data_to_cols and sum_current_and_future_data functions that you implemented. The code is below.

def compute_feature_label(df_smell, df_sensor, b_hr_sensor=0, f_hr_smell=0):
    """
    Compute features and labels from the smell and sensor data.

    Parameters
    ----------
    df_smell : pandas.DataFrame
        The preprocessed smell data.
    df_sensor : pandas.DataFrame
        The preprocessed sensor data.
    b_hr_sensor : int
        Number of hours that we want to insert the previous sensor data.
    f_hr_smell : int
        Number of hours that we want to sum up the future smell data.

    Returns
    -------
    df_x : pandas.DataFrame
        The features that we want to use for modeling.
    df_y : pandas.DataFrame
        The labels that we want to use for modeling.
    """
    # Copy data frames to prevent editing the original ones.
    df_smell = df_smell.copy(deep=True)
    df_sensor = df_sensor.copy(deep=True)

    # Replace -1 values in sensor data to NaN
    df_sensor[df_sensor==-1] = np.nan

    # Convert all wind directions.
    df_sensor = convert_wind_direction(df_sensor)

    # Scale sensor data and fill in missing values
    df_sensor = (df_sensor - df_sensor.mean()) / df_sensor.std()
    df_sensor = df_sensor.round(6)
    df_sensor = df_sensor.fillna(-1)

    # Insert previous sensor data as features.
    # Noice that the df_sensor is already using the previous data.
    # So b_hr_sensor=0 means using data from the previous 1 hour.
    # And b_hr_sensor=n means using data from the previous n+1 hours.
    df_sensor = insert_previous_data_to_cols(df_sensor, b_hr_sensor)

    # Sum up current and future smell values as label.
    # Notice that the df_smell is already the data from the future 1 hour.
    # (as indicated in the preprocessing phase of smell data)
    # So f_hr_smell=0 means using data from the future 1 hour.
    # And f_hr_smell=n means using data from the future n+1 hours.
    df_smell = answer_sum_current_and_future_data(df_smell, f_hr_smell)

    # Add suffix to the column name of the smell data to prevent confusion.
    # See the description above for the reason of adding 1 to the f_hr_smell.
    df_smell.columns += "_future_" + str(f_hr_smell+1) + "h"

    # We need to first merge these two timestamps based on the available data.
    # In this way, we synchronize the time stamps in the sensor and smell data.
    # This also means that the sensor and smell data have the same number of data points.
    df = pd.merge_ordered(df_sensor.reset_index(), df_smell.reset_index(), on=df_smell.index.name, how="inner", fill_method=None)

    # Sanity check: there should be no missing data.
    assert df.isna().sum().sum() == 0, "Error! There is missing data."

    # Separate features (x) and labels (y).
    df_x = df[df_sensor.columns].copy()
    df_y = df[df_smell.columns].copy()

    # Add the hour of day and the day of week.
    dow_radian = df["EpochTime"].dt.dayofweek.copy(deep=True) * 2 * np.pi / 6.0
    tod_radian = df["EpochTime"].dt.hour.copy(deep=True) * 2 * np.pi / 23.0
    df_x.loc[:,"day_of_week_sine"] = np.sin(dow_radian)
    df_x.loc[:,"day_of_week_cosine"] = np.cos(dow_radian)
    df_x.loc[:,"hour_of_day_sine"] = np.sin(tod_radian)
    df_x.loc[:,"hour_of_day_cosine"] = np.cos(tod_radian)
    return df_x, df_y

We will use the sensor data within the previous 3 hours to predict bad smell within the future 8 hours. To use the compute_feature_label function that we just build, we need to set b_hr_sensor=2 and f_hr_smell=7 because originally df_sensor already contains data from the previous 1 hour, and df_smell already contains data from the future 1 hour.

Note that b_hr_sensor=n means that we want to insert previous n+1 hours of sensor data , and f_hr_smell=m means that we want to sum up the smell values of the future m+1 hours. For example, suppose that the current time is 8:00, setting b_hr_sensor=2 means that we use all sensor data from 5:00 to 8:00 (as features df_x in prediction), and setting f_hr_smell=7 means that we sum up the smell values from 8:00 to 16:00 (as labels df_y in prediction).

df_x, df_y = compute_feature_label(df_smell, df_sensor, b_hr_sensor=2, f_hr_smell=7)

Below is the data frame of features (i.e., the predictor variable).

df_x
3.feed_1.SO2_PPM_pre_1h 3.feed_1.H2S_PPM_pre_1h 3.feed_1.SIGTHETA_DEG_pre_1h 3.feed_1.SONICWS_MPH_pre_1h 3.feed_23.CO_PPM_pre_1h 3.feed_23.PM10_UG_M3_pre_1h 3.feed_29.PM10_UG_M3_pre_1h 3.feed_29.PM25_UG_M3_pre_1h 3.feed_11067.CO_PPB..3.feed_43.CO_PPB_pre_1h 3.feed_11067.NO2_PPB..3.feed_43.NO2_PPB_pre_1h ... 3.feed_28.SONICWD_DEG_cosine_pre_3h 3.feed_28.SONICWD_DEG_sine_pre_3h 3.feed_26.SONICWD_DEG_cosine_pre_3h 3.feed_26.SONICWD_DEG_sine_pre_3h 3.feed_3.SONICWD_DEG_cosine_pre_3h 3.feed_3.SONICWD_DEG_sine_pre_3h day_of_week_sine day_of_week_cosine hour_of_day_sine hour_of_day_cosine
0 -0.273112 -0.403688 -1.520058 -0.599075 -0.388936 -0.777225 -0.406466 -0.395826 -0.716551 -0.585693 ... 0.279097 1.746934 -0.383942 1.929446 -0.542867 1.331119 0.000000 1.0 -2.449294e-16 1.000000
1 -0.273112 -0.403688 -1.433654 -0.684709 -0.388936 -0.690974 0.007500 -0.305936 -0.426597 0.488014 ... 1.089779 1.481480 0.945548 1.350182 0.512949 1.355712 0.866025 0.5 0.000000e+00 1.000000
2 -0.273112 -0.403688 1.142731 -0.941610 0.147335 -0.173473 -0.147737 -0.216045 -0.444787 0.829648 ... 0.799733 1.640186 0.726159 1.603583 0.537757 1.347897 0.866025 0.5 2.697968e-01 0.962917
3 -0.273112 -0.403688 -0.082623 -0.941610 0.147335 -0.432224 -0.302974 -0.216045 -0.796641 0.081306 ... 0.960380 1.562966 1.185067 0.816616 0.512949 1.355712 0.866025 0.5 5.195840e-01 0.854419
4 -0.273112 -0.403688 1.527618 -0.984426 0.147335 -0.259723 -0.458211 -0.485717 -0.762976 -0.504352 ... 1.623480 0.780539 1.225168 0.602477 0.659294 1.303117 0.866025 0.5 7.308360e-01 0.682553
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16746 -0.273112 -0.403688 0.011635 -0.128090 -0.925207 -0.604724 -0.561703 -0.575607 0.529598 -0.748376 ... 1.707841 0.380565 1.119099 -0.254164 1.210281 -0.738344 -0.866025 0.5 -9.976688e-01 -0.068242
16747 -0.273112 -0.403688 0.443651 0.000361 -0.925207 -0.690974 -0.406466 -0.665498 0.662087 -0.292864 ... 1.693445 0.048275 1.098706 -0.303805 1.327922 -0.583204 -0.866025 0.5 -9.790841e-01 0.203456
16748 -0.273112 -0.403688 0.443651 -0.256540 -0.925207 -0.604724 -0.458211 -0.575607 0.181817 -0.862254 ... 1.489886 -0.500401 0.609087 -0.931955 0.798657 -1.062282 -0.866025 0.5 -8.878852e-01 0.460065
16749 -0.273112 -0.403688 0.270844 -0.085273 -0.925207 -0.518474 -0.302974 -0.575607 0.856204 -0.439279 ... 1.402368 -0.626362 0.237194 -1.124325 0.706601 -1.107672 -0.866025 0.5 -7.308360e-01 0.682553
16750 -0.273112 -0.403688 -0.341833 0.085995 -0.925207 -0.432224 -0.406466 -0.395826 0.798647 -0.309133 ... 0.581575 -1.153330 -0.684058 -1.060369 -0.161757 -1.243870 -0.866025 0.5 -5.195840e-01 0.854419

16751 rows × 148 columns

Below is the data frame of labels (i.e., the response variable).

df_y
smell_value_future_8h
0 8.0
1 5.0
2 5.0
3 5.0
4 5.0
... ...
16746 6.0
16747 6.0
16748 6.0
16749 3.0
16750 11.0

16751 rows × 1 columns

Task 7: Train and Evaluate Models#

We have processed raw data and prepared the compute_feature_label function to convert smell and sensor data into features df_x and labels df_y. In this task, you will work on training basic and advanced models to predict bad smell events (i.e., the situation that the smell value is high) using the sensor data from air quality monitoring stations.

First, let us threshold the features to make them binary for our classification task. We will use value 40 as the threshold to indicate a smell event. The threshold 40 was used in the Smell Pittsburgh research paper. It is equivalent to the situation that 10 people reported smell with rating 4 within 8 hours.

df_y_40 = (df_y>=40).astype(int)
df_y_40
smell_value_future_8h
0 0
1 0
2 0
3 0
4 0
... ...
16746 0
16747 0
16748 0
16749 0
16750 0

16751 rows × 1 columns

The descriptive statistics below tell us that the dataset is imbalanced, which means that the numbers of data points in the positive and negative label groups have a big difference.

print("There are %d rows with smell events." % (df_y_40.sum().iloc[0]))
print("This means %.2f proportion of the data has smell events." % (df_y_40.sum().iloc[0]/len(df_y_40)))
There are 1439 rows with smell events.
This means 0.09 proportion of the data has smell events.

Principal Component Analysis#

We can also plot the result from the Principal Component Analysis (PCA) to check if there are patterns in the data.

PCA is a linear dimensionality reduction technique aiming to transform the original variables into a new set of uncorrelated variables (principal components) that preserve as much information or variability in the dataset as possible. To do so, it applies a linear transformation to the data onto a new coordinate system such that the directions (principal components) that capture the largest variation in the data can be identified.

If you are interested in the mathematical explanation of PCA, you can check the following hidden block. Otherwise, please feel free to skip the math.

Dimensionality reduction techiniques are often used to visualize larger-dimensional data in a 2D or 3D space. Let’s apply PCA to our example.

The following function creates a dataframe with the n principal components in the dataset:

def get_pca_result(x, y, n=3):
    """
    Get the result of Principal Component Analysis.

    Parameters
    ----------
    x : pandas.DataFrame
        The features.
    y : pandas.DataFrame
        The labels.
    n : int
        Number of principal components.

    Returns
    -------
    df_pc : pandas.DataFrame
        A data frame with information about principal components.
    df_r : pandas.DataFrame
        A data frame with information about ratios of explained variances.
    """
    # Copy the data to prevent editing it.
    x, y = deepcopy(x), deepcopy(y)

    # Run the principal component analysis.
    pca = PCA(n_components=n)

    # Compute the eigenvectors, which are the principal components.
    pc = pca.fit_transform(x)
    columns = ["PC" + str(i) for i in range(1,1+n)]
    df_pc = pd.DataFrame(data=pc, columns=columns)

    # Set the label for plotting.
    df_pc["y"] = y.astype(str)

    # Set the marker size for plotting.
    df_pc["size"] = 15

    # Get eigenvalues (i.e., variances explained by principal components).
    r = np.round(pca.explained_variance_ratio_, 3)
    df_r = pd.DataFrame({"var":r, "pc":columns})
    return df_pc, df_r

The code below prints the outputs from the get_pca_result function.

df_pc_pca, df_r_pca = get_pca_result(df_x, df_y_40, n=3)
print(df_pc_pca)
print(df_r_pca)
            PC1       PC2       PC3  y  size
0     -2.544053 -7.423711 -4.220014  0    15
1     -1.899612 -7.726739 -1.953110  0    15
2     -0.020275 -7.352263 -0.359142  0    15
3      1.970582 -6.803431  0.281523  0    15
4      3.207700 -6.725773 -0.596667  0    15
...         ...       ...       ... ..   ...
16746 -3.848883 -4.378285  3.056364  0    15
16747 -3.791501 -3.742726  3.182619  0    15
16748 -3.576693 -2.224402  2.208638  0    15
16749 -3.338212 -1.966243  1.894476  0    15
16750 -3.708470 -1.175944  1.687017  0    15

[16751 rows x 5 columns]
     var   pc
0  0.254  PC1
1  0.118  PC2
2  0.084  PC3

Now let us plot the ratios of explained variances (i.e., the eigenvalues). The intuition is that if the explained variance ratio is larger, the corresponding principal component is more important and can represent more information.

# Plot the eigenvalues.
ax1 = sns.barplot(x="pc",y="var", data=df_r_pca, color="c")
ax1 = ax1.set(xlabel="Principal Component",
       ylabel="Ratio of Variance Explained",
       title="Variance Explained by Principal Components")
../_images/tutorial-structured-data_107_0.png

We can also plot the first and second principal components to see the distribution of labels. In the figure below, we see that label 1 (which means having a smell event) is concentrated on one side. However, it looks like labels 0 and 1 do not separate well, which means that the classifier will have difficulty distinguishing these two groups.

# Plot two principal components.
ax2 = sns.lmplot(x="PC1", y="PC2", data=df_pc_pca,
    fit_reg=False, hue="y", legend=True,
    scatter_kws={"s": 10, "alpha" :0.4})
ax2 = ax2.set(title="Distribution of Labels by Principal Components")
../_images/tutorial-structured-data_109_0.png

In the interactive figure below, we can zoom in, zoom out, and rotate the view to check the distribution of labels. We can confirm that label 1 is concentrated on one side. But we also observe that there is a large part of the data that may not be separated well by the classifier (especially linear classifiers). We probably need non-linear classifiers to be able to tell the differences between these two groups.

# Use the Plotly package to show the PCA results.
fig = px.scatter_3d(df_pc_pca.sample(n=5000), x="PC1", y="PC2", z="PC3",
                    color="y", symbol="y", opacity=0.6,
                    size="size", size_max=15,
                    category_orders=dict(y=["0", "1"]),
                    height=700)
fig.show()