Model development - Fine Dust Concentration Prediction

Version 1.0.0, April 26th 2023

Version Date Author Amendments Status
1.0.0 April 26th 2023 Skyler Vermeer Layout + basic setup (introduction etc) Draft
1.1.0 xxxx Skyler Vermeer xxxx Draft

Introduction

In this document we will be developing, evaluating and iterating upon multiple models. We are doing this to see which model gives better results. Seeing as we want to predict Particulate Matter concentrations at a certain location at a certain time, a time-series model is a logical first approach. This would take into account seasonality and trends.

After exploring researches that created similar models and their approaches, the models we are going to explore are: ARIMA, ETS, PROPHET, SNAIVE and Seasonal ARIMA. In this document we will explore all of them, evaluating them and if they have potential, develop a final model that is fit for deployment based on them.

We will evaluate them based on how accurate they are, their RMSE and MAE, and how long they would take when deployed.

Chapter 1 - Setting up

In this chapter we will be setting up our project, importing relevant packages and retrieving and preparing the relevant data for modelling.

Before we do anything, we need to load in our standard packages, printing out their version so our experiments can be reproduced.

import sklearn as sk
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

import warnings
warnings.filterwarnings('ignore')


%matplotlib inline

print('scikit-learn version:', sk.__version__)
print('numpy version:', np.__version__)
print(np.__file__)
print('matplotlib version:', matplotlib.__version__)
print('seaborn version:', sns.__version__)
print('pandas version:', pd.__version__)
scikit-learn version: 1.1.2
numpy version: 1.23.0
C:\Users\micro\AppData\Roaming\Python\Python39\site-packages\numpy\__init__.py
matplotlib version: 3.6.2
seaborn version: 0.12.2
pandas version: 1.5.2

Now we have the relevant packages, it is time to get the data

df_measurements = pd.read_csv("csv_files/full_measurement.csv")
print(df_measurements.sample(10))
               Date      Time  Location   PM10  PM2.5    PM1
1704657  2020-10-25  08:09:59       536   3.11   0.82   0.49
5178180  2022-07-23  21:00:00       572   6.66   4.13   3.44
4112962  2022-11-25  22:40:00       562  25.73  17.23  12.05
813186   2022-05-18  22:20:00       528  28.15  20.42  17.96
1209868  2022-06-23  06:00:00       531  11.98   7.14   5.83
3771851  2023-05-02  23:20:00       558   9.14   5.80   4.39
1302747  2021-12-02  17:50:00       532   2.51   0.26   0.26
1739831  2021-06-28  08:30:01       536  22.17  20.31  18.62
1469092  2020-12-05  21:20:00       534  12.16   6.69   5.26
4204727  2022-06-20  02:40:00       563  26.03   7.56   5.01

We have already somewhat prepared the data, however seeing as we are going to be exploring a lot of options the further preparation of data will need to be prepared on a case by case basis.

Let’s import a few packages that we will need to use that aren’t part of our standard list.

from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
from datetime import datetime
from math import sqrt, floor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from prophet import Prophet
from decimal import Decimal
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.arima.model import ARIMA
import itertools
import statsmodels.api as sm
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsforecast import StatsForecast
from statsforecast.models import SeasonalNaive

Now let’s make a function for plotting the decomposed time series, as this is something that we will have to do multiple times.

def plot_decomposed_time(original, model_type, period_num, suptitle_text):
    result = seasonal_decompose(original, model=model_type, period = period_num)
    trend = result.trend
    seasonal = result.seasonal
    residual = result.resid

    decomposed_combo = [['Original', original], ['Trend', trend], ['Cyclic', seasonal], ['Residual',residual]]

    fig, axes = plt.subplots(4, 1, sharex=True, sharey=False)
    for idx in range(len(axes)):
        axes[idx].plot(decomposed_combo[idx][1])
        axes[idx].title.set_text(decomposed_combo[idx][0])

    fig.tight_layout(rect=[0, 0.03, 1, 0.95])
    fig.autofmt_xdate()
    fig.suptitle(suptitle_text)
    fig.set_figheight(10)
    fig.set_figwidth(15)
    plt.ioff()
    plt.close(fig)
    return fig

Based on our research in the EDA, we know that every 10 minutes introduces a lot of noise, disguising the trends a lot. That is why we will be altering the seasonality to make it every 60 minutes, using the average.

measurements_by_datetime = df_measurements.copy()
measurements_by_datetime['Datetime'] = pd.to_datetime(measurements_by_datetime['Date'] + ' ' + measurements_by_datetime['Time'])
measurements_by_datetime = measurements_by_datetime.set_index('Datetime')
df_521_data = measurements_by_datetime.loc[measurements_by_datetime.Location == 521]

hourly_521_data = df_521_data.resample('60min').mean().ffill()
hourly_521_data = hourly_521_data.asfreq('60min')
hourly_521_data
Location PM10 PM2.5 PM1
Datetime
2021-01-21 13:00:00 521.0 20.935000 10.983333 7.190000
2021-01-21 14:00:00 521.0 21.180000 9.538333 6.596667
2021-01-21 15:00:00 521.0 18.338333 9.216667 6.063333
2021-01-21 16:00:00 521.0 16.801667 8.466667 5.746667
2021-01-21 17:00:00 521.0 12.163333 6.010000 4.270000
... ... ... ... ...
2023-05-19 05:00:00 521.0 21.443333 12.723333 9.218333
2023-05-19 06:00:00 521.0 22.588333 12.730000 9.610000
2023-05-19 07:00:00 521.0 25.113333 13.068333 9.831667
2023-05-19 08:00:00 521.0 22.680000 13.041667 9.475000
2023-05-19 09:00:00 521.0 26.280000 11.330000 8.150000

20349 rows × 4 columns

Now we have all the basics for trying out some models so we don’t have to keep repeating it.

hour_weather_data_full = pd.read_csv("csv_files/hourdata_eindhoven_2021_16may2023.csv")
hour_weather_data_full
Date Hour WindDir WindSpeed_avg PrecipationHour Temp
0 20210101 1 290 10 0 20
1 20210101 2 250 10 0 18
2 20210101 3 220 10 0 11
3 20210101 4 270 20 0 10
4 20210101 5 250 20 0 3
... ... ... ... ... ... ...
20779 20230516 20 330 40 0 105
20780 20230516 21 320 20 0 90
20781 20230516 22 300 20 0 78
20782 20230516 23 300 20 0 71
20783 20230516 24 290 20 0 67

20784 rows × 6 columns

weather_measurements_by_datetime_full = hour_weather_data_full.copy()
datetimes = []
for index in weather_measurements_by_datetime_full.index:
    string_to_convert = str(weather_measurements_by_datetime_full['Date'][index]) + " " + str(weather_measurements_by_datetime_full['Hour'][index] - 1)
    datetimes.append(datetime.strptime(string_to_convert, "%Y%m%d %H"))

weather_measurements_by_datetime_full['DateTime'] = datetimes
print(weather_measurements_by_datetime_full.tail())
           Date  Hour  WindDir  WindSpeed_avg  PrecipationHour  Temp  \
20779  20230516    20      330             40                0   105   
20780  20230516    21      320             20                0    90   
20781  20230516    22      300             20                0    78   
20782  20230516    23      300             20                0    71   
20783  20230516    24      290             20                0    67   

                 DateTime  
20779 2023-05-16 19:00:00  
20780 2023-05-16 20:00:00  
20781 2023-05-16 21:00:00  
20782 2023-05-16 22:00:00  
20783 2023-05-16 23:00:00  
full_measurement = pd.read_csv('csv_files/full_measurement.csv')

full_measurement['DateTime'] = pd.to_datetime(full_measurement['Date'] + ' ' + full_measurement['Time'])
measurement_df = full_measurement.set_index('DateTime', drop=True)
grouper = measurement_df.groupby([pd.Grouper(freq='1H'), 'Location'])  # .mean().ffill().asfreq('1H')
# hourly_measurement_full.sample(20)
result = grouper.mean().ffill()
hourly_measurement_full = result.reset_index()
hourly_measurement_full
DateTime Location PM10 PM2.5 PM1
0 2019-04-18 11:00:00 574 20.190000 15.050000 13.930000
1 2019-04-18 11:00:00 575 19.540000 11.460000 9.900000
2 2019-04-18 12:00:00 574 19.470000 14.322000 13.376000
3 2019-04-18 12:00:00 575 18.830000 10.296000 9.004000
4 2019-04-18 13:00:00 574 14.772000 10.352000 9.628000
... ... ... ... ... ...
964167 2023-05-19 09:00:00 576 19.206667 6.463333 5.495000
964168 2023-05-19 09:00:00 577 12.033333 6.821667 4.931667
964169 2023-05-19 10:00:00 575 14.460000 8.080000 5.745000
964170 2023-05-19 10:00:00 576 15.602500 6.862500 5.937500
964171 2023-05-19 10:00:00 577 12.578333 6.605000 4.628333

964172 rows × 5 columns

combine_weather_measurements = hourly_measurement_full.copy()
combine_weather_measurements['WindDir'] = np.nan
combine_weather_measurements['WindSpeed_avg'] = np.nan
combine_weather_measurements['PrecipationHour'] = np.nan
combine_weather_measurements['Temp'] = np.nan

print(len(weather_measurements_by_datetime_full.index))
combine_weather_measurements_full = pd.DataFrame()
for index in weather_measurements_by_datetime_full.index:
    relevant_data = combine_weather_measurements[combine_weather_measurements['DateTime'] == weather_measurements_by_datetime_full['DateTime'][index]].copy()
    relevant_data['WindDir'] = weather_measurements_by_datetime_full['WindDir'][index]
    relevant_data['WindSpeed_avg'] = weather_measurements_by_datetime_full['WindSpeed_avg'][index]
    relevant_data['PrecipationHour'] = weather_measurements_by_datetime_full['PrecipationHour'][index]
    relevant_data['Temp'] = weather_measurements_by_datetime_full['Temp'][index]
    combine_weather_measurements_full = pd.concat([combine_weather_measurements_full, relevant_data])
20784
combine_weather_measurements_full
DateTime Location PM10 PM2.5 PM1 WindDir WindSpeed_avg PrecipationHour Temp
69454 2021-01-01 00:00:00 522 60.440000 51.553333 40.851667 290 10 0 20
69455 2021-01-01 00:00:00 524 51.945000 44.296667 21.755000 290 10 0 20
69456 2021-01-01 00:00:00 526 67.430000 60.678333 38.430000 290 10 0 20
69457 2021-01-01 00:00:00 527 70.378333 62.808333 39.130000 290 10 0 20
69458 2021-01-01 00:00:00 529 56.486667 51.555000 30.910000 290 10 0 20
... ... ... ... ... ... ... ... ... ...
961572 2023-05-16 23:00:00 572 14.198333 8.448333 5.388333 290 20 0 67
961573 2023-05-16 23:00:00 574 10.563333 4.800000 2.753333 290 20 0 67
961574 2023-05-16 23:00:00 575 10.540000 7.101667 4.821667 290 20 0 67
961575 2023-05-16 23:00:00 576 16.710000 5.246667 4.116667 290 20 0 67
961576 2023-05-16 23:00:00 577 9.560000 6.486667 4.405000 290 20 0 67

892123 rows × 9 columns

combine_weather_measurements_full.to_csv('csv_files/hour_combined_data_eindhoven_2021_16may2023.csv', index=False)
useable_data = pd.read_csv('csv_files/hour_combined_data_eindhoven_2021_16may2023.csv')
useable_data
DateTime Location PM10 PM2.5 PM1 WindDir WindSpeed_avg PrecipationHour Temp
0 2021-01-01 00:00:00 522 60.440000 51.553333 40.851667 290 10 0 20
1 2021-01-01 00:00:00 524 51.945000 44.296667 21.755000 290 10 0 20
2 2021-01-01 00:00:00 526 67.430000 60.678333 38.430000 290 10 0 20
3 2021-01-01 00:00:00 527 70.378333 62.808333 39.130000 290 10 0 20
4 2021-01-01 00:00:00 529 56.486667 51.555000 30.910000 290 10 0 20
... ... ... ... ... ... ... ... ... ...
892118 2023-05-16 23:00:00 572 14.198333 8.448333 5.388333 290 20 0 67
892119 2023-05-16 23:00:00 574 10.563333 4.800000 2.753333 290 20 0 67
892120 2023-05-16 23:00:00 575 10.540000 7.101667 4.821667 290 20 0 67
892121 2023-05-16 23:00:00 576 16.710000 5.246667 4.116667 290 20 0 67
892122 2023-05-16 23:00:00 577 9.560000 6.486667 4.405000 290 20 0 67

892123 rows × 9 columns

Chapter 2 - ARIMA

In this chapter we are going to be seeing whether ARIMA (AutoRegressive Integrated Moving Average) might be an option for this project. To do this we are going to be looking at the results of a basic model, if there are any factors that would drastically improve in a non-basic model, and in the chapter Evaluation we are going to looking at if with this knowledge and some experimenting this model is the one to continue developing.

First we need to create a basic model and use it to predict our PM values. We want it to be location based, as the location has a drastic influence on the values, and have a different prediction for each PM value.

Let’s start with getting the hourly PM10 values. We want a train and test set, let’s try splitting in 90% training and 10% testing data for now.

arima_PM10_X = hourly_521_data.PM10
# size = int(len(arima_PM10_X) * 0.9)
arima_train, arima_test = arima_PM10_X[0:-12], arima_PM10_X[-12:]
history = [x for x in arima_train]
predictions = list()

len(arima_test)
12

Now let’s predict for the items in the test set what ARIMA thinks the value is.

model = ARIMA(history, order=(6,1,1))
model_fit = model.fit()
output = model_fit.forecast(len(arima_test))
print(output)

# for t in range(len(arima_test)):
#     model = ARIMA(history, order=(6,1,1))
#     model_fit = model.fit()
#     output = model_fit.forecast()
#     yhat = output[0]
#     predictions.append(yhat)
#     obs = arima_test[t]
#     history.append(obs)
#     print(t)
[24.50263313 24.64901739 24.72725562 24.70783426 24.65342629 24.61453263
 24.58566049 24.56195447 24.54244349 24.52567607 24.51070663 24.49724354]

Now let’s see what the difference between our prediction and the actual values is.

test_dates = hourly_521_data.index[-12:len(arima_PM10_X)].strftime('%Y-%m-%d').tolist()
arima_test_values = []

for item in arima_test:
    arima_test_values.append(item)
pred_real_comparison = {'dates':test_dates, 'true':arima_test_values, 'pred':output}
prc_df = pd.DataFrame(pred_real_comparison)


fig, ax = plt.subplots()
ax.plot(prc_df.true)
ax.plot(prc_df.pred,color='red')
# ax.set_xticks([0.  ,2.,  4.,  6.,  8., 10., 12., 14.])
ax.set_xticklabels(prc_df.dates[::2], rotation=90)
fig.suptitle("Difference between Arima model predictions and actual PM10")
plt.xlabel("Date")
plt.ylabel("PM10 (μg/m³)")


plt.show()

png

OBSERVATIONS

predictions = pred_real_comparison['pred']
predictions = predictions.tolist()
rmse = sqrt(mean_squared_error(arima_test, predictions))
print('Test RMSE: %.3f' % rmse)
mae = mean_absolute_error(arima_test, predictions)
print('MAE: %.3f' % mae)
r2_train = r2_score(prc_df.true, prc_df.pred)
print('R2 score: {}'.format(r2_train))
Test RMSE: 2.204
MAE: 2.017
R2 score: -1.27875243625301

OBSERVATIONS

Chapter 3 - ETS

In this chapter we are going to be seeing whether ETS (Exponential Smoothing) might be an option for this project. To do this we are going to be looking at the results of a basic model, if there are any aspects that would drastically improve the model later, and in the chapter Evaluation we are going to looking at if with this knowledge and some experimenting we can improve this model.

First we need to prepare the data we want to use. We are going to be using the location 521 data with 60min frequency again, to start.

ets_df = hourly_521_data.PM10

Now let’s take a look at the data when it is decomposed. Let’s start with the trends.

final = seasonal_decompose(ets_df,model='multiplicative')
final.trend.plot()
<AxesSubplot: xlabel='Datetime'>

png

It’s useful to zoom in on the seasonality, as in a timeframe that spans multiple years, every hour produces a lot of data. Let’s use the first 100 for now.

final.seasonal[:100].plot()
plt.title("Multiplicative ETS decomposition Seasonality", fontsize=15)
plt.show()

png

Now let’s show it all together.

final_decomposed = plot_decomposed_time(ets_df, 'multiplicative', 20, 'Decomposed plotting of average PM10 at location 521 with a 1H average per item')
final_decomposed

png

The next goal is to create a training and testing dataset. We can play a bit with the train and test difference later, for now we will take 90% and 10% again. We will also plot the datapoints of the training and test set.

ets_train = ets_df[0:-12]
ets_test= ets_df[-12:]

print("\n Training data start at \n")
print (ets_train[ets_train.index == ets_train.index.min()],['Date','PM10'])
print("\n Training data ends at \n")
print (ets_train[ets_train.index == ets_train.index.max()],['Date','PM10'])

print("\n Test data start at \n")
print (ets_test[ets_test.index == ets_test.index.min()],['Date','PM10'])

print("\n Test data ends at \n")
print (ets_test[ets_test.index == ets_test.index.max()],['Date','PM10'])

plt.scatter(ets_train.index, ets_train, label = 'Train')
plt.scatter(ets_test.index, ets_test,  label = 'Test')
plt.legend(loc = 'best')
plt.title('Original data after split')
plt.show()
 Training data start at 

Datetime
2021-01-21 13:00:00    20.935
Freq: 60T, Name: PM10, dtype: float64 ['Date', 'PM10']

 Training data ends at 

Datetime
2023-05-18 21:00:00    23.788333
Freq: 60T, Name: PM10, dtype: float64 ['Date', 'PM10']

 Test data start at 

Datetime
2023-05-18 22:00:00    22.07
Freq: 60T, Name: PM10, dtype: float64 ['Date', 'PM10']

 Test data ends at 

Datetime
2023-05-19 09:00:00    26.28
Freq: 60T, Name: PM10, dtype: float64 ['Date', 'PM10']

png

Now let’s see which combination of additive and multiplicative trend and seasonality calculation provides us with the best model.

warnings.filterwarnings("ignore")

trends = seasonals = ['add', 'mul', 'additive', 'multiplicative']

for trend in trends:
    for seasonal in seasonals:
        model = ExponentialSmoothing(ets_train, trend=trend, seasonal=seasonal, seasonal_periods=12).fit()

        print('Trend: {} Seasonal: {} - AIC:{}'.format(trend, seasonal, model.aic))
Trend: add Seasonal: add - AIC:51584.38578673731
Trend: add Seasonal: mul - AIC:51703.90055441872
Trend: add Seasonal: additive - AIC:51584.38578673731
Trend: add Seasonal: multiplicative - AIC:51703.90055441872
Trend: mul Seasonal: add - AIC:53051.08022200621
Trend: mul Seasonal: mul - AIC:53350.007462004534
Trend: mul Seasonal: additive - AIC:53051.08022200621
Trend: mul Seasonal: multiplicative - AIC:53350.007462004534
Trend: additive Seasonal: add - AIC:51584.38578673731
Trend: additive Seasonal: mul - AIC:51703.90055441872
Trend: additive Seasonal: additive - AIC:51584.38578673731
Trend: additive Seasonal: multiplicative - AIC:51703.90055441872
Trend: multiplicative Seasonal: add - AIC:53051.08022200621
Trend: multiplicative Seasonal: mul - AIC:53350.007462004534
Trend: multiplicative Seasonal: additive - AIC:53051.08022200621
Trend: multiplicative Seasonal: multiplicative - AIC:53350.007462004534

The best AIC score would be the lowest possible, and it would be XXXX and XXXX trend. We can now create a model using this.

print(ets_test)
Datetime
2023-05-18 22:00:00    22.070000
2023-05-18 23:00:00    22.978333
2023-05-19 00:00:00    22.141667
2023-05-19 01:00:00    22.991667
2023-05-19 02:00:00    21.128333
2023-05-19 03:00:00    21.970000
2023-05-19 04:00:00    24.230000
2023-05-19 05:00:00    21.443333
2023-05-19 06:00:00    22.588333
2023-05-19 07:00:00    25.113333
2023-05-19 08:00:00    22.680000
2023-05-19 09:00:00    26.280000
Freq: 60T, Name: PM10, dtype: float64
model = ExponentialSmoothing(ets_train, trend='add', seasonal='add', seasonal_periods=24).fit()
ets_test.plot()
prediction = model.forecast(len(ets_test))
prediction.plot(label="Prediction")
# model.forecast(len(ets_test)).plot(label = 'Prediction')
plt.title("PM10 per 60 minutes", fontsize=15)
plt.xlabel('Date Time combo per 60 minutes')
plt.ylabel('PM10 ((μg/m³))')
plt.legend()
plt.show()

png

OBSERVATION

Let’s calculate the RMSE, MAE, and R2 score so we can compare it to the other models later.

y_true = ets_test.values
y_pred = model.forecast(len(ets_test))
mae = mean_absolute_error(y_true, y_pred)
rmse = sqrt(mean_squared_error(y_true, y_pred))
r2_train = r2_score(y_true, y_pred)

print('MAE: %.3f' % mae)
print('RMSE: %.3f' % rmse)
print('R2 score: {}'.format(r2_train))
MAE: 1.855
RMSE: 2.112
R2 score: -1.092018206614978

Chapter 4 - PROPHET

In this chapter we are going to be seeing whether PROPHET might be an option for this project. To do this we are going to be looking at the results of a basic model, looking at whether we know already adjustments that would significantly improve the performance, and in the chapter Evaluation we are going to looking at if with this knowledge and some experimenting we can improve this model.

First we need to prepare the data that we are going to use for this model. Prophet requires you to have your date as ds and your target as y.

# prophet_df = hourly_521_data.copy()
prophet_df = df_521_data.copy()
prophet_df = prophet_df['PM10']
# prophet_df['ds'] = prophet_df.index.values
# prophet_df['ds'] = pd.to_datetime(prophet_df['Date'] + ' ' + prophet_df['Time'])
prophet_df = prophet_df.reset_index(drop=False)
prophet_df.columns = ['ds','y']

Then we get the mean data of each date.

# prophet_df = prophet_df.groupby('ds').mean('y').reset_index()

Now we have the data to use, we will create a training and test dataset.

num_entries = prophet_df['ds'].count()
num_test_entries = -72
prophet_train = prophet_df.drop(prophet_df.iloc[num_test_entries:].index)
prophet_test = prophet_df[num_test_entries:]

It is time to create our basic model. One of the perks of Prophet is that you can easily add the holidays of the country you choose.

model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.add_country_holidays(country_name='NL')
model.fit(prophet_train)
model.train_holiday_names
14:03:16 - cmdstanpy - INFO - Chain [1] start processing
14:07:57 - cmdstanpy - INFO - Chain [1] done processing




0         Nieuwjaarsdag
1        Eerste paasdag
2         Goede Vrijdag
3        Tweede paasdag
4            Hemelvaart
5    Eerste Pinksterdag
6    Tweede Pinksterdag
7       Eerste Kerstdag
8       Tweede Kerstdag
9            Koningsdag
dtype: object
dates = prophet_test['ds']

future = pd.DataFrame(dates)
future.columns = ['ds']

Now we have the model, we can predict our ‘future’, with the test data.

prophecy = model.predict(future)
hypothesis = prophecy.copy()
hypothesis['yhat_middle'] = hypothesis['yhat_upper'] - hypothesis['yhat_lower']

Now let’s see what the metrics are.

prophet_true = prophet_test['y'].values
prophet_pred = hypothesis['yhat_middle'].values
mae = mean_absolute_error(prophet_true, prophet_pred)
print('MAE: %.3f' % mae)
rmse = sqrt(mean_squared_error(prophet_test['y'].values, prophecy['yhat'].values))
print('Test RMSE: %.3f' % rmse)
MAE: 6.230
Test RMSE: 3.342

And what it looks like visually.

sns.set(rc={'figure.figsize':(15,6)})

fig = sns.lineplot(data=prophet_test, x='ds', y ='y', color='black')
sns.lineplot(data=hypothesis, x='ds', y ='yhat', color='green')

fig.set(title="PM10", xlabel="Time", ylabel="ug/m3")
plt.xticks(rotation=90)
plt.show()

png

prophet_data = combine_weather_measurements_full[combine_weather_measurements_full['Location'] == 521].copy()
prophet_data = prophet_data.rename(columns={'DateTime':'ds', 'PM10': 'y'})
prophet_data
ds Location y PM2.5 PM1 WindDir WindSpeed_avg PrecipationHour Temp
81916 2021-01-21 13:00:00 521 20.935000 10.983333 7.190000 210 90 0 84
81952 2021-01-21 14:00:00 521 21.180000 9.538333 6.596667 200 90 0 81
81988 2021-01-21 15:00:00 521 18.338333 9.216667 6.063333 200 70 0 82
82024 2021-01-21 16:00:00 521 16.801667 8.466667 5.746667 200 60 0 81
82060 2021-01-21 17:00:00 521 12.163333 6.010000 4.270000 190 60 2 62
... ... ... ... ... ... ... ... ... ...
961357 2023-05-16 19:00:00 521 17.615000 7.656667 5.011667 330 40 0 105
961401 2023-05-16 20:00:00 521 23.405000 8.698333 5.396667 320 20 0 90
961445 2023-05-16 21:00:00 521 22.431667 9.130000 5.641667 300 20 0 78
961489 2023-05-16 22:00:00 521 24.321667 10.835000 6.460000 300 20 0 71
961533 2023-05-16 23:00:00 521 24.141667 10.641667 6.711667 290 20 0 67

18732 rows × 9 columns

prophet_train = prophet_data[:-12][['ds', 'y', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']]
prophet_test = prophet_data[-12:][['ds', 'y', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']]

model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.add_country_holidays(country_name='NL')
model.add_regressor('WindDir')
model.add_regressor('WindSpeed_avg')
model.add_regressor('PrecipationHour')
model.add_regressor('Temp')
model.fit(prophet_train)

14:08:01 - cmdstanpy - INFO - Chain [1] start processing
14:08:13 - cmdstanpy - INFO - Chain [1] done processing




<prophet.forecaster.Prophet at 0x1f0503526d0>
future = prophet_test.drop('y', axis=1)
prophecy = model.predict(future)
future
ds WindDir WindSpeed_avg PrecipationHour Temp
961049 2023-05-16 12:00:00 330 60 0 140
961093 2023-05-16 13:00:00 330 50 0 134
961137 2023-05-16 14:00:00 350 40 0 135
961181 2023-05-16 15:00:00 330 40 0 141
961225 2023-05-16 16:00:00 320 50 0 141
961269 2023-05-16 17:00:00 320 40 0 129
961313 2023-05-16 18:00:00 310 40 0 117
961357 2023-05-16 19:00:00 330 40 0 105
961401 2023-05-16 20:00:00 320 20 0 90
961445 2023-05-16 21:00:00 300 20 0 78
961489 2023-05-16 22:00:00 300 20 0 71
961533 2023-05-16 23:00:00 290 20 0 67
prophet_true = prophet_test['y'].values
prophet_pred = prophecy['yhat'].values
mae = mean_absolute_error(prophet_true, prophet_pred)
print('MAE: %.3f' % mae)
rmse = sqrt(mean_squared_error(prophet_test['y'].values, prophecy['yhat'].values))
print('Test RMSE: %.3f' % rmse)
r2 = r2_score(prophet_test['y'].values, prophecy['yhat'].values)
print('Test r2: %.3f' % r2)
MAE: 2.452
Test RMSE: 3.040
Test r2: 0.588
sns.set(rc={'figure.figsize': (15, 6)})

fig = sns.lineplot(data=prophet_test, x='ds', y='y', color='black')
sns.lineplot(data=prophecy, x='ds', y='yhat', color='green')

fig.set(title="PM10", xlabel="Time", ylabel="ug/m3")
plt.xticks(rotation=90)
plt.show()

png

prophet_data = combine_weather_measurements_full[combine_weather_measurements_full['Location'] == 521].copy()
prophet_data = prophet_data.rename(columns={'DateTime':'ds', 'PM2.5': 'y'})
prophet_train = prophet_data[:-12][['ds', 'y', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']]
prophet_test = prophet_data[-12:][['ds', 'y', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']]

model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.add_country_holidays(country_name='NL')
model.add_regressor('WindDir')
model.add_regressor('WindSpeed_avg')
model.add_regressor('PrecipationHour')
model.add_regressor('Temp')
model.fit(prophet_train)
future = prophet_test.drop('y', axis=1)
prophecy = model.predict(future)
prophet_true = prophet_test['y'].values
prophet_pred = prophecy['yhat'].values
mae = mean_absolute_error(prophet_true, prophet_pred)
print('MAE: %.3f' % mae)
rmse = sqrt(mean_squared_error(prophet_test['y'].values, prophecy['yhat'].values))
print('Test RMSE: %.3f' % rmse)
r2 = r2_score(prophet_test['y'].values, prophecy['yhat'].values)
print('Test r2: %.3f' % r2)

14:08:16 - cmdstanpy - INFO - Chain [1] start processing
14:08:28 - cmdstanpy - INFO - Chain [1] done processing

MAE: 1.216
Test RMSE: 1.374
Test r2: 0.469
sns.set(rc={'figure.figsize': (15, 6)})

fig = sns.lineplot(data=prophet_test, x='ds', y='y', color='black')
sns.lineplot(data=prophecy, x='ds', y='yhat', color='green')

fig.set(title="PM2.5", xlabel="Time", ylabel="ug/m3")
plt.xticks(rotation=90)
plt.show()

png

prophet_data = combine_weather_measurements_full[combine_weather_measurements_full['Location'] == 521].copy()
prophet_data = prophet_data.rename(columns={'DateTime':'ds', 'PM1': 'y'})
prophet_train = prophet_data[:-12][['ds', 'y', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']]
prophet_test = prophet_data[-12:][['ds', 'y', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']]

model = Prophet(yearly_seasonality=True, daily_seasonality=True, weekly_seasonality=True)
model.add_country_holidays(country_name='NL')
model.add_regressor('WindDir')
model.add_regressor('WindSpeed_avg')
model.add_regressor('PrecipationHour')
model.add_regressor('Temp')
model.fit(prophet_train)
future = prophet_test.drop('y', axis=1)
prophecy = model.predict(future)
prophet_true = prophet_test['y'].values
prophet_pred = prophecy['yhat'].values
mae = mean_absolute_error(prophet_true, prophet_pred)
print('MAE: %.3f' % mae)
rmse = sqrt(mean_squared_error(prophet_test['y'].values, prophecy['yhat'].values))
print('Test RMSE: %.3f' % rmse)
r2 = r2_score(prophet_test['y'].values, prophecy['yhat'].values)
print('Test r2: %.3f' % r2)

14:08:31 - cmdstanpy - INFO - Chain [1] start processing
14:08:41 - cmdstanpy - INFO - Chain [1] done processing

MAE: 1.294
Test RMSE: 1.504
Test r2: -1.171
sns.set(rc={'figure.figsize': (15, 6)})

fig = sns.lineplot(data=prophet_test, x='ds', y='y', color='black')
sns.lineplot(data=prophecy, x='ds', y='yhat', color='green')

fig.set(title="PM1", xlabel="Time", ylabel="ug/m3")
plt.xticks(rotation=90)
plt.show()

png

OBSERVATIONS

Chapter 5 - SNAIVE

In this chapter we are going to be seeing whether SNAIVE (Seasonal Naive) might be an option for this project. To do this we are going to be looking at the results of a basic model, and in the chapter Evaluation we are going to looking at if with this knowledge and some experimenting we can improve this model.

First we need to create a basic visualization of the time data.

print(hourly_521_data)
                     Location       PM10      PM2.5       PM1
Datetime                                                     
2021-01-21 13:00:00     521.0  20.935000  10.983333  7.190000
2021-01-21 14:00:00     521.0  21.180000   9.538333  6.596667
2021-01-21 15:00:00     521.0  18.338333   9.216667  6.063333
2021-01-21 16:00:00     521.0  16.801667   8.466667  5.746667
2021-01-21 17:00:00     521.0  12.163333   6.010000  4.270000
...                       ...        ...        ...       ...
2023-05-19 05:00:00     521.0  21.443333  12.723333  9.218333
2023-05-19 06:00:00     521.0  22.588333  12.730000  9.610000
2023-05-19 07:00:00     521.0  25.113333  13.068333  9.831667
2023-05-19 08:00:00     521.0  22.680000  13.041667  9.475000
2023-05-19 09:00:00     521.0  26.280000  11.330000  8.150000

[20349 rows x 4 columns]
series = hourly_521_data.copy()
series_PM10 = series['PM10']
len(series_PM10)
20349

Now we have the data to plot, we want to visualize the different aspects of the time series. Let’s grab a snapshot of 1000 entries near the end.

fig_mean_total = plot_decomposed_time(series_PM10[17000:18000], 'additive', 20, 'Decomposed plotting of average PM10 2019-2022')
fig_mean_total

png

# PM10_202122 = series.PM10
train = series_PM10[0:int(len(series_PM10)*0.9)]
test = series_PM10[int(len(series_PM10)*0.9):]

y_hat_naive = pd.DataFrame(test)
y_hat_naive['naive_forecast'] = train[len(train) - 1]

fig = plt.figure(figsize=(20,5))
plt.grid()
plt.plot(train.index, train, label='Train')
plt.plot(test.index, test , label='Test')
plt.plot(y_hat_naive.index, y_hat_naive['naive_forecast'], label='Naive forecast')
plt.legend(loc='best')
plt.title('Naive Method')
plt.xticks(rotation=45)
plt.show()

png

rmse = sqrt(mean_squared_error(test, y_hat_naive['naive_forecast']))
print('Test RMSE: %.3f' % rmse)

mae = mean_absolute_error(test, y_hat_naive['naive_forecast'])
print('MAE: %.3f' % mae)
Test RMSE: 11.106
MAE: 9.327
df_testing = train.reset_index()
df_testing['unique_id'] = "PM10Data"
df_testing.rename(columns={'Datetime':'ds', 'PM10':'y'}, inplace = True)
df_testing.head(10)

sf = StatsForecast(
    models = [SeasonalNaive(season_length = 24)],
    freq = '60T'
)

sf.fit(df_testing)

days = test.index.unique()
test_days = test.groupby(test.index).mean()

y = df_testing['y'].to_numpy()
y_hat_dict = sf.predict(h=len(test_days))
y_hat_dict = y_hat_dict.reset_index()
y_hat_dict.index = y_hat_dict.ds


forecast_df = pd.DataFrame(y_hat_dict['SeasonalNaive'])
# plt.plot(train.index, train)
plt.plot(test.index,test)

plt.plot(forecast_df.index, forecast_df)
plt.xticks(rotation=45)
plt.show()

png

rmse = sqrt(mean_squared_error(test_days, forecast_df))
print('Test RMSE: %.3f' % rmse)

mae = mean_absolute_error(test_days, forecast_df)
print('MAE: %.3f' % mae)
Test RMSE: 32.623
MAE: 30.422

OBSERVATION

Chapter 6 - Seasonal ARIMA

zzzz

df_521_data.index
DatetimeIndex(['2021-01-21 13:00:00', '2021-01-21 13:10:00',
               '2021-01-21 13:20:00', '2021-01-21 13:30:00',
               '2021-01-21 13:40:00', '2021-01-21 13:50:00',
               '2021-01-21 14:00:00', '2021-01-21 14:10:00',
               '2021-01-21 14:20:00', '2021-01-21 14:30:00',
               ...
               '2023-05-19 07:30:01', '2023-05-19 07:40:00',
               '2023-05-19 07:50:00', '2023-05-19 08:00:00',
               '2023-05-19 08:10:00', '2023-05-19 08:20:01',
               '2023-05-19 08:30:00', '2023-05-19 08:40:00',
               '2023-05-19 08:50:00', '2023-05-19 09:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=111871, freq=None)
sarima_df = df_521_data.copy()
sarima_df = sarima_df.resample('60min').mean().ffill().asfreq('60min')
sarima_df = sarima_df.PM10
train = sarima_df[0:int(len(sarima_df)*0.95)].asfreq('60T')
test = sarima_df[int(len(sarima_df)*0.95):].asfreq('60T')
# train = sarima_df[0:-1000].asfreq('60T')
# test = sarima_df[-100:].asfreq('60T')
mod = sm.tsa.statespace.SARIMAX(train,
                                order=(1,1,1),
                                seasonal_order=(1,1,1,24),
                                enforce_stationarity=False,
                                enforce_invertibility=False)

results = mod.fit(low_memory=True)
# pred = results.get_prediction(start = test.head().index[0], end= 100, dynamic=True)
pred = results.get_forecast(len(test))
pred_ci = pred.conf_int()
len(pred_ci)
1018
pred.predicted_mean
2023-04-07 00:00:00    37.037621
2023-04-07 01:00:00    37.326010
2023-04-07 02:00:00    37.309312
2023-04-07 03:00:00    36.760874
2023-04-07 04:00:00    36.717244
                         ...    
2023-05-19 05:00:00    39.125892
2023-05-19 06:00:00    39.230595
2023-05-19 07:00:00    39.475869
2023-05-19 08:00:00    39.222370
2023-05-19 09:00:00    38.787616
Freq: 60T, Name: predicted_mean, Length: 1018, dtype: float64
ax = test.plot(label='Real Data', figsize=(25, 10))
pred.predicted_mean.plot()

# ax.fill_between(pred_ci.index,
#                 pred_ci.iloc[:, 0],
#                 pred_ci.iloc[:, 1], color='k', alpha=.2)

plt.legend()
# plt.xlim(['2022-02-20', '2023-02-21'])
# plt.title("PM10 per 10 Minutes", fontsize=15)
plt.xlabel('Date Time combo per 60 Minutes')
plt.ylabel('PM10 (ug/m3)')
plt.show()

png

y_true = test.values
y_pred = pred.predicted_mean
mae = mean_absolute_error(y_true, y_pred)
rmse = sqrt(mean_squared_error(y_true, y_pred))
r2_train = r2_score(y_true, y_pred)

print('MAE: %.3f' % mae)
print('RMSE: %.3f' % rmse)
print('R2 score: {}'.format(r2_train))
MAE: 15.843
RMSE: 17.372
R2 score: -1.4834429241888407

Chapter 7 - Polynomial Features

for index in useable_data.index:
    useable_data['DateTime'][index] = pd.to_datetime(useable_data['DateTime'][index]).timestamp()
useable_data
DateTime Location PM10 PM2.5 PM1 WindDir WindSpeed_avg PrecipationHour Temp
0 1609459200.0 522 60.440000 51.553333 40.851667 290 10 0 20
1 1609459200.0 524 51.945000 44.296667 21.755000 290 10 0 20
2 1609459200.0 526 67.430000 60.678333 38.430000 290 10 0 20
3 1609459200.0 527 70.378333 62.808333 39.130000 290 10 0 20
4 1609459200.0 529 56.486667 51.555000 30.910000 290 10 0 20
... ... ... ... ... ... ... ... ... ...
892118 1684278000.0 572 14.198333 8.448333 5.388333 290 20 0 67
892119 1684278000.0 574 10.563333 4.800000 2.753333 290 20 0 67
892120 1684278000.0 575 10.540000 7.101667 4.821667 290 20 0 67
892121 1684278000.0 576 16.710000 5.246667 4.116667 290 20 0 67
892122 1684278000.0 577 9.560000 6.486667 4.405000 290 20 0 67

892123 rows × 9 columns

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

X_train = useable_data[:-24].drop(['PM10','PM2.5','PM1'], axis=1)
Y_train = useable_data[:-24]['PM10']
X_test = useable_data[-24:].drop(['PM10','PM2.5','PM1'], axis=1)
Y_test = useable_data[-24:]['PM10']

def create_polynomial_regression_model(degree):
  "Creates a polynomial regression model for the given degree"

  poly_features = PolynomialFeatures(degree=degree)

  # transforms the existing features to higher degree features.
  X_train_poly = poly_features.fit_transform(X_train)

  # fit the transformed features to Linear Regression
  poly_model = LinearRegression()
  poly_model.fit(X_train_poly, Y_train)

  # predicting on training data-set
  y_train_predicted = poly_model.predict(X_train_poly)

  # predicting on test data-set
  y_test_predict = poly_model.predict(poly_features.fit_transform(X_test))

  # evaluating the model on training dataset
  rmse_train = np.sqrt(mean_squared_error(Y_train, y_train_predicted))
  r2_train = r2_score(Y_train, y_train_predicted)

  # evaluating the model on test dataset
  rmse_test = np.sqrt(mean_squared_error(Y_test, y_test_predict))
  r2_test = r2_score(Y_test, y_test_predict)

  print("The model performance for the training set")
  print("-------------------------------------------")
  print("RMSE of training set is {}".format(rmse_train))
  print("R2 score of training set is {}".format(r2_train))

  print("\n")

  print("The model performance for the test set")
  print("-------------------------------------------")
  print("RMSE of test set is {}".format(rmse_test))
  print("R2 score of test set is {}".format(r2_test))
  plt.plot(y_test_predict)
for i in range(3):
    print("degree is: {}".format(i))
    create_polynomial_regression_model(i)
plt.plot(Y_test.reset_index(drop=True))
plt.show()
degree is: 0
The model performance for the training set
-------------------------------------------
RMSE of training set is 14.46229192559551
R2 score of training set is 0.0


The model performance for the test set
-------------------------------------------
RMSE of test set is 5.993982502208196
R2 score of test set is -0.26582615618008254
degree is: 1
The model performance for the training set
-------------------------------------------
RMSE of training set is 14.07516622892575
R2 score of training set is 0.052819348524525744


The model performance for the test set
-------------------------------------------
RMSE of test set is 7.587458967938542
R2 score of test set is -1.0283170191605833
degree is: 2
The model performance for the training set
-------------------------------------------
RMSE of training set is 14.011162933271896
R2 score of training set is 0.061413897114652505


The model performance for the test set
-------------------------------------------
RMSE of test set is 7.45098038658431
R2 score of test set is -0.9560050160488545

png

poly_features = PolynomialFeatures(degree=1)

# transforms the existing features to higher degree features.
X_train_poly = poly_features.fit_transform(X_train)

# fit the transformed features to Linear Regression
poly_model = LinearRegression()
poly_model.fit(X_train_poly, Y_train)

# predicting on training data-set
y_train_predicted = poly_model.predict(X_train_poly)

# predicting on test data-set
y_test_predict = poly_model.predict(poly_features.fit_transform(X_test))

plt.plot(y_test_predict)
plt.plot(Y_test.reset_index(drop=True))
plt.show()

png

y_test_predict
array([21.57553263, 21.55679041, 21.48182152, 21.44433707, 21.42559485,
       21.3881104 , 21.36936818, 21.35062596, 21.31314152, 21.29439929,
       21.27565707, 21.25691485, 21.23817263, 21.2194304 , 21.20068818,
       21.16320374, 21.14446151, 21.12571929, 21.10697707, 21.08823485,
       21.0507504 , 21.03200818, 21.01326596, 20.99452374])

KNN for Locations

from sklearn.neighbors import KNeighborsClassifier
locations = pd.read_csv('csv_files/locaties-airboxen.csv', sep=';')

neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(locations[['Lon','Lat']], locations['ID'])
KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(neigh.predict([[5.486614062741882,51.4460197870039]]))
[534]

A portfolio by Skyler Vermeer