Data Collection and Preparation - City Digital Twin AR Zoom lens

Skyler Vermeer - Intern Lectorate HTES

Version 1.3.3, March 21st 2023

Version Control

Version Date Author Amendments Status
1.0.0 February 16th 2023 Skyler Vermeer Layout + test out downloaded files data Draft
1.1.0 - 9 February 17th - March 13th 2023 Skyler Vermeer Various variations of code to scrape data from the API Draft
1.2.0 March 3rd 2023 Skyler Vermeer Exploring 1 location (1 day) Draft - old version still being updated
1.3.0 March 14th 2023 Skyler Vermeer Combining all data into full datasets Draft
1.3.1 March 15th 2023 Skyler Vermeer Extend documentation (Sourcing, layout requirement) Draft
1.3.2 March 16th 2023 Skyler Vermeer Verfication of start dates and units, overall layout and explanation update. Draft
1.3.3 March 21st 2023 Skyler Vermeer Requirements, Data Quality Draft
1.3.4 March 28th 2023 Skyler Vermeer Exploring Single day Draft
1.3.5 March 30th 2023 Skyler Vermeer Exploring Single day - continued Draft

The first thing we do with any exploratory data analysis, is get our basic packages imported and print their versions so it is transparent what we did and can be reproduced.

import os
from datetime import datetime

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

print('pandas version:', pd.__version__)
print('numpy version:', np.__version__)
print('matplotlib version:', matplotlib.__version__)
pandas version: 1.5.2
numpy version: 1.23.0
matplotlib version: 3.6.2

Data Sourcing

To obtain data that is relevant to my Fine Dust domain and can help with anwering my subquestions: “What data is available regarding this problem?” and “How can AI be used to make predictions using this data?”, I went to the Open Eindhoven Data Eindhoven Open Data. I did this, because the City Digital Twin project works on the basis of creating a Digital Twin of the City Eindhoven using the open data that the city shares. Here I searched for all the datasets relevant to fine dust.

Different datasets based on fine dust on platform.

  1. Real-time fine dust monitoring {SOURCE}, it contains real-time data of fine dust in Eindhoven, updated every hour and collected every 10 minutes. It is collected over just under 50 locations in and around the city Eindhoven. It is made to give an overview of the air quality in South-East Brabant. It measures PM1, PM2.5, PM10 and NO2. It has locations of varying concentrations of fine dust particles. They give you the ability to download the data, select specific locations and download in batches of up to 70 days. It also has an API {SOURCE} where you can obtain historical data, or search for specific moments, and filter what features you want.
  2. Locaties Caireboxen Fijnstof-monitoring {SOURCE}, it contains the locations of the Fine Dust monitoring boxes, in the Eindhoven municipality. It has the ID, name, town and street, observables, API URL and location of the monitoring boxes.

During my research I was referred to some datasets for additional information or data

  1. SamenMeten {SOURCE}, contains data collected by a variety of organizations and is combined by RIVM into one data portal. Most sensors measure multiple times an hour, and air quality is averaged each hour. The measurements are calibrated with the use of data science, to correct unwanted influences such as humidity. Raw sensormeasurements are also available.
  2. WorldClim {SOURCE}, contains climate data for 1970-2000. There is monthly climate data for various weather features, as well as 19 bioclimatic variables. It is available at 4 spatial resulations, between 30 seconds to 10 minutes. There is also monthly weather data from 1960-2018 regarding the minimum and maximum temperature and precipation. Based on this, a model was created for future weather and climate predictions.

Based on the research, and the data we have available, we are going to visualize the current fine dust data and predict the amount of fine dust (PM1, PM2.5 and PM10) in the future based on historical data and if possible also incorporating weather and climate. This, because it fits the requests of the client, helps with answering the research question and the data allows for it. The weather data is added as based on my domain research, the weather is an important aspect in the behaviour of fine dust, so if possible we want to incorporate this. There are other prediction opportunities, such as classifying the type of fine dust, whether it is harmful or not, however that is not possible within the current data.

Data requirements

For this project we want to create a prediction of the amount of fine dust in the air at a certain moment in the future. We also want to show the current amount of fine dust in the area. To do this, we will need a few things from our data.

  • The data must be collected consistently, so we can see trends and outliers.
  • The data must be collected regularly, so we can show the current amount of fine dust in the area.
  • The data must contain some form of timestamp.
  • The data must contain some form of location.
  • The data must contain a measurement of fine dust, preferably PM1, PM2.5 and PM10.

Based on this, we will be using the real-time fine dust monitoring that was shared on the Eindhoven Open Data platform. We will use the Locaties Caireboxen fijnstof monitoring to verify the locations that are being monitored.

If we want, we could use WorldClim to incorporate weather and climate data into our prediction. However, for now we will first explore the fine dust monitoring dataset and work from there.

Collecting the Data

Downloading

To start this project, I manually downloaded data from December 21st to February 16th 2023 due to the 70 day maximum. I stored the manually downloaded files in a directory called csv_files/manual_download for ease of access.

manual_file_path = "C:/Users/Public/Documents/Github/Internship-S5/cdt-a-i-r-zoom-lens-back-end/csv_files/manual_download/"

Let’s take a look at what this data looks like

pd.read_csv(manual_file_path + 'Dustmonitoring data.2022-12-21.2023-02-16.csv', sep=';')
Unnamed: 0 Unnamed: 1 I01 I01.1 I01.2 I01.3 I01.4 I01.5 I02 I02.1 ... I57.2 I57.3 I57.4 I57.5 I58 I58.1 I58.2 I58.3 I58.4 I58.5
0 Time Time Lat Lon PM1 PM2.5 PM10 NO2 Lat Lon ... PM1 PM2.5 PM10 NO2 Lat Lon PM1 PM2.5 PM10 NO2
1 UTC Local ° ° μg/m³ μg/m³ μg/m³ μg/m³\r\n ° ° ... μg/m³ μg/m³ μg/m³ μg/m³\r\n ° ° μg/m³ μg/m³ μg/m³ μg/m³\r\n
2 2022-12-21 00:00:00 2022-12-21 01:00:00 51,4063 5,4574 17,21 19,13 30,32 22 NaN NaN ... 10,44 10,71 20,22 11 NaN NaN NaN NaN NaN NaN
3 2022-12-21 00:00:01 2022-12-21 01:00:01 NaN NaN NaN NaN NaN NaN 51,4379 5,3581 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2022-12-21 00:00:04 2022-12-21 01:00:04 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 51,4904 5,3941 15,75 16,54 17,94 15
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
28915 2023-02-16 23:40:04 2023-02-17 00:40:04 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 51,4904 5,3941 11,01 11,85 13,49 13.0
28916 2023-02-16 23:49:59 2023-02-17 00:49:59 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28917 2023-02-16 23:50:00 2023-02-17 00:50:00 51,4063 5,4573 15,00 16,16 17,20 2.0 51,4379 5,3582 ... 9,80 10,20 24,09 16.0 NaN NaN NaN NaN NaN NaN
28918 2023-02-16 23:50:01 2023-02-17 00:50:01 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
28919 2023-02-16 23:50:04 2023-02-17 00:50:04 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 51,4904 5,3941 10,48 11,02 11,56 13.0

28920 rows × 287 columns

We can see that the information is quite wonky. It is technically the information, but it has a column for every location and measurement. The columns that are not related to the location of the entry, are NaN. It would need a lot of adjusting to gather the info, not really a good idea. Let’s look at a selection of a single location

pd.read_csv(manual_file_path + 'Dustmonitoring data.2022-12-21.2023-02-16_I01.csv', sep=';')
Unnamed: 0 Unnamed: 1 I01 I01.1 I01.2 I01.3 I01.4 I01.5
0 Time Time Lat Lon PM1 PM2.5 PM10 NO2
1 UTC Local ° ° μg/m³ μg/m³ μg/m³ μg/m³\r\n
2 2022-12-21 00:00:00 2022-12-21 01:00:00 51,4063 5,4574 17,21 19,13 30,32 22
3 2022-12-21 00:10:00 2022-12-21 01:10:00 51,4063 5,4574 18,29 20,75 25,83 25
4 2022-12-21 00:20:01 2022-12-21 01:20:01 51,4063 5,4574 18,88 22,04 32,27 25
... ... ... ... ... ... ... ... ...
8340 2023-02-16 23:10:00 2023-02-17 00:10:00 51,4063 5,4573 19,13 21,14 29,29 2
8341 2023-02-16 23:20:00 2023-02-17 00:20:00 51,4063 5,4573 18,37 20,23 32,35 2
8342 2023-02-16 23:30:00 2023-02-17 00:30:00 51,4063 5,4573 17,36 19,13 26,48 2
8343 2023-02-16 23:40:01 2023-02-17 00:40:01 51,4063 5,4573 18,19 19,54 25,84 2
8344 2023-02-16 23:50:00 2023-02-17 00:50:00 51,4063 5,4573 15,00 16,16 17,20 2

8345 rows × 8 columns

This would work as it just needs adjustment of the top 2 rows into the column name. However, manually downloading it would take too much work because 58 locations in batches of 70 days would be around 580+ requests. There is also an API. The api works with odata, which if you use the correct query will also provide you with @odata.nextlink which will give you the query to get even more of the same query aside from the data you have already gotten. This means that if we can manipulate this query to automatically process all the information, we can scrape the data in a loop (and also in the future starting from a specific point)

https://api.dustmonitoring.nl/v1/project(30)/location(521)/stream I found that this gives the data for location ‘521’, and gives an @odata.nextlink.

Using the API

So first I am going to get all the available locations.

import requests as request

res = request.get("https://api.dustmonitoring.nl/v1/project(30)/locations")
data = res.json()

locations = []
for item in data['value']:
    locations.append(item['ID'])

Then it is time to obtain the info from all these locations.

print(locations)
print(len(locations))
[531, 521, 542, 543, 553, 544, 545, 532, 533, 554, 534, 535, 546, 536, 556, 522, 557, 547, 549, 524, 537, 525, 526, 539, 551, 540, 558, 527, 528, 529, 530, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571, 572, 573, 574, 575, 576, 577]
49
file_path = "C:/Users/Public/Documents/Github/Internship-S5/cdt-a-i-r-zoom-lens-back-end/csv_files/raw_data/"
file_list = os.listdir(file_path)
for location in locations:
    request_string = "https://api.dustmonitoring.nl/v1/project(30)/location(" + str(location) + ")/stream"
    index = 0
    dataframes = []
    while index != -1:
        res = request.get(request_string)
        data = res.json()
        df = pd.DataFrame(data['value'])
        dataframes.append(df)
        try:
            request_string = data['@odata.nextLink']
        except KeyError:
            index = -1
        print(request_string)
    combined_df = pd.DataFrame()
    for dataframe in dataframes:
        combined_df = combined_df.append(dataframe)

    combined_df.to_csv(file_path + str(location) + '_raw_data')
print("done")

# This code block is to save a location if the last step (saving) threw an error for whatever reason

# combined_df.to_csv(file_path + str(577) + '_raw_data')
# This code block is to finish a location if it was interrupted while scraping so it doesn't have to go from the start.

# while index != -1:
#     res = request.get(request_string)
#     data = res.json()
#     df = pd.DataFrame(data['value'])
#     dataframes.append(df)
#     try:
#         request_string = data['@odata.nextLink']
#     except KeyError:
#         index = -1
#     print(request_string)
# combined_df = pd.DataFrame()
# for dataframe in dataframes:
#     combined_df = combined_df.append(dataframe)
#
# combined_df.to_csv(file_path + str(location) + '_raw_data')

The next piece of code is to update the dataframe since the initial scraping, so it can be combined with KNMI data. It is practically the same, but it updates from the last known data point.

current_data = pd.read_csv(file_path + str(531) + "_raw_data")
current_data.tail(1).Timestamp.values[0]
'2023-05-19T09:00:00Z'
for location in locations:
    current_data = pd.read_csv(file_path + str(location) + "_raw_data")
    last_datetime = current_data.tail(1).Timestamp.values[0]
    request_string = "https://api.dustmonitoring.nl/v1/project(30)/location(" + str(location) + ")/stream?$filter=timestamp gt " + last_datetime
    index = 0
    dataframes = []
    while index != -1:
        res = request.get(request_string)
        data = res.json()
        df = pd.DataFrame(data['value'])
        dataframes.append(df)
        try:
            request_string = data['@odata.nextLink']
        except KeyError:
            index = -1
        print(request_string)
    combined_df = current_data
    for dataframe in dataframes:
        combined_df = pd.concat([combined_df, dataframe], ignore_index=True)

    combined_df.to_csv(file_path + str(location) + '_raw_data')
print("done")

Preparation

Single location

Now that we have all the data, it is time to combine the raw data into a large dataset. But first, let’s look at a single location dataset.

df_521 = pd.read_csv(file_path + '521_raw_data')
print(df_521.sample(5))
        Unnamed: 0.4  Unnamed: 0.3  Unnamed: 0.2  Unnamed: 0.1  Unnamed: 0  \
86328          86328       86328.0       86328.0       86328.0        72.0   
16279          16279       16279.0       16279.0       16279.0         7.0   
262576        262576      262576.0      262576.0      262576.0        64.0   
607332        607332      607332.0      607332.0      607332.0         NaN   
585241        585241      585241.0      585241.0      585241.0        25.0   

                   Timestamp  Location  Entity   Name       Value  \
86328   2021-06-13T16:00:00Z       521     129  PM2.5      8.3600   
16279   2021-02-09T09:30:00Z       521     130   PM10     12.1900   
262576  2022-01-30T05:20:00Z       521      10    Lat     51.3468   
607332  2023-03-12T07:20:01Z       521     145    UFP  11782.0000   
585241  2023-02-16T16:40:01Z       521      10    Lat     51.4379   

                  Unit  
86328            μg/m³  
16279            μg/m³  
262576               °  
607332  counts/cm³\r\n  
585241               °  

We see we start with an unnamed column, which is because the dataframes were combined including their index, so we need to remove that. It would also be useful if we seperate the timestamp into date and time. It might be useful to combine the name and unit as that is just useful for reference. Or create a separate dataframe to reference it. It is also useful to seperate the Lon/Lat values from the rest of the values as even though the values are there, and we need them, they are somewhat the same per location with a few inaccuracies in measuring. So we should save the locations in a separate csv, and make it one per location. Entity is the ID associated with a measurement, so all Lat measurements have entity 10 and all PM10 have entity 130. Also in units some of them have \r\n attached to them, which isn’t part of the unit.

First, let’s remove the extra, unnamed index column.

print(df_521.loc[:, 'Timestamp':'Unit'].sample(5))
                   Timestamp  Location  Entity   Name    Value   Unit
48853   2021-04-22T08:00:00Z       521      10    Lat  51.3468      °
131742  2021-08-16T00:00:00Z       521     129  PM2.5   9.8400  μg/m³
313889  2022-03-30T21:30:00Z       521      10    Lat  51.3468      °
661700  2023-05-10T19:10:00Z       521     128    PM1   9.2100  μg/m³
388315  2022-07-22T23:10:01Z       521      11    Lon   4.7966      °

Now, let’s separate the values. The names are short enough that they can easily be used as an index. So we want to split the data into three datasets. A dataset with the different locations (ID combined with Lat and Lon), a dataset with the different measurements (Entity ID, Name, and Unit), and a dataset with the main data we will be using, which will be a separate column for day and time, a location id, a name , and a measurement.

Let’s check first whether there is one unique unit per measurement or whether there are multiple.

df_521.groupby('Name')['Unit'].unique()
Name
Lat                   [°]
Lon                   [°]
NO2           [μg/m³\r\n]
PM1               [μg/m³]
PM10              [μg/m³]
PM2.5             [μg/m³]
UFP      [counts/cm³\r\n]
Name: Unit, dtype: object

It doesn’t look like it. Of course this is just one location, and we need to check with the others, but it is positive that at least for this location it is just a static combination. Let’s just save all this as a static dataframe for now, and we can verify whether it is the same with all locations later.

units = df_521.groupby('Name')['Unit'].unique()
entities = df_521.groupby('Name')['Entity'].unique()
measurement_info = pd.DataFrame()
units[2][0] = units[2][0][:-2]
units[6][0] = units[6][0][:-2]

for idx in range(len(units)):
    measurement_info = pd.concat(
        [measurement_info, pd.DataFrame([units.index[idx], units[idx][0], entities[idx][0]]).T])
measurement_info.columns = ['Name', 'Unit', 'Entity_ID']
measurement_info = measurement_info.reset_index(drop=True)
print(measurement_info)
    Name        Unit Entity_ID
0    Lat           °        10
1    Lon           °        11
2    NO2       μg/m³       146
3    PM1       μg/m³       128
4   PM10       μg/m³       130
5  PM2.5       μg/m³       129
6    UFP  counts/cm³       145
measurement_info.to_csv(
    'C:/Users/Public/Documents/Github/Internship-S5/cdt-a-i-r-zoom-lens-back-end/csv_files/' + 'measurement_info',
    index=False)

Now, let’s see about the locations. We know that though the there is only one correct longitude and latitude combination per location, due to some errors there are sometimes other locations. However, there is also the dataset that is saved in ‘locaties-airboxen.csv’ which contains the actual locations of the sensor.

airboxes = pd.read_csv('csv_files/locaties-airboxen.csv', sep=';')
airbox_521 = airboxes.loc[airboxes.ID == 521]
airbox_521
ID Name Lat Lon Description Observables/0 Observables/1 Observables/2 Observables/3 Observables/4 URL geopoint
12 521 I02 51.437895 5.358142 Eindhoven, Scherpenering Lat Lon PM1 PM2.5 PM10 https://api.dustmonitoring.nl/v1/project(30)/l... 51.437895, 5.358142
lon = airbox_521['Lon']
lat = airbox_521['Lat']
print(str(521) + ', Lat: ' + str(lat.values[0]) + ', Lon: ' + str(lon.values[0]))
521, Lat: 51.437895, Lon: 5.358142

Now we know how to easily get more data about the different locations. The last and most important aspect that we need to get is the data that we want to use. For that we want a separate column for day and time, a location id, a name , and a measurement.

measurement_521_data = df_521.loc[:, ['Timestamp', 'Location', 'Name', 'Value']]
measurement_521_drop_unused = measurement_521_data[
    (measurement_521_data.Name != 'Lat') & (measurement_521_data.Name != 'Lon') & (
            measurement_521_data.Name != 'NO2') & (measurement_521_data.Name != 'UFP')].copy()

print(measurement_521_drop_unused)
                   Timestamp  Location   Name  Value
54      2021-01-21T13:00:00Z       521   PM10  22.03
55      2021-01-21T13:00:00Z       521  PM2.5  10.54
56      2021-01-21T13:00:00Z       521    PM1   7.29
59      2021-01-21T13:10:00Z       521   PM10  20.90
60      2021-01-21T13:10:00Z       521  PM2.5  10.52
...                      ...       ...    ...    ...
670192  2023-05-19T08:50:00Z       521  PM2.5  13.44
670193  2023-05-19T08:50:00Z       521   PM10  19.53
670198  2023-05-19T09:00:00Z       521    PM1   8.15
670199  2023-05-19T09:00:00Z       521  PM2.5  11.33
670200  2023-05-19T09:00:00Z       521   PM10  26.28

[335613 rows x 4 columns]
measurement_521_drop_unused['Timestamp'] = pd.to_datetime(measurement_521_drop_unused['Timestamp'])
measurement_521_drop_unused['Date'] = [d.date() for d in measurement_521_drop_unused['Timestamp']]
measurement_521_drop_unused['Time'] = [d.time() for d in measurement_521_drop_unused['Timestamp']]
measurement_521_drop_unused.head(5)
Timestamp Location Name Value Date Time
54 2021-01-21 13:00:00+00:00 521 PM10 22.03 2021-01-21 13:00:00
55 2021-01-21 13:00:00+00:00 521 PM2.5 10.54 2021-01-21 13:00:00
56 2021-01-21 13:00:00+00:00 521 PM1 7.29 2021-01-21 13:00:00
59 2021-01-21 13:10:00+00:00 521 PM10 20.90 2021-01-21 13:10:00
60 2021-01-21 13:10:00+00:00 521 PM2.5 10.52 2021-01-21 13:10:00

Now we have successfully split the Date and Time column, it is time to start converting the data to have all the different values in one single column.

df_521_measurements = pd.DataFrame(columns=['Date', 'Time', 'Location', 'PM10', 'PM2.5', 'PM1'])
date_time_groups = measurement_521_drop_unused.groupby(['Date', 'Time'])
groupings = [date_time_groups.get_group(x) for x in date_time_groups.groups]

print(groupings[:5])
print(groupings[-5:])
[                   Timestamp  Location   Name  Value        Date      Time
54 2021-01-21 13:00:00+00:00       521   PM10  22.03  2021-01-21  13:00:00
55 2021-01-21 13:00:00+00:00       521  PM2.5  10.54  2021-01-21  13:00:00
56 2021-01-21 13:00:00+00:00       521    PM1   7.29  2021-01-21  13:00:00,                    Timestamp  Location   Name  Value        Date      Time
59 2021-01-21 13:10:00+00:00       521   PM10  20.90  2021-01-21  13:10:00
60 2021-01-21 13:10:00+00:00       521  PM2.5  10.52  2021-01-21  13:10:00
61 2021-01-21 13:10:00+00:00       521    PM1   7.02  2021-01-21  13:10:00,                    Timestamp  Location   Name  Value        Date      Time
64 2021-01-21 13:20:00+00:00       521   PM10  21.30  2021-01-21  13:20:00
65 2021-01-21 13:20:00+00:00       521  PM2.5  10.93  2021-01-21  13:20:00
66 2021-01-21 13:20:00+00:00       521    PM1   6.66  2021-01-21  13:20:00,                    Timestamp  Location   Name  Value        Date      Time
69 2021-01-21 13:30:00+00:00       521   PM10  20.23  2021-01-21  13:30:00
70 2021-01-21 13:30:00+00:00       521  PM2.5  11.72  2021-01-21  13:30:00
71 2021-01-21 13:30:00+00:00       521    PM1   7.67  2021-01-21  13:30:00,                    Timestamp  Location   Name  Value        Date      Time
74 2021-01-21 13:40:00+00:00       521   PM10  19.25  2021-01-21  13:40:00
75 2021-01-21 13:40:00+00:00       521  PM2.5  11.37  2021-01-21  13:40:00
76 2021-01-21 13:40:00+00:00       521    PM1   6.90  2021-01-21  13:40:00]
[                       Timestamp  Location   Name  Value        Date      Time
670170 2023-05-19 08:20:01+00:00       521    PM1   8.67  2023-05-19  08:20:01
670171 2023-05-19 08:20:01+00:00       521  PM2.5  12.33  2023-05-19  08:20:01
670172 2023-05-19 08:20:01+00:00       521   PM10  19.85  2023-05-19  08:20:01,                        Timestamp  Location   Name  Value        Date      Time
670177 2023-05-19 08:30:00+00:00       521    PM1   9.25  2023-05-19  08:30:00
670178 2023-05-19 08:30:00+00:00       521  PM2.5  12.79  2023-05-19  08:30:00
670179 2023-05-19 08:30:00+00:00       521   PM10  29.53  2023-05-19  08:30:00,                        Timestamp  Location   Name  Value        Date      Time
670184 2023-05-19 08:40:00+00:00       521    PM1   8.97  2023-05-19  08:40:00
670185 2023-05-19 08:40:00+00:00       521  PM2.5  12.51  2023-05-19  08:40:00
670186 2023-05-19 08:40:00+00:00       521   PM10  21.14  2023-05-19  08:40:00,                        Timestamp  Location   Name  Value        Date      Time
670191 2023-05-19 08:50:00+00:00       521    PM1   9.54  2023-05-19  08:50:00
670192 2023-05-19 08:50:00+00:00       521  PM2.5  13.44  2023-05-19  08:50:00
670193 2023-05-19 08:50:00+00:00       521   PM10  19.53  2023-05-19  08:50:00,                        Timestamp  Location   Name  Value        Date      Time
670198 2023-05-19 09:00:00+00:00       521    PM1   8.15  2023-05-19  09:00:00
670199 2023-05-19 09:00:00+00:00       521  PM2.5  11.33  2023-05-19  09:00:00
670200 2023-05-19 09:00:00+00:00       521   PM10  26.28  2023-05-19  09:00:00]

It looks like the grouping is going according to plan, so now we want to use the info from each grouping to create a row where the different names here are different columns of the row.

for x in date_time_groups.groups:
    current_group = date_time_groups.get_group(x)
    location = current_group['Location'][current_group.index[0]]
    new_row = {'Date': [x[0]], 'Time': [x[1]], 'Location': [location], 'PM10': [np.nan], 'PM2.5': [np.nan],
               'PM1': [np.nan]}
    for row in current_group.values:
        new_row[row[2]] = [row[3]]
    df_521_measurements = pd.concat([df_521_measurements, pd.DataFrame(new_row)])
print(df_521_measurements)
          Date      Time Location   PM10  PM2.5   PM1
0   2021-01-21  13:00:00      521  22.03  10.54  7.29
0   2021-01-21  13:10:00      521  20.90  10.52  7.02
0   2021-01-21  13:20:00      521  21.30  10.93  6.66
0   2021-01-21  13:30:00      521  20.23  11.72  7.67
0   2021-01-21  13:40:00      521  19.25  11.37  6.90
..         ...       ...      ...    ...    ...   ...
0   2023-05-19  08:20:01      521  19.85  12.33  8.67
0   2023-05-19  08:30:00      521  29.53  12.79  9.25
0   2023-05-19  08:40:00      521  21.14  12.51  8.97
0   2023-05-19  08:50:00      521  19.53  13.44  9.54
0   2023-05-19  09:00:00      521  26.28  11.33  8.15

[111871 rows x 6 columns]

The measurements do all have an index of 0 now, so we do want to reset the index, as it is already in order.

df_521_measurements = df_521_measurements.reset_index(drop=True)

Now we want to visualize a single day at a single location, to see any quick patterns.

df_521_measurements.Date[0]
datetime.date(2021, 1, 21)
df_lcd = df_521_measurements.loc[df_521_measurements['Date'] == pd.to_datetime('2023-02-20')]
print(df_lcd)
             Date      Time Location   PM10  PM2.5    PM1
99235  2023-02-20  00:00:00      521  20.17  15.49  13.95
99236  2023-02-20  00:10:00      521  20.61  15.38  13.11
99237  2023-02-20  00:20:00      521  22.29  15.65  12.96
99238  2023-02-20  00:30:00      521  21.63  15.73  13.35
99239  2023-02-20  00:40:01      521  23.97  16.24  14.77
...           ...       ...      ...    ...    ...    ...
99373  2023-02-20  23:10:01      521  39.96  29.24  25.05
99374  2023-02-20  23:20:00      521  52.30  32.50  25.66
99375  2023-02-20  23:30:00      521  42.31  31.38  24.80
99376  2023-02-20  23:40:00      521  42.03  30.57  24.89
99377  2023-02-20  23:50:00      521  39.30  30.58  23.82

[143 rows x 6 columns]
fig, ax = plt.subplots()

fig.canvas.draw()
plt.scatter(df_lcd.index, df_lcd['PM10'], label='PM10', alpha=0.5)
plt.scatter(df_lcd.index, df_lcd['PM2.5'], label='PM2.5', alpha=0.5)
plt.scatter(df_lcd.index, df_lcd['PM2.5'], label='PM1', alpha=0.5)

ax.set_xticks(df_lcd.index[::6])
ax.set_xticklabels(df_lcd['Time'][::6], rotation=90)

fig.suptitle('20-2-2023 Fine dust particles')
ax.set_xlabel('Time')
ax.set_ylabel('Particles (μg/m³)')
plt.legend()
plt.show()

png

From here we can see that PM1 and PM2.5 are near identical, at least in comparison to PM10. If we look at the actual data we can see there are differences. We can also see that there is some data missing it seems, as after 3:00:00 comes 4:10:00, so somewhere along the lines some 10 minutes went missing it seems. Another thing we can observe is that the increases and decreases of the different particle matters are at similar moments, so there are probably some trends to observe there.

measurement_521_drop_unused
Timestamp Location Name Value Date Time
54 2021-01-21 13:00:00+00:00 521 PM10 22.03 2021-01-21 13:00:00
55 2021-01-21 13:00:00+00:00 521 PM2.5 10.54 2021-01-21 13:00:00
56 2021-01-21 13:00:00+00:00 521 PM1 7.29 2021-01-21 13:00:00
59 2021-01-21 13:10:00+00:00 521 PM10 20.90 2021-01-21 13:10:00
60 2021-01-21 13:10:00+00:00 521 PM2.5 10.52 2021-01-21 13:10:00
... ... ... ... ... ... ...
670192 2023-05-19 08:50:00+00:00 521 PM2.5 13.44 2023-05-19 08:50:00
670193 2023-05-19 08:50:00+00:00 521 PM10 19.53 2023-05-19 08:50:00
670198 2023-05-19 09:00:00+00:00 521 PM1 8.15 2023-05-19 09:00:00
670199 2023-05-19 09:00:00+00:00 521 PM2.5 11.33 2023-05-19 09:00:00
670200 2023-05-19 09:00:00+00:00 521 PM10 26.28 2023-05-19 09:00:00

335613 rows × 6 columns

df_521_PM10 = measurement_521_drop_unused.loc[measurement_521_drop_unused['Name'] == "PM10"]
df_521_PM1 = measurement_521_drop_unused.loc[measurement_521_drop_unused['Name'] == "PM1"]
df_521_PM25 = measurement_521_drop_unused.loc[measurement_521_drop_unused['Name'] == "PM2.5"]
df_521_PM25 = df_521_PM25.set_index('Timestamp')
df_521_PM1 = df_521_PM1.set_index('Timestamp')
df_521_PM10 = df_521_PM10.set_index('Timestamp')
fig, axs = plt.subplots(3, sharex=True, figsize=(15, 10))
fig.suptitle('The different sizes of particles individual values on location 521')
axs[0].plot(df_521_PM10['Value'], 'tab:blue')
axs[0].title.set_text('PM10')
axs[1].plot(df_521_PM25['Value'], 'tab:orange')
axs[1].title.set_text('PM25')
axs[2].plot(df_521_PM1['Value'], 'tab:green')
axs[2].title.set_text('PM1')

plt.tight_layout()

plt.subplots_adjust(hspace=0.75)
plt.xticks(rotation=90)
plt.show()

png

As we can see even over a longer period of time there are definitely relations between the different sizes of atoms but also individuality. We can also very clearly see two moments where the data was not recorded for a long period of time. In PM10 we can see a little outlier spike at the end of 2021, in PM1 there is an outlier halfway through 2022 and PM2.5 doesn’t seem to have any notable outliers.

rolling_mean = df_521_PM10['Value'].rolling(7).mean()
rolling_std = df_521_PM10['Value'].rolling(7).std()
plt.figure(figsize=(15, 10))
plt.plot(df_521_PM10['Value'], color="blue", label="Original PM10 Data")
plt.plot(rolling_mean, color="red", label="Rolling Mean PM10")
plt.plot(rolling_std, color="black", label="Rolling Standard Deviation in PM10")
plt.title("PM10 Time Series, Rolling Mean, Standard Deviation")
plt.legend(loc="best")
plt.xticks(rotation=90)
plt.show()

png

from statsmodels.tsa.stattools import adfuller

adft = adfuller(df_521_PM10['Value'], autolag="AIC")

output_df = pd.DataFrame({"Values": [adft[0], adft[1], adft[2], adft[3], adft[4]['1%'], adft[4]['5%'], adft[4]['10%']],
                          "Metric": ["Test Statistics", "p-value", "No. of lags used", "Number of observations used",
                                     "critical value (1%)", "critical value (5%)", "critical value (10%)"]})
print(output_df)
         Values                       Metric
0 -1.762838e+01              Test Statistics
1  3.808919e-30                      p-value
2  6.500000e+01             No. of lags used
3  1.118050e+05  Number of observations used
4 -3.430408e+00          critical value (1%)
5 -2.861566e+00          critical value (5%)
6 -2.566784e+00         critical value (10%)

The most important part of this is the p-value, or the probability that a particular statistical measure, such as the mean or standard deviation, of an assumed probability distribution will be greater than or equal to (or less than or equal to in some instances) observed results. We have a very small p-value so that is good.

from statsmodels.tsa.seasonal import STL
# plt.figure(figsize=(60, 20))
data = df_521_PM10.resample('10min').mean().ffill()
res = STL(data['Value'], period=144).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

# plt.figure(figsize=(30, 20))
res2 = STL(data['Value'], period=1008).fit()
res2.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

# plt.figure(figsize=(30, 20))
res3 = STL(data['Value'], period=30240).fit()
res3.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

Test with weekly seasonality over 1 months

subset1 = data.loc['2021-04-01':'2021-04-30']
res4 = STL(subset1['Value'], period=1008).fit()
res4.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

Combining all the data

Now we have all the data and we know what a single location looks like and how we need to alter the data so we can use it, we can combine all the data into a single dataset to work from.

location_data = pd.DataFrame(columns=['location_id', 'Lat', 'Lon'])
measurement_full = pd.DataFrame(columns=['Date', 'Time', 'Location', 'PM10', 'PM2.5', 'PM1'])

for dataset in file_list:
    current_df = pd.read_csv(file_path + dataset)

    #get location
    lon = current_df.loc[current_df['Name'] == 'Lon']
    lat = current_df.loc[current_df['Name'] == 'Lat']
    loc_data = {'location_id': [current_df['Location'][0]],
                'Lat': [lat.groupby('Value')['Value'].count().sort_values(ascending=False).head(1).index.values[0]],
                'Lon': [lon.groupby('Value')['Value'].count().sort_values(ascending=False).head(1).index.values[0]]}
    location_data = pd.concat([location_data, pd.DataFrame(loc_data)])

    #get measurements
    df_measurements_data = current_df.loc[:, ['Timestamp', 'Location', 'Name', 'Value']]
    df_measurements_drop_unused = df_measurements_data[
        (df_measurements_data.Name != 'Lat') & (df_measurements_data.Name != 'Lon') & (
                df_measurements_data.Name != 'NO2') & (df_measurements_data.Name != 'UFP')].copy()
    df_measurements_drop_unused['Timestamp'] = pd.to_datetime(df_measurements_drop_unused['Timestamp'])
    df_measurements_drop_unused['Date'] = [d.date() for d in df_measurements_drop_unused['Timestamp']]
    df_measurements_drop_unused['Time'] = [d.time() for d in df_measurements_drop_unused['Timestamp']]
    df_measurements = pd.DataFrame(columns=['Date', 'Time', 'Location', 'PM10', 'PM2.5', 'PM1'])

    date_time_groups = df_measurements_drop_unused.groupby(['Date', 'Time'])
    for x in date_time_groups.groups:
        current_group = date_time_groups.get_group(x)
        location = current_group['Location'][current_group.index[0]]
        new_row = {'Date': [x[0]], 'Time': [x[1]], 'Location': [location], 'PM10': [np.nan], 'PM2.5': [np.nan],
                   'PM1': [np.nan]}
        for row in current_group.values:
            new_row[row[2]] = [row[3]]
        df_measurements = pd.concat([df_measurements, pd.DataFrame(new_row)])

    measurement_full = pd.concat([measurement_full, df_measurements])
    print(dataset)

Everything is combined, so let’s take a look at whether everything is like we want it to.

It seems like only the index is still a problem, which was to be expected. So if we reset the index, we can save the data into a single file so we can later just load that one to continue preparing and analysing our data.

measurement_full = measurement_full.reset_index(drop=True)
measurement_full.to_csv(
    "C:/Users/Public/Documents/Github/Internship-S5/cdt-a-i-r-zoom-lens-back-end/csv_files/full_measurement.csv",
    index=False)
len(measurement_full)

We also need to save the location data that we seperated from our raw data.

location_data.to_csv(
    "C:/Users/Public/Documents/Github/Internship-S5/cdt-a-i-r-zoom-lens-back-end/csv_files/location_data.csv",
    index=False)

Seeing as we now saved them in a seperate directory than the raw directory, we also want a new file path to work with.

Analysing the combined data

Now let’s get our dataframes again and start analysing them.

df_file_path = "C:/Users/Public/Documents/Github/Internship-S5/cdt-a-i-r-zoom-lens-back-end/csv_files/"
loc_df = pd.read_csv(df_file_path + 'location_data.csv')
measurement_df = pd.read_csv(df_file_path + 'full_measurement.csv')
print(loc_df.sample(10))
print(measurement_df.sample(10))
    location_id      Lat     Lon
18          542  51.5108  5.3606
20          544  51.4379  5.3582
26          553  51.4915  5.4392
41          570  51.5835  5.5480
31          560  51.4414  5.4714
0           521  51.3468  5.3887
15          537  52.6558  4.7969
29          557  51.5696  5.3113
9           531  51.4063  5.4574
17          540  51.4797  5.4751
               Date      Time  Location   PM10  PM2.5    PM1
3883319  2023-02-06  06:50:01       560  11.67   8.65   7.86
4226370  2022-11-18  04:46:00       563   7.50   3.81   2.93
1319576  2022-05-27  07:00:01       532  11.91   7.19   4.55
5072896  2022-01-11  05:40:00       571  31.12  26.72  24.59
1790736  2022-08-17  08:50:00       536  23.60  21.00  19.39
895283   2021-05-26  03:10:00       529  15.61   7.72   5.06
4916557  2022-11-29  08:10:00       569  30.32  30.17  26.16
5008998  2022-08-22  21:40:00       570   7.80   5.04   3.45
1865333  2021-06-24  01:30:00       537   9.88   7.38   6.62
4726573  2023-05-08  00:40:00       567  15.60  11.00  10.30

When we were scraping the data, we noticed that location 575 took much longer than the rest, as it was getting data from 2015. However, let’s see how much of that data is still relevant to us.

longest_location = measurement_df[(measurement_df.Location == 575)]
print(longest_location.head(10))
               Date      Time  Location   PM10  PM2.5    PM1
5390315  2019-04-18  11:50:00       575  19.54  11.46   9.90
5390316  2019-04-18  12:00:00       575  19.95  10.91   9.86
5390317  2019-04-18  12:10:00       575  28.21  11.53  10.39
5390318  2019-04-18  12:20:00       575  14.92  10.62   9.07
5390319  2019-04-18  12:40:00       575  18.03   9.98   8.84
5390320  2019-04-18  12:50:00       575  13.04   8.44   6.86
5390321  2019-04-18  13:20:00       575  14.63   7.20   6.47
5390322  2019-04-18  13:30:00       575  14.60   6.50   5.67
5390323  2019-04-18  13:40:00       575  12.77   6.90   5.94
5390324  2019-04-18  13:50:00       575  16.92   6.87   6.03

It looks like though it started in 2015, the data we need didn’t get recorded until April 2019. Let’s see what that is like with the other locations, so we know the distribution of when our actual data got collected.

starting_dates = []
for location in measurement_df.Location.unique():
    location_df = measurement_df[(measurement_df.Location == location)]['Date']
    starting_dates.append(location_df.head(1).values[0])

print(starting_dates)
['2021-01-21', '2020-10-19', '2020-12-02', '2020-12-02', '2020-09-14', '2020-09-14', '2021-01-21', '2020-10-19', '2019-05-16', '2021-03-30', '2020-12-02', '2020-10-19', '2020-12-02', '2020-10-19', '2020-09-14', '2020-10-19', '2020-12-02', '2020-10-19', '2020-12-02', '2020-09-14', '2020-09-14', '2020-12-02', '2020-10-19', '2021-01-20', '2020-11-09', '2020-10-19', '2020-12-02', '2021-01-21', '2020-12-02', '2021-01-22', '2020-10-19', '2020-10-19', '2021-01-21', '2020-09-14', '2021-01-22', '2020-09-14', '2021-01-21', '2020-09-14', '2021-01-21', '2021-01-21', '2021-05-01', '2021-03-30', '2021-03-30', '2021-03-26', '2021-05-01', '2019-04-18', '2019-04-18', '2021-05-01', '2021-05-01']

Now we have the first date of relevant data of each location, let’s see how it is distributed.

from collections import Counter, OrderedDict

c = Counter(starting_dates)
ordered_data = OrderedDict(sorted(c.items(), key=lambda x: datetime.strptime(x[0], '%Y-%m-%d'), reverse=False))

plt.bar(ordered_data.keys(), ordered_data.values())
plt.suptitle("Amount of locations that started recording relevant data on that date")
plt.xticks(rotation=90)
plt.xlabel("Start date")
plt.ylabel("Amount of locations")
plt.show()

png

As we can see, there were two locations that started recording in April 2019, which were the first. So the earliest records are from April 18th 2019. Most locations started recording at the end of 2020, somewhere between September 14th and December 2nd. There were also still quite some locations that started recording at the start of 2021, and the latest a location started recording was May 2021. Something of note here, is that this means that we have 2 sensors that were placed before covid, while the rest was placed during. This means that outliers in observations don’t have a ‘base level’ from before covid to compare it with. This needs to be taken into account when analysing the results.

Something else that we need to verify, is whether everything is in the same unit. We know at location 521 it was, but we need to verify whether this is the case with all the locations.

units = []
for file in file_list:
    df_temp = pd.read_csv(file_path + file)
    unique_units = df_temp.groupby('Name')['Unit'].unique()
    units.append(unique_units)
unit_df = pd.DataFrame()
for unit in range(len(units)):
    for row in range(len(units[unit])):
        units[unit][row] = units[unit][row][0]
    unit_df = pd.concat([unit_df, units[unit].to_frame().T], ignore_index=True)

unique_unit = unit_df.drop_duplicates()
print(unique_unit)
Name Lat Lon        NO2    PM1   PM10  PM2.5             UFP
0      °   °  μg/m³\r\n  μg/m³  μg/m³  μg/m³  counts/cm³\r\n
6      °   °  μg/m³\r\n  μg/m³  μg/m³  μg/m³             NaN

It looks like the only place where there is a difference in the unit values, is in a location where the UFP unit is NaN. However, seeing as we are not doing anything with the UFP unit, this doesn’t matter, so now we have verified that the same unit of measurement is used in every single location. Now we have verified everything we needed, we can take a definitive look at the data quality.

Data Quality

Usually you would check the data quality before you prepare the data, so you can adjust accordingly. However, in our case preparing the data and collecting it was intertwined, adjusting the way of collecting so there was less need for preparing it. That is why the data quality is checked after the initial preparation of the data, as the data is not yet completely prepared, but enough that we can work with it.

Validity - Check whether there is always a number when we expect it etc Accuracy - We cannot definitevely say it is accurate, for example the location keeps jumping so it is prone to mistakes, but we don't need 100% accurate + this is why we trust this source and sample a few from samenmeten and explain some differences Consistency - The different units are the same Completeness - Missing data, how much do we have etc Uniqueness - It is not unique, as samenmeten enzo het ook heeft, but this is the difference between those things Timeliness - Check the time it takes to make a request to the api for 1 location for right now

Validity

Let’s start by looking at info to see whether the type matches.

measurement_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5739479 entries, 0 to 5739478
Data columns (total 6 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Date      object 
 1   Time      object 
 2   Location  int64  
 3   PM10      float64
 4   PM2.5     float64
 5   PM1       float64
dtypes: float64(3), int64(1), object(2)
memory usage: 262.7+ MB

It looks like the type is matching with what we are expecting, the date and time are objects and location, location is an int as it an ID and PM10, PM2.5 and PM1 are all floats as they are measurements. However, let’s check whether they are all positive, as a negative amount of fine dust does not exist.

measurement_df.describe()
Location PM10 PM2.5 PM1
count 5.739479e+06 5.735059e+06 5.739478e+06 5.739479e+06
mean 5.485570e+02 1.835883e+01 1.273283e+01 1.031460e+01
std 1.692974e+01 1.443371e+01 1.396131e+01 1.273105e+01
min 5.210000e+02 -0.000000e+00 -3.200000e+00 -3.590000e+00
25% 5.330000e+02 9.990000e+00 5.500000e+00 4.120000e+00
50% 5.470000e+02 1.518000e+01 9.330000e+00 7.140000e+00
75% 5.640000e+02 2.360000e+01 1.652000e+01 1.352000e+01
max 5.770000e+02 1.196890e+03 1.358270e+03 1.355330e+03

As we can see, there are some negative values. Not a lot, as they do not reach past the 25% quartile, however they do exist. Let’s see how often they occur.

negative_percentages = pd.DataFrame()

for location in measurement_df.Location.unique():
    loc_data = measurement_df[(measurement_df.Location == location)]
    neg_data = loc_data[(loc_data.PM10 < 0) | (loc_data['PM2.5'] < 0) | (loc_data.PM1 < 0)]
    neg_df_data = {'loc_id': [location], 'neg_amounts': [len(neg_data)], 'total_amounts': [len(loc_data)],
                   'percentage_negative': [(len(neg_data) / len(loc_data)) * 100]}
    negative_percentages = pd.concat([negative_percentages, pd.DataFrame(neg_df_data)])

print(negative_percentages.sort_values('percentage_negative', ascending=False))
   loc_id  neg_amounts  total_amounts  percentage_negative
0     566         5662         120626             4.693847
0     537         3388         130891             2.588413
0     573          805          43670             1.843371
0     544         2304         127384             1.808704
0     553         2061         115215             1.788830
0     560         1689         118812             1.421574
0     572          672          74668             0.899984
0     562         1037         130176             0.796614
0     540          856         128359             0.666880
0     551          691         118343             0.583896
0     536          640         130430             0.490685
0     522          546         117745             0.463714
0     526          633         139222             0.454670
0     527          546         134782             0.405099
0     568          423         109085             0.387771
0     558          425         118283             0.359308
0     546          424         121509             0.348945
0     529          369         113568             0.324915
0     569          327         104493             0.312940
0     524          363         123710             0.293428
0     570          287          98287             0.292002
0     530          528         183580             0.287613
0     535          291         106338             0.273656
0     575          338         147377             0.229344
0     574          251         136859             0.183400
0     545          171         125742             0.135993
0     525          162         121321             0.133530
0     556           98         112554             0.087069
0     571           73          95844             0.076165
0     533          100         132893             0.075249
0     549           61         116901             0.052181
0     528           54         116834             0.046219
0     534           55         124645             0.044125
0     543           58         133510             0.043442
0     567           46         110198             0.041743
0     576            8         102129             0.007833
0     577            2          99658             0.002007
0     547            1         112618             0.000888
0     565            0         118270             0.000000
0     564            0         125884             0.000000
0     563            0         114562             0.000000
0     561            0         114681             0.000000
0     557            0         116113             0.000000
0     554            0         116303             0.000000
0     542            0         125188             0.000000
0     539            0         125269             0.000000
0     532            0          78550             0.000000
0     531            0          94529             0.000000
0     521            0         111871             0.000000
plt.hist(negative_percentages.percentage_negative)
plt.suptitle("Distribution of percentage of negative values")
plt.xlabel("% of negative values")
plt.ylabel("Distribution")
plt.show()

png

As we can see there are a few negative values, but most location have less than 1% or even no negative values. There are a few that are a bit higher than 1% and one is even just over 5%. It is not extremely concerning however it is a little bit damaging to the validity. However, the combination of correct types and only a few negative values means the *Validity of the data is decent, it is quite good but there is the issue with negative values.

Accuracy

We cannot definitively say the data is accurate. It is not something that you can classify, such as whether an animal is a dog. It is a measurement and measurements can be wrong and especially with something like fine dust, where the difference between the sensor and a meter away from the sensor can be very different, it is constantly changing based on the wind etc, we can only determine how much we trust we put in the idea that the data is accurate. For this we will look at a few aspects:

  • Who is collecting it
  • Who is using it
  • Samenmeten

The data is being collected by ODZOB, the Omgevingsdienst Zuidoost-Brabant (Environment Service South-East Brabant). They work on executing the ecology and construction tasks mandated by law such as preventing noise disturbance, limiting health risks and tackling soil pollution {SOURCE, https://odzob.nl/expertises}. This means that the results of this organisation are trusted to be accurate enough for tasks mandated by law. This is also shown as they work with multiple municipalities. It is used to inform those who are making the local laws. Samenmeten, the portal of the RIVM, which is the national institute for public health and the environment, is also sharing the data of the sensors on their dataportal. So if local lawmakers are using the data, and the national institutes are sharing the data, and we know it is collected by experts, we can say that the Accuracy is acceptable.

Consistency

We know from checking the units just now, that the units used are consistent. So there is definetely a shared understanding of definition between the different locations. So we know the Consistency of the data is good.

Completeness

We know from the .info() that there aren’t any NaN values. However, this doesn’t mean that there isn’t any missing data. It just means that all the times that data was sent through, all the data was present. However when we were previously visualizing the single location data, we saw that there was a jump of 10 minutes, so that means that somewhere along the lines a measurement wasn’t collected once. Which isn’t that bad, however this was only once on the last collected day on one location. So let’s see how much missing timestamps there are overall. (Is there any missing data, how much do we have etc)

measurement_df.head()
Date Time Location PM10 PM2.5 PM1
0 2021-01-21 13:00:00 521 22.03 10.54 7.29
1 2021-01-21 13:10:00 521 20.90 10.52 7.02
2 2021-01-21 13:20:00 521 21.30 10.93 6.66
3 2021-01-21 13:30:00 521 20.23 11.72 7.67
4 2021-01-21 13:40:00 521 19.25 11.37 6.90
import time

sets_of_10 = pd.DataFrame()

fmt = '%Y-%m-%d %H:%M:%S'
for location in measurement_df.Location.unique():
    loc_data = measurement_df[(measurement_df.Location == location)]
    d1 = datetime.strptime(loc_data.head(1).Date.values[0] + ' ' + loc_data.head(1).Time.values[0], fmt)
    d2 = datetime.strptime(loc_data.tail(1).Date.values[0] + ' ' + loc_data.tail(1).Time.values[0], fmt)

    d1_ts = time.mktime(d1.timetuple())
    d2_ts = time.mktime(d2.timetuple())

    supposed_sets = int(int(d2_ts - d1_ts) / 600)
    sets_data = {'loc_id': [location], 'supposed_sets': [supposed_sets], 'actual_sets': [len(loc_data)],
                 'percentage_missing': [100 - (
                         (len(loc_data) / supposed_sets) * 100)]}
    sets_of_10 = pd.concat([sets_of_10, pd.DataFrame(sets_data)])

print(sets_of_10)
   loc_id  supposed_sets  actual_sets  percentage_missing
0     521         122082       111871            8.364050
0     522         130448       117745            9.737980
0     524         129296       123710            4.320319
0     525         129296       121321            6.168018
0     526         140660       139222            1.022323
0     527         140661       134782            4.179552
0     528         122085       116834            4.301102
0     529         135641       113568           16.273103
0     530         210787       183580           12.907342
0     531         112308        94529           15.830573
0     532          94121        78550           16.543598
0     533         135639       132893            2.024491
0     534         129295       124645            3.596427
0     535         135640       106338           21.602772
0     536         140660       130430            7.272857
0     537         135641       130891            3.501891
0     539         129296       125269            3.114559
0     540         135641       128359            5.368583
0     542         129294       125188            3.175708
0     543         140659       133510            5.082504
0     544         140659       127384            9.437718
0     545         129295       125742            2.747979
0     546         135640       121509           10.418018
0     547         122214       112618            7.851801
0     549         131090       116901           10.823861
0     551         135635       118343           12.748922
0     553         129281       115215           10.880176
0     554         122083       116303            4.734484
0     556         129296       112554           12.948583
0     557         121926       116113            4.767646
0     558         135630       118283           12.789943
0     560         135641       118812           12.407016
0     561         122084       114681            6.063858
0     562         140661       130176            7.454092
0     563         121923       114562            6.037417
0     564         140661       125884           10.505400
0     565         122085       118270            3.124872
0     566         140662       120626           14.244074
0     567         122086       110198            9.737398
0     568         122086       109085           10.649051
0     569         107770       104493            3.040735
0     570         112312        98287           12.487535
0     571         112312        95844           14.662725
0     572         112863        74668           33.841915
0     573          49251        43670           11.331750
0     574         214835       136859           36.295762
0     575         214838       147377           31.400869
0     576         107775       102129            5.238692
0     577         107777        99658            7.533147

As we can see, there is definitely some missing timestamps. Now let’s see how much.

plt.hist(sets_of_10.percentage_missing, bins=10)
plt.suptitle('Distribution of Percentage of Missing Timestamps')
plt.xlabel('Percentage missing')
plt.ylabel('Amount of locations')
plt.show()

png

As we can see, most locations have around 10% of missing data or under. There are a few with more, however for now it looks like, with years of data every there is usually a max of 20% data loss over 45 locations. The 4 remaining locations, do have a larger amount of missing data, between 20 to 40%. However, it might be because at the start the data collection didn’t go as well. Let’s see which locations have this much missing data and whether they are the locations that have been around the longest.

highest_missing = sets_of_10.sort_values('percentage_missing', ascending=False)[:4]
for location in highest_missing.loc_id:
    loc_data = measurement_df[(measurement_df.Location == location)]
print(loc_data.head(1))
               Date      Time  Location  PM10  PM2.5   PM1
1593250  2020-10-19  10:29:55       535  1.96  -0.79 -0.79

Uniqueness

The data is not very unique. There are similar versions of fine dust being measured around Eindhoven, as can be seen on the samenmeten platform of the RIVD. However, there also aren’t a lot of datasets like it, with the exception of the wide scale similar measurements that samenmeten shows. This, in combination with the difference in measurement techniques makes the Uniqueness of the data moderate.

Timeliness

It took quite a while to get the data from the API, so someone who wants to collect the data from scratch the timeliness will be bad. However, now that we have all the data, the only factor is how quick the API can get a small amount of data, namely the current data.

import requests as request
start_time = time.time()
res = request.get("https://api.dustmonitoring.nl/v1/project(30)/observations?$expand=data&$filter=location%20eq%20521")
data = res.json()
print("--- %s seconds ---" % (time.time() - start_time))
--- 0.5507380962371826 seconds ---

It looks like it doesn’t take long to get the data from the API if you aren’t doing it in bulk. This is fortunate as it means that the Timeliness of the data (after it is already collected) is excellent.

Compare with report

Comparing with ILM2 report

Subdivide known location IDS into airport, city and outdoor.

airport_loc = ['I02', 'I14', 'I25']
outdoors_loc = ['I18', 'I45', 'I42', 'I43', 'I33', 'I03', 'I46', 'I47', 'I58', 'I13', 'I56', 'I10', 'I05', 'I41', 'I16', 'I44'] # 4 extra? I55 I54
city_loc = ['I51', 'I49', 'I52', 'I48', 'I04', 'I32', 'I37', 'I23', 'I07', 'I22', 'I28', 'I40', 'I08', 'I30', 'I19', 'I36', 'I39', 'I29', 'I12', 'I09', 'I11', 'I24'] # 'I57'
divided_loc_df = pd.DataFrame()

for airport in airport_loc:
    loc = airboxes[airboxes['Name'] == airport]
    dict_loc = {'ID': loc['ID'].values, 'Name': loc['Name'].values, 'Group': ['Airport']}
    divided_loc_df = pd.concat([divided_loc_df, pd.DataFrame(dict_loc)])

for outdoor in outdoors_loc:
    loc = airboxes[airboxes['Name'] == outdoor]
    dict_loc = {'ID': loc['ID'].values, 'Name': loc['Name'].values, 'Group': ['Outdoors']}
    divided_loc_df = pd.concat([divided_loc_df, pd.DataFrame(dict_loc)])

for city in city_loc:
    loc = airboxes[airboxes['Name'] == city]
    dict_loc = {'ID': loc['ID'].values, 'Name': loc['Name'].values, 'Group': ['City']}
    divided_loc_df = pd.concat([divided_loc_df, pd.DataFrame(dict_loc)])

print(divided_loc_df)
    ID Name     Group
0  521  I02   Airport
0  536  I14   Airport
0  525  I25   Airport
0  532  I18  Outdoors
0  564  I45  Outdoors
0  561  I42  Outdoors
0  562  I43  Outdoors
0  558  I33  Outdoors
0  542  I03  Outdoors
0  565  I46  Outdoors
0  566  I47  Outdoors
0  577  I58  Outdoors
0  542  I13  Outdoors
0  575  I56  Outdoors
0  554  I10  Outdoors
0  553  I05  Outdoors
0  560  I41  Outdoors
0  556  I16  Outdoors
0  563  I44  Outdoors
0  570  I51      City
0  568  I49      City
0  571  I52      City
0  567  I48      City
0  543  I04      City
0  540  I32      City
0  528  I37      City
0  524  I23      City
0  545  I07      City
0  549  I22      City
0  526  I28      City
0  530  I40      City
0  532  I08      City
0  551  I30      City
0  547  I19      City
0  527  I36      City
0  529  I39      City
0  539  I29      City
0  535  I12      City
0  533  I09      City
0  534  I11      City
0  537  I24      City

Expected: 99% uptime.

import time

df_2021_observation = pd.DataFrame()
df_2021 = pd.DataFrame()

fmt = '%Y-%m-%d %H:%M:%S'
for location in measurement_df.Location.unique():
    loc_data = measurement_df[(measurement_df.Location == location)]
    d1 = datetime.strptime(loc_data.head(1).Date.values[0] + ' ' + loc_data.head(1).Time.values[0], fmt)
    if d1.date().year < 2021:
        d1 = datetime.strptime('2021-01-01 00:00:00', fmt)
    # d2 = datetime.strptime(loc_data.tail(1).Date.values[0] + ' ' + loc_data.tail(1).Time.values[0], fmt)
    d2 = datetime.strptime('2021-12-31 23:59:59', fmt)

    d1_ts = time.mktime(d1.timetuple())
    d2_ts = time.mktime(d2.timetuple())

    mask = (loc_data['Date'] > str(d1.date())) & (loc_data['Date'] <= str(d2.date()))
    loc_2021_df = loc_data.loc[mask]
    df_2021 = pd.concat([df_2021, loc_2021_df])

    supposed_sets = int(int(d2_ts - d1_ts) / 600)
    sets_data = {'loc_id': [location], 'supposed_sets': [supposed_sets], 'actual_sets': [len(loc_2021_df)],
                 'percentage_missing': [100 - (
                         (len(loc_2021_df) / supposed_sets) * 100)]}
    df_2021_observation = pd.concat([df_2021_observation, pd.DataFrame(sets_data)])

print(df_2021_observation)
   loc_id  supposed_sets  actual_sets  percentage_missing
0     521          49601        44812            9.655047
0     522          52559        47230           10.139082
0     524          52559        51913            1.229095
0     525          52559        51315            2.366864
0     526          52559        51836            1.375597
0     527          52559        52039            0.989364
0     528          49601        48689            1.838673
0     529          52559        52033            1.000780
0     530          52559        51989            1.084496
0     531          39827        29089           26.961609
0     532          52559        42347           19.429593
0     533          52559        52034            0.998877
0     534          52559        51977            1.107327
0     535          52559        52015            1.035027
0     536          52559        52039            0.989364
0     537          52559        51886            1.280466
0     539          52559        49667            5.502388
0     540          52559        50091            4.695675
0     542          52559        48943            6.879887
0     543          52559        51194            2.597081
0     544          52559        52033            1.000780
0     545          52559        50657            3.618790
0     546          52559        39450           24.941494
0     547          49731        48451            2.573847
0     549          52559        49382            6.044636
0     551          52559        52035            0.996975
0     553          52559        51170            2.642744
0     554          49601        44265           10.757848
0     556          52559        50650            3.632109
0     557          49443        44112           10.782113
0     558          52559        46832           10.896326
0     560          52559        52026            1.014098
0     561          49601        42667           13.979557
0     562          52559        51059            2.853936
0     563          49440        42496           14.045307
0     564          52559        51899            1.255732
0     565          49601        46377            6.499869
0     566          52559        41651           20.753820
0     567          49601        45314            8.642971
0     568          49601        45303            8.665148
0     569          35285        32408            8.153606
0     570          39827        34514           13.340196
0     571          39827        32084           19.441585
0     572          40378        24960           38.184160
0     573          35285        29656           15.952955
0     574          52559        21573           58.954699
0     575          52559        26243           50.069446
0     576          35285        34958            0.926739
0     577          35285        27579           21.839308
plt.hist(df_2021_observation.percentage_missing, bins=50)
plt.suptitle('Distribution of Percentage of Missing Timestamps in 2021')
plt.xlabel('Percentage missing')
plt.ylabel('Amount of locations')
plt.show()

png

Result: about half of the locations come close to this 99% uptime.

Yearly average per amount

ilm2_df = df_2021[(df_2021.Location == 521)]
print('For ILM2:')
print('PM10 year mean: ' + str(ilm2_df['PM10'].mean()))
print('PM2.5 year mean: ' + str(ilm2_df['PM2.5'].mean()))
print('PM1 year mean: ' + str(ilm2_df['PM1'].mean()))

print('\nOverall:')
print('PM10 year mean: ' + str(df_2021['PM10'].mean()))
print('PM2.5 year mean: ' + str(df_2021['PM2.5'].mean()))
print('PM1 year mean: ' + str(df_2021['PM1'].mean()))
For ILM2:
PM10 year mean: 20.534862313665986
PM2.5 year mean: 14.855719226992772
PM1 year mean: 12.831332009283232

Overall:
PM10 year mean: 18.318628310984085
PM2.5 year mean: 12.777047338836285
PM1 year mean: 10.463851545963406

Expected ILM: PM10: 24,5 PM2.5: 18,7

Result ILM: PM10: 20,5 PM2.5: 14,8

Expected Total: PM10: 11,5 PM2.5: 13,9 PM1: 19,3

Result Total: PM10: 18,3 PM2.5: 12,8 PM1: 10.5

df_2021
Date Time Location PM10 PM2.5 PM1
66 2021-01-22 00:00:00 521 6.42 3.42 2.77
67 2021-01-22 00:10:01 521 6.53 2.67 2.15
68 2021-01-22 00:20:00 521 7.15 2.93 2.03
69 2021-01-22 00:30:00 521 6.84 2.52 1.95
70 2021-01-22 00:40:00 521 6.32 2.60 2.06
... ... ... ... ... ... ...
5667539 2021-12-31 23:10:04 577 23.53 16.54 11.50
5667540 2021-12-31 23:20:04 577 26.89 20.32 14.93
5667541 2021-12-31 23:30:04 577 24.43 18.48 12.45
5667542 2021-12-31 23:40:04 577 25.76 20.31 15.27
5667543 2021-12-31 23:50:04 577 32.12 24.04 17.81

2194942 rows × 6 columns

divided_loc_df = divided_loc_df.reset_index(drop=True)
print(divided_loc_df)
     ID Name     Group
0   521  I02   Airport
1   536  I14   Airport
2   525  I25   Airport
3   532  I18  Outdoors
4   564  I45  Outdoors
5   561  I42  Outdoors
6   562  I43  Outdoors
7   558  I33  Outdoors
8   542  I03  Outdoors
9   565  I46  Outdoors
10  566  I47  Outdoors
11  577  I58  Outdoors
12  542  I13  Outdoors
13  575  I56  Outdoors
14  554  I10  Outdoors
15  553  I05  Outdoors
16  560  I41  Outdoors
17  556  I16  Outdoors
18  563  I44  Outdoors
19  570  I51      City
20  568  I49      City
21  571  I52      City
22  567  I48      City
23  543  I04      City
24  540  I32      City
25  528  I37      City
26  524  I23      City
27  545  I07      City
28  549  I22      City
29  526  I28      City
30  530  I40      City
31  532  I08      City
32  551  I30      City
33  547  I19      City
34  527  I36      City
35  529  I39      City
36  539  I29      City
37  535  I12      City
38  533  I09      City
39  534  I11      City
40  537  I24      City

Monthly average per group.

groups_2021 = pd.DataFrame()
for location in divided_loc_df['ID']:
    loc_specific = df_2021[(df_2021.Location == location)]
    # if divided_loc_df[(divided_loc_df.ID == location)].Group.values == ['Airport']:
    #     loc_specific['Group'] = 'Airport'
    print(str(location) + ' ' + divided_loc_df[(divided_loc_df.ID == location)].Group.values)
    # if divided_loc_df[(divided_loc_df.ID == location)].Group.values == ['Outdoors']:
    #     loc_specific['Group'] = 'Outdoors'
    # if divided_loc_df[(divided_loc_df.ID == location)].Group.values == ['City']:
    #     loc_specific['Group'] = 'City'
    # groups_2021 = pd.concat([groups_2021, loc_specific])
    # groups_2021_airport = pd.concat(groups_2021_airport, df_2021[(df_2021.Location == location) and (divided_loc_df[location]['Group'] == 'Airport')])
    # groups_2021_outdoors = df_2021[(df_2021.Location == location.ID) and (location['Group'] == 'Outdoors')]
    # groups_2021_city = df_2021[(df_2021.Location == location.ID) and (location['Group'] == 'City')]

groups_2021
# print(groups_2021_airport)

# print('For ILM2:')
# print('PM10 year mean: ' + str(ilm2_df['PM10'].mean()))
# print('PM2.5 year mean: ' + str(ilm2_df['PM2.5'].mean()))
# print('PM1 year mean: ' + str(ilm2_df['PM1'].mean()))
#
# print('\nOverall:')
# print('PM10 year mean: ' + str(df_2021['PM10'].mean()))
# print('PM2.5 year mean: ' + str(df_2021['PM2.5'].mean()))
# print('PM1 year mean: ' + str(df_2021['PM1'].mean()))
['521 Airport']
['536 Airport']
['525 Airport']
['532 Outdoors' '532 City']
['564 Outdoors']
['561 Outdoors']
['562 Outdoors']
['558 Outdoors']
['542 Outdoors' '542 Outdoors']
['565 Outdoors']
['566 Outdoors']
['577 Outdoors']
['542 Outdoors' '542 Outdoors']
['575 Outdoors']
['554 Outdoors']
['553 Outdoors']
['560 Outdoors']
['556 Outdoors']
['563 Outdoors']
['570 City']
['568 City']
['571 City']
['567 City']
['543 City']
['540 City']
['528 City']
['524 City']
['545 City']
['549 City']
['526 City']
['530 City']
['532 Outdoors' '532 City']
['551 City']
['547 City']
['527 City']
['529 City']
['539 City']
['535 City']
['533 City']
['534 City']
['537 City']

All airbox locations are unique

print(len(airboxes.ID))
print(len(airboxes.ID.unique()))
print(len(airboxes.Name.unique()))
print(len(airboxes.Description.unique()))
42
40
42
40

There are two IDs that are used twice, and have a ‘unique’ name and the lon lat combo also doesn’t match.


Now with more insights:

  • 75% of relevant period needs to be measured
  • background noise is measured by each hour the lowest average over the entire ILM2
  • ‘Weekgang’ where you show an average week, measures its averages per hour.

Things that are odd based on previous ideas:

  • Negative values should be automatically excluded
  • There should be about 35 days of no data max, in 1 location, there are only 8 locations that have 1+ day of no data.

‘Claims’ that we want to verify/compare:

  • The average over a week of PM2.5, PM10 and NO2 with their 25/75 percentile
  • (maybe: Wind direction plot measured based on eindhoven airport weather station)
  • Negative values are only mentioned on NO2, is this correct
  • 8 locations have 1+ day of no data, max of 35 days.
  • Fig 10, profile of the month of june in 10 min measurements
  • ILM037 has an average of 20.9 ug/m3 PM10
df_521_NO2 = measurement_521_data[
    (measurement_521_data.Name == 'NO2')].copy()

df_521_NO2['Timestamp'] = pd.to_datetime(df_521_NO2['Timestamp'])
df_521_NO2['Date'] = [d.date() for d in df_521_NO2['Timestamp']]
df_521_NO2['Time'] = [d.time() for d in df_521_NO2['Timestamp']]
df_521_NO2 = df_521_NO2.set_index('Timestamp')

df_521_NO2['Weekday'] = df_521_NO2.index.weekday

df_521_PM10['Weekday'] = df_521_PM10.index.weekday
df_521_PM25['Weekday'] = df_521_PM25.index.weekday
df_521_PM1['Weekday'] = df_521_PM1.index.weekday
# df = df.groupby([df['date_UTC'].dt.hour, df['w_day']]).pm2_5_ugm3.mean().reset_index().dropna()
weekday_hourly_average_PM10_521 = df_521_PM10.groupby([df_521_PM10['Weekday'], df_521_PM10.index.hour]).Value.mean()
weekday_hourly_average_PM25_521 = df_521_PM25.groupby([df_521_PM25['Weekday'], df_521_PM25.index.hour]).Value.mean()
weekday_hourly_average_PM1_521 = df_521_PM1.groupby([df_521_PM1['Weekday'], df_521_PM1.index.hour]).Value.mean()
weekday_hourly_average_NO2_521 = df_521_NO2.groupby([df_521_NO2['Weekday'], df_521_NO2.index.hour]).Value.mean()
weekday_hourly_75quantile_PM10_521 = df_521_PM10.groupby([df_521_PM10['Weekday'], df_521_PM10.index.hour]).Value.quantile(q=0.75)
weekday_hourly_25quantile_PM10_521 = df_521_PM10.groupby([df_521_PM10['Weekday'], df_521_PM10.index.hour]).Value.quantile(q=0.25)

weekday_hourly_75quantile_PM25_521 = df_521_PM25.groupby([df_521_PM10['Weekday'], df_521_PM25.index.hour]).Value.quantile(q=0.75)
weekday_hourly_25quantile_PM25_521 = df_521_PM25.groupby([df_521_PM10['Weekday'], df_521_PM25.index.hour]).Value.quantile(q=0.25)

weekday_hourly_75quantile_NO2_521 = df_521_NO2.groupby([df_521_NO2['Weekday'], df_521_NO2.index.hour]).Value.quantile(q=0.75)
weekday_hourly_25quantile_NO2_521 = df_521_NO2.groupby([df_521_NO2['Weekday'], df_521_NO2.index.hour]).Value.quantile(q=0.25)
combined_dataframe = pd.DataFrame({'mean_PM10': weekday_hourly_average_PM10_521, 'mean_PM25': weekday_hourly_average_PM25_521, 'mean_NO2': weekday_hourly_average_NO2_521})
combined_dataframe_reset = combined_dataframe.reset_index(drop=True)
fig, ax = plt.subplots(figsize=(20,5))
ax.plot(combined_dataframe_reset)
ax.fill_between(combined_dataframe_reset.index, weekday_hourly_25quantile_PM10_521, weekday_hourly_75quantile_PM10_521, facecolor='blue', alpha=0.2)
ax.fill_between(combined_dataframe_reset.index, weekday_hourly_25quantile_PM25_521, weekday_hourly_75quantile_PM25_521, facecolor='orange', alpha=0.2)
ax.fill_between(combined_dataframe_reset.index, weekday_hourly_25quantile_NO2_521, weekday_hourly_75quantile_NO2_521, facecolor='green', alpha=0.2)

ax.set_xticks(ax.get_xticks() + 10,["","Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "", ""])
fig.suptitle("Weekgang 521")
ax.legend(['PM10', 'PM2.5', 'NO2'])
ax.set_ylabel("concentraties [µg/m\xB3]")
# fig.show()
Text(0, 0.5, 'concentraties [µg/m³]')

png

data = df_521_PM10.resample('60min').mean().ffill()
res = STL(data['Value'], period=24).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

df_2021['Datetime'] = pd.to_datetime(df_2021['Date'] + ' ' + df_2021['Time'])
df_2021 = df_2021.set_index('Datetime')

data = df_2021.resample('60min').mean().ffill()
res = STL(data['PM10'].iloc[data.index > "2021-09-01"], period=24).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

res = STL(data['PM10'].iloc[data.index > "2021-09-01"], period=168).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

Weather Data

Info

BRON: KONINKLIJK NEDERLANDS METEOROLOGISCH INSTITUUT (KNMI) Opmerking: door stationsverplaatsingen en veranderingen in waarneemmethodieken zijn deze tijdreeksen van uurwaarden mogelijk inhomogeen! Dat betekent dat deze reeks van gemeten waarden niet geschikt is voor trendanalyse. Voor studies naar klimaatverandering verwijzen we naar de gehomogeniseerde dagreeksen http://www.knmi.nl/nederland-nu/klimatologie/daggegevens of de Centraal Nederland Temperatuur http://www.knmi.nl/kennis-en-datacentrum/achtergrond/centraal-nederland-temperatuur-cnt.

SOURCE: ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE (KNMI) Comment: These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of daily data http://www.knmi.nl/nederland-nu/klimatologie/daggegevens or the Central Netherlands Temperature http://www.knmi.nl/kennis-en-datacentrum/achtergrond/centraal-nederland-temperatuur-cnt.

YYYYMMDD = datum (YYYY=jaar,MM=maand,DD=dag) / date (YYYY=year,MM=month,DD=day) HH = tijd (HH=uur, UT.12 UT=13 MET, 14 MEZT. Uurvak 05 loopt van 04.00 UT tot 5.00 UT / time (HH uur/hour, UT. 12 UT=13 MET, 14 MEZT. Hourly division 05 runs from 04.00 UT to 5.00 UT DD = Windrichting (in graden) gemiddeld over de laatste 10 minuten van het afgelopen uur (360=noord, 90=oost, 180=zuid, 270=west, 0=windstil, 990=veranderlijk) / Mean wind direction (in degrees) for the 10-minute period preceding the observation time stamp (360=north, 90=east, 180=south, 270=west, 0=calm, 990=variable) FH = Windsnelheid (in 0.1 m/s) gemiddeld over het afgelopen uur / Mean wind speed (in 0.1 m/s) for the hour preceding the observation time stamp FF = Windsnelheid (in 0.1 m/s) gemiddeld over de laatste 10 minuten van het afgelopen uur / Mean wind speed (in 0.1 m/s) for the 10-minute period preceding the observation time stamp “FX = Hoogste windstoot (3 seconde gemiddelde windsnelheid; in 0.1 m/s) gemeten in het afgelopen uur / Maximum wind gust (3 second mean wind speed; in 0.1 m/s) in the preceding hour” T = Temperatuur (in 0.1 graden Celsius) op 1.50 m hoogte tijdens de waarneming / Temperature (in 0.1 degrees Celsius) at 1.50 m at the time of observation T10N = Minimumtemperatuur (in 0.1 graden Celsius) op 10 cm hoogte in de afgelopen 6 uur / Minimum temperature (in 0.1 degrees Celsius) at 0.1 m inthe preceding 6-hour period TD = Dauwpuntstemperatuur (in 0.1 graden Celsius) op 1.50 m hoogte tijdens de waarneming / Dew point temperature (in 0.1 degrees Celsius) at 1.50 m at the time of observation SQ = Duur van de zonneschijn (in 0.1 uren) per uurvak, berekend uit globale straling (-1 for <0.05 uur) / Sunshine duration (in 0.1 hour) during the hourly division, calculated from global radiation (-1 for <0.05 hour) Q = Globale straling (in J/cm2) per uurvak / Global radiation (in J/cm2) during the hourly division DR = Duur van de neerslag (in 0.1 uur) per uurvak / Precipitation duration (in 0.1 hour) during the hourly division RH = Uursom van de neerslag (in 0.1 mm) (-1 voor <0.05 mm) / Hourly precipitation amount (in 0.1 mm) (-1 for <0.05 mm) P = Luchtdruk (in 0.1 hPa) herleid naar zeeniveau, tijdens de waarneming / Air pressure (in 0.1 hPa) reduced to mean sea level, at the time of observation VV = Horizontaal zicht tijdens de waarneming (0=minder dan 100m, 1=100-200m, 2=200-300m,…, 49=4900-5000m, 50=5-6km, 56=6-7km, 57=7-8km, …, 79=29-30km, 80=30-35km, 81=35-40km,…, 89=meer dan 70km) / Horizontal visibility at the time of observation (0=less than 100m, 1=100-200m, 2=200-300m,…, 49=4900-5000m, 50=5-6km, 56=6-7km, 57=7-8km, …, 79=29-30km, 80=30-35km, 81=35-40km,…, 89=more than 70km) N = Bewolking (bedekkingsgraad van de bovenlucht in achtsten), tijdens de waarneming (9=bovenlucht onzichtbaar) / Cloud cover (in octants), at the time of observation (9=sky invisible) U = Relatieve vochtigheid (in procenten) op 1.50 m hoogte tijdens de waarneming / Relative atmospheric humidity (in percents) at 1.50 m at the time of observation WW = Weercode (00-99), visueel(WW) of automatisch(WaWa) waargenomen, voor het actuele weer of het weer in het afgelopen uur. Zie https://cdn.knmi.nl/knmi/pdf/bibliotheek/scholierenpdf/weercodes_Nederland.pdf / Present weather code (00-99), description for the hourly division. IX = Weercode indicator voor de wijze van waarnemen op een bemand of automatisch station (1=bemand gebruikmakend van code uit visuele waarnemingen, 2,3=bemand en weggelaten (geen belangrijk weersverschijnsel, geen gegevens), 4=automatisch en opgenomen (gebruikmakend van code uit visuele waarnemingen), 5,6=automatisch en weggelaten (geen belangrijk weersverschijnsel, geen gegevens), 7=automatisch gebruikmakend van code uit automatische waarnemingen) / Indicator present weather code (1=manned and recorded (using code from visual observations), 2,3=manned and omitted (no significant weather phenomenon to report, not available), 4=automatically recorded (using code from visual observations), 5,6=automatically omitted (no significant weather phenomenon to report, not available), 7=automatically set (using code from automated observations) M = Mist 0=niet voorgekomen, 1=wel voorgekomen in het voorgaande uur en/of tijdens de waarneming / Fog 0=no occurrence, 1=occurred during the preceding hour and/or at the time of observation R = Regen 0=niet voorgekomen, 1=wel voorgekomen in het voorgaande uur en/of tijdens de waarneming / Rainfall 0=no occurrence, 1=occurred duringthe preceding hour and/or at the time of observation S = Sneeuw 0=niet voorgekomen, 1=wel voorgekomen in het voorgaande uur en/of tijdens de waarneming / Snow 0=no occurrence, 1=occurred during the preceding hour and/or at the time of observation O = Onweer 0=niet voorgekomen, 1=wel voorgekomen in het voorgaande uur en/of tijdens de waarneming / Thunder 0=no occurrence, 1=occurred during the preceding hour and/or at the time of observation Y = IJsvorming 0=niet voorgekomen, 1=wel voorgekomen in het voorgaande uur en/of tijdens de waarneming / Ice formation 0=no occurrence, 1=occurred during the preceding hour and/or at the time of observation

eindhoven_weather_hourly = pd.read_csv('csv_files/uurgeg_370_2021-2030.csv', skiprows=31, sep='\,', engine='python')
for item in eindhoven_weather_hourly['"# STN'].index:
    eindhoven_weather_hourly['"# STN'][item] = int(eindhoven_weather_hourly['"# STN'][item][1:])
    eindhoven_weather_hourly['    Y"'][item] = int(eindhoven_weather_hourly['    Y"'][item][:-1])

eindhoven_weather_hourly = eindhoven_weather_hourly.rename(columns={'"# STN': '# STN', '    Y"': 'Y'})
columns = []
for col in eindhoven_weather_hourly.columns:
    columns.append(col.strip())
eindhoven_weather_hourly.columns = columns
eindhoven_weather_hourly_relevant = eindhoven_weather_hourly[['YYYYMMDD','HH', 'DD', 'FH', 'RH', 'T']]
eindhoven_weather_hourly_relevant.columns = ['Date', 'Hour', 'WindDir', 'WindSpeed_avg', 'PrecipationHour', 'Temp']
# eindhoven_weather_hourly_relevant.to_csv(df_file_path + 'hourdata_eindhoven_2021_16may2023.csv', index=False)
eindhoven_weather_hourly_relevant.tail(1)
Date Hour WindDir WindSpeed_avg PrecipationHour Temp
20783 20230516 24 290 20 0 67
eindhoven_weather_hourly = pd.read_csv('csv_files/uurgeg_370_2021-2030.csv', skiprows=31, skipinitialspace=True)
eindhoven_weather_hourly.tail()
# STN,YYYYMMDD, HH, DD, FH, FF, FX, T, T10N, TD, SQ, Q, DR, RH, P, VV, N, U, WW, IX, M, R, S, O, Y
20779 370,20230516, 20, 330, 40, 40, 70, 1...
20780 370,20230516, 21, 320, 20, 20, 40, ...
20781 370,20230516, 22, 300, 20, 20, 30, ...
20782 370,20230516, 23, 300, 20, 20, 30, ...
20783 370,20230516, 24, 290, 20, 20, 30, ...
eindhoven_weather_hourly_full = pd.read_csv('csv_files/hourdata_eindhoven_2021_16may2023.csv')
eindhoven_weather_hourly_full
Date Hour WindDir WindSpeed_avg PrecipationHour Temp
0 20210101 1 290 10 0 20
1 20210101 2 250 10 0 18
2 20210101 3 220 10 0 11
3 20210101 4 270 20 0 10
4 20210101 5 250 20 0 3
... ... ... ... ... ... ...
20779 20230516 20 330 40 0 105
20780 20230516 21 320 20 0 90
20781 20230516 22 300 20 0 78
20782 20230516 23 300 20 0 71
20783 20230516 24 290 20 0 67

20784 rows × 6 columns

weather_measurements_by_datetime = eindhoven_weather_hourly_relevant.copy()
datetimes = []
for index in weather_measurements_by_datetime.index:
    string_to_convert = str(weather_measurements_by_datetime['Date'][index]) + " " + str(weather_measurements_by_datetime['Hour'][index] - 1)
    datetimes.append(datetime.strptime(string_to_convert, "%Y%m%d %H"))

weather_measurements_by_datetime['DateTime'] = datetimes
print(weather_measurements_by_datetime.tail())
           Date  Hour  WindDir  WindSpeed_avg  PrecipationHour  Temp  \
20779  20230516    20      330             40                0   105   
20780  20230516    21      320             20                0    90   
20781  20230516    22      300             20                0    78   
20782  20230516    23      300             20                0    71   
20783  20230516    24      290             20                0    67   

                 DateTime  
20779 2023-05-16 19:00:00  
20780 2023-05-16 20:00:00  
20781 2023-05-16 21:00:00  
20782 2023-05-16 22:00:00  
20783 2023-05-16 23:00:00  
measurement_df = pd.read_csv(df_file_path + 'full_measurement.csv')
measurement_df.tail()
Date Time Location PM10 PM2.5 PM1
5739474 2023-05-19 10:10:03 577 11.65 6.02 4.69
5739475 2023-05-19 10:20:03 577 13.06 6.85 4.92
5739476 2023-05-19 10:30:03 577 11.15 7.33 4.76
5739477 2023-05-19 10:40:03 577 11.88 6.35 4.47
5739478 2023-05-19 10:50:03 577 14.06 7.43 5.20
measurement_df['DateTime'] = pd.to_datetime(measurement_df['Date'] + ' ' + measurement_df['Time'])
measurement_df = measurement_df.set_index('DateTime', drop=True)
grouper = measurement_df.groupby([pd.Grouper(freq='1H'), 'Location'])\
    # .mean().ffill().asfreq('1H')
# hourly_measurement_full.sample(20)
result = grouper.mean().ffill()
hourly_measurement_full = result.reset_index()
weather_measurements_by_datetime
Date Hour WindDir WindSpeed_avg PrecipationHour Temp DateTime
0 20210101 1 290 10 0 20 2021-01-01 00:00:00
1 20210101 2 250 10 0 18 2021-01-01 01:00:00
2 20210101 3 220 10 0 11 2021-01-01 02:00:00
3 20210101 4 270 20 0 10 2021-01-01 03:00:00
4 20210101 5 250 20 0 3 2021-01-01 04:00:00
... ... ... ... ... ... ... ...
20779 20230516 20 330 40 0 105 2023-05-16 19:00:00
20780 20230516 21 320 20 0 90 2023-05-16 20:00:00
20781 20230516 22 300 20 0 78 2023-05-16 21:00:00
20782 20230516 23 300 20 0 71 2023-05-16 22:00:00
20783 20230516 24 290 20 0 67 2023-05-16 23:00:00

20784 rows × 7 columns

combine_weather_measurements = hourly_measurement_full.copy()
combine_weather_measurements['WindDir'] = np.nan
combine_weather_measurements['WindSpeed_avg'] = np.nan
combine_weather_measurements['PrecipationHour'] = np.nan
combine_weather_measurements['Temp'] = np.nan

print(len(weather_measurements_by_datetime.index))
combine_weather_measurements_full = pd.DataFrame()
for index in weather_measurements_by_datetime.index:
    relevant_data = combine_weather_measurements[combine_weather_measurements['DateTime'] == weather_measurements_by_datetime['DateTime'][index]].copy()
    relevant_data['WindDir'] = weather_measurements_by_datetime['WindDir'][index]
    relevant_data['WindSpeed_avg'] = weather_measurements_by_datetime['WindSpeed_avg'][index]
    relevant_data['PrecipationHour'] = weather_measurements_by_datetime['PrecipationHour'][index]
    relevant_data['Temp'] = weather_measurements_by_datetime['Temp'][index]
    combine_weather_measurements_full = pd.concat([combine_weather_measurements_full, relevant_data])
remaining = combine_weather_measurements[combine_weather_measurements['DateTime'] >= '2023-05-17 00:00:00']
import requests as request
for date in remaining['DateTime'].map(lambda t: t.date()).unique():
    relevant_data = combine_weather_measurements[combine_weather_measurements['DateTime'].dt.date == date].copy()
    relevant_data['WindDir'] = weather_measurements_by_datetime['WindDir'][index]
    url = 'http://api.weatherapi.com/v1/history.json?key=2bd0e4d77ef74cf5bec123551231805&q=Eindhoven&dt=' + str(date)
    res = request.get(url)
    data = res.json()
    for hour in range(len(data['forecast']['forecastday'][0]['hour'])):
        relevant_hour = relevant_data[relevant_data['DateTime'].dt.hour == hour].copy()
        relevant_hour['WindDir'] = data['forecast']['forecastday'][0]['hour'][hour]['wind_degree']
        relevant_hour['WindSpeed_avg'] = data['forecast']['forecastday'][0]['hour'][hour]['wind_kph'] / 3.6
        relevant_hour['PrecipationHour'] = data['forecast']['forecastday'][0]['hour'][hour]['precip_mm']
        relevant_hour['Temp'] = data['forecast']['forecastday'][0]['hour'][hour]['temp_c']
        combine_weather_measurements_full = pd.concat([combine_weather_measurements_full, relevant_hour])

# combine_weather_measurements_full.to_csv(df_file_path + 'hour_combined_data_eindhoven_2021_19may2023.csv')
current_data = combine_weather_measurements_full[combine_weather_measurements_full['Location'] == 521]
if pd.to_datetime(current_data.tail(1)['DateTime'].values[0]).date() < datetime.now().date():
    print(pd.to_datetime(current_data.tail(1)['DateTime'].values[0]).date())
    print(datetime.now().date())
elif pd.Timestamp(current_data.tail(1)['DateTime'].values[0]).hour < datetime.now().hour:
    print(pd.Timestamp(current_data.tail(1)['DateTime'].values[0]).hour)
    print(datetime.now().hour)

2023-05-16
2023-06-05
last_datetime = current_data.tail(1).DateTime.values[0]
pd.to_datetime(last_datetime).strftime("%Y-%m-%dT%H:%M:%SZ")
'2023-05-16T23:00:00Z'
fig, ax = plt.subplots(figsize=(20,5))
ax.plot(combined_dataframe_reset)
ax.fill_between(combined_dataframe_reset.index, weekday_hourly_25quantile_PM10_521, weekday_hourly_75quantile_PM10_521, facecolor='blue', alpha=0.2)
ax.fill_between(combined_dataframe_reset.index, weekday_hourly_25quantile_PM25_521, weekday_hourly_75quantile_PM25_521, facecolor='orange', alpha=0.2)
ax.fill_between(combined_dataframe_reset.index, weekday_hourly_25quantile_NO2_521, weekday_hourly_75quantile_NO2_521, facecolor='green', alpha=0.2)

ax.set_xticks(ax.get_xticks() + 10,["","Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "", ""])
fig.suptitle("Weekgang 521")
ax.legend(['PM10', 'PM2.5', 'NO2'])
ax.set_ylabel("concentraties [µg/m\xB3]")
# fig.show()
Text(0, 0.5, 'concentraties [µg/m³]')

png

data = df_521_PM10.resample('60min').mean().ffill()
res = STL(data['Value'], period=24).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

df_2021['Datetime'] = pd.to_datetime(df_2021['Date'] + ' ' + df_2021['Time'])
df_2021 = df_2021.set_index('Datetime')

data = df_2021.resample('60min').mean().ffill()
res = STL(data['PM10'].iloc[data.index > "2021-09-01"], period=24).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png

res = STL(data['PM10'].iloc[data.index > "2021-09-01"], period=168).fit()
res.plot()
plt.gcf().set_size_inches(20,10)
plt.show()

png


A portfolio by Skyler Vermeer