Skip to content

bharathsudharsan/Air-Quality-IoT-Analytics

Repository files navigation

Air-Quality-IoT-Analytics

Introduction

Air pollution is a global problem and one of the most dangerous environmental risks to human health. In Europe, air quality remains poor in many cities that experience exceedances of the regulated limits for air pollutants. In this work, we present a real-world end-to-end air quality use case that leverages cutting-edge IoT devices and wireless technology to improve the living experience of citizens in urban areas. More particularly, we exploit the low-cost IoT devices to monitor air quality in multiple location points to build a historical air quality dataset that contains accurate (experts calibrated for precision close to air-quality stations) and reliable (resilient LoRa networks) concentrations of air pollutants. These data can be the foundation for European Environment Agency (EEA), World Health Organization (WHO), other e-government bodies to design ML algorithms for advanced air-quality analytics.

[ipynb] Anomaly Detection using TRAFAIR Air Quality Dataset.ipynb: We have run 12 unsupervised anomaly detection algorithms such as Angle-base Outlier Detection, Isolation Forest, clustering-Based Local Outlier, and other algorithms on the TRAFAIR air quality dataset. The anomaly score of each model type is provided along with the T-distributed Stochastic Neighbor Embedding (3D) and Uniform Manifold Approximation and Projection (2D) plots.

[html] Anomaly Detection using TRAFAIR Air Quality Dataset.html: IPython notebook converted/exported to HTML.

Download the [html] file and open it via browser. The [ipynb] file can be loaded and viewed from the Github page, but it needs to be reloaded as the file is large. Hence, it is best to download and open via Google Colab or Jupyter Notebook.

Air quality dataset build description

Numerous IoT devices are exploited to monitor and collect the pollution level on an urban scale in Modena, Florence, Pisa, Livorno, Santiago de Compostela, and Zaragoza (cities in Italy and Spain). As shown below, 14 IoT devices are installed in 12 points of Modena, Italy (similar installation in remaining cities). We show the installed device's hardware view in the same Figure below, where each device contains 4 cells (sensors), one for each gas (NO, NO2, CO, and Ox). Each cell measures the gas level through 2 channels (the auxiliary and the working channels) and provides a measure for each gas and channel in millivolts (mV).

Unsupervised anomaly detection models

The accuracy of raw measurements by sensors is affected by events such as low battery voltage, weather conditions, air humidity, physical disturbances, and others. Since such events are prevalent in real-world distributed sensor deployment-based data collection practice, after data acquisition at the LoRa servers, we need to centrally run algorithms to identify and remove such abnormal data patterns.

The following anomaly detection models are created using PyCaret and trained using a part of the TRAFAIR dataset.

ID Name Reference
abod Angle-base Outlier Detection pyod.models.abod.ABOD
cluster Clustering-Based Local Outlier pyod.models.cblof.CBLOF
cof Connectivity-Based Local Outlier pyod.models.cof.COF
iforest Isolation Forest pyod.models.iforest.IForest
histogram Histogram-based Outlier Detection pyod.models.hbos.HBOS
knn K-Nearest Neighbors Detector pyod.models.knn.KNN
lof Local Outlier Factor pyod.models.lof.LOF
svm One-class SVM detector pyod.models.ocsvm.OCSVM
pca Principal Component Analysis pyod.models.pca.PCA
mcd Minimum Covariance Determinant pyod.models.mcd.MCD
sod Subspace Outlier Detection pyod.models.sod.SOD
sos Stochastic Outlier Selection pyod.models.sos.SOS

Assign anomaly labels to dataset

The data collection started in August 2019. The data rows generated by IoT devices contain sensor measurements, temperature, humidity, the battery voltage values with timestamps. These data rows are encapsulated into LoRa packets at the device level and sent to the LoRaWAN server via gateways.

The two columns 'Anomaly' and 'Anomaly_Score are added towards the end. 0 stands for inliers and 1 for outliers/anomalies. The score is the values computed by the algorithm. Outliers are assigned with larger anomaly scores.

id_sensor_low_cost phenomenon_time battery_voltage humidity temperature no_we no_aux no2_we no2_aux ox_we ox_aux co_we co_aux Anomaly Anomaly_Score
0 4005 2020-02-02 05:53:25.263916 4.901 100.758118 5.027297 296.813965 -4.577637 243.255615 18.493652 235.656738 4.119873 451.812744 129.638672 0 -299.285983
1 4005 2020-02-08 23:47:16.844981 4.840 81.787628 1.938469 312.194824 8.697510 246.643066 22.064209 239.593506 7.965088 553.161621 223.571777 0 102.858787
2 4005 2020-01-21 07:01:46.596547 4.956 99.678558 -0.571204 347.717285 46.875000 246.917725 21.057129 238.769531 6.774902 894.927979 555.725098 1 11492.205912
3 4005 2020-03-03 18:30:47.208524 4.418 88.692230 7.343918 310.089111 4.119873 244.171143 20.324707 236.206055 6.317139 461.059570 132.476807 0 -685.653888
4 4005 2020-02-19 14:59:04.939508 4.730 54.386658 17.945677 314.483643 0.091553 235.565186 12.176514 230.895996 0.732422 369.140625 61.157227 0 2816.180134

Uniform manifold approximation and projection (umap) for outliers

Plot that can be used to analyze the anomaly detection model over different aspects. We provide users the freedom to use any model of their choice to detect and remove anomalies (due to events such as low battery voltage, physical disturbances, etc). The output is clean data that can power advanced air quality analytics tasks.

How to remove the anomalies (yellow points): Just run the above trained unsupervised anomaly detection model of choice. Then, remove the data rows that correspond to high anomaly scores (set a threshold).

alt text

Air quality data analytics

The battle against air pollution will be won with data. Following are few potential applications of the built dataset; (i) Train ML algorithms that can be centrally deployed (on a cloud platform) and give access to various APIs for forecasting/predicting (model inference) urban air quality in any given location point of Spain and Italy; (ii) Generate air quality index heat-map of past, present, future to rank the area's emission levels and track changes; (iii) Assessment, intervention, and use for e-government decision support (since reliable historic data) when forming pollution-reducing policies.

Note: In the [html] and [ipynb] files, the interactable plots are broken due to the high-resolution output by Plotly. It will appear back when the notebook is run again from start.

If the code is useful, please consider citing paper using the BibTex entry below.

@inproceedings{rollo2021ubicomp,
  title     = {Air Quality Sensor Network Data Acquisition, Cleaning, Visualization, and Analytics: A Real-world IoT Use Case},
  author    = {Rollo, Federica and Sudharsan, Bharath and Po, Laura and Breslin, John G},
  editor    = {Afsaneh Doryab and Qin Lv and Michael Beigl},
  booktitle = {UbiComp/ISWC '21: 2021 {ACM} International Joint Conference on Pervasive
               and Ubiquitous Computing and 2021 {ACM} International Symposium on
               Wearable Computers, Virtual Event, September 21-25, 2021},
  pages     = {67--68},
  publisher = {{ACM}},
  year      = {2021},
  url       = {https://doi.org/10.1145/3460418.3479277},
  doi       = {10.1145/3460418.3479277}
}

@inproceedings{sudharsan2021iotdidemo,
  title     = {Demo abstract: Porting and execution of anomalies detection models on embedded systems in iot},
  author    = {Sudharsan, Bharath and Patel, Pankesh and Wahid, Abdul and Yahya, Muhammad and Breslin, John G and Ali, Muhammad Intizar},
  booktitle = {International Conference on Internet-of-Things Design and Implementation (IoTDI)},
  year      = {2021}
}

About

Repo and code of the UbiComp-ISWC 2021 paper: 'Air Quality Sensor Network Data Acquisition, Cleaning, Visualization, and Analytics: A Real-world IoT Use Case'

Topics

Resources

License

Stars

Watchers

Forks