GitHub - ByUnal/Wildfire-Application: It is a web application which you aggregate Wildfire data, and make prediction of causes by the help of ML

Wildfires

Overview

The API provides User Interface for the SQL aggregations and XGBoost model to predict cause of the Wildfires according to given inputs. 1.88 Million US Wildfires is used for the training. This dataset includes several tables, but I used only "Fires" table for both training the model and SQL aggregations.

Model Training Details

identifier	learning rate	tree method	database
XGBoostClassifier	0.5	hist	1.88 Million US Wildfires (Preprocessed)

Data Installation and Preparation

Firstly, create data and logs folder in the directory. Then, you need to download the dataset. Next, put it under the "data" folder. You can see the steps I followed while preparing the data for the training below. Open the terminal in the project's directory first. Then go inside "operation" folder.

As I mentioned above, 1.88 Million US Wildfires in SQL format and includes lots of tables. We're going to extract Fires table only. Then, we will convert SQL table to CSV file and save it. For this:

python extract_db_to_csv.py

It will save the DataFrame as "1.88_Million_US_Wildfires.csv" by taking specific columns into account. You can examine the extracted CSV file.

Before training the model, we should extract useful information from the dataset and get rid of the unnecessary things to make our model successful.

To prepare the data for training we need to do;
- Convert columns into numerical format (if they are not already.)
- Drop unnecessary columns
- Drop duplicates
- "DISCOVERY_SIZE" is in Julian Date format. So, convert it to date type(the type we used in every day). Then save in "DATE" column.
- Divide "DATE" column as "MONTH" and "DAY_OF_WEEK" to increase feature number.

To do aforementioned steps:

python data_preprocessing.py

Lastly, the final DataFrame is ready for the training, and it will be extracted to "wildfire_cleansed.csv". Also, final DataFrame saved at "wildfires.sqlite" to use in aggregations in UI. Datasets can be found in data folder.

Running the API

via Docker

Build the image inside the Dockerfile's directory

docker build -t wildfire .

Then for running the image in local network

docker run --network host --name wildfire-cont wildfire

Finally, you can reach the API from your browser by entering:

http://localhost:5000/

via Python in Terminal

Open the terminal in the project's directory. Install the requirements first.

pip install -r requirements.txt

Then, run the main.py file

python main.py

User Interface

You will encounter with this page when you run the API successfully.

Example Usage

Wildfire Cause Prediction

After entering the inputs, click the submit button and see the prediction of Wildfire Cause

SQL Aggregation

You can make SQL query by using only "Fires" table, and you can see the columns by: Select * From Fires.

Enter the SQL Query in the textbox.

Then click the search button. Then you will encounter with this kind of page:

You can examine the results through the pages.

In the end, you return the home page by clicking "Go Back" button.

Train Model

Training can be done by using different parameters by using environment variable.

python train.py --learnin_rate 0.3 --train_size 0.7 --tree_method hist --model_name wildfire.pkl

Inference

You can also use model for inference by giving inputs (all of them are required)

python inference.py --state NM --date 22.07.2008 --latitude 40.8213 --longitude -121.5397 --fire_size 9.0

Examine my work further

You can glance my works with Jupyter Notebook in notebooks folder. Notebooks cover:

Examining data in detail
EDA (Exploratory Data Analysis)
Data Cleaning
Correlation Matrix
RandomForrest and Decision Tree training
Hyperparameter optimization.

Improvement Suggestions

SQL database is slow in loading. Therefore, MongoDB can be one of the efficient in terms of latency.
Current model has %56.42 accuracy. Total label count 12. It may be too much to make correct prediction. Also, data is imbalanced in terms of labels. Hence, label count can be lowered by defining new labels and distributing existed labels into these labels. For example,
- natural = ['Lightning']
- accidental = ['Structure','Fireworks','Powerline','Railroad','Smoking','Children','Campfire','Equipment Use','Debris Burning']
- malicious = ['Arson']
- other = ['Missing/Undefined','Miscellaneous']
Mlflow can be used to track ML operations.

Citation

Short, Karen C. 2017. Spatial wildfire occurrence data for the United States, 1992-2015 [FPAFOD20170508]. 4th Edition. Fort Collins, CO: Forest Service Research Data Archive. https://doi.org/10.2737/RDS-2013-0009.4

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
models		models
notebooks		notebooks
operations		operations
static		static
templates		templates
.gitattributes		.gitattributes
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

ByUnal/Wildfire-Application

Folders and files

Latest commit

History

Repository files navigation

Wildfires

Overview

Model Training Details

Data Installation and Preparation

Running the API

via Docker

via Python in Terminal

User Interface

Example Usage

Wildfire Cause Prediction

SQL Aggregation

Train Model

Inference

Examine my work further

Improvement Suggestions

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages