Net Hourly Electrical Power Output Prediction in a Combined Cycle Power Plant

Dataset

The dataset is open source, available here, and contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. A combined cycle power plant (CCPP) is composed of gas turbines, steam turbines and heat recovery steam generators. In a CCPP, the electricity, in the range of 420.26-495.76 MW, is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the vacuum is collected from and has effect on the steam turbine, the other ambient variables effect the gas turbine performance.

- Features

Features consist of hourly average ambient variables, namely:

Ambient Temperature (AT) in the range 1.81-37.11 °C
Ambient Pressure (AP) in the range 992.89-1033.30 milibar
Relative Humidity (RH) in the range 25.56%-100.16%
Exhaust Vacuum (V) in the range 25.36-81.56 cm Hg

- Target

The target is to predict the net hourly electrical power output (EP) of the plant.

Implementation and Results Interpretation

Step 1: Importing the necessary modules

The electricity prediction system utilizes the following Python libraries.

NumPy
Pandas
Seaborn
Matplotlib
Scikit-learn

Step 2: Importing dataset, exploratory data analysis

The combined cycle power plant dataset that spans a variety of ambient conditions over 6 years of operation was read using pandas.read_csv( ) function. All features were found to be numeric with no NaN values. The descriptive statistics for the dataset are listed in Table 1.

Table 1: Dataset statistics

Stats	AT	V	AP	RH	EP
count	9568	9568	9568	9568	9568
mean	19.651231	54.305804	1013.259078	73.308978	454.365009
std	7.452473	12.707893	5.938784	14.600269	17.066995
min	1.810000	25.360000	992.890000	25.560000	420.260000
25%	13.510000	41.740000	1009.100000	63.327500	439.750000
50%	20.345000	52.080000	1012.940000	74.975000	451.550000
75%	25.720000	66.540000	1017.260000	84.830000	468.430000
max	37.110000	81.560000	1033.300000	100.160000	495.760000

a. Checking skewness in data

To analyze the density distribution and spread of the data, a pair-plot was sketched using the seaborn module. From Figure 1, it could be observed that the kernel density estimate (KDE) subplots, shown diagonally, have somewhat normal distributions rather than having left or right skewed values. This eliminates the need of log transformation.

Figure 1: Pairwise relationships in dataset

b. Analyzing linearity trend with target variable

The regression plots illustrated in Figure 2 indicate how the independent variables vary with the dependent variable. At different intercepts, the relationship with output electrical power (EP) is linear with decreasing slope in case of ambient temperature (AT) and exhaust vacuum (V), while the slope for ambient pressure (AP) and relative humidity (RH) is positive.

Figure 2: Linearity trend of features with response variable

c. Checking multicollinearity

Multicollinearity is a condition when two or more input features have high correlation with each other besides having strong correlation with the target variable. From Figure 3, it could be observed that the predictors – AT and V, have a correlation of 0.84. So, a general intuition could be that including both Temperature and Vacuum in the regression model would lead the model to overfitting. However, the actual experimentation done with the model and its independent variables revealed that the model made better predictions when trained on all four ambient features.

Figure 3: Correlation matrix

Moreover, the last column in the correlation matrix verifies the observations drawn from Figure 2. The strong negative correlation of AT and V with EP is in accordance with the decreasing linear trend. The correlation between RH and EP is not a strong one due to the high variance in humidity values (Table 1) and scattered data (Figure 2 subplot 4).

Step 3: Preprocessing

Since there were no outliers in the dataset nor any skewed distributions, it could be referred as clean data. This saved the computation cost in terms of data cleaning and manipulation. After extracting the independent and dependent variables, the only preliminary processing step being performed was feature scaling using MinMaxScaler from scikit-learn module i.e., the input features were scaled in the range of [0,1]. The dataset was then split into training and test set. The stats could be read from Table 2.

Table 2: Train-test split

Parameters	Training set	Test set
Split ratio	70%	30%
Features	(6697,4)	(2871,4)
Target	(6697,1)	(2871,1)

Step 4: Building Machine Learning Models

We have developed the electrical power prediction system based on four different regression models, namely:

Multiple Linear Regressor
Support Vector Regressor
Random Forest Regressor (using 10 estimators)
K-Nearest Neighbors Regressor

These 04 models were trained on the training set and predictions were made on the test set. A comparison between the true and predicted electrical outputs is summarized for the first few samples in Table 3.

Table 3: Actual EP vs Predicted EP

Actual EP (MW)	Predicted EP (MW)
Actual EP (MW)	MLR	SVR	Random Forest	KNN
431.23	431.690360	446.207784	434.829	435.482
460.01	458.157572	454.278001	456.961	457.708
461.14	463.972658	456.869228	466.618	467.592
445.90	447.510	450.720681	446.643	447.510
451.29	456.906435	451.807264	461.511	458.602

Step 5: Performance Evaluation

Once trained and tested, the performance of each model was evaluated via R^2 score, mean absolute error, and root mean squared error. The evaluation results are tabulated in Table 4. It could be observed that the Random Forest Regressor performed the best with the lowest error and the highest R^2 score.

Table 4: Performance evaluation

ML Model	MAE	RMSE	R^2 score
MLR	3.5785	4.4736	0.9316
SVR	3.1967	4.1662	0.9406
Random Forest	2.9016	3.8929	0.94822
KNN	3.2245	4.2781	0.93746

Step 6: Cross-Validation

10-fold cross validation was applied on the dataset. The corresponding results are shown in Table 5.

Table 5: Cross Validation performance evaluation

ML Model	MAE	RMSE	R^2 score
MLR	3.6278	4.5565	0.9285
SVR	3.1578	4.1636	0.9403
Random Forest	2.4329	3.4312	0.95939
KNN	2.6886	3.7312	0.9520

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
results		results
LICENSE		LICENSE
README.md		README.md
electrical power output prediction in a CCPP.ipynb		electrical power output prediction in a CCPP.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

results

results

LICENSE

LICENSE

README.md

README.md

electrical power output prediction in a CCPP.ipynb

electrical power output prediction in a CCPP.ipynb

Repository files navigation

Net Hourly Electrical Power Output Prediction in a Combined Cycle Power Plant

Dataset

- Features

- Target

Implementation and Results Interpretation

Step 1: Importing the necessary modules

Step 2: Importing dataset, exploratory data analysis

a. Checking skewness in data

b. Analyzing linearity trend with target variable

c. Checking multicollinearity

Step 3: Preprocessing

Step 4: Building Machine Learning Models

Step 5: Performance Evaluation

Step 6: Cross-Validation

About

Releases

Packages

Languages

License

rymshasaeed/Net-Hourly-Electrical-Power-Prediction-via-Regression

Folders and files

Latest commit

History

Repository files navigation

Net Hourly Electrical Power Output Prediction in a Combined Cycle Power Plant

Dataset

- Features

- Target

Implementation and Results Interpretation

Step 1: Importing the necessary modules

Step 2: Importing dataset, exploratory data analysis

a. Checking skewness in data

b. Analyzing linearity trend with target variable

c. Checking multicollinearity

Step 3: Preprocessing

Step 4: Building Machine Learning Models

Step 5: Performance Evaluation

Step 6: Cross-Validation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages