Skip to content

celikmustafa89/usedCarDatabase

Repository files navigation

USED CAR PRICE PREDICTION using LINEAR REGRESSION

1. DATASET

1. dateCrawled : when this ad was first crawled, all field-values are taken from this date
2. name : "name" of the car
3. seller : private or dealer
4. offerType : the selling type of the car
5. price : the price on the ad to sell the car
6. abtest : unknown
7. vehicleType : type of the car. Limousine, kleinwagen, kombi, bus etc.
8. yearOfRegistration : at which year the car was first registered
9. gearbox : manuel or otomatic
10. powerPS : power of the car in PS
11. model : model of the car
12. kilometer : how many kilometers the car has driven
13. monthOfRegistration : at which month the car was first registered
14. fuelType : benzin, diesel, lpg etc
15. brand : brand of the car. Mercedes, Porsche, audi etc…
16. notRepairedDamage : if the car has a damage which is not repaired yet. Yes or no
17. dateCreated : the date for which the ad at ebay was created
18. nrOfPictures : number of pictures in the ad
19. postalCode : code that shows the location of the car
20. lastSeenOnline : when the crawler saw this ad last online

2. DATA EXPLORATION

Dataset consists of 371528 rows and 20 columns [dateCrawled, name, seller, offerType, price, abtest, vehicleType, yearOfRegistration, gearbox, powerPS, model, kilometer, monthOfRegistration, fuelType, brand, notRepairedDamage, dateCreated, nrOfPictures, postalCode, lastSeen].

Table 1
Figure 1: Missing Value

According to Figure 1:

  • 5 columns [vehicleType, gearbox, model, fuelType, notRepairedDamage] have missing values. Depending on the model, these columns can be removed completely. However, these columns could be important for the model. So, the records that have null values can be removed. By removing the null values column can be saved for the model. In the table, notRepairedDamage column has 72060 missing value which is the 19% of the whole dataset. In this case, removing each record is not the best option. It seems better to drop whole column.
  • In the whole dataset, 110572 records has at least one missing value. Removing these records are not the best option because it causes 33% data loss. So, data loss can be decreased by dropping some non-important columns or a representing value can be assigned the missing values.

According to Table 1 and the Histogram Graph of the columns:

  • dateCrawled (280500), name (233531), lastSeen (182806) columns have too many unique values. Depending on the learning model, these columns can be dropped.
Table 1
  • Histogram of nrOfPictures column and Table 1 says that this column has only one values and it has no meaning for model and visualization. This column can be dropped.
  • Histogram graph of seller column and Table 1 says that this column has 2 unique values and “gewerblich” value has only 3 records. So, this column has no distinguishing information for the model. This column can be dropped, but first the records that has “gewerblich” value should be removed.
  • Based on histogram graph of offerType, Table 1 and column description, this column has 2 unique values. “gesuch” value has only 12 records. So, this column has no distinguishing information for the model. This column can be dropped, after removing the records that has “gesuch” value.
  • Based on histogram graph of abtest and Table 1, this column has consistent values for the model.
  • Based on histogram graph of monthOfRegistration and Table 1, this column should have 12 unique values. However, it has 13 values. Also, dataset has yearOfRegistration column. So, this column can be dropped.
  • postalCode column has 8150 unique values. This information can be used to determine the location of the car. However, for the model, location is not a distinguishing information because all car has postal code in Germany. Car location does not affect the model. This column can be dropped.
  • dateCreated column has 114 unique values. This column has no distinguishing information for the model. So, this column can be dropped
  • Based on histogram of price column and its description, this column has 10772 records that has 0 value which means the car is free. Also, records have some value more than 100000 euro. It is not a consistent variable because cars can’t be that much expensive. To make dataset more consistent, prices values which are 0 and larger than 100000 can be removed. The histogram graph of price column are below. First graph shows the original distribution of the prices. Second graph shows the prices without free cars. Third graph shows the prices between 0 and 100000 euro. Other records can be removed from the dataset to have consistent values.
  • Based on histogram of yearsOfRegistration column and its description result, this column has inconsistent values. For example, it has year values larger than 2017. Also, there are some values smaller than 1900 which does not make sense because in that time engine did not invented. So, yersOfRegistration records can be limited between 1900- yearsOfRegistration-2017.
  • Based on histogram of powerPS and Table 1, this column has some inconsistent values. powerPS can’t be less than 0. Also, powerPS can’t be stronger than 1000. So, it can be limited between 0-powerPS-1000. Other records can be removed from the dataset to have consistent values.
  • Based on histogram of fuelType and Table 1, this column has 7 different value type. Apart from “benzin” and “diesel”, other values have very low records. So, these values do not mean much for the model. These records can be removed from the dataset.
  • Based on vehicleType histogram and Table 1, it has 8 different values. Values are consistent and can be used for the model.
  • Based on histogram of model and Table 1, this column has many different values. It can be useful for price prediction but to make the model simpler this column can be dropped.
  • Based on histogram of kilometer and table 1, this column has consistent values. Also, it can be useful for the model.
  • Based on histogram of gearbox and Table 1, this column has 2 different values and values are consistent.
  • Based on histogram of notRepairedDamage, this column has 2 different values. Values are consistent and can be useful for the model.
  • Based on histogram of the brand, this column has 40 different values. It has consistent records and can be useful for the model. Also, it can be said that volkswagen is the most popular car.
  • Based on year v.s. price graph, it can be said that new cars are more expensive than old ones. However, there some old cars which are also expensive.
  • Based on powerPS v.s price graph, low powerPS cars are cheaper than others.

3. PROBLEM SETTING

The problem is predicting the actual price of the car by looking given parameters.
There are 371528 records in dataset. 243422 of them are remaining after data cleaning process.
2 models are implemented.
One of them uses the 30% of the dataset as a test set.
Other one uses 33% for testing issue

4. MODELLING

Before the implementation of the model. Dataset is cleaned using python libraries, and total number of column is decreased to 10 ['vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model','kilometer', 'fuelType', 'brand', 'notRepairedDamage', 'price'].

After that, cleaned dataset is preprocessed using WEKA. WEKA attribute evaluator selected some attributes. It suggests the following attributes for the model ['yearOfRegistration', 'gearbox', 'powerPS' ,'kilometer', 'fuelType', 'notRepairedDamage].

There are 2 different implementation of linear regression model:

  • First model is implemented using python sklearn library.
  • It has 10 attributes.
    30 percentage of the dataset is used for testing.

  • Second model is implemented using WEKA.
  • Attributeselection filter of WEKA applied on the dataset.
    Filter removed the 3 attributes. Remaining 7 attributes are used for learning.
    33 percentage of the dataset is used for testing.

5. CONCLUSION and IMPROVEMENT POINTS

First 20 prediction of the both model is on the below table.
Python Sklearn Regression model has lower error than the WEKA model.
It gives pretty good result for such a simple model.
In order to improve the model:
- dataset cleaning could be more effienct. According to histogram graph of the attributes, they have lots of inconsistent values.
- age of the car attribute can improve the model. Age attribute could be extracted by using registrationDate and dateCreated attributes.
- according to attributeselection filter of the WEKA, "kilometer" and "powerPS" attributes are the most important ones.

Following results are taken from python sklearn model:

Root mean squared error: 555.6768635632562
mean_absulute_error: 397.8973442835626
mean_squared_error: 308776.7766994987

Following results are taken from WEKA model:

Root mean squared error:4958.1553
Mean absolute error: 3103.1781
Correlation coefficient: 0.7864
Relative absolute error: 56.7852 %
Root relative squared error: 61.7716 %
Total Number of Instances: 82763

PYTHON SKLEARN LINEAR REGRESSION MODEL RESULT

Root mean squared error: 555.6768635632562

inst# actual predicted error
1 400 419.435 19.435
2 400 602.408 202.408
3 624 955.720 331.720
4 2579 1690.700 888.299
5 2166 1737.427 428.572
6 1697 2051.277 354.277
7 835 839.545 4.545
8 3062 2340.980 721.019
9 2273 1970.432 302.567
10 897 1485.134 588.134
11 2006 2126.953 120.953
12 1225 1281.999 56.999
13 1149 695.919 453.080
14 1863 1228.076 634.923
15 3352 2866.412 485.587
16 2056 1947.652 108.347
17 0 664.973 664.973
18 1975 1852.990 122.009
19 932 868.003 63.996
20 3435 2710.145 724.854

WEKA MODEL RESULT

Root mean squared error: 4958.1553

inst# actual predicted error
1 9499 5840.168 -3658.832
2 7990 12515.27 4525.27
3 4000 5139.611 1139.611
4 10700 14125.198 3425.198
5 690 -3346.537 -4036.537
6 1450 3834.998 2384.998
7 1000 720.987 -279.013
8 7800 5329.186 -2470.814
9 8150 8553.781 403.781
10 1111 720.987 -390.013
11 9899 16116.077 6217.077
12 150 -4427.849 -4577.849
13 2000 5802.808 3802.808
14 15000 11530.067 -3469.933
15 8000 10208.546 2208.546
16 4190 4656.816 466.816
17 16000 20750.782 4750.782
18 6000 7567.218 1567.218
19 2800 4278.547 1478.547
20 399 1198.466 799.466