Skip to content

Data-Driven Decision Making: Selecting the Best Regression Model for E-commerce Sales

License

Notifications You must be signed in to change notification settings

lucashomuniz/Project-12

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

✅ PROJECT-12

In this project, we will present a complete guide to the process of building, training, evaluating and selecting the best models among three options for Regression (Benchmark, Ridge Regression, LASSO Regression). We will demonstrate the entire process, from defining the business problem to interpreting the model and delivering the results to the decision maker. The company in question is an e-commerce that sells products both through its website and its mobile application. To make a purchase, customers need to register on the portal, using the website or the application. Each time a customer logs in, the system records the time they remain logged in. In addition, the company keeps records of sales, including the total amount spent per month for each customer.

The aim of the project is to increase sales, considering that the current budget only allows investing in the website or the application. Therefore, the aim is to improve the customer experience while navigating the system, increasing the length of stay, engagement and, consequently, sales.It is important to emphasize that the data used in this project are fictitious, but represent real data from e-commerce companies. They were obtained from a month of portal operation and each column of the data set has a self-explanatory title.

Keywords: Python Language, Data Visualization, Data Analysis, Linear Regression, Benchmark, Ridge Regression, LASSO Regression, Machine Learning, e-commerce.

✅ PROCESS

The exploratory analysis is performed right at the beginning, after loading the data. In this step, cleaning processes are carried out, such as removing duplicate values and missing values, in addition to possible specific transformations. The main objective is to understand the dataframe, focusing on the visualization of the types of numerical and categorical variables, as well as their distributions and the treatment of outliers using the boxplot, description table and frequency count. In exploratory analysis, it is essential to avoid the existence of duplicate rows or duplicate columns (variables). This occurs because the presence of duplicates would result in redundant information, which could bias the developed model. The objective is to obtain a generalizable model, free of unwanted biases.

Through the scatter plot generated from the correlation table, it is possible to observe the interaction between the variables. Note that the increase in application login time is clearly related to the increase in the total amount spent (moderate positive correlation). In a regression Machine Learning project, it is desirable that the predictor variables present a high correlation with the target variable. However, it is important to avoid a high correlation between predictor variables, as this can lead to multicollinearity issues. Another analysis that can be performed from the scatter plot based on the correlation table is that the variables "client_registration_time" and "spend_total_value" show a high positive correlation. This indicates that as the customer registration time increases, the total amount spent also increases. In other words, older customers tend to spend more.

image

The subsequent step consists of the Attribute Engineering process, in which deeper transformations are performed, if necessary, as well as the creation and modification of variables. During this phase, one option is to perform attribute selection, in order to obtain the best variables to proceed with the Machine Learning process. Finally, one of the most relevant techniques at this stage is the creation of the Correlation Table, which makes it possible to identify possible relationship levels (positive or negative) between the variables, mainly to analyze evidence of multicollinearity between them.

The next step is pre-processing, in which changes are made to variables that are still in text format, converting them to numeric format. In addition, the organization of the entire Machine Learning model is carried out, including the choice of the main algorithm, the application of label encoding, normalization, standardization and scaling. During this step, a widely used technique is to split the dataframe data into training and testing sets. This division is important, as the Machine Learning model is trained with data from the training set and later evaluated with data from the test set. Once the model has been trained, we cannot present it with the same data used in training, as it is already familiar with it. To evaluate the performance of the model, it is necessary to use new data, whose results are already known.

image

✅ CONCLUSION

There are three popular Machine Learning algorithms for regression: Benchmark Linear Regression, Ridge Regression, and LASSO Regression. Benchmark Linear Regression is a simple and easy-to-understand model, allowing direct interpretation of the coefficients. However, this model assumes a linear relationship between the variables, being sensitive to outliers and unable to capture non-linear relationships in the data. On the other hand, Ridge Regression incorporates a regularization term that deals with multicollinearity and reduces model complexity. This prevents overfitting and improves performance on unseen data. Despite this, Ridge Regression coefficients are less interpretable and choosing the regularization parameter can be challenging. In some cases, it can introduce a bias in the estimated coefficients.

LASSO Regression combines regularization with automatic selection of variables. This technique is useful to avoid overfitting and improves model generalization. Furthermore, LASSO Regression tends to generate a set of sparse coefficients, facilitating the interpretation and identification of the most relevant variables. However, it is also sensitive to multicollinearity and may suboptimally select or exclude important variables when there is high correlation between them. The proper choice of the regularization parameter is a challenge. In summary, Linear Benchmark Regression is simple but limited by its assumption of linearity and sensitivity to outliers. Ridge Regression deals with multicollinearity and overfitting, albeit with less interpretable coefficients. LASSO Regression combines regularization and variable selection, being more sparse, but also sensitive to multicollinearity. The choice of algorithm will depend on the specific characteristics and objectives of the problem in question.

After analyzing the results, the model selection was performed considering the performance metrics. In this context, it was observed that model 3, using the LASSO Regression technique, presented a slightly higher error rate (RMSE) compared to the other models. Based on this evaluation, it was decided to discard model 3. However, models 1 (Linear Benchmark Regression) and 2 (Ridge Regression) performed very similarly, with very close results in terms of error rate. In this scenario, the choice of the appropriate model also started to consider the simplicity of the model. Taking this into account, we chose to select the simplest model, which, in the case of this example, corresponds to model 1, that is, the Benchmark Linear Regression. This decision to choose the simplest model is based on the premise that, in the absence of a significant difference in performance, it is preferable to opt for a simpler model that is easier to understand and interpret. Therefore, the choice of model 1 as the best option was determined considering both performance and simplicity.

Interpretation of the coefficients reveals an identifiable pattern as follows: holding all other resources constant, an increase of 1 unit in customer registration time is associated with an increase of BRL 63.74 in the total amount spent per customer per month. Likewise, an increase of 1 unit in the average number of clicks per session is related to an increase of R$ 26.24 in the total amount spent per customer per month. Furthermore, an increase of 1 unit in the total time logged into the application is associated with an increase of BRL 38.57 in the total amount spent per customer per month, while an increase of 1 unit in the total time logged into the website is related to a increase of R$0.68 in the total amount spent per customer per month.

These results show that it would be more profitable for the company to invest in updating its application, since the expected return is higher in this case. In addition, it is important to implement policies to encourage customers to stay connected longer, as this also results in an increase in sales. Updating the application itself can be a way to increase the customer's on-call time. On the other hand, at this moment, it is not worth investing in updating the website, as the expected return would be minimal.