GSoC2024 Solution to DL Starter Problem #244
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GSoC2024 Solution to DL Starter Problem
Here I present the solution I implemented for this DL Starter Problem (GSoC2024).
Solution Description
First, I implemented an interpolation function that linearly interpolates the measurements. The results offered a good first approximation. Then I implemented a polynomial interpolation exploring the best degree of the polynomial. This method offered an output considering a wider range of information from the measurements and improving the solution.
Subsequently, I searched for the best ML models to predict these measurements. I started with Support Vector Regression model, which outperformed the previous methods after briefly tuning the parameters. This approach harnessed more detailed information from the data, achieving superior fits and capturing complex data relationships more accurately. Finally, despite experimenting with Deep Neural Networks, challenges arose due to overfitting, the high computational demands and the poor approximation to the measurements obtained, made it less feasible and appropriate to the project's scope.
Results
The following graphs, for each interpolation method, represent the plots for 3 sample galaxies, comparing the original measurements with the predicted ones showcasing also the interpolation function obtained to see how the method adjusted to the measurements.
Linear Interpolation
We can clearly see how the interpolated points in the common wavelenghts are obtained directly form the linear interpolation. This is a very simple way to obtain the measurements although there are more advanced interpolation methods that can offer a more reliable solution based on a wider range of information from the measurements we have.
Polynomial Interpolation
The outcomes obtained with polynomial interpolation appear to be more grounded in the information and potential relationships between measurements than with linear interpolation. This method offers a well-adjusted interpolation that aligns closely with both the general trend and specific data points.
SVR Interpolation
Using interpolation through a Support Vector Regression (SVR) model demonstrates significantly improved outcomes, indicating that this approach effectively leverages a broader spectrum of information. Unlike simpler interpolation methods, SVR is adept at capturing complex relationships within the data by learning a detailed curve that represents a higher understanding of the actual wavelenght functions. The total computation time with this model is affordable being less than 5 minutes approximately.
DNN Interpolation
Using a deep neural network model, the results seem to be less accurately fitted. With extensive training, the model tends to overfit, resembling linear interpolation, while insufficient training leads to outcomes that do not align well with the expected values. Moreover, the training demands are significantly high for the scope of this project, presenting practical constraints in terms of time and computational resources. This suggests that although deep neural networks offer powerful modeling capabilities, their application may not be the most efficient or effective choice for projects with limited resources or those aimed at modeling data with these specific underlying patterns.
Final Conclusion
Upon reviewing the outcomes of various interpolation methods, the Support Vector Regression (SVR) model stands out as the most promising. This approach appears to encapsulate a broader array of information from the data measurements, demonstrating superior adaptability and precision in its fit compared to other techniques. Unlike the polynomial interpolation, which required careful balancing between degrees to avoid overfitting or underfitting, with results limited by the polynomial properites, and the deep neural network model, which faced challenges with overfitting and high computational demands, the SVR model effectively captures the complex relationships within the data while maintaining am affordable training.
Future Improvements
With more time, a fine-tuning of the parameters of the ML methods could be done to improve the results using cross-validation within the data we have. Also I could investigate more ML models that have proven good results in the past with similar problems. Finally, another way to approach this problem could be using pretrained models used in interpolation or even searching for similar data to train the ML models and try to improve their results.
Issue number: #243