Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducible code #61

Open
eshom opened this issue Jul 30, 2021 · 7 comments
Open

Reproducible code #61

eshom opened this issue Jul 30, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@eshom
Copy link
Collaborator

eshom commented Jul 30, 2021

I think as a standard all scripts should be completely independent and reproducible. I.e. people should be able to copy and paste code in their R REPL session without errors. This is currently not the case with many scripts in this repo. Instead of supplying example data, many algorithms are written as "templates" where one has to input their own data. However, there's no information what the data structure should even be.

R has many built in datasets, so these can be used to run algorithms with. If the script is just a function definition, then there should be an example usage of the function.

I could list here all scripts that need to be written this way.

What do you think?

@eshom eshom added the enhancement New feature or request label Jul 30, 2021
@eshom
Copy link
Collaborator Author

eshom commented Jul 30, 2021

  • ./Data-Preprocessing/lasso.R
  • ./Data-Preprocessing/K_Folds.R
  • ./Data-Preprocessing/data_processing.R
  • ./Data-Preprocessing/dimensionality_reduction_algorithms.R
  • ./Classification-Algorithms/lasso.R
  • ./Classification-Algorithms/decision_tree.R
  • ./Classification-Algorithms/KNN.R
  • ./Classification-Algorithms/gradient_boosting_algorithms.R
  • ./Classification-Algorithms/LightGBM.R
  • ./Classification-Algorithms/SVM.R
  • ./Classification-Algorithms/xgboost.R
  • ./Classification-Algorithms/naive_bayes.R
  • ./Classification-Algorithms/random_forest.R
  • ./Clustering-Algorithms/K-Means.R
  • ./Clustering-Algorithms/dbscan_clustering.R
  • ./Clustering-Algorithms/gmm.R
  • ./Clustering-Algorithms/pam.R
  • ./Clustering-Algorithms/kmeans_raw_R.R
  • ./Association-Algorithms/apriori.R
  • ./Regression-Algorithms/logistic_regression2.R
  • ./Regression-Algorithms/logistic_regression.R
  • ./Regression-Algorithms/linear_regression.R
  • ./Regression-Algorithms/KNN.R
  • ./Regression-Algorithms/gradient_boosting_algorithms.R
  • ./Regression-Algorithms/LightGBM.R
  • ./Regression-Algorithms/ANN.R
  • ./Regression-Algorithms/multiple_linear_regression.R
  • ./Regression-Algorithms/linearRegressionRawR.R
  • ./Data-Manipulation/OneHotEncode.R
  • ./Data-Manipulation/LabelEncode.R

@Panquesito7 Panquesito7 pinned this issue Jul 30, 2021
@siriak
Copy link
Member

siriak commented Aug 1, 2021

How can the scripts be tested if they don't accept data as arguments? I think we need to add unit tests instead. They will test our code and provide users with examples at the same time.

@eshom
Copy link
Collaborator Author

eshom commented Aug 1, 2021

I think this can be part of the documentation solution we talked about in #59. Using knitr we can turn scripts into HTML reports, which would nicely incorporate example output. Errors caused by bad scripts can be handled, printed, and reviewed. I can write R code for this, but I'm not sure how to set up github actions correctly.

@siriak
Copy link
Member

siriak commented Aug 1, 2021

So you suggest having algorithms separated from data and unit tests that will show usage of the algorithms? And the tests can be transformed into HTML reports for convenience? Sounds good to me

@eshom
Copy link
Collaborator Author

eshom commented Aug 1, 2021

Hmm not exactly. What I mean is that scripts specially formatted can be turned into HTML reports (https://rdrr.io/cran/knitr/man/spin.html). Data would still need to be part of the algorithms. Because this function, while trying to compile a report, runs the actual script - errors would be thrown if there's any problem with the script. That error can be part of a test. At the same time good scripts would compile to nice HTML reports.

It would make more sense once we have a prototype running in https://github.com/Panquesito7/R/tree/documentation_stuff

@alexgarland
Copy link
Contributor

I agree with you on this fundamental issue; for linearRegressionRaw.R, I replaced a reference to the diamonds dataset with a specifically simulated and reproducible (via a set seed) synthetic dataset.

Half of the challenge here is going to be eliminating extraneous library calls, such as with the tidyverse functions and datasets.

@eshom
Copy link
Collaborator Author

eshom commented Aug 1, 2021

I personally don't mind if third party packages are used, but either the include.only operator should be used in order to only attach to the search path objects that appear in the code, or preferably it should be replaced entirely with the double colon operator to make everything more explicit.

In either case, some check should be done if packages are installed. Something like:

if (!require(ggplot2)) 
    install.packages("ggplot2")
    
# The rest of the code
# ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants