New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QuadratiK: A Collection of Methods Using Kernel-Based Quadratic Distances for Statistical Inference and Clustering #632
Comments
@ropensci-review-bot check srr |
1 similar comment
@ropensci-review-bot check srr |
'srr' standards compliance:
✔️ This package complies with > 50% of all standads and may be submitted. |
Thanks for the submission @giovsaraceno ! I'm getting some advice from the other editors about your question. One thing that would be really helpful - could you push up your documentation to a GitHub page? From the |
Hi @giovsaraceno, Mark here from the rOpenSci stats team to answer your question. We've done our best to clarify the role of Probability Distributions Standards:
So packages should generally fit within some main category, with Probability Distributions being an additional category. In your case, Dimensionality Reduction seems like the appropriate main category, but it seems like your package would also fit within Probability Distributions. Given that, the next step would be for you to estimate what proportion of those standards you think might apply to your package? Our general rule-of-thumb is that at least 50% should apply, but for Probability Distributions as an additional category, that figure may be lower. We are particularly keen to document compliance with this category, because it is where our standards have a large overlap with many core routines of the R language itself. As always, we encourage feedback on our standards, so please also feel very welcome to open issues in the Stats Software repository, or add comments or questions in the discussion pages. Thanks for you submission! |
Thanks @ldecicco-USGS for your guidance during this process. Following your suggestion, I've now pushed the documentation for the QuadratiK package to a GitHub page. You can find it displayed on the main page of the GitHub repository. Here's the direct link for easy access: QuadratiK package GitHub page. |
Hi Mark, Thank you for the additional clarification regarding the standards for Probability Distributions and their integration with other statistical software categories. Following your guidance, we have conducted a thorough review of the standards applicable to the Probability Distributions category in relation to our package. Based on our assessment, we found that the current version of our package satisfies 14% of the standards directly. Furthermore, we identified that an additional 36% of the standards could potentially apply to our package, but this would require us to make some enhancements, including the addition of checks and test codes. We feel the remaining 50% of the standards are not applicable to our package. We are committed to improve our package and aim to fulfill the applicable standards. To this end, we plan to work on a separate branch dedicated to implementing these enhancements, with the goal of meeting the 50% of the standards for the Probability Distributions category. Before proceeding, we would greatly appreciate your opinion on this plan. Thank you for your time and support. Giovanni |
Hi Mark, We addressed the enhancements we discussed, and our package now meets 50% of the standards for the Probability Distributions category. These updates are in the probability-distributions-standards branch of our repository. Thank you, Giovanni |
Hi Giovanni, your Those are very minor points which you may ignore for the moment if you'd like to get the review process started, or you could quickly address them straight away if you prefer. Either way, feel free to ask the bot to |
Hi, thank you for your suggestions on our compliance statements and testing practices. As for comparing results from different distributions, the rpkb function in our package provides options to generate random observations using three distinct algorithms based on different probability distributions. We've conducted tests to confirm that each method functions as intended. We added also a new vignette in which the methods are compared by graphically displaying the generated points. Is this what you are looking for? We're inclined to address them promptly. We would appreciate if we can get an answer to the questions posed above so that we can start the review process. |
Sorry we didn't reply faster, @giovsaraceno. In, say, a single-variable distribution tests might include:
|
Thanks @noamross for your explanation. We have taken your suggestions into consideration and have implemented them accordingly. |
@ropensci-review-bot check package |
Thanks, about to send the query. |
🚀 The following problems were found in your submission template:
👋 |
Checks for QuadratiK (v1.0.0)git hash: 21541a40
Important: All failing checks above must be addressed prior to proceeding (Checks marked with 👀 may be optionally addressed.) Package License: GPL (>= 3) 1. rOpenSci Statistical Standards (
|
type | package | ncalls |
---|---|---|
internal | base | 382 |
internal | QuadratiK | 50 |
internal | utils | 10 |
internal | grDevices | 1 |
imports | stats | 29 |
imports | methods | 26 |
imports | sn | 14 |
imports | ggpp | 2 |
imports | cluster | 1 |
imports | mclust | 1 |
imports | moments | 1 |
imports | rrcov | 1 |
imports | clusterRepro | NA |
imports | doParallel | NA |
imports | foreach | NA |
imports | ggplot2 | NA |
imports | ggpubr | NA |
imports | MASS | NA |
imports | movMF | NA |
imports | mvtnorm | NA |
imports | Rcpp | NA |
imports | RcppEigen | NA |
imports | rgl | NA |
imports | rlecuyer | NA |
imports | Tinflex | NA |
suggests | knitr | NA |
suggests | rmarkdown | NA |
suggests | roxygen2 | NA |
suggests | testthat | NA |
linking_to | Rcpp | NA |
linking_to | RcppEigen | NA |
Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.
base
list (46), data.frame (26), matrix (24), nrow (23), t (20), log (19), rep (19), ncol (18), c (14), numeric (12), for (11), sqrt (10), length (8), mean (8), as.numeric (6), return (6), sample (6), T (6), vapply (6), apply (5), as.factor (5), table (5), unique (5), as.vector (4), cumsum (4), exp (4), rbind (4), sum (4), as.matrix (3), kappa (3), lapply (3), lgamma (3), pi (3), q (3), replace (3), unlist (3), as.integer (2), diag (2), max (2), readline (2), rownames (2), rowSums (2), which (2), which.max (2), with (2), beta (1), colMeans (1), expand.grid (1), F (1), factor (1), if (1), levels (1), norm (1), rep.int (1), round (1), seq_len (1), subset (1)
QuadratiK
DOF (3), kbNormTest (3), normal_CV (3), C_d_lambda (2), compute_CV (2), cv_ksample (2), d2lpdf (2), dlpdf (2), lpdf (2), norm_vec (2), objective_norm (2), poisson_CV (2), rejvmf (2), sample_hypersphere (2), statPoissonUnif (2), compare_qq (1), compute_stats (1), computeKernelMatrix (1), computePoissonMatrix (1), dpkb (1), elbowMethod (1), generate_SN (1), NonparamCentering (1), objective_2 (1), objective_k (1), ParamCentering (1), pkbc_validation (1), rejacg (1), rejpsaw (1), select_h (1), stat_ksample_cpp (1), stat2sample (1)
stats
df (12), quantile (4), dist (2), rnorm (2), runif (2), aggregate (1), cov (1), D (1), qchisq (1), sd (1), sigma (1), uniroot (1)
methods
setMethod (12), setGeneric (8), new (3), setClass (3)
sn
rmsn (14)
utils
data (8), prompt (2)
ggpp
annotate (2)
cluster
silhouette (1)
grDevices
colorRampPalette (1)
mclust
adjustedRandIndex (1)
moments
skewness (1)
rrcov
PcaLocantore (1)
NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.
3. Statistical Properties
This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.
Details of statistical properties (click to open)
The package has:
- code in C++ (17% in 2 files) and R (83% in 12 files)
- 4 authors
- 5 vignettes
- 1 internal data file
- 21 imported packages
- 24 exported functions (median 14 lines of code)
- 56 non-exported functions in R (median 16 lines of code)
- 16 R functions (median 13 lines of code)
Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:
loc
= "Lines of Code"fn
= "function"exp
/not_exp
= exported / not exported
All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown()
function
The final measure (fn_call_network_size
) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.
measure | value | percentile | noteworthy |
---|---|---|---|
files_R | 12 | 65.5 | |
files_src | 2 | 79.1 | |
files_vignettes | 5 | 96.9 | |
files_tests | 10 | 90.7 | |
loc_R | 1408 | 76.6 | |
loc_src | 281 | 34.1 | |
loc_vignettes | 235 | 55.3 | |
loc_tests | 394 | 70.0 | |
num_vignettes | 5 | 97.9 | TRUE |
data_size_total | 11842 | 71.9 | |
data_size_median | 11842 | 80.1 | |
n_fns_r | 80 | 70.4 | |
n_fns_r_exported | 24 | 72.5 | |
n_fns_r_not_exported | 56 | 70.6 | |
n_fns_src | 16 | 40.4 | |
n_fns_per_file_r | 5 | 67.1 | |
n_fns_per_file_src | 8 | 69.1 | |
num_params_per_fn | 5 | 69.6 | |
loc_per_fn_r | 15 | 46.1 | |
loc_per_fn_r_exp | 14 | 35.1 | |
loc_per_fn_r_not_exp | 16 | 54.8 | |
loc_per_fn_src | 13 | 41.6 | |
rel_whitespace_R | 24 | 82.7 | |
rel_whitespace_src | 18 | 36.2 | |
rel_whitespace_vignettes | 16 | 29.2 | |
rel_whitespace_tests | 34 | 78.1 | |
doclines_per_fn_exp | 50 | 62.8 | |
doclines_per_fn_not_exp | 0 | 0.0 | TRUE |
fn_call_network_size | 50 | 66.3 |
3a. Network visualisation
Click to see the interactive network visualisation of calls between objects in package
4. goodpractice
and other checks
Details of goodpractice checks (click to open)
3a. Continuous Integration Badges
(There do not appear to be any)
GitHub Workflow Results
id | name | conclusion | sha | run_number | date |
---|---|---|---|---|---|
8851531581 | pages build and deployment | success | 21541a | 25 | 2024-04-26 |
8851531648 | pkgcheck | failure | 21541a | 60 | 2024-04-26 |
8851531643 | pkgdown | success | 21541a | 25 | 2024-04-26 |
8851531649 | R-CMD-check | success | 21541a | 83 | 2024-04-26 |
8851531642 | test-coverage | success | 21541a | 83 | 2024-04-26 |
3b. goodpractice
results
R CMD check
with rcmdcheck
R CMD check generated the following warning:
- checking whether package ‘QuadratiK’ can be installed ... WARNING
Found the following significant warnings:
Warning: 'rgl.init' failed, running with 'rgl.useNULL = TRUE'.
See ‘/tmp/RtmpQrtXuf/file133861d90686/QuadratiK.Rcheck/00install.out’ for details.
R CMD check generated the following note:
- checking installed package size ... NOTE
installed size is 16.6Mb
sub-directories of 1Mb or more:
libs 15.0Mb
R CMD check generated the following check_fails:
- no_import_package_as_a_whole
- rcmdcheck_examples_run_without_warnings
- rcmdcheck_significant_compilation_warnings
- rcmdcheck_reasonable_installed_size
Test coverage with covr
Package coverage: 78.21
Cyclocomplexity with cyclocomp
The following function have cyclocomplexity >= 15:
function | cyclocomplexity |
---|---|
select_h | 46 |
Static code analyses with lintr
lintr found the following 20 potential issues:
message | number of times |
---|---|
Avoid library() and require() calls in packages | 9 |
Lines should not be more than 80 characters. | 9 |
Use <-, not =, for assignment. | 2 |
5. Other Checks
Details of other checks (click to open)
✖️ Package contains the following unexpected files:
- src/RcppExports.o
- src/kernel_function.o
✖️ The following function name is duplicated in other packages:
-
extract_stats
from ggstatsplot
Package Versions
package | version |
---|---|
pkgstats | 0.1.3.13 |
pkgcheck | 0.1.2.21 |
srr | 0.1.2.9 |
Editor-in-Chief Instructions:
Processing may not proceed until the items marked with ✖️ have been resolved.
We have solved all the marked items and we are now ready to request the automatic bot check. |
Submitting Author Giovanni Saraceno
Submitting Author Github Handle: @giovsaraceno
Other Package Authors Github handles: @rmj3197
Repository: https://github.com/giovsaraceno/QuadratiK-package§
Submission type: Stats
Language: en
Scope
Data Lifecycle Packages
Statistical Packages
Bayesian and Monte Carlo Routines
Dimensionality Reduction, Clustering, and Unsupervised Learning
Machine Learning
Regression and Supervised Learning
Exploratory Data Analysis (EDA) and Summary Statistics
Spatial Analyses
Time Series Analyses
Explain how and why the package falls under these categories (briefly, 1-2 sentences). Please note any areas you are unsure of:
This category is the most suitable due to QuadratiK's clustering technique, specifically designed for spherical data. The package's clustering algorithm falls within the realm of unsupervised learning, where the focus is on identifying groupings in the data without pre-labeled categories. The two- and k-sample tests serve as additional tools for testing the differences between the identified groups.
Following the link https://stats-devguide.ropensci.org/standards.html we noticed in the "Table of contents" that category 6.9 refers to Probability Distribution. We are unsure how we fit and if we fit this category. Can you please advise?
Yes, we have incorporated documentation of standards into our QuadratiK package by utilizing the srr package, considering the categories "General" and "Dimensionality Reduction, Clustering, and Unsupervised Learning", in line with the recommendations provided in the rOpenSci Statistical Software Peer Review Guide.
The QuadratiK package offers robust tools for goodness-of-fit testing, a fundamental aspect in statistical analysis, where accurately assessing the fit of probability distributions is essential. This is especially critical in research domains where model accuracy has direct implications on conclusions and further research directions. Spherical data structures are common in fields such as biology, geosciences and astronomy, where data points are naturally mapped to a sphere. QuadratiK provides a tailored approach to effectively handle and interpret these data. Furthermore, this package is also of particular interest to professionals in health and biological sciences, where understanding and interpreting spherical data can be crucial in studies ranging from molecular biology to epidemiology. Moreover, its implementation in both R and Python broadens its accessibility, catering to a wide audience accustomed to these popular programming languages.
Yes, there are other R packages that address goodness-of-fit (GoF) testing and multivariate analysis. Notable among these are the energy package for energy statistics-based tests. The function kmmd in the kernlab package offers a kernel-based test which has similar mathematical formulation. The package sphunif provides all the tests for uniformity on the sphere available in literature. The list of implemented tests includes the test for uniformity based on the Poisson kernel. However, there are fundamental differences between the methods encoded in the aforementioned packages and those offered in the QuadratiK package.
QuadratiK uniquely focuses on kernel-based quadratic distances methods for GoF testing, offering a comprehensive set of tools for one-sample, two-sample, and k-sample tests. This specialization provides more nuanced and robust methodologies for statistical analysis, especially in complex multivariate contexts. QuadratiK is optimized for high-dimensional datasets, employing efficient C++ implementations. This makes it particularly suitable for contemporary large-scale data analysis challenges. The package introduces advanced methods for kernel centering and critical value computation, as well as optimal tuning parameter selection based on midpower analysis. QuadratiK includes a unique clustering algorithm for spherical data. These innovations are not covered in other available packages. With implementations in both R and Python, QuadratiK appeals to a wider audience across different programming communities. We also provide a user-friendly dashboard application which further enhances accessibility, catering to users with varying levels of statistical and programming expertise.
In summary there are fundamental differences between QuadratiK and all existing R packages:
Yes, our package, QuadratiK, is compliant with the rOpenSci guidelines on Ethics, Data Privacy, and Human Subjects Research. We have carefully considered and adhered to ethical standards and data privacy laws relevant to our work.
Please see the question posed in the first bullet.
The text was updated successfully, but these errors were encountered: