Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion #18

Open
statquant opened this issue Aug 12, 2021 · 31 comments
Open

Suggestion #18

statquant opened this issue Aug 12, 2021 · 31 comments

Comments

@statquant
Copy link

statquant commented Aug 12, 2021

Hello, I wanted to suggest packages additions and removals

  • nanotime is the first package to handle nanos in R, I think it should be used in replacement of clock
  • clustermq is a package that leverage zeromq to send function to workers on a grid with 0 files involved
  • qs is a binary format like fst that supports all objects (vs data.frames for fst) though no random access
  • lubridate is (I think) fairly slow so I am surprised it’s on the list

Many thanks for this universe, I did not know kit, that's great to discover new packages
Regards

@SebKrantz
Copy link
Member

SebKrantz commented Aug 12, 2021

Hello, thank you! I already considered nanotime, but did not include it for now because my thinking was it provides a specialized class that few people require and those that require it know about it. But I can include it for sure. Lubridate and ggplot2 are on the list because I haven’t quite found convenient replacements for them, and the fastverse should still be somewhat well rounded. I don’t know about the other packages, but you can send a pull request to the development branch, making a new category for reading and writing files.

Otherwise I’ll look at them during the weekend...

@statquant
Copy link
Author

Hello, I personally never really found lubridate very helpful, totally agree with ggplot2 that should just be there because there is nothing else quite like it. Will do PR (adding links to each repo too while we're at it).

@SebKrantz
Copy link
Member

Great, thanks! So what do you use for standard Date and POSIXct manipulation? I know clock does it, but is mostly geared towards its own set of classes.

@statquant
Copy link
Author

For Date I use base::Date or rather now data.table::IDate which make sure internal representation is integer so it is used much faster by data.table. Most of the time I only need to add/remove a number of days. What operations do you usually have to do ?
I use a mix of POSIXct and nanotime, they are interchangeable with nanotime, note there is nanoduration too that is a proxy of "time of day" too which is very useful.

@nickforr
Copy link

Hope nobody minds me jumping into this issue but is it worth mentioning the arrow package alongside fst and qs, as the parquet format gives options for sharing binary files with python etc (apologies if fst does this and I’ve missed that)?

@SebKrantz
Copy link
Member

SebKrantz commented Aug 12, 2021

Thanks @nickforr, as I said a category for reading and writing files can be added featuring arrow, vroom etc. just make PR to development branch. Also mention the number of dependencies, you can use fastverse_deps(pck, recursive = TRUE).

@statquant I know about and have mentioned IDdate, but it’s a data.table thing that is not totally portable. As an economist I deal a lot with monthly and quarterly data where I use a mix of lubridate and xts/zoo. We can keep this thread open, I definitely don’t mind good packages like nanotime being addeed. I‘m not yet convinced lubridate should be removed. I also have not benchmarked it tbh, just know that dependency wise it is definitely different from the rest of the tidyverse and it serves a lot of comon tasks.

@SebKrantz
Copy link
Member

Just one note to both of you: you need to fork, implement, and send PR on the development branch. You cannot fork "main" or "CRAN-Version" branch and send an PR from those to development, as that will include other stuff I don't want in development.

@SebKrantz
Copy link
Member

I've added nanotime qs and arrow for now, but perhaps you can still improve on my descrriptions and add the links - as you find time.

@eddelbuettel
Copy link

(Came here late via the commit you just made.... thanks for that)

So what do you use for standard Date and POSIXct manipulation?

anytime::anydate() and anytime::anytime() really do all I need or want, and never need a format (on sane inputs). Others are slowly copying its design---I need to check when base R did this but it now also 'guesses' over two or three plausible formats on some converters. And lubridate, apparently, also does (or will). But like @statquant I never found a use for lubridate, likely because it first started it was very slow. The C++ rewrites and pruning of dependencies made it better.

And nanotime is good when you have to deal with sub-microsecond timestamp as is now common with high-frequency trading data. It's S4 class is pretty sane (thanks to @lsilvest who rewrote my more basic S3 class and then added a ton more useful features). So yes, adding it here makes sense even it (sadly) is not quite as minimal in its dependencies. At least recursively it still doesn't blow up. I haven't looked at clock at all as we covered the same ground earlier for our needs so 🤷‍♂️

@BenoitLondon
Copy link

Hello!
This "package" is a nice idea, I used myself the (defunct?) pkgverse package to build my own package universe...
Sorry to bump on this issue but I wanted to suggest some packages.
Maybe they are too specialized or do not meet your coding standards...

@SebKrantz
Copy link
Member

SebKrantz commented Aug 16, 2021

Thanks @eddelbuettel for clarifying this about lubridate and nanotime. I think it is good then to have all these packages here. I recall from my use that I didn't find lubridate terribly slow, and indeed it has both C and C++ functions.

Thanks also @BenoitLondon for these suggestions. I did not know about pkgverse, but it's a nice idea. I could create a function fastverse_child() allowing the creation of a 0-dependency extensible verse like the fastverse - for a future release.

Regarding the packages you suggested, I am happy to add stringdist.

The others I think don't qualify because (1) speedglm and ranger are packages to estimate specific kinds of statistical models. The fastverse focuses on general purpose statistical computing and data manipulation, and for good reason: we are talking about more than 50 packages in the estimation category: from various fast lm's, glm's, panel data and time series models (e.g. Kalman Filter), various fast machine learning models (random forests hast at least two faster implementations, there are several fast knn and other classifiers). We could add an "Estimation" category as an extra feature at the end of the README file, but then we should try to be comprehensive and also need to move in broad strokes e.g. just listing the packages under serveal categories 'liner models', 'time series', 'classifiers', 'imputation' etc.. At the moment I certainly don't have time to ckeck out all those packages and determine their dependencies, but if you want to undertake a comprehensive mapping of fast and low-dependency estimation packages I can add it to the README. Just picking out two packages here is definitely not an option, and estimation packages will never be added to the documentation under ?fastverse_extend.

future also for me does not qualify because it is a parallel computing package. Parallel implementation alone makes nothing fast, it depends on the code that is being parallelized, and C/C++ level parallelism (as you have in data.table, fst, roll etc.) also does a significantly better job at that. In any case, the fastverse includes packages alllowing you to write 'fast code' for statistical computing and data manipulation. Everything else is for the "High-Performance Computing" Task View on CRAN. redux appears to be in the same category, although I don't fully understand it.

@SebKrantz
Copy link
Member

Now added stringdist and the links.

@BenoitLondon
Copy link

BenoitLondon commented Aug 16, 2021

thanks! Yeah I think the ability to create several of our own *-verses is quite nice, depending on what you re working on.

some examples :

  • data-verse (which could be fastverse)
  • web-verse (with httr, rvest etc)
  • pkg-dev-verse (with testing pkg, usethis etc)
  • ML-verse (with ranger speedglm h2o etc)

Can create a new issue for this if you want?

@SebKrantz
Copy link
Member

Thank you, yes I can add it as an extra, but the purpose of my package is not to make a verse-creating package, but to emphasize packages with certain desirable properties. Full flexibility to customize this verse, both gloablly and for specific projects, has already been granted (see vignette). The disadvantage of creating wholly separate verses is that it requires creating a source package which is not available on CRAN, whereas simply adding a configuration file inside a project directory is much easier. So I'll keep it in the back of my head and implement it if feasable. I don't think an extra issue is necessary. Thanks.

@BenoitLondon
Copy link

sure, makes sense thanks!

@SebKrantz
Copy link
Member

So @BenoitLondon I have just pushed an update to github which includes a function fastverse_child that does what you want. Feel free to check it out and give feedback.

@emmansh
Copy link

emmansh commented Sep 3, 2021

@SebKrantz this organization is a great initiative. If I may, I would like to remind of {rrapply} which is a great package for dealing with lists. It provides great speed with no dependencies.

@SebKrantz
Copy link
Member

Thanks @emmansh, this package is interesting. I will check it out.

@s3alfisc
Copy link

Hi @SebKrantz, mabye this is out of scope for the fastverse, but I wanted to point you towards the dqrng package, which provides very fast sampling of random numbers. Here is a benchmark I did a while back:

library(dqrng)
library(bench)

m <- 1000
n <- 99999
all <- m * n
bm <- bench::mark(samp = sample(x = c(1, -1), size = all, replace = TRUE),
                  dqsamp = dqsample(x = c(1,-1), size = all, replace = TRUE),
                  check = FALSE, 
                  iterations = 3)
bm

# # A tibble: 2 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 samp          6.37s    6.59s     0.153    1.12GB    0.153     3     3     19.56s
# 2 dqsamp        1.07s    1.43s     0.723    1.12GB    0.482     3     2      4.15s
# # ... with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

@SebKrantz
Copy link
Member

Thanks @s3alfisc, yes, random number generation is part of general purpose statistical computing and I am happy to include it.

@t-wojciech
Copy link

Here is a new package ast2ast that translates some functions from R to C++, so they are faster. I think it's worth looking into.

@SebKrantz
Copy link
Member

Thanks @t-wojciech. I also recently became aware of several approaches of compiling R to make it faster. I'll investigate and think about featuring such packages in the fastverse over the coming weeks.

@t-wojciech
Copy link

rpolars bindings to Polars. It's still in the early stages (not available on CRAN), but it promises to be interesting.

SebKrantz added a commit that referenced this issue Apr 10, 2023
SebKrantz added a commit that referenced this issue Apr 10, 2023
Add r-polars and Tidier.jl (#18).
SebKrantz added a commit that referenced this issue Apr 10, 2023
Add r-polars and Tidier.jl (#18).
@tony-aw
Copy link

tony-aw commented Jun 25, 2023

Hello, thank you! I already considered nanotime, but did not include it for now because my thinking was it provides a specialized class that few people require and those that require it know about it. But I can include it for sure. Lubridate and ggplot2 are on the list because I haven’t quite found convenient replacements for them, and the fastverse should still be somewhat well rounded. I don’t know about the other packages, but you can send a pull request to the development branch, making a new category for reading and writing files.

Otherwise I’ll look at them during the weekend...

Sorry to jump in like this. But regarding an alternative to ggplot2: what do you think of the vegabrite R package (https://github.com/vegawidget/vegabrite)? It looks promising (but it's still somewhat experimental).

@SebKrantz
Copy link
Member

Thanks, its interesting indeed, especially for interactive visualization in R. However, it imports vegawidget, and through that incurs 34 dependencies. So given that this is experimental and with high dependency count, not really a fastverse candicate. But I agree with you, a lightweight and more performant system for complex graphics in R would be very nice.

@tony-aw
Copy link

tony-aw commented Jun 29, 2023

Hi, thank you for your response.
Yes, including recursive dependencies the number is indeed high. Mostly due to the recursive dependencies of dependency htmlwidgets. If only that package could reduce its dependencies....

@tony-aw
Copy link

tony-aw commented Oct 6, 2023

By the way, stringr has 7 dependencies, not 3:
cli, glue (≥ 1.6.1), lifecycle (≥ 1.0.3), magrittr, rlang (≥ 1.0.0), stringi (≥ 1.5.3), vctrs.
Why would stringr be in the list, considering it's just a wrapper around stringi thouggh with unnecessary many dependencies?

@SebKrantz
Copy link
Member

Agreed, it could be removed, I sometimes still use it because of the more convenient API.

@tony-aw
Copy link

tony-aw commented Oct 23, 2023

The function names and arguments of 'stringi' and 'stringr' are quite similar, or do you mean something else? Also, sorry if this is a stupid question, but what does the API have got to do with the fastverse? It's about high speed and minimal dependencies, right?

@waynelapierre
Copy link

waynelapierre commented Jan 14, 2024

Some suggestions:

  1. remove magrittr. it is slow and has been obsolete for a while: https://michaelbarrowman.co.uk/post/the-new-base-pipe/
  2. nowadays very few R users resort to Java for speed
  3. it makes no sense to resort to Julia for speed as it is slow and bloated
  4. many packages are just wrappers of the real fast ones (stringr to stringi, tidytable to data.table, etc.). since this repo is called fastverse not tidyverse, you might want to keep only the real fast ones to avoid confusing readers.

@SebKrantz
Copy link
Member

Thanks, I have adjusted the README a bit, putting stringr, snakecase and lubridate into the notes below each section. I want to keep magrittr due to the reasons mentioned here. Bindings to faster languages and data.table wrappers were moved to the end of the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants