Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods for large corpora? #9

Open
gnewton opened this issue Dec 27, 2018 · 1 comment
Open

Methods for large corpora? #9

gnewton opened this issue Dec 27, 2018 · 1 comment

Comments

@gnewton
Copy link

gnewton commented Dec 27, 2018

Sort of related to #8...

You have methods in the API, like in your example, that take an array of strings (docs).

matrix, _ := vectoriser.FitTransform(testCorpus...)

I'd like to use this for very large corpora, with 10s or 100s of millions of (not tiny) documents. Putting these all into a single array of strings does not sound optimal.
Any chance the methods that now have a string array parameter for the documents could be altered to take in a function or interface that could allow iteration to get all the docs? (Or new methods that support this?)

Thanks,
Glen

@james-bowman
Copy link
Owner

Thanks, this is on the agenda. Was thinking something like a FitPartial() method and/or adding support for more generalisable input streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants