Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable cache system? #103

Open
jimsmart opened this issue Feb 11, 2018 · 26 comments · May be fixed by #766
Open

Pluggable cache system? #103

jimsmart opened this issue Feb 11, 2018 · 26 comments · May be fixed by #766

Comments

@jimsmart
Copy link

Hi,

Thanks for Colly! :)

I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)

I'd like to propose a pluggable cache system, similar to how you've made other Colly components.

Perhaps with an API like this:-

type Cache interface {
	Init() error
	Get(url string) (*colly.Response, error)
	Put(url string, r colly.Response) error
	Remove(url string) error
}

...or...

type Cache interface {
	Init() error
	Get(url string) ([]byte, error)
	Put(url string, data []byte) error
	Remove(url string) error
}

The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.

The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.

If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.

I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)

— Your thoughts?

@jimsmart
Copy link
Author

Another option is to pass/return io.Reader across the interface, instead of []byte

@asciimoo
Copy link
Member

asciimoo commented Feb 11, 2018

@jimsmart thanks for this great idea and for the detailed and clear specification.

I'm currently working on a storage interface (04d96ab) which follows pretty much the same concept. It allows us to use any kind of backend to store visited urls and cookies (http://go-colly.org/docs/best_practices/storage/).
Let me describe its architecture, perhaps it can help to design a more sophisticated cache functionality.
storage interface is a stand alone module in colly: https://github.com/gocolly/colly/blob/master/storage/storage.go and colly imports it. storage module also contains the default naive implementation of the storage interface to be backward compatible. I decided to create separate packages for different implementations of the Storage interface to reduce colly's dependencies. Redis storage implementation can be found here: https://github.com/gocolly/redisstorage .

I think, a similar concept could work with the 2nd cache interface from your specification. What do you think?

Also, I'd suggest an additional Close() error function in the interface to be able to terminate connections.

@jimsmart
Copy link
Author

jimsmart commented Feb 11, 2018

Yes. That's what I was thinking, similar to package storage.

And have a SetCache(c cache.Cache) error method on Collector.

I don't need Close() myself... Get(), Put() and Remove() should all be atomic operations, and each method would simply obtain a database connection on execution, and close it before returning. All Go db drivers use a connection pool behind the scenes anyhow, so it's a good pattern to follow, has minimal overheads and is easy to reason about. It's better to do that than to have an open db connection hanging about for hours/days (it's better to release it back to the pool after every use, that way the driver can deal with broken or stale connections, etc.). — My Init() method here wasn't for creating a connection, it was to create the db / db table if it doesn't already exist (or, for the file system: create the cache dir).

Your redisstorage.Storage could then also implement Get(), Put() and Remove() and would satisfy both storage.Storage and cache.Cache interfaces.

— Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have colly.Storage as the interface in the method calls.

@asciimoo
Copy link
Member

And have a SetCache(c cache.Cache) error method on Collector.

Yes.

I don't need Close() myself... Get(), Put() and Remove() should all be atomic operations

You're right, i've removed it from Storage too, thanks.

Your redisstorage.Storage could then also implement Get(), Put() and Remove() and would satisfy both storage.Storage and cache.Cache interfaces.

Storage is responsible for storing both visited urls and cookies, so a simple Get(), Put() isn't enough in this case. Also, it is planned to add robots txt handling to the Storage interface which makes the interface even more complex.

Slight aside: I see you have a lot of stutter in your interface names. It might be better to move (or perhaps copy) them up to the parent package, then you'd have colly.Storage as the interface in the method calls.

I agree, it is a bit redundant, but the package doesn't use internals of the colly package which is quite big. I'm not against moving Storage/Cache interface to colly, especially because it would allow us to use the first proposed interface.

@jimsmart
Copy link
Author

Storage is responsible for storing both visited urls and cookies, so a simple Get(), Put() isn't enough in this case.

Of course not! :) I wasn't suggesting it would be, I was suggesting that specific implementations of Storage could (optionally) also implement the cache methods (Get/Put/Remove) as well as the methods on the Storage interface — this would then give the user the option of using that store as both a cache and a store for state persistence, or just one or the other.

@vosmith
Copy link
Collaborator

vosmith commented Feb 12, 2018

@jimsmart what exactly are you get/putting in the cache? My imagination is having a hard time going beyond image and css files which aren't downloaded through Colly in any case. If it's just HTML isn't the IsVisited feature enough? I was initially excited about the idea of a cacheing backend as well, but I couldnt find a clear enough use case.

@jimsmart
Copy link
Author

jimsmart commented Feb 12, 2018

I have several hundred thousand web pages I am processing to extract data.

Currently, if the task fails part way through, I just resume later, and rely on the cached data (in the current file system cache) to get reprocessed and then it simply resumes from where it got to.

But there are so many files, and most are hundreds of kb each. The cache folder is >6gb after a crawl. Compressed and in Sqlite, that same data is a single file, of <1.5gb. It's fast. It's easy to manage.

I've pushed two repos with working prototype code, run go test in each project folder to execute the tests.

https://github.com/jimsmart/collysqlite - implements a cache using Sqlite.
https://github.com/jimsmart/collyzstandard - implements a compressing cache wrapper using Zstandard.

Both of these use []byte for the type of data being cached — it was the simplest solution to get a prototype up and running.

@jimsmart
Copy link
Author

Basically, if I'm going to invest the time and resources to download that much data, I'd most certainly like to keep it around, for later reruns.

@jimsmart
Copy link
Author

Scrapy, the leading Python web scraper, has pluggable caches. Maybe take a look?

@jimsmart
Copy link
Author

FWIW: some log output from collyzstandard tests showing compression stats...

2018/02/12 02:40:51 compressed 12930/5251, ratio 2.46, in 2.757465ms https://google.com/
2018/02/12 02:40:51 decompressed in 91.872µs https://google.com/
2018/02/12 02:40:53 compressed 351734/71301, ratio 4.93, in 78.481158ms https://facebook.com/
2018/02/12 02:40:53 decompressed in 1.093598ms https://facebook.com/
2018/02/12 02:40:54 compressed 148439/29126, ratio 5.10, in 54.710252ms https://twitter.com/
2018/02/12 02:40:54 decompressed in 338.833µs https://twitter.com/

@asciimoo
Copy link
Member

@jimsmart impressive! Could you please open a PR?

@jimsmart
Copy link
Author

Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today.

I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already.

I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar?

@jimsmart
Copy link
Author

In my local repo I also have an implementation of 99% of the 'storage' API... for SQLite.

@asciimoo
Copy link
Member

Sure, I'll get something together ASAP, I'm just on the tail-end of a bit of a coding marathon, so it won't be today.
I don't think my code will integrate cleanly yet, it needs some more work. But I'm on with that already.

Awesome, there is no rush. Thanks!

I'm also concerned about 'breaking changes' / whether you have any API compatibility guarantees, or anything similar?

Does this change introduce any backward incompatibility?

@jimsmart
Copy link
Author

Um, maybe. But I want to minimise that, obviously. In an ideal world there'd be no breaking changes. That's partly why the code needs more work.

I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best.

@asciimoo
Copy link
Member

I'm new to the party, so I don't know much about any current plans for features/improvements/etc. So I think a slow and gentle approach to this is probably best.

Perfect!
The project is relatively young, so we don't have strict rules about api changes. Try to avoid 'breaking changes' if possible and be gentle otherwise as you mentioned.

@jonathaningram
Copy link

I wonder if a good solution is to use an interface like that in go-cloud: https://github.com/google/go-cloud/tree/master/blob

Check out https://github.com/google/go-cloud/blob/master/blob/example_test.go for a "local/file" cache. Similarly, check out the GCS or S3 tests. In this way, I could pass my own bucket to colly and it would work with non-local file systems.

@vosmith
Copy link
Collaborator

vosmith commented Sep 13, 2018

@jonathaningram I think the interface used in go-cloud/blob looks like a good model to follow! Having it as a hard dependency would bloat this project, but I think emulating their framework would a good move.

@asciimoo @jimsmart , Thoughts?

@michilu
Copy link

michilu commented Nov 26, 2018

FYI: my use case.
I want to cache the responses of the POST requests to API.
It is an example of the customizable cache handler.
master...michilu:customizable-cache-handler

@florinutz
Copy link

I just wanted to drop this here https://www.nayuki.io/page/self-encrypted-cache-structure

@andreacab
Copy link

andreacab commented Jun 15, 2020

Hey I am thinking of picking that up but just wondering where this is at?

I should probably describe my use case. I ran into a blocker as we use heroku and their dynos' file systems are ephemeral. If you're not using a cloud storage mechanism (which you can but is quiet limited under the free tier) then you'll loose and will only have a day worth of caching at most...

@jimsmart
Copy link
Author

Hi, the repos mentioned in #103 (comment) (above) contain my prototype code. Feel free to open issues on those repos if you have any questions — happy to help, as best I can.

I recall there were also changes needed with Colly to make these work, but I don't believe I ever pushed those, sorry. But IIRC the interface was fairly straightforward.

@hsinhoyeh
Copy link

Hi, just want to catch up this issue

@jimsmart I don't see any advantage to use sqlite as a cache here, you can even tar files on top of filesystem which could deliver the similar performance compared to sqlite.

@asciimoo If you are seeking an api backward compatibility changes, maybe you can replace filesystem (currently using pkg "os") with https://github.com/spf13/afero, so we could easily swtich to memory fs with tar/gzip functionality on top of it.

any thoughts?

@jimsmart
Copy link
Author

jimsmart commented Nov 19, 2021 via email

@slinso
Copy link

slinso commented Aug 13, 2022

I really like the colly API for scraping, but having just the filesystem for storage is not nice. I came back to this issue a few times over the years and always went back to use my own simple crawler, that basically just uses httpclient and goquery.
But I always add caching with compression in some form, when the number of pages gets > 10000.

Thats why I would like to bring this topic up again. Is it worth investigating this issue further? I think @jimsmart did some nice prototyping. Is there any chance to get this included, if we have a complete pull request @hsinhoyeh
If so I would like to work on this.

@unjello
Copy link

unjello commented Apr 20, 2023

Bumping up. I'd love to see more pluggable caching/storage system, but I don't believe my go-fu is strong enough to attempt refactoring. I may need to, because I would really use it tho, so I wonder if there are people at least willing to do some solid peer-review.

unjello added a commit to Kronika-Polskiej-Demosceny/colly that referenced this issue Apr 29, 2023
Create an interface `Cache`, similiar to `storage.Storage` that
will allow for pluggable caching extensions. Caching subsystem is
created with full access to Request and Response objects, to potentialy
make more complex caching decisions based on both request and response
headers. Because of the need to access those objects, `Cache` is not
created in its own package like `storage`, but in root `colly`.

Closes gocolly#103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants