Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching file modification times. #26

Open
jjeffery opened this issue Feb 1, 2017 · 16 comments
Open

Caching file modification times. #26

jjeffery opened this issue Feb 1, 2017 · 16 comments
Labels

Comments

@jjeffery
Copy link

jjeffery commented Feb 1, 2017

Hi and thanks for the effort you have put into this excellent package.

One enhancement that I would find useful is the caching of file modification times. What I find is that if I run "make generate" on my project the generated vfsgen files can have different modification times for the files even if they have not changed. This is not devastating but I for one would find it handy if the generated vfsgen files did not change unless the input files were in fact different.

I was thinking that one way to achieve this might be to generate a JSON file that was keyed on file name and contained a digest of the file name and a modification time. If the file had not changed then the modification time was used. If the file had changed then the current time would be used and the JSON file would be updated accordingly:

{
    "img/favicon.png": {
        "md5": "123456789abcdef0fedcba987654321",
        "mtime": "2015-11-14T18:56:32"
    },
   "js/app.js": {
        "md5": "fb3456789abcdef0fedcba987654321",
        "mtime": "2015-12-16T11:47:32"
    }
}

Now I realise that this would be a significant addition of functionality and your contribution guidelines indicate that it might be considered out of scope for the package. So before I start serious work on a PR, could I ask for your thoughts on this idea?

@dmitshur
Copy link
Member

dmitshur commented Feb 1, 2017

Hi @jjeffery, thanks for your feedback on vfsgen, and for and creating this issue!

I'd like to discuss this to better understand the problem and solution space, before we get to implementation details.

vfsgen fully supports and respects the modification times in the source virtual filesystem (VFS), and preserves them in the output VFS. Running it multiple times on same input should produce same output (i.e., it's deterministic). It even normalizes the timezone as UTC, so that people running vfsgen in different parts of the world (i.e., where their local timezone is different) would still produce same output.

So, I want to learn more about this:

What I find is that if I run "make generate" on my project the generated vfsgen files can have different modification times for the files even if they have not changed.

What does your make generate do? Why do the modification times change, if the files do not change?

@jjeffery
Copy link
Author

jjeffery commented Feb 2, 2017

Thanks for the reply @shurcooL. I'm keen to discuss and see if perhaps there is a better way to achieve what I am trying to do. Perhaps there is.

Apologies for make generate, I meant to say go generate.

As you say, vfsgen does respect the file modification time, and if that does not change then the output of vfsgen is consistent. I take your point about converting to UTC and notice that I omitted to put a trailing "Z" in my example timestamps above, which would be necessary.

Where I run into issues is when the modification time changes but nothing else changes. In my experience the most common examples of this are:

  • Source code being checked out to different machines will have different modification times
  • CI server that checks out the source from Git each time will have a different modification time for each build run

In these cases I would like vfsgen to recognize that the MD5 (or SHA1, whatever) is the same and assign a previously allocated modification time so that the generated code matches exactly.

The attached screenshot might help explain. I have just checked out one of our repos and run go generate. The new code uses the date/time that I am writing this (2017-02-02 01:46:09Z) because that is the modification time of the files in the code repo that I have just cloned.

screen shot 2017-02-02 at 12 04 40 pm

As an aside, I'm not sure why the previous version is showing fractions of a second and the recently generated code shows to second precision only. As far as I can see the code is just using MarshalText so that's a small mystery at this stage. I'll have to check this out when I'm near a Windows machine, but I suspect that if you run vfsgen on Windows you get a modification time to fractions of a second, but on Unix systems the modification time resolution is one second.

@jjeffery jjeffery closed this as completed Feb 2, 2017
@jjeffery jjeffery reopened this Feb 2, 2017
@dmitshur
Copy link
Member

(Sorry for the delay; I lost track of this issue and forgot to get back to you.)

Thanks for providing that info.

Where I run into issues is when the modification time changes but nothing else changes. In my experience the most common examples of this are:

  • Source code being checked out to different machines will have different modification times
  • CI server that checks out the source from Git each time will have a different modification time for each build run

I see. That makes sense. I run into the source code being checked out causing modtimes to change on disk occasionally too.

In these cases I would like vfsgen to recognize that the MD5 (or SHA1, whatever) is the same and assign a previously allocated modification time so that the generated code matches exactly.

Right, so vfsgen currently has no input aside from the input VFS itself, where it sources the modtimes from. If I understand your proposed solution, it's to add an optional cache that would override the modtimes from input VFS if the cached hashes (and filesizes, I presume) match, right?

Question about that approach, where do you anticipate the cache would be stored? Imagine if I check something in, and another person pulls the same repo on a different computer. If the cache isn't a part of the repo, they're unlikely to be able to access it, right? Also, CI servers are probably not going to get a copy of your cache either.

Here's what I'm thinking so far. The whole approach of potentially using a cache is something that happens before vfsgen produces the output. It's pretty orthogonal from the actual generation process and can be cleanly abstracted out. The code that does it can be a part of vfsgen, or it can be a part of a separate tool.

However, for prototype purposes, I think your best bet would be to prototype it as a separate tool from vfsgen. Here's what you can do, create a tool that caches modtimes and hashes, and then give it the ability to "restore" old cached modtimes. It would literally write to disk and change modtimes of files on disk to the saved values. Then, vfsgen would pick up those new modtimes that are the values you want.

How does such an approach sound to you?

In the end, I think that caching modtimes is just one potential approach to solve this overarching issue, and perhaps there's other ways. I'm open to keep thinking about this problem.

@dmitshur
Copy link
Member

It would literally write to disk and change modtimes of files on disk to the saved values.

Just to follow up that thought, such code/tool doesn't have to write to disk necessarily, it can simply provide a VFS wrapper that overrides modtimes of the underlying source VFS.

@jjeffery
Copy link
Author

jjeffery commented Feb 12, 2017

Thanks for taking the time to consider this. You are absolutely correct in saying that our hypothetically modified vfsgen needs something additional in order to remember the computed hashes, file sizes, etc, and if it is to work the way I am wanting it to, that storage has to be stored in source control in order for it to be useful.

My initial implementation thought was to introduce a file with a special name, perhaps _vfsgen.json. If that file was present in the root directory of the VFS, then its contents would be read, and updated if any of the files had actually changed. If the file was not present, then vfsgen would operate exactly as it does today.

I've already included a pretty simple example of what the contents of _vfsgen.json might look in my initial comment. (Yes, I think that including the file size would be a good idea).

I can well understand that magic filenames are not to everyone's taste, alternative implementations might be to include an additional command line argument to vfsgen.

Your point about this being an orthogonal concern to the main job of vfsgen is absolutely correct, and yes, I should probably just prototype this as a separate utility. My go generate steps are then in two steps, which should be no problem.

I guess that the only real motivation for including this in vfsgen would be if this were a common enough problem for all of its users. The convenience of having it built-in would obviate the need for another go get when setting up the project.

Never mind, I'm getting the impression that perhaps it is not something anyone else has had issue with. Feel free to close this issue, and I'll put together something that restores the timestamps prior to generating with vfsgen. (The wrapper idea is also a good one -- thanks very much. I'll experiment with that idea).

@dmitshur
Copy link
Member

dmitshur commented Feb 12, 2017

Sounds good, thanks!

As I said in my previous comment, I'm okay to keep this open with thinking label and see track any progress we make towards resolving the problem, or at least helping it in some ways.

It'd be great if you could post any updates about your experimentation and success, and tell me how well it works, etc.

If in the end it's something that's worthwhile to integrate into vfsgen, meaning it provides enough value to justify the increased complexity, I would certainly consider integrating it.

That said, I really want to think more and hopefully come up with a simpler solution.

@dc0d
Copy link

dc0d commented Apr 23, 2017

Nice package! Though I've not used it in any app yet. It would be nice if HTTP caching headers could be provided automatically (ETag, Cache-Control and other headers in the league as described - for example - here).

We should be overwrite default behavior or disable it - could be just a function that has access to file body and a KV-Store of file/info,header.

@dmitshur
Copy link
Member

Thanks @dc0d. This issue is about some sort caching of file modify dates, for files that are not modified between invocations of vfsgen. Not app-layer HTTP caching. Although the two might be related in some ways.

Still, if you'd like to discuss that, it's better to open a separate issue about that, to avoid crosstalk.

@dc0d
Copy link

dc0d commented Apr 23, 2017

My mistake; sorry.

@dmitshur
Copy link
Member

No problem! If you create that issue, we can discuss it further there.

@ghostsquad
Copy link

if this were a common enough problem for all of its users

I also think this would be very useful, as I plan on checking in the go generated file into source control, it would be nice to have a build script that is idempotent, and as such be able to run go genenerate, then run git ls-files -m and error if any files have changed.

@ghostsquad
Copy link

@shurcooL what's the usefulness of saving the modtime? Maybe this could be added to options and set using a static value?

@ghostsquad
Copy link

ghostsquad commented Dec 25, 2017

Assuming that file modification time doesn't actually matter, and only the content of the file matters, the above PR allows a user to set the filemod time to the the time zero value.

cc @jjeffery

@dmitshur
Copy link
Member

dmitshur commented Jan 4, 2018

what's the usefulness of saving the modtime?

Mostly to enable HTTP caching. When files have accurate modification times, and a request comes in with a If-Modified-Since: Thu, 05 Jul 2012 15:31:30 GMT header, then it's possible to respond with "304 Not Modified" if a file hasn't been modified since that time.

it would be nice to have a build script that is idempotent, and as such be able to run go genenerate, then run git ls-files -m and error if any files have changed.

I agree that'd be nice, but I still don't know of a great way of making that possible.

@dc0d
Copy link

dc0d commented Jan 5, 2018

I needed Cache-Control headers so created this package. Not sure is it's best in the market :) but it works - might be useful for implementing those headers in case those are going to be added.

@antong
Copy link
Contributor

antong commented Sep 7, 2018

In the stated use case where there is say assets_vfsdata.go that is committed to version control, then that means the necessary info is already contained in that file. I don't think a separate cache is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants