Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move image copyrights to image #8863

Open
3 of 4 tasks
doofus-01 opened this issue May 12, 2024 · 23 comments
Open
3 of 4 tasks

Move image copyrights to image #8863

doofus-01 opened this issue May 12, 2024 · 23 comments
Labels
CI Issues involving the CI platform. Enhancement Issues that are requests for new features or changes to existing ones. Graphics Issues that involve the graphics engine or assets. Question Issues that are actually questions rather than problem reports.

Comments

@doofus-01
Copy link
Member

doofus-01 commented May 12, 2024

Describe the desired feature

We currently (since 97c8feb) track image copyrights in a CSV, with some support script and CI enforcement. This makes sure we don't end up with images of unknown or forgotten authors/copyrights. But then this involves meta-work and potential for merge conflicts whenever an image is added or modified.

We could use the EXIF tags of the WEBP, PNG, JPEG images to hold all the necessary info, and check that added/updated images have the info. A human-readable summary file (essentially replacing the current CSV) can be printed whenever, no reason for it to create merge conflicts.


EDIT: I'm working on a draft of this in one of my own repositories, because it's not clear to me how to test this locally before incorporating into the Wesnoth repository. This matters most for the CI workflow, the maintenance scripts can probably be a proper PR.

  • Get a basic CI workflow using exiftool to check the tags of an image touched by a PR
  • Get agreement on what needs to be checked
  • Write a script to read the existing copyrights.csv and update the Exif metadata of the existing images
  • Write a script to read and export existing image metadata to a CSV file, probably using the same format as the existing copyrights.csv.

Once the scripts are done, I don't think it's too much to ask that anyone deciding to batch run some image optimizing script also run the Exif metadata update.


EDIT2: This is a functional draft of the maintenance script to read the Exif data from the images and write to CSV
export_copyrights.txt

@doofus-01 doofus-01 added Enhancement Issues that are requests for new features or changes to existing ones. Graphics Issues that involve the graphics engine or assets. CI Issues involving the CI platform. labels May 12, 2024
@cooljeanius
Copy link
Contributor

One problem with this is that some of the tools for keeping image size down will strip EXIF data; I'm not sure if woptipng.py does this, but many similar tools do...

@Elvish-Hunter
Copy link
Contributor

Just like @cooljeanius said, I remember that one of the things that utils/woptipng.py (and probably also utils/optiwebp.py) does to reduce the file size is to remove all those metadata. Most image editors can also edit the EXIF fields, even putting some default values that can overwrite what's already there.
To me, it looks like there are too many ways to accidentally mess up the EXIF metadata, so we shouldn't rely on them.

@Pentarctagon
Copy link
Member

I could also remove it from the CI check and then update everything en-masse as part of the release work for each point release.

@doofus-01
Copy link
Member Author

I'm not convinced by either of those excuses.

  • There are ways to mess up anything, how is that a reason? CI fails if there's an extra white space, it should be able to handle this.

  • Removing the 200 bytes or whatever it is from the EXIF data is useless hyper-optimization.

@doofus-01
Copy link
Member Author

I could also remove it from the CI check and then update everything en-masse as part of the release work for each point release.

That sounds like it would shift the meta-work to you, and doesn't seem right if there is a better way. Maybe there isn't a better way, but I'd like to try.

A side benefit of using the EXIF data is that the artists, Wesnoth, and CC licenses could get marginally more exposure, as the image is grabbed and reused.

@Pentarctagon
Copy link
Member

Just like @cooljeanius said, I remember that one of the things that utils/woptipng.py (and probably also utils/optiwebp.py) does to reduce the file size is to remove all those metadata.

Probably using EXIF data in the webp files would be fine, since optiwebp doesn't do anything besides check if converting from png/jpg to webp results in a meaningful size reduction. It doesn't touch existing webp images or do anything else to reduce file size, though I have no idea if EXIF data would survive the conversion to webp for existing images.

@soliton-
Copy link
Member

I think this is definitely worth exploring. Having author and copyright info in the files themselves would mean they are still there even if someone uses them in their own project without copying the copyright.csv entries in some form.

Currently utils/woptipng.py calls convert -strip which removes nearly all meta data. Didn't test utils/optiwebp.py since you can't even run it on a path you want. cwebp itself appears to only care for meta data when given the -metadata option.

So our tooling would certainly need adjusting. Btw, not stripping meta data at all is not a good option as can be seen in e3ee472 for example where image sizes increase by thousands of percent in size just because of the added meta data.

@doofus-01
Copy link
Member Author

I've got a bit of a learning curve to get through to actually make this happen, but it looks like it should be possible to do this in CI:

  1. Read new/changed images for specific EXIF tags
  2. Fail CI if the specific metadata is missing (and maybe there should be a test that there isn't extra data?)

Then as a manually run script (not necessarily all one script):

  1. Read the existing CSV and update the existing images to have the metadata.
  2. Read through the images and export the metadata to some-human readable file, whether that's the current CSV or some updated format.

@doofus-01
Copy link
Member Author

Well, I'm stuck... Nothing really specific to this image stuff, just git fun and inexperience with github workflows.

I'm trying to get a list of files in a pull request that are different than the merge base - a very basic and common task, one would think. There's an example in the map-diff tool, and a few others in stack overflow, but I can't seem to get this. It doesn't help that there's not really a way to test this locally, AFAICT.

All my attempts to get a diff result in either a clean & empty diff or a fatal error.
How do I reference the thing in the blue box, (triggered by actions/checkout@v4)?
Screenshot_20240519_

git merge-base --fork-point sounds promising, but hasn't helped yet.

Any help appreciated. Thanks.

@Pentarctagon
Copy link
Member

Even if the copyright information is kept in the images themselves, we could still keep a file around with the hashes of all the images files and check against that.

For the actual question: IIRC it might have something to do with the shallow checkout (--depth=1)? I vaguely remember that being a problem when I tried implementing the current python script, which is why it keeps track of the file hashes to avoid needing to clone the entire repository history since that can take multiple minutes by itself.

@doofus-01
Copy link
Member Author

Even if the copyright information is kept in the images themselves, we could still keep a file around with the hashes of all the images files and check against that.

Right, I was trying to say that was part of the deal: there is a bookkeeping file that is human readable, and there are scripts to deal with it. It's just not part of CI; the images and whatever .cfg or .lua references them are all that need be touched in a PR. Storing the image hashes there makes sense, and there is no reason the bookkeeping file can't be automatically updated. Eventually... maybe...

For the actual question: IIRC it might have something to do with the shallow checkout (--depth=1)? I vaguely remember that being a problem when I tried implementing the current python script, which is why it keeps track of the file hashes to avoid needing to clone the entire repository history since that can take multiple minutes by itself.

I hadn't looked at that, it's something to try, thanks. It's very frustrating that the info is right there, it just can't be touched without a whole lot of BS overhead.

@doofus-01
Copy link
Member Author

I finally got out of that github cathole, and have a mostly working metadata check. (My struggle, that completely trashed the commit history of one of my repositories, seems to have more to do with github webhooks than anything purely git related.)

Figuring out how to make a PR for this sounds like another cathole I don't want to squish around in right now, probably best to get something mostly finished first. I'll update the first post when there's progress.

But for now, the specific check I'm looking at is:

  1. Fail if an image is updated or added without Artist or Copyright EXIF tag
  2. Fail if an image has both tags, but the copyright isn't an acceptable one. In the checks linked above, the list is
    • GNU GPL v2+
    • CC BY-SA 4.0
      But other licenses should probably be added. Public Domain and CC0 are obvious candidates, but I'm not sure what that means as far as generative AI, so I'd rather keep that in a separate issue - one that may need to be resolved before this is mainlined, but still not here.
  3. Should the Artist tag be checked against some list, so the check fails if there is an unrecognized contributor? I think so, this would be a flag that the credits/about need to be updated, but maybe this can be added later.
  4. Is there any other tag that should be included?

@doofus-01 doofus-01 added the Question Issues that are actually questions rather than problem reports. label May 26, 2024
@gfgtdf
Copy link
Contributor

gfgtdf commented May 26, 2024

I am a bit worried that this will result in a noticable size increase of the wesnoth github repo since it could mean that we will change the image files more often in the future. Unlike text files, git usually cannot handle binay files very well and afaik any change to them is stored as a full replacement of the image file.

@doofus-01
Copy link
Member Author

I am a bit worried that this will result in a noticable size increase of the wesnoth github repo since it could mean that we will change the image files more often in the future. Unlike text files, git usually cannot handle binay files very well and afaik any change to them is stored as a full replacement of the image file.

How worried are you, really?

I'm not seeing it, you've got to explain.

@Pentarctagon
Copy link
Member

@gfgtdf I assume that'd be pretty easy to test out?

@gfgtdf
Copy link
Contributor

gfgtdf commented May 27, 2024

How worried are you, really?
I'm not seeing it, you've got to explain.

So i'm not really an expert on the git details, so what i'm saying might be wrong, but i think i remember from an earlier discussion that it's a bad idea to change binary files too often in git repos because the (git history) compression cannot really store diffs of them, so that the git history can become very large if binary files get changed often. (a quick search on the internet gave me different contradictory results on this topic so i actually don't know. Some sources say this only applies extra compressed binary files and that compression part of git doesn't really make a diffrence between binary and text files, and other pages say git simply cannot do this for any kind of binary files at all)

I also don't know how often this change will actually change the files, (except once of course to add the first metadata, which would be fine in i any case i guess). My worries were more about cases where we later decide to add extra data, change the format, add authors etc.

@gfgtdf I assume that'd be pretty easy to test out?

Its a bit tricky actually since one needs to make sure to compare compressed git histoy to compressed git history, but sureley possible.

@Pentarctagon
Copy link
Member

Its a bit tricky actually since one needs to make sure to compare compressed git histoy to compressed git history, but sureley possible.

I mean that all you'd need to do would be to make a metadata change and then check how much data changed in the diff, wouldn't you? If the diff only says that 50 bytes where added, then surely git doesn't do a full replacement anyway.

@CelticMinstrel
Copy link
Member

If the diff only says that 50 bytes where added, then surely git doesn't do a full replacement anyway.

I doubt you can make such an assumption. It's certainly possible to do a binary diff which works out very similar in functionality to a text diff, but even if only 50 bytes were added, there's a pretty high chance that a lot more bytes were changed. And there's a point where the diff would end up being bigger than the original file. I'm not sure what it would take to hit that point with image or sound files though.

@Pentarctagon
Copy link
Member

Based on https://stackoverflow.com/a/59346690 and https://stackoverflow.com/a/53648035, the answer seems to be:

  • initially binary files are stored as completely separate copies
  • after git packs them (either every so often automatically or after running git gc), then it will store the changes as a delta between the two binary files

So I don't think ballooning the git history is a reason this can't be done.

@doofus-01
Copy link
Member Author

doofus-01 commented Jun 1, 2024

I also don't know how often this change will actually change the files, (except once of course to add the first metadata, which would be fine in i any case i guess). My worries were more about cases where we later decide to add extra data, change the format, add authors etc.

This was the source of my confusion; I don't understand why we'd suddenly start changing this info frequently, if at all. Moot point, I guess, if what Pentarctagon found is true. BTW, where does one find a repository's resource usage?

If we continue to use the copyrights.csv as the readable export list, it would be the author/artist tag and license/copyright that are used, and no reason a comment/notes tag can't be added where needed. The date, directory, and hash are for tracking, I guess they could still be added to the CSV, but I don't see a reason to add them to the Exif tags.

@Pentarctagon
Copy link
Member

What do you mean by a repository's resource usage, specifically?

@CelticMinstrel
Copy link
Member

Probably something similar to the total size of the repository history on disk?

@doofus-01
Copy link
Member Author

Total resources that Wesnoth is being granted by the Microsoft Corporation, or any subset of that (disk space, data transfer, anything else). I can't find any online meter, but I'm not curious enough to clone and check my disk usage. I was mostly wondering about the context of gfgtdf's objection.


Back on topic, I've got a maintenance script (derived from update_copyrights) to read the Exif tags from all the images and export a CSV. I'm sure it could be better, but it isn't part of CI automation, so as long as the relevant Exif tags aren't changing, the stakes should be pretty low. I'll update the OP.

If no one has any other suggestions, I'll consider it settled that these are the tags:

  • Artist - same as "Author" in the existing CSV. It is checked for in CI, but contents are not evaluated.
  • Copyright - same as "License" in existing CSV. It is checked for in CI, and must be a compatible license.
  • UserComment - same as "Notes" in existing CSV. It is not checked for in CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Issues involving the CI platform. Enhancement Issues that are requests for new features or changes to existing ones. Graphics Issues that involve the graphics engine or assets. Question Issues that are actually questions rather than problem reports.
Projects
None yet
Development

No branches or pull requests

7 participants