Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[*chan] Add text-posts opt (generic text-tweets) #3626

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wlritchi
Copy link
Contributor

@wlritchi wlritchi commented Feb 6, 2023

The Twitter extractor has a text-tweets option, which causes it to return Directory results even for tweets with no media. Some other extractors (such as Patreon) behave this way by default. Still others skip non-media posts entirely. This PR adds a text-posts option to (some of?) the extractors that otherwise skip non-media posts, and updates the Twitter extractor to respect this new option alongside the existing Twitter-specific one.

I did a cursory search for common patterns (if not files and the like) that I expected might appear in extractors that skipped non-media posts. The only ones I found were the *chan extractors. The logic change is broadly similar in all of them: move the Directory result inside the loop over posts rather than just one per thread.

This change probably isn't meaningful without corresponding post-processor config for each extractor. I tested with the following config additions:

...
    "postprocessors": [
        {
            "name": "metadata",
            "event": "post",
            "filename": "{no}.post_data.json",
            "whitelist": ["2chan", "4chan", "420chan", "vichan"]
        },
        {
            "name": "metadata",
            "event": "post",
            "filename": "{postId}.post_data.json",
            "whitelist": ["8chan", "lynxchan"]
        }
    ]

420chan appears to be permanently offline, so no test results for that one. For the rest, I found threads with some text-only posts1 and manually tested. Everything seems to work as I expected.

This change was requested in #570 (comment). The ticket itself appears to be about Twitter; should that one be closed as completed? (regardless of this PR)

Footnotes

  1. Which was a rather unpleasant task for some of these sites.

@wlritchi
Copy link
Contributor Author

wlritchi commented Feb 6, 2023

Mmm, further testing suggests that actually the dict update changes in lynxchan and 8chan extractors aren't right; it looks like the threadId is overwriting the postId. Will investigate further. Resolved, a logic change that I thought was necessary in these extractors actually had some incorrect assumptions. Reverted, and squashed/force-pushed to keep git blame clean for the affected lines.

@wlritchi
Copy link
Contributor Author

wlritchi commented Feb 6, 2023

Oh, and if #3447 is merged, that extractor will need a similar change to the ones here.

@Scripter17
Copy link
Contributor

This seems to not work with --write-metadata

@mikf
Copy link
Owner

mikf commented Sep 15, 2023

--write-metadata only write per-file metadata, while the metadata returned here is per-post. You'd need an "event": "post" metadata post processor to make use of it.

@Scripter17
Copy link
Contributor

Sorry for not being clear

Adding --write-metadata stops the post-processor from working. Which I assume is not the intended behaviour

@mikf
Copy link
Owner

mikf commented Sep 17, 2023

--write-metadata will overwrite post-processors defined at the top level of your config, but anything in extractor or a site category should be unaffected.

{
    "postprocessors": [{"name": "mtime"}],
    "extractor": {
        "postprocessors": [{"name": "ugoira"}],
        "4chan": {
            "postprocessors": [{"name": "compare"}]
        }
    }
}
$ gallery-dl -c config.json -v https://boards.4chan.org/tg/thread/15396072/
[4chan][debug] Active postprocessor modules: [ComparePP, UgoiraPP, MtimePP]

$ gallery-dl -c config.json -v --write-metadata https://boards.4chan.org/tg/thread/15396072/
[4chan][debug] Active postprocessor modules: [ComparePP, UgoiraPP, MetadataPP]

@Scripter17
Copy link
Contributor

  1. That should probably be changed/raise a warning
  2. After I got it working, it seems lynxchan posts are all put in different folders instead of one per-thread folder. This happens with and without the postprocessor and --write-metadata

@Wiiplay123
Copy link
Contributor

After I got it working, it seems lynxchan posts are all put in different folders instead of one per-thread folder. This happens with and without the postprocessor and --write-metadata

The extra Directory messages are updating the directory for every post in the thread.

By default, the directory is set to threadId + the text in the post, which wasn't a problem when the directory was only updated for the first post. Now it's making folders with the original thread ID + the text of the current post. The metadata filename for the text post itself gets weird and ends up having the thread ID + the most recent image filename, or just "metadata" if there is none.

@Wiiplay123
Copy link
Contributor

I made an experimental commit for Lynxchan only, but it's just the bare minimum to make it work.
Is there already a way to trigger saving metadata without updating the directory? I uncommented Message.Metadata and made an extra function to do that in my version.

Wiiplay123@db8aece

As with wlritchi's version, it needs the postprocessor event in gallery-dl.conf set to "post". To keep the old metadata from being duplicated, I used this filename: "{postId}{num:?-//} {filename|subject|message[:50]}{extension:?['.'/]/}.json"

For posts with files, the filenames should be identical to the old ones. For posts without files, the format will be like the default folder filename.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants