Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Substack with a custom domain #3244

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

EruditeLying
Copy link

Remove the substack.com domain from the target regex to support Substacks with a custom domain

In detectWeb, manually check for the old target regex first, then for one of the Substack footer buttons in the page DOM. If neither matches, return false

Remove the substack.com domain from the target regex to support Substacks with a custom domain

In detectWeb, manually check for the old target regex first, then for one of the Substack footer buttons in the page DOM. If neither matches, return false
Copy link
Contributor

@alex-ter alex-ter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a couple of inline comments and suggestions, otherwise looks and works (in Scaffold, manually + tests) ok for me (with my suggested change for the archive regex).

There are two test failures due to one-line expected result differences in view of the usual translator test rot (blogTitle changed on both test subjects) - probably worth fixing in this PR? Alternatively, I could update them after this one is merged, together with asyncifying (now that it drew my attention, why not do that).

@@ -2,14 +2,14 @@
"translatorID": "ac3b958f-0581-4117-bebc-44af3b876545",
"label": "Substack",
"creator": "Abe Jellinek",
"target": "^https://[^.]+\\.substack\\.com/(p/|archive)",
"target": "/p/|/archive",
Copy link
Contributor

@alex-ter alex-ter Mar 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is of course quite generic, so this translator will be considered for quite a few URLs I suspect (e.g., many sites plausibly have some "archive"). A couple of suggestions in view of that:

  1. The priority field should be bumped up to "250", I think. See translator priority docs for details. This is accounting for that a.footer-substack-cta check, which I think can be counted as a "unique check in detectWeb".
  2. I'd suggest specializing the target pattern as much as possible and at least adding a (?:$|\?) (non-capturing because performance 🙂) for the archive part (see a proposal and a counterexample below). /p/ is trickier as I'm not sure what symbols Substack could use there, but maybe something along the same lines, i.e., /p/<something-but-not-/>$? Though I see in the test set that there may be a /p/<post name>/comments, so maybe not exactly that - essentially, this is just a suggestion to think twice of any additional options.

As for the archive, below is the proposed alteration and here's the counterexample link I could think of OTMH: IACR ePrint paper version record.

Suggested change
"target": "/p/|/archive",
"target": "/p/|/archive(?:$|\?)",

@@ -49,7 +51,7 @@ function detectWeb(doc, url) {
function getSearchResults(doc, checkOnly) {
var items = {};
var found = false;
var rows = doc.querySelectorAll('a.post-preview-title[href*="/p/"]');
var rows = doc.querySelectorAll('a[data-testid="post-preview-title"][href*="/p/"]');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good one - looks like this part is plain broken right now due to this change in the markup. In view of that, even if the pattern change doesn't get in (e.g., I can imagine maintainers proposing an alternative approach to keep the target specialized due to performance reasons), this one is worthwhile as a standalone fix.

@alex-ter
Copy link
Contributor

BTW, one more thing to note - I see somehow none of your PRs submitted that day got a workflow run (with the usual checks), so maybe something like rebasing on top of current master would be useful too.

@AbeJellinek
Copy link
Member

I don't think this is the right approach. The Substack translator is very basic. On that test case, Embedded Metadata already gets the author name, abstract, and date. The item type is wrong - webpage instead of blogPost - but the only field it's missing is websiteType ("Substack newsletter", which isn't necessarily correct for custom-domain Substack blogs). It would be much better to detect Substack in EM and set the item type correctly than to turn this into a generic translator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants