Support Substack with a custom domain #3244

EruditeLying · 2024-01-30T00:47:49Z

Remove the substack.com domain from the target regex to support Substacks with a custom domain

In detectWeb, manually check for the old target regex first, then for one of the Substack footer buttons in the page DOM. If neither matches, return false

Remove the substack.com domain from the target regex to support Substacks with a custom domain In detectWeb, manually check for the old target regex first, then for one of the Substack footer buttons in the page DOM. If neither matches, return false

alex-ter

There's a couple of inline comments and suggestions, otherwise looks and works (in Scaffold, manually + tests) ok for me (with my suggested change for the archive regex).

There are two test failures due to one-line expected result differences in view of the usual translator test rot (blogTitle changed on both test subjects) - probably worth fixing in this PR? Alternatively, I could update them after this one is merged, together with asyncifying (now that it drew my attention, why not do that).

alex-ter · 2024-03-17T18:47:38Z

Substack.js

@@ -2,14 +2,14 @@
 	"translatorID": "ac3b958f-0581-4117-bebc-44af3b876545",
 	"label": "Substack",
 	"creator": "Abe Jellinek",
-	"target": "^https://[^.]+\\.substack\\.com/(p/|archive)",
+	"target": "/p/|/archive",


This is of course quite generic, so this translator will be considered for quite a few URLs I suspect (e.g., many sites plausibly have some "archive"). A couple of suggestions in view of that:

The priority field should be bumped up to "250", I think. See translator priority docs for details. This is accounting for that a.footer-substack-cta check, which I think can be counted as a "unique check in detectWeb".

I'd suggest specializing the target pattern as much as possible and at least adding a (?:$|\?) (non-capturing because performance 🙂) for the archive part (see a proposal and a counterexample below). /p/ is trickier as I'm not sure what symbols Substack could use there, but maybe something along the same lines, i.e., /p/<something-but-not-/>$? Though I see in the test set that there may be a /p/<post name>/comments, so maybe not exactly that - essentially, this is just a suggestion to think twice of any additional options.

As for the archive, below is the proposed alteration and here's the counterexample link I could think of OTMH: IACR ePrint paper version record.

Suggested change

"target": "/p/|/archive",

"target": "/p/|/archive(?:$|\?)",

alex-ter · 2024-03-17T18:51:42Z

Substack.js

@@ -49,7 +51,7 @@ function detectWeb(doc, url) {
 function getSearchResults(doc, checkOnly) {
 	var items = {};
 	var found = false;
-	var rows = doc.querySelectorAll('a.post-preview-title[href*="/p/"]');
+	var rows = doc.querySelectorAll('a[data-testid="post-preview-title"][href*="/p/"]');


This is a good one - looks like this part is plain broken right now due to this change in the markup. In view of that, even if the pattern change doesn't get in (e.g., I can imagine maintainers proposing an alternative approach to keep the target specialized due to performance reasons), this one is worthwhile as a standalone fix.

alex-ter · 2024-03-17T19:21:12Z

BTW, one more thing to note - I see somehow none of your PRs submitted that day got a workflow run (with the usual checks), so maybe something like rebasing on top of current master would be useful too.

AbeJellinek · 2024-03-19T15:29:52Z

I don't think this is the right approach. The Substack translator is very basic. On that test case, Embedded Metadata already gets the author name, abstract, and date. The item type is wrong - webpage instead of blogPost - but the only field it's missing is websiteType ("Substack newsletter", which isn't necessarily correct for custom-domain Substack blogs). It would be much better to detect Substack in EM and set the item type correctly than to turn this into a generic translator.

alex-ter suggested changes Mar 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Substack with a custom domain #3244

Support Substack with a custom domain #3244

EruditeLying commented Jan 30, 2024

alex-ter left a comment •

edited

alex-ter Mar 17, 2024 •

edited

alex-ter Mar 17, 2024

alex-ter commented Mar 17, 2024

AbeJellinek commented Mar 19, 2024

Support Substack with a custom domain #3244

Are you sure you want to change the base?

Support Substack with a custom domain #3244

Conversation

EruditeLying commented Jan 30, 2024

alex-ter left a comment • edited

Choose a reason for hiding this comment

alex-ter Mar 17, 2024 • edited

Choose a reason for hiding this comment

alex-ter Mar 17, 2024

Choose a reason for hiding this comment

alex-ter commented Mar 17, 2024

AbeJellinek commented Mar 19, 2024

alex-ter left a comment •

edited

alex-ter Mar 17, 2024 •

edited