Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic webpage translator #1092

Open
dstillman opened this issue Jul 11, 2016 · 56 comments
Open

Generic webpage translator #1092

dstillman opened this issue Jul 11, 2016 · 56 comments
Labels
New Translator Pull requests for new translators

Comments

@dstillman
Copy link
Member

As suggested on zotero/translation-server#32, and further bolstered by zotero/zotero#1059, we should create a translator that saves the basic data (title, URL, access date) on all webpages.

Some follow-up work will be needed in the client to show the gray icon for this translator ID, and probably some other things.

To allow this to be rolled out to 4.0 clients without causing trouble, we should figure out a way to return a value from detectWeb only in 5.0. Not sure if we make the Zotero version available now, but if we want to avoid that (e.g., for other consumers of translators), we could do some sort of feature detection.

(Ideally we could just use a minVersion here, but as far as I know the client won't ignore translators with later minVersions when running detection, which would seem to make a lot of sense.)

@zuphilip
Copy link
Contributor

zuphilip commented Jul 11, 2016

How about the other idea to extend the EM translator for this case? It looks for me that the function addLowQualityMetadata in the EM translator is similar to what you want to achieve. Thus, it might be already enough to extend the detectWeb in EM to always output webpage as a last case.

@adam3smith
Copy link
Collaborator

+1 to @zuphilip 's question. Also, I don't understand why this must be limited to Zotero 5+ -- what am I missing?

@dstillman
Copy link
Member Author

dstillman commented Jul 11, 2016

How about the other idea to extend the EM translator for this case?

It's what I say in the other thread: "even in the single-save-button era I still think there's value in setting different expectations for EM and <title/>".

I don't understand why this must be limited to Zotero 5+ -- what am I missing?

Without client changes, the color icon would appear on every page, even for title/URL/accessDate, and there'd be a confusingly redundant set of options in the context menu (which hard-codes web-page saving right now).

@zuphilip
Copy link
Contributor

It's what I say in the other thread: "even in the single-save-button era I still think there's value in setting different expectations for EM and <title/>".

Well, you explained that the different colors serve some purpose. But if the EM translator can extract some data on a page (and it will save also a snapshot of that page), then I cannot think of any use for a lower quality website translator on this page. I guess, that we can also somehow color the icon for EM differently if we are in some low quality data case, maybe just if detectionWeb returns website.

@dstillman
Copy link
Member Author

We can just have the generic translator not show up in the context menu when the EM translator triggers. The point is that we can't distinguish between EM and generic data within a single translator, so they have to be separate.

@dstillman
Copy link
Member Author

dstillman commented Jul 11, 2016

It's true that the stuff in addLowQualityMetadata blurs the line here a little bit — I didn't realize the EM translator used keywords and description and even tried to do byline author extraction. It's a little odd to do those things when there happen to be other metadata tags but not do them for generic webpage saving, when those things aren't really related to the presence of the more complex metadata. On the other hand, it's possible that site authors are more likely to populate even the very basic meta tags like keywords and description with better data when they also have more complex metadata, whereas in the absence of more complex tags those basic tags might be very low quality (spammy, ignorant SEO stuff).

So, some options:

  1. Add a generic translator but keep it limited to what we do now (title, URL, access date), and keep the gray/color distinction.

  2. Copy some of that logic — stuff we extract from the page in EM but that doesn't alone trigger EM detection (description, keywords) — to the generic translator and keep the gray/color distinction. Some generic pages would start having (potentially very low quality) tags and abstracts.

  3. Trigger EM on all pages and show the gray icon on all pages that return 'webpage'. Despite the gray icon, some saved pages might include very high quality metadata.

  4. Trigger EM on all pages and show the blue icon on all pages that return 'website' (so no more gray icon anywhere, except maybe on non-HTML documents). Despite the blue icon, some saved pages might include nothing other than title/URL/accessDate.

@dstillman
Copy link
Member Author

@simonster points out that the webpage item type doesn't have a lot of metadata available anyway, and the quality is often bad even when EM detects (e.g., here on GitHub). Even with EM, we're pretty much talking author and date at best. So this seems like a decent argument in favor of (3).

Here's I think what (3) would involve:

  1. Renaming "Embedded Metadata" to "Webpage"

  2. Changing init in EM to return 'webpage' as fallback in all cases

  3. Changing the client to show a gray icon for website, at least from EM. Not sure if we would show gray for non-EM translators that return website. I was inclined to say yes, since it highlights that we have specific support for a site/platform, but in an ideal world every site would just embed metadata, and then we'd be left with the same inherently limited website data

  4. Showing a "without snapshot" option in the context menu for this translator, or perhaps rethinking how we handle the "[with/without] snapshot" context menu options in general

One potential future complication: when we support JSON-LD, and specifically multiple JSON-LD blocks on the page, the translator would return multiple, which would remove the webpage option. This probably isn't a huge deal, but it would mean that there wouldn't be a good way of saving a straight webpage in those cases. But this seems sufficiently outweighed by all the other benefits here (e.g., to get webpage saving for free in the bookmarklet and translation server).

@avram
Copy link
Contributor

avram commented Jul 12, 2016

Since EM would still be able to detect non-webpage content, I'm not sure renaming it to Webpage makes sense.

@dstillman
Copy link
Member Author

dstillman commented Jul 12, 2016

Hmm. That's fair, though it's the translator name, so it's saying that it's saving using the Webpage translator (i.e., extracting generic data from the webpage), not that it's saving as a webpage (which the icon indicates). But maybe overly confusing. "Embedded Metadata" is a bit technical to show on all pages, though. Best option might be to just not show anything in parentheses for this translator, since it'll be the default saving mode (and the default icon as well).

@zuphilip
Copy link
Contributor

I think your option 3) is a good choice!

@simonster points out that the webpage item type doesn't have a lot of metadata available anyway, and the quality is often bad even when EM detects (e.g., here on GitHub).

Yes, I agree with that. Therefore, it makes IMO sense show the gray icon for these low quality data, which are maybe just useful enough to save some urls for later reading (bookmark functionality) but usually one has to cite more reliable sources than just webpages.

One potential future complication: when we support JSON-LD, and specifically multiple JSON-LD blocks on the page, the translator would return multiple, which would remove the webpage option.

Well, we can see this clearer if we have some ideas of a JSON-LD translator. In general I think it is a good idea to have the possibility to use Zotero also as a bookmark tool and therefore any handy one-click option to capture the website (as one item) is appreciated.

Best option might be to just not show anything in parentheses for this translator, since it'll be the default saving mode (and the default icon as well).

I.e. simply Save to Zotero (with snapshot) and Save to Zotero (without snapshot). That is a good idea. Alternative, we could think about name as Save to Zotero using "Website Data", Save to Zotero using "Web Data", Save to Zotero using "Generic", Save to Zotero using "Default".

@dstillman
Copy link
Member Author

See also #686, which suggests that DOI should go in this too. zotero/zotero#1110 is an interesting test case.

@adomasven
Copy link
Member

adomasven commented Sep 29, 2017

So I'll be working on this, as per @dstillman's comment

  1. Renaming "Embedded Metadata" to "Webpage"

  2. Changing init in EM to return 'webpage' as fallback in all cases

  3. Changing the client to show a gray icon for website, at least from EM. Not sure if we would show gray for non-EM translators that return website. I was inclined to say yes, since it highlights that we have specific support for a site/platform, but in an ideal world every site would just embed metadata, and then we'd be left with the same inherently limited website data

  4. Showing a "without snapshot" option in the context menu for this translator, or perhaps rethinking how we handle the "[with/without] snapshot" context menu options in general

noting the following:

  1. Let's keep the name, but in the connector display "Save to Zotero" without translator name. The name is more descriptive for translator creators and users never have to see "Embedded Metadata" if it's the default translator.

  2. Add some additional handling code within the connector to allow saving with and without snapshot for the EM translator.

There have been suggestions to incorporate COInS and DOI into EM, but I would like to leave that up to someone else as there are additional considerations, like what happens with the translators (if any) that use both COInS and EM for initial metadata.

@adomasven
Copy link
Member

adomasven commented Sep 29, 2017

Ok, so a problem with the above approach is that if EM always returns at least webpage, then it will always overshadow the DOI translator. We could change the priority of EM back to 400 (see discussion), but it was moved above DOI for a reason. Which means that incorporating DOI into EM is inevitable.

I understand that we always return multiple for DOI translation. Is it only to verify the data or does DOI translation sometimes genuinely have multiple items? Any suggestion on how/whether this could be reasonably handled?

@adomasven
Copy link
Member

adomasven commented Sep 29, 2017

Yep, so at least for some of the DOI test cases the select dialog contains multiple entries, with only one of them corresponding to the actual article being saved. Potential options:

  1. Display a select dialog for saves that include DOI and ask the user to select the relevant entry, if any, but that is crude and potentially confusing. Like a twisted captcha for translation. We might want to disable DOI translation for translation-server.
  2. Keep the DOI translator separate with a lower priority. For pages with DOIs present users would have to manually select translation with DOI from the context menu.

@adam3smith
Copy link
Collaborator

I think 2. is the way to go. The cases where you do want to use DOIs as multiples are often for fairly sophisticated use (e.g. importing all references from an article you're looking at in html) -- but as that example shows, it's also a really useful feature.

@zuphilip
Copy link
Contributor

Agree with @adam3smith and the example http://libguides.csuchico.edu/citingbusiness shows that we already preferring EM over DOI in "sparse" examples. (Technically, I guess it would also be possible to call DOI translator from EM translator if this case happens, but this might be more fragile code...)

@adam3smith
Copy link
Collaborator

I thought that was the idea of combining ? For single-DOI cases, call DOI in EM with some heuristic for making sure we're looking at the same item, then merge data. Same for COinS, which can also have multiples.

@dstillman
Copy link
Member Author

dstillman commented Sep 30, 2017

I'm a bit confused about the argument for (2). DOI being the only available translator is fairly common, so we wouldn't want to start preferring a generic webpage in that case. Even if we kept it separate for multi-DOI cases but integrated it into EM for single-DOI cases, a search results page with multiple DOIs and no other real metadata would start offering a generic webpage as the main option, which is worse than the current behavior. I think the only real solution is to integrate DOI (and COinS, and JSON-LD eventually) into EM and decide what to do based on what's available.

So this is a bit radical, but working through some scenarios and optimal behavior, it seems we need to allow a single translator to provide multiple save options. This is how the EM translator could pass webpage options, including snapshot/no-snapshot, with or without color and before or after its other save options as appropriate. (We could still alter the display order of the snapshot options based on the client pref, but we wouldn't need to do most other special-casing for the EM translator.) There are also various scenarios where the EM translator could intelligently decide which options to offer, whereas relying on multiple translators based on priority is much more limited, would result in redundant, confusing, inferior secondary options (e.g., a "DOI" menu option that only used CrossRef when the save button was already combining data from the page and from CrossRef), and would require special-casing for the placement of various options (e.g., putting the generic webpage options last).

We could allow returning an object (instead of a string) to specific a different label, including an empty one, which, among other things, would avoid the need to special-case the EM translator to remove the label and let us instead intelligently label based on how it was actually doing the save (since "DOI" or "Embedded Metadata" or "COinS" would sometimes be nice to show).

Finally, this could obviate the need for various translator hidden preferences and make those options much more accessible (e.g., Nature Publishing Group (with supplemental data)).

So for EM, detectWeb() could return an array like this:

[
  'journalArticle',
  {
    label: 'DOI',
    icon: 'multiple'
  },
  {
    label: 'Web Page with Snapshot',
    icon: 'webpageGray'
  },
  {
    label: 'Web Page without Snapshot',
    icon: 'webpageGray'
  }
]

which would result in a button with Save to Zotero (Embedded Metadata) and a journal article icon and a menu with Save to Zotero (Embedded Metadata)/journalArticle, Save to Zotero (DOI)/multiple, and two gray webpage options.

doWeb() would be called with the chosen index, including for the snapshot options.

With that in mind, some example scenarios:

Page has a non-generic translator, embedded metadata for non-webpage, no DOIs

Item type icon via non-generic translator, EM item type in menu, EM gray webpage options in menu

Page has a non-generic translator, embedded metadata for webpage, no DOIs

Item type icon via non-generic translator, EM color webpage options in menu

Page has single non-webpage embedded metadata and multiple DOIs

Item type icon, DOI selection in menu, gray webpage options in menu — all from the EM translator. As a single translator, doWeb() could resolve the first DOI and combine metadata from EM and CrossRef.

Page has no embedded metadata but multiple DOIs

Folder icon via EM translator, gray webpage options in menu. In doWeb(), resolve first DOI, if DOI seems to match page, just treat as regular DOI list. Otherwise, first entry in select dialog is current page using generic info (title, URL, access date) and DOI for the rest of results. As a single translator, it allows saving of the generic page info (potential improvement over status quo) and avoids showing a gray webpage icon even though there might be a DOI for the main item on the page (which would be a regression from status quo).

Page has single-item embedded metadata that returns webpage and one DOI

Folder icon via EM translator, color webpage options in menu. In doWeb(), resolve DOI, if DOI seems to match embedded metadata, combine (which probably means using only CrossRef). Otherwise display select list with first entry from embedded metadata and resolved DOI as second entry. (For the first case, a little weird to save straight from a folder, but why show two entries when we know one is worse and why show one entry if we're sure it matches the current page?) As a single translator, it avoids saving a webpage item when there's better metadata available as DOI, which is an improvement from current behavior where EM translator is prioritized over DOI.

Page has single-item embedded metadata that returns something other than webpage and one DOI

Same as previous, but optimistically show an item type icon from the embedded metadata. Combining metadata (when resolved DOI matches embedded metadata) might just mean adding an abstract from the embedded metadata to supplement CrossRef data.

Page has single-item embedded metadata that returns webpage and no DOIs

Color webpage icon, color webpage options (snapshot/no-snapshot) in menu, no gray options

@dstillman
Copy link
Member Author

Another thing we could do: ISBN detection that only ever showed as a folder in the menu and was never offered as a primary method, for the reasons @adam3smith explains in that thread.

@adam3smith
Copy link
Collaborator

I'm convinced by that rundown. The only one that's a bit wonky, (no metadata, multiple DOIs) is a bit weird currently, too, and the proposed solution is a slight improvement. COinS should likely work exactly the same way.

@adomasven
Copy link
Member

In doWeb(), resolve DOI, if DOI seems to match embedded metadata, combine (which probably means using only CrossRef). Otherwise display select list with first entry from embedded metadata and resolved DOI as second entry.

Do you have any suggestions for how the "seems to match" check would be performed in JS, considering we only had very low quality metadata before DOI lookup? Some sort of fuzzy matching is needed, but this would mean involving a third-party library and showing false-positives first would be a rather bad experience.


In general one of the reasons we wanted a generic translator (and why I specifically decided to work on this now) was to remove special-casing in Zotero, connector and translation server codebases for pages that miss translators and leverage the existing code to provide generic saving in all instances. However the plan outlined is actually in opposition of at least the simplification goal and will take a non-trivial amount of time and effort to implement and roll out within the translators and translate software and make translation server client handling more complicated too. Having the above working would be great, but I wouldn't want to commit myself on a change this big.

Having said that, I propose a less elegant and efficient, but much simpler solution:

  • If EM only contains webpage data, run DOI translator's detectWeb within EM and if there are DOIs present, return undefined, otherwise return webpage(Gray?).
  • Change special-case code within the Connector to always allow running the EM translator. If EM returns undefined then list EM options with and without snapshot as the final two options in the context menu, otherwise list the EM translator according to its priority.

If there are DOIs present, EM will not overshadow the DOI translator and otherwise will overtake. If both rich EM data and DOI are present then both can co-exist. This way we can avoid any changes or special translation handling within zotero and translation server and have a translator for every page. It sacrifices code clarity in the intermediate term, but is a workable solution for the short-term until someone has the time and spirit to commit to the bigger change.

@dstillman
Copy link
Member Author

I think yours is a good interim plan. Mine will let us remove almost all special-casing and also provide better results, but it will definitely take some work to get there, and might makes sense as part of a larger reworking of the translator architecture (e.g., to use promises everywhere). I'll probably work on that at some point.

webpage(Gray?)

webpageGray doesn't exist now, so we'd have to add that, but as long as we're still special-casing, we can just use a gray icon whenever the EM translator returns webpage or undefined, and then we wouldn't risk problems in translation-server or elsewhere before we add proper support for webpageGray. I think we can still use the color icon for non-EM webpage results — even though, as noted above, metadata for webpage is really limited, it's probably worthwhile to show that we're doing something site-specific and that the EM options are still available in the menu.

So the user-facing changes here will be that 1) you'll see the blue webpage icon much less often and 2) the gray icon and webpage menu options will start showing more data. And translation-server will be able to save all webpages.

@dstillman
Copy link
Member Author

From the forums, here's an example where DOI gets better metadata overall but EM gets the PDF, Abstract, and a better date: http://www.sdewes.org/jsdewes/pid6.0223

@zuphilip
Copy link
Contributor

zuphilip commented Jan 1, 2019

Actually, it seems that EM translator is used as a base for 105 other translators. Which are used for the most important journals.

It might be useful to look closer at this depending translators: https://github.com/zotero/translators/search?q=951c027d-74ac-47d4-a107-9c3069ab7b48&unscoped_q=951c027d-74ac-47d4-a107-9c3069ab7b48 . I just clicked on a few and two things become clear (also I suggest to try to look closer at all these depending translators and not just a sample):

  • The current EM translator is often a good start for a site. It then may need some small adjustments, e.g. a XPath for the abstract, how to link the PDF, save constant data about the journal/newspaper like ISSN, format the date better.
  • Second, the detection for multiples cannot been done with EM and therefore this is then done in an individual translator. Moreover, the detection by EM may be wrong/too generic which can be improved with the information about the specific site.

Thus, EM is currently used as a generic way of extracting bibliographic data from website (possibly tweaked a little with an specific translator).

If we are using it as a dependency for other translators, it should be as simple and predictable as possible. [...]
So, I would suggest to add another translator that would do all the smart logic. Let's call it for now the ultimate translator. It would be called when site-specific translators fail. That translator wouldn't do any metadata extraction on it's own, but instead it would call all other generic translators (EM, COinS, DOI, unAPI, etc.) and intelligently combine metadata while also deciding if the result is single or multiple items.

No, actually I would expect in your scenario that most/some of the dependent translators would then need to be based on this new ultimate/merging translator. This would be then the same as the current situation, but with another intermediate step.

Get metadata for the first DOI and try to match the article

Okay, that sounds fine and can possibly fix some currently problematic cases.

I.e. https://journals.sfu.ca/flr/index.php/journal/article/view/137 returns a very poor metadata and doesn't give a way for other translators.

This OJS instance is not giving much more machine readable/guessable information, also OJS makes this in general very easy: there are mandatory plugins for OpenURL, DublinCore, MODS and I guess they just have not enabled the Dublin Core Indexing Plugin.


One main drawback IMO for EM is currently the lack of JSON-LD and other variants of schema.org. I tried to work on these but currently my time does not permit to continue here...

As for the order given above:

  • If there is unAPI or a <link> to MARCXML or MODS then I would expect the quality to be higher than anything else.
  • I am skeptical that we can just rank the different methods for the optimal result. If you look closer at the RDF translator you see that for each field several options are considered in an order we expect to be the best. In this way it is e.g. possible to take the Abstract and PDF from DublinCore but the other fields from HighWire meta tags.

@mrtcode
Copy link
Member

mrtcode commented Jan 2, 2019

No, actually I would expect in your scenario that most/some of the dependent translators would then need to be based on this new ultimate/merging translator. This would be then the same as the current situation, but with another intermediate step.

But why would we want to base other translators on this new combined translator? I think the combined translator should only be used when a site-specific translator fails, and never used as a dependency (except maybe in that case with multi-domain OJS). I imagine it would be different from what we're regularly calling as translators, and it would be more like a logic that decides what to do with the web page if there isn't (or failed) the site-specific translator.

But my point is that making even a small number of translators (or the EM translator in general) more strict to force fallback isn't the right fix, because we still want anything else that the site-specific (by which I just mean non-generic, not that it's tailored to a specific site) translators can provide, which could include a PDF, tags, etc. (In this case it doesn't for some reason, but it could.) (Granted, if this one fell back to DOI, that would also get the PDF, because it's OA, but this problem could happen with a gated journal too.)

I don't think we need to fix site-specific translators metadata problems by using the combined translator. If a site-specific translator is implemented, it should be better by default, because translators authors should know what they are doing and find the best way to extract metadata, even if need to additionally get metadata by DOI. And they can use all the same methods that are used in the combined translator.

Also the output of site-specific translators can be controlled with tests and the problems should also be fixed within the same translator. Therefore I agree with @adam3smith that making a few translators stricter could be a solution.

From the forums, here's an example where DOI gets better metadata overall but EM gets the PDF, Abstract, and a better date: http://www.sdewes.org/jsdewes/pid6.0223

The combined translator would successfully extract the correct metadata from that URL. But let's imagine that someone decides to make a translator for that URL. If so, then the translator would have to combine metadata from EM and DOI to have the same quality metadata. And to base on the combined translator wouldn't be a good idea, because it's going to do too much magic. Therefore if the translator author sees that EM returned an item with a missing ISSN or an imprecise date, those should be extracted either from the page or by DOI.


Also we are discussing about adding MODS, MARCXML, JSON-LD to EM translator, but what if the page has multiple items? EM is single item only.

@dstillman
Copy link
Member Author

Also we are discussing about adding MODS, MARCXML, JSON-LD to EM translator, but what if the page has multiple items? EM is single item only.

I would think that retrieval based on <link> would go in a separate Linked Metadata translator called from the combined translator, similar to unAPI, not in EM. But in-page JSON-LD might go in EM, in which case it would need to possibly handle multiple items. Do you mean that it'd be a problem in terms of EM being called from other translators that expect a single item?

@mrtcode
Copy link
Member

mrtcode commented Jan 3, 2019

EM is currently designed to only return single item and all dependent translators also expect single item. And yeah, I'm thinking how that influences other translators.

Also if we start advising people to use MODS/MARCXML, we should expect translators that are wrapping that Linked Metadata translator and improving some fields, just like EM is used now in other translators.

@dstillman
Copy link
Member Author

all dependent translators also expect single item

That's not true — it's just a callback on itemDone, which can run more than once (the same way that, say, the Google Scholar translator can call an import translator like RIS and save more than one item). I'm not actually sure what happens now if a child translator calls selectItems(), but there's a good chance it just triggers the usual selection window.

@mrtcode
Copy link
Member

mrtcode commented Jan 3, 2019

all dependent translators also expect single item

That's not true — it's just a callback on itemDone, which can run more than once (the same way that, say, the Google Scholar translator can call an import translator like RIS and save more than one item). I'm not actually sure what happens now if a child translator calls selectItems(), but there's a good chance it just triggers the usual selection window.

I was thinking about cases like this where translator actually trusts that it gets a single item, because otherwise it would add the same abstract to all items, which wouldn't make sense. Of course if a website for which the site-specific translator was implemented has only one item, why it should ever return multiple items.

Anyway, if we are adding JSON-LD which can return multiple items, logically, we should add COinS too, which can also return multiple items. But again, I am trying to understand what will be the consequences of making EM as a multi-items translator.

Also adding JSON-LD and COinS to EM means there must be a logic in EM translator that combines metadata if multiple methods exist. Also what if RDF returns single item and JSON-LD or COinS returns multiple?

@mrtcode
Copy link
Member

mrtcode commented Jan 7, 2019

I'm thinking that Embedded Metadata translator name is maybe a little bit confusing and sets our thinking on the wrong path. Partially because we always imagined it as a last resort generic translator that has all the methods inside to extract metadata, and partially because it's used in many other translators and we want it to automatically extract as much metadata as possible to allow translators developers to just update a few last fields.

But let's imagine what would happen if Embedded Metadata translator would be renamed to something more narrow like Meta Tags Translator. It would be just like a regual non-site-specific translator. I.e. COinS.

So I think all translators should be separated:

  • Meta tags translator (previously Embedded Metadata) [single]
  • Linked metadata [mostly single, but can be multiple]
  • JSON-LD [multiple]
  • Microdata [multiple]
  • COinS [multiple]
  • unAPI [multiple]
  • DOI [multiple]

A few more reasons why not to merge any other translators into the EM translator:

  • Although it would be convenient for translators developers to have a translator that does it all, but at the same time they lose the option to manually choose metadata sources and fields.
  • Combining metadata from multiple sources will look like a "black box" for depended translators developers.
  • Combining single and multiple metadata is also complicated.
  • No reason to combine only some (i.e. JSON-LD) translators to EM and leave others. Better keep them all separate and simple.
  • Each translator has its own nuances i.e. COinS is sometimes querying Crossref OpenURL API
  • Making EM translator multi-item can result to unexpected behavior for parent translators that expect only one item.

So my suggestion is to keep all translators separate, use them in site-specific parent translators separately, and then introduce a combined last resort translator that intelligently uses all the previously listed separate translators.

The combined translator wouldn't be used in any other translator, except maybe in multi-domain translators, because they can't control their output quality with tests, but are dangerous to block the combined translator. In that case the combined translator could be invoked with the already extracted metadata, which will be utilized too.

@dstillman
Copy link
Member Author

I think that makes a lot of sense.

Only somewhat related, but one general concern I have is that, traditionally, we've been pretty complacent about the data available on a given site — we've mostly just accepted that what's there is the best we can do, even if some fields are missing. It would be nice to figure out ways to make sure we're getting as much data as possible, even if it means using other services. I don't think it's realistic to solve that purely by convention and tests (e.g., by using the DOI translator as a dependency more liberally, though we can do that too), and I still think we may want to consider certain thresholds or rules that trigger automatic supplementation of the data when possible.

@mrtcode
Copy link
Member

mrtcode commented Jan 8, 2019

Well, that sounds similar to what we are trying to do with zotero/zotero#1582.

If we trust that translators are already doing their best to extract metadata from the page, there is no need to perform any additional generic translation for the page. So the only thing that is left is to utilize identifiers to retrieve metadata from additional sources, what are we doing in zotero/zotero#1582:

  1. Resolve an identifier with our resolver API (currently only DOI), if there isn't any
  2. Get metadata by identifier (other ids besides DOI have limited querying capabilities)
  3. Get metadata from publisher website if we are not translating it already (resolve the publisher URL over doi.org)
  4. Combine metadata

And actually the combined translator will have some similarities with the metadata update logic in the client. I.e. it gets metadata by an identifier (DOI) and combines metadata. I'm a little bit concerned about duplicated operations in some situations. For example if user manually triggers metadata update in Zotero client, and the combined translator takes over, the metadata will be fetched from DOI RA and combined two times - one from the combined translator, and another from the metadata update logic in client. It would be nice to somehow converge both logics.

We were previously discussing about automatically triggering the metadata update logic when saving items over Zotero client lookup dialog or connector, but I think the conclusion was to proceed with the manually triggered metadata updating and see how it performs.

We had concerns about leaking our usage stats and querying some identifier APIs too often.

I'm also concerned about Zotero connector/bookmarklet and cross-origin requests. What are our limitations here?

@mrtcode
Copy link
Member

mrtcode commented Jan 10, 2019

I'm waiting for any suggestions how we could improve the generic metadata extraction, but if no one opposes I'm starting to implement the roadmap below. And of course everyone is welcome to work on any part too.

  1. Update Embedded Metadata translator:
    • Make sure it's only extracting from meta tags and isn't doing anything what is beyond its scope like addLowQualityMetadata
    • If some site-specific translators are depending on addLowQualityMetadata result, fix them
  2. Update DOI translator:
    • Extract DOI from the web page URL
    • Return results in the original order
  3. The combined translator:
    • Set its priority higher than any other generic translator i.e. EM, COinS, unAPI, DOI, etc.
    • Do detection and use generic translators to extract metadata:
      • DOI translator:
        • Optimistically get metadata for the first DOI (from URL or body) and try to match the article, otherwise get metadata for all other DOIs and try again
        • Utilize Zotero.Utilities.levenshtein plus some additional magic to match DOI metadata with web page title and maybe some other fields
      • Embedded Metadata
      • Linked metadata
      • JSON-LD
      • Microdata
      • COinS
      • unAPI
    • Run 'addLowQualityMetadata' which would be added from EM, plus maybe some additional logic like automatic abstract extraction, etc.
    • Combine metadata
      • Use DOI metadata as a base
      • Use other translators metadata to fill empty fields, or replace when we can detect that the specific field is better
    • Automatically decide if the final result should be single or multiple
  4. Implement linked metadata translator:
    • Get metadata from various sources
    • Combine metadata field by field, take inspiration from RDS translator
  5. Implement JSON-LD translator
  6. For multi-domain translators add a fallback to the combined translator

The improvements will be made in steps, and for the beginning we basically just want to wrap DOI and the EM translators with the new combined translator.

As soon as the combined translator will wrap other translators, I will use its output to collect and compare metadata from all URLs in translators tests. This will allow to review how metadata differs between various translators and should give a better idea how to combine metadata from different translators.

@zuphilip
Copy link
Contributor

I agree that it is cleaner to have separate translators and one combining translator. However, I cannot say on what part of the EM translator (meta tags, microdata, low quality data, ...) the currently 100+ dependent translators are depending on nor what this would mean to change in the future. Maybe you can help me to answer some questions around that aspect:

Can we do the same things we can do currently in dependent translators?

For a dependent translator I would then still be able to call any of the new separate translators or possibly more than just one. However, you said that I should usually not call the merged translator, but the addLowQualityMetadata is in there only. If some of the data have to been added manually as well in my dependent translator, then I possibly have to add similar steps as in the addLowQualityMetadata function into my dependent translator. Is that correct? Is this then a possible code duplication?

Can we do the same things in a dependent translator with some easy code?

I could imagine that I would need for a website a specific translator for the multiples and for most of the metadata I can then use a mixture of JSON-LD, meta tags, and Microdata. Then I possibly need to call all three translators, e.g. in a nested way:

function scrape(doc, url) {
	var translatorEM = Zotero.loadTranslator('web');
	translatorEM.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');//Embedded Metadata
	translatorEM.setHandler("itemDone", function(obj, itemEM) {
		var translatorJSONLD = Zotero.loadTranslator('web');
		translatorJSONLD.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48-jsonld');//Embedded Metadata
		translatorJSONLD.setHandler("itemDone", function(obj, itemJSONLD) {
			var translatorMICRODATA = Zotero.loadTranslator('web');
			translatorMICRODATA.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48-microdata');//Embedded Metadata
			translatorMICRODATA.setHandler("itemDone", function(obj, itemMICRODATA) {
				/*
				combine here itemEM, itemJSONLD, itemMICRODATA 
				and/or add some site-specific data
				*/
				itemMICRODATA.complete();
			});
			translatorMICRODATA.translate();
			itemJSONLD.complete();
		});
		translatorJSONLD.translate();
		item.complete();
	});
	translatorEM.getTranslatorObject(function(trans) {
		trans.itemType = "newspaperArticle";
		trans.doWeb(doc, url);
	});
}

Or is there a much easier way to do the same? Do all these nesting things here work? I remember some problems with EM being called from other translators (Sandboxing hell??), but maybe they are solved. Besides from the feasibility, this code is IMO quite difficult to work with. Could we possibly do some helper functions maybe in Zotero.Utilities for such cases?

(I hope it is okay that I play here the devil's advocate with my questions. If you think that is not helpful, then you can also let me know.)

@zuphilip
Copy link
Contributor

No reason to combine only some (i.e. JSON-LD) translators to EM and leave others. Better keep them all separate and simple.

Some vocabularies like Dublin Core or schema.org can written either as meta tags, microdata or JSON-LD. There is the different syntax which could be handled by separate translators, but the same semantics (e.g. assign DC.title to title field in Zotero) which should be reused.

@dstillman
Copy link
Member Author

Yeah, I'm not sure removing addLowQualityMetadata from EM makes sense. That includes literal <meta> tags like author and keywords and even some OG tags, which seem like they should be extracted along with the other stuff. The byline extraction based on arbitrary classes (byline and vcard) seems like a potential candidate for moving to a utility function that could be called explicitly by other translators, including the combined translator.

Re: nesting, we're developing this on a branch where we have async/await support in translators (though we still need to figure out how network requests should work, and I'm going to try to make let items = await translatorJSONLD.translate() work for child translators. You should even be able to create multiple translator objects and do something like let itemArrays = await Promise.all(translators.map(t => t.translate())) to benefit from parallel network requests.

@dstillman
Copy link
Member Author

Some vocabularies like Dublin Core or schema.org can written either as meta tags, microdata or JSON-LD. There is the different syntax which could be handled by separate translators, but the same semantics (e.g. assign DC.title to title field in Zotero) which should be reused.

Specifically, they would all just forward to RDF.js, like EM does now. We discussed this previously in the context of JSON-LD.

@mrtcode
Copy link
Member

mrtcode commented Feb 15, 2019

The combined (actually named it "Generic.js") translator is functioning and I am currently testing it with journal articles from 3.5k unique publishers.

So the goal is to make this translator intelligent enough to automatically decide if it's returning single or multiple items. But it's quite challenging to do in the generic way.

In past, Zotero was automatically using DOIs from the page, but the decision was to change that because the translator never knows if the DOI belongs to the current article, search results, references or the next article in the journal. But actually the same problem applies for JSON-LD, COinS, unAPI, Microdata. You are never sure if the metadata is describing the item in the current page or something else.

The following ways are used to detect if the current web page is representing a single item:

  1. There is a single DOI in the URL
  2. There is Embedded Metadata (it's in HEAD and always means single item)
  3. There is a DOI in Embedded Metadata result
  4. Linked Metadata (in HEAD. if it would be in BODY then it's a different story) is also always representing a single item
  5. Item from JSON-LD, DOIs, COinS, unAPI, Microdata (not implemented yet) that is matched with title from document.title or in some circumstances from H1, H2, H3

To simply put, all metadata in HEAD is representing a single item (except JSON-LD, unAPI), and all metadata in BODY can be representing single or multiple items - but you never know that.

So not only the extracted DOIs items are matched against the page title, but also all the other metadata too, where we can't assume that it undeniably represents a single item.

And then the combined translator cross matches, deduplicates and combines item metadata from different translators.

JSON-LD
Now a few thoughts regarding JSON-LD. The translator is working. It transforms JSON-LD to RDF, and does that without any library therefore supports only compacted JSON-LD format, but it was working fine with all the websites I encountered, even though it's totally not according the standard. json-ld library was 20K lines and the current JSON-LD to RDF code is 50 lines.

I know we are considering to recommend people to expose metadata in this format, but I see huge problems with it:

  1. JSON-LD can contain nested metadata with sophisticated relations and many different ways to represent it. While our JSON-LD to RDF method is relatively dumb - it just searches allover the RDF for items, but the difficulty to process JSON-LD according to scheme.org vocabulary nuances would be out of this translator scope.

  2. There can be multiple types for the same web page, or the same page can have multiple items. But again, everything is too dynamic to figure out what belongs to what.

  3. Produces more "noise". Even though it's still relatively rare to encounter this format in publisher websites, it still results to many empty or partially empty items. The more mainstream it will become, the more noise we will get. More and more data will be exposed, but we are interested only in bibliographic data, only a small part of it.

  4. Format is sophisticated and I already see the trend that it's difficult for website maintainers to produce quality metadata. Some of the JSON-LDs are even invalid.

So, I think, we shouldn't recommend this format. It's a little bit "Wild West", Something that isn't so mainstream and more targeted to bibliography would be a better choice.

@dstillman
Copy link
Member Author

dstillman commented Feb 15, 2019

For JSON-LD, I wouldn’t let the size issue affect our decision too much. We can put a library in a separate utility file without putting it in the translator itself. We would probably want to avoid injecting it into every page, but if we can do detection without the library we might be able to inject it dynamically for saving when necessary.

@mrtcode
Copy link
Member

mrtcode commented Feb 15, 2019

Good to know, but the missing jsonld.js is totally unrelated with the listed JSON-LD downsides.

@zuphilip
Copy link
Contributor

To simply put, all metadata in HEAD is representing a single item (except JSON-LD, unAPI), and all metadata in BODY can be representing single or multiple items - but you never know that.

A conservative approach as you describe seems fine for me. I would restrict point 5 to cases where the H1, H2, H3 is unique within the page. Some unnecessary multiple results shouldn't be too troubleful. In the worst case, the user has to do 2 clicks more if he is only interested in the main entry.

json-ld library was 20K lines and the current JSON-LD to RDF code is 50 lines.

It is also possible to think about switching completely from RDF to JSON-LD as our main format to support, i.e. replace RDF.js. I don't know how much feasible that is and how much work this would mean. But RDF.js is always ugly to work with and some parts are really old e.g. originating from Tim Berners-Lee. However, we may not want to do this within this PR.

So, I think, we shouldn't recommend this format. It's a little bit "Wild West", Something that isn't so mainstream and more targeted to bibliography would be a better choice.

Interesting that you say mainstream has some disadvantages. - There is AFAIK only COinS which is a dedicated bibliographic format which can be embedded within a website. Every other bibliographic format has to be linked from a website with <link> or unAPI, but we don't see that often. We could try to promote them more? As a website maintainer, I can then choose some meta tags and schema.org to optimize the appearance in search machines, and this will not interfer with the actual bibliographic data.

@dstillman
Copy link
Member Author

Every other bibliographic format has to be linked from a website with <link> or unAPI, but we don't see that often.

Well, unAPI has basically been dead for years, and we never supported <link>. Once we support <link> we can certainly promote that as the obvious choice when you already have BibTeX/RIS/MODS/etc. But most of those formats are fundamentally lossy, and it'd be nice to be able to recommend something that can more reliably represent data, particularly if we're going to support custom item types/fields down the line.

I don't see the quality of existing JSON-LD as particularly relevant to whether it's our recommendation for exposing metadata. We need to support it either way, and it's at least possible for people to expose high-quality metadata for multiple items with it. It seems like the main alternative would be RDF (via a <link>), and that's basically impossible for most people to generate.

@mrtcode
Copy link
Member

mrtcode commented Feb 18, 2019

To simply put, all metadata in HEAD is representing a single item (except JSON-LD, unAPI), and all metadata in BODY can be representing single or multiple items - but you never know that.

A conservative approach as you describe seems fine for me. I would restrict point 5 to cases where the H1, H2, H3 is unique within the page. Some unnecessary multiple results shouldn't be too troubleful. In the worst case, the user has to do 2 clicks more if he is only interested in the main entry.

Actually, now I'm matching with header tags only when:

  1. There is one or more H1
  2. There is only one H2
  3. There is only one H3

Title matching seems to work quite reliably in this way. Although the title is not always in header tags.

So, I think, we shouldn't recommend this format. It's a little bit "Wild West", Something that isn't so mainstream and more targeted to bibliography would be a better choice.

Interesting that you say mainstream has some disadvantages. - There is AFAIK only COinS which is a dedicated bibliographic format which can be embedded within a website. Every other bibliographic format has to be linked from a website with <link> or unAPI, but we don't see that often. We could try to promote them more? As a website maintainer, I can then choose some meta tags and schema.org to optimize the appearance in search machines, and this will not interfer with the actual bibliographic data.

Mainstream means people are defining multiple item types in multiple ways, and will do that even more in future. More metadata from the first look seems like a good thing. Unless you can't distinguish which one of the items is actually relevant for you.

The problem is that for some websites we are extracting many different items, but we can't distinguish if they are:

  1. Multiple entities with different types representing the same website - i.e. article, dataset, audio, etc.
  2. A search results page containing multiple items
  3. JSON-LD entity containing other entities that are somehow related

And we just get a flat list of all JSON-LD items. Unless we would process everything according to the meaning for specific vocabulary and sophisticated relations, which would be very difficult.

Also Zotero by itself supports many different item types. And then suddenly if we encounter a page with many different item types, which one we should choose?

It's worth to investigate COinS more, but there is the same problem regarding the detection if the specific COinS record is representing the current article or just a related article. We can't trust any metadata that is in BODY.

If making an additional request wouldn't be a problem, linked metadata could be a better choice, because when there is single item <link> can be defined in HEAD and when there are multiple items, multiple <link itemprop="…" /> can be defined in BODY (which is now allowed in HTML5).

@mrtcode
Copy link
Member

mrtcode commented Feb 18, 2019

I don't see the quality of existing JSON-LD as particularly relevant to whether it's our recommendation for exposing metadata. We need to support it either way, and it's at least possible for people to expose high-quality metadata for multiple items with it. It seems like the main alternative would be RDF (via a <link>), and that's basically impossible for most people to generate.

Sure, we should definitely support JSON-LD, but we should also accept that it will result to more false results than for example Embedded Metadata. I already set JSON-LD as the last method to extract metadata in the combined translator because of the low quality results.

@mrtcode
Copy link
Member

mrtcode commented Feb 28, 2019

The next translator on the list is the Linked Metadata translator which represents #77.

There were discussions about making it a multi item translator by extracting <link> tags from the BODY, instead of just getting a single item from the HEAD.

But there are some issues with this approach.
Firstly, a link tag in the BODY can be defined only if it has an itemprop or property attribute (in some cases rel too, but it's unrelated to our case).

itemprop:
<link itemprop="url" title="marcxml" type="application/marcxml+xml" href="http://domain.com/123.marcxml" />

And according to the w3.org validator itemprop must be inside an itemscope. Which means this becomes a microdata and should probably be processed by an appropriate microdata translator. Also just defining empty microdata items with a single url property, would result to many empty items for microdata translator.

property:
<link property="url" title="marcxml" type="application/marcxml+xml" href="http://domain.com/123.marcxml" />

This can exist freely in any part of the BODY, but still it's the RDFa way to define a property. So why not to use an appropriate RDFa translator?

Another problem with <link> extraction from body is that we can't do the proper item selection because it's just a URL, and there is no title to show in the selection dialog, unless we would process the whole item as microdata or RDFa, but then, again, it's not the Linked Metadata translator's job.

So I started to think that maybe it would better to restrict the Linked Metadata extraction to a single item in the HEAD.

And then to add microdata and RDFa translators. @zuphilip has already integrated microdata translator to EM, but, as I said previously, it would be better to keep them separated. Instead we should probably rework and utilize the @dgerber microdata translator implementation.

@dstillman
Copy link
Member Author

That all sounds good to me. Supporting <link> in HEAD is much more important anyway.

@mrtcode
Copy link
Member

mrtcode commented Mar 4, 2019

I pushed all the code behind this huge generic translation update. The code still needs some work and more thorough testing, but this is how it roughly looks like. Multiple PRs must be merged to support the new translators.

Firstly the async support must be merged zotero/zotero#1609, then the new utilities, and then the translators.

The most important file is Generic.js. It has comments, so it would be nice to get some review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Translator Pull requests for new translators
Development

No branches or pull requests

6 participants