Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEI translator full rewrite #3245

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

Conversation

glorieux-f
Copy link

@glorieux-f glorieux-f commented Jan 30, 2024

New features

  • Rich html text in notes exported as TEI
  • Rich text in titles, more Zotero conformant
  • Extra field parsed for extra CSL properties
  • XML/TEI indented
  • More conformant TEI

Best is to see how is handled a quite complex item.

{
	"version": 3038
	"itemType": "journalArticle"
	"title": "Not closing <b> or<strong>unknown tag</strong>, kept as text nodes; <i>italic</i>, is an element"
	"date": "1929"
	"language": "fr"
	"callNumber": "piaget1929a02"
	"extra": "Start of text note.\nGenre: Review of\nReviewed Title: Rich text >> <i>italic</i>\nMedium: Paris, <Payot>, 1927\nUnknown CSL property: CLA1929_5\nNote continuing.\nEditorial Director: One || John\nEditorial Director: Two || John"
	"volume": "22"
	"pages": "117-118"
	"publicationTitle": "Archives de psychologie"
	"issue": "85"
	"creators": [
		"0": {
			"firstName": "Jean"
			"lastName": "Piaget"
			"creatorType": "author"
		}
		"1": {
			"name": "Eugène Minkowski"
			"creatorType": "reviewedAuthor"
		}
		"2": {
			"firstName": "J."
			"lastName": "Contributor"
			"creatorType": "contributor"
		}
		"3": {
			"firstName": "M."
			"lastName": "Editor"
			"creatorType": "editor"
		}
	]
	"tags": [
		"0": {
			"tag": "AddATag"
		}
	]
	"collections": [
		"0": "U8JSTJZ5"
	]
	"relations": {
		"owl:sameAs": [
			"0": "http://zotero.org/groups/5048422/items/G99GTU5D"
		]
	}
	"dateAdded": "2024-01-29T19:58:13Z"
	"dateModified": "2024-01-30T12:55:46Z"
	"uri": "http://zotero.org/users/8989645/items/RLRXRWYM"
	"attachments": []
	"notes": [
		"0": {
			"key": "6SE87GSB"
			"version": 3020
			"itemType": "note"
			"parentItem": "RLRXRWYM"
			"note": "<div data-schema-version=\"8\"><p>A zotero note with some <em>formatting</em>.</p>\n<ol>\n<li>\nHate Numbered List\n</li>\n<li>\nGo to 1.\n</li>\n</ol>\n</div>"
			"tags": []
			"relations": {}
			"dateAdded": "2024-01-30T12:10:49Z"
			"dateModified": "2024-01-30T12:12:30Z"
			"uri": "http://zotero.org/users/8989645/items/6SE87GSB"
		}
	]
}

What Zotero do with this item in APA7.

Piaget, J. (1929). Not closing <b> or<strong>unknown tag</strong>, kept as text nodes; italic, is an element [Review of Rich text >> italic, par Eugène Minkowski; Paris, <Payot>, 1927]. Archives de psychologie, 22(85), 117‑118.

This TEI export.

<biblStruct type="journalArticle" corresp="http://zotero.org/users/8989645/items/RLRXRWYM">
	<analytic xml:lang="fr">
		<author>
		<forename>Jean</forename>
		<surname>Piaget</surname>
		</author>
		<respStmt>
		<resp>contributor</resp>
		<persName>
			<forename>J.</forename>
			<surname>Contributor</surname>
		</persName>
		</respStmt>
		<respStmt>
		<resp>editorial-director</resp>
		<persName>
			<forename>John</forename>
			<surname>One</surname>
		</persName>
		</respStmt>
		<respStmt>
		<resp>editorial-director</resp>
		<persName>
			<forename>John</forename>
			<surname>Two</surname>
		</persName>
		</respStmt>
		<title level="a">Not closing &lt;b&gt; or&lt;strong&gt;unknown tag&lt;/strong&gt;,
 kept as text nodes; <hi rend="italic">italic</hi>, is an element</title>
		<idno type="callNumber">piaget1929a02</idno>
	</analytic>
	<monogr>
		<editor>
		<forename>M.</forename>
		<surname>Editor</surname>
		</editor>
		<title level="j">Archives de psychologie</title>
		<imprint>
		<date when="1929">1929</date>
		<biblScope unit="volume">22</biblScope>
		<biblScope unit="issue">85</biblScope>
		<biblScope unit="page">117-118</biblScope>
		</imprint>
	</monogr>
	<relatedItem type="reviewed">
		<bibl>
		<author>
			<name>Eugène Minkowski</name>
		</author>
		<title>Rich text &gt;&gt; <hi rend="italic">italic</hi></title>
		<edition>Paris, &lt;Payot&gt;, 1927</edition>
		</bibl>
	</relatedItem>
	<note type="extra">Start of text note.
	
	Unknown CSL property: CLA1929_5
	Note continuing.</note>
	<note corresp="http://zotero.org/users/8989645/items/6SE87GSB"><p>A zotero note with some <emph>formatting</emph>.</p>
	<list rend="numbered">
	<item>
	Hate Numbered List
	</item>
	<item>
	Go to 1.
	</item>
	</list>
	</note>
	<note type="tags">
		<term type="tag">AddATag</term>
	</note>
</biblStruct>

The previous TEI export (original is not well idented)

<biblStruct type="journalArticle">
	<analytic>
		<title level="a">Not closing <hi rend="bold"> or<strong>unknown tag</strong>,
		 kept as text nodes; <hi rend="italics">italic</hi>, is an element</hi></title>
		<author><forename>Jean</forename><surname>Piaget</surname></author>
		<respStmt>
			<resp>reviewedAuthor</resp>
			<persName><name>Eugène Minkowski</name></persName>
		</respStmt>
		<respStmt>
			<resp>contributor</resp>
			<persName><forename>J.</forename><surname>Contributor</surname></persName>
		</respStmt>
	</analytic>
	<monogr>
		<title level="j">Archives de psychologie</title>
		<idno type="callNumber">piaget1929a02</idno>
		<editor><forename>M.</forename><surname>Editor</surname></editor>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">85</biblScope>
			<biblScope unit="page">117-118</biblScope>
			<date>1929</date>
		</imprint>
	</monogr>
	<note>A zotero note with some formatting.
		
		
		
		
		Hate Numbered List
		
		
		Go to 1.
		
		
	</note>
	<note type="tags"><note type="tag">AddATag</note></note>
</biblStruct>

TEI.js Outdated

/* 2024, Frédéric Glorieux.

// item produced by Zotero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had test cases for export translators, but this is a lot to put at the top of the translator. We can keep it in the commit history but should remove from the translator before merging IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I tried was a kind of documentation for end users. TEI is always an interpretation, there are more than one way to encode an information. To understand how the old translator was working, I had to test various kind of records to trace the fields in TEI output, I thought a good idea to explain the code a bit more. Comment in source code is not the right place, you are right, test case is a good idea, but there is still something to find for a documentation.

Comment on lines +229 to +230
* Imitated from zotero source code
* https://github.com/zotero/zotero/blob/main/chrome/content/zotero/itemTree.jsx#L2472
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Zotero is imitating the behavior of Citeproc.js here, so let's link straight to the source too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link added. Below the spec, @dstillman #3054 (comment) make me rewrite the parser to handle tricky cases like
Not closing <b> or<strong>unknown tag</strong>, kept as text nodes; <i>italic</i>, is an element
Reading code helps me to make the right choices on each case.

TEI.js Outdated
let discardedNode = nodeStack.pop();
nodeStack[0].append(discardedMarkup.token, ...discardedNode.childNodes);
}
// return textContent; // lint see it’s not used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old code used for debug. Deleted.

TEI.js Outdated
if (!html) return;
// import html as dom
let dom = xmlParser.parseFromString(html, "text/html");
let body = dom.getElementsByTagName("body").item(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let body = dom.getElementsByTagName("body").item(0);
let body = dom.body;

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

* @param {*} item
* @returns
*/
function parseExtraFields(item) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Won't it happen upon saving the item anyway?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sorry to not understand your question. In the context of a translator, do you mean I can get a state of an item where extra field is parsed to CSL? I would be glad if extraToCSL() have been available through ZU, but it is not.

TEI.js Outdated
Comment on lines 684 to 685
// unicode classes seems not supported
// xmlid = xmlid.normalize("NFD").replace(/\p{M}+/u, '');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use XRegExp:

var nameFormat1RE = new ZU.XRegExp("^\\p{Letter}+\\s\\p{Letter}+\\s\\p{Letter}+$");

This will be supported natively in Z7+.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip.

TEI.js Outdated
Comment on lines 728 to 738
let iso = null;
let year = Number(date.year);
if (isNaN(year)) return iso;
iso = String(date.year).padStart(4, '0');
let month = Number(date.month);
if (isNaN(month)) return iso;
// january = 0
iso += '-' + String(date.month + 1).padStart(2, '0');
if (!date.day) return iso;
iso += '-' + String(date.day).padStart(2, '0');
return iso;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not strToISO?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed ZU.strToISO(item.date). Deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants