Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test removing <meta charset="UTF-8"/> in known reading systems to see if it is really necessary #470

Open
martinpub opened this issue Jun 11, 2021 · 15 comments
Assignees
Labels
validator-revision EPUB 3 / HTML Validator revision: 2020-1

Comments

@martinpub
Copy link
Collaborator

As can be seen in the quoted part of the RelaxNG schema file, the line <meta charset="UTF-8"/> is strictly required as the first child of content documents' <head>.

At MTM we are experiencing issues with EPUB editors removing that line when the document is processed. Investigating this, I'm starting to think that the strict requirement of this line is perhaps not well motivated in the 2020-1 validator edition. 1. It is not documented in the guidelines. 2. It can be considered redundant in an XHTML 5 setting, given that a. UTF-8 is the default encoding for HTML5, and b. for the XML serialization of HTML5 used in EPUB 3, the character encoding of the document will be recorded in the XML declaration (<?xml version="1.0" encoding="UTF-8"?>).

See also the example given in the HTML5 spec. Also, the EPUB 3.2 Content Doc specification does not mention any requirement to use the meta tag to specify the encoding of the document.

If you agree that this is redundant information, my suggestion is to make this optional in the validator for the 2020-1 guidelines. Ping @AndersEkl @kalaspuffar.

<element name="meta">
<a:documentation>&lt;meta&gt; indicates metadata about the book. An empty
element that may appear repeatedly only in &lt;head&gt;.</a:documentation>
<a:documentation>&lt;meta&gt; is the container for the Dublin Core attributes.</a:documentation>
<attribute name="charset">
<value>UTF-8</value>
</attribute>
</element>

@martinpub martinpub added the validator-revision EPUB 3 / HTML Validator revision: 2020-1 label Jun 11, 2021
@AndersEkl
Copy link
Collaborator

If you agree that this is redundant information, my suggestion is to make this optional in the validator for the 2020-1 guidelines. Ping @AndersEkl @kalaspuffar.

I don't see any problems with that.

@kalaspuffar
Copy link
Collaborator

Hi @martinpub and @AndersEkl

Making this change should not be that complicated, and I don't have any strong opinions on the matter. I know from experience that not having explicit definitions of the charset could lead to many headaches for the developers and users of readers. Hopefully, the reading systems using these files are reading the files as XML files instead of HTML, but I don't think there is any guarantee that it would be the case.

So the safes thing to do is figuring out why these meta data tags are removed in the processing. Perhaps other important information is lost as well.

Best regards
Daniel

@martinpub
Copy link
Collaborator Author

Thanks for your input @kalaspuffar.

I assumed even an HTML(5) parser would interpret the data without this line as UTF-8, but perhaps that's not always true? In that case, I agree with checking the processing tool. Let's leave this open for now.

@martinpub
Copy link
Collaborator Author

Decided on validation group meeting on June 26 to keep this requirement.

@martinpub
Copy link
Collaborator Author

This issue was raised in the calibre editor bug reporting system, where the main developer pointed out the redundancy of such a declaration. As this piece of information is not mentioned in the guidelines, I argue that it be removed from the validation ruleset.

@martinpub martinpub reopened this Sep 6, 2021
@martinpub
Copy link
Collaborator Author

@josteinaj If you have the time, I would be interested in your input on this issue :-)

@josteinaj
Copy link
Member

josteinaj commented Sep 10, 2021

@martinpub sure

I agree with @kalaspuffar: #470 (comment)

According to the standards, there's no extra information in this meta tag.

I think this issue is mainly about what the purpose of the markup is. For production purposes, the meta tag doesn't give any extra information. However, for compliance with all reading systems, especially older ones, the meta tag might be necessary.

(whops, accidentally closed. I reopened again…)

@josteinaj josteinaj reopened this Sep 10, 2021
@martinpub
Copy link
Collaborator Author

Thanks @josteinaj! I think we can conclude that the requirement of the meta tag should be removed, as it is redundant.

@josteinaj
Copy link
Member

I accidentally closed and posted my incomplete comment, sorry.

The issue is what the purpose of the markup is, whether it's meant for production, or also for distribution and compliance with all reading systems.

@martinpub
Copy link
Collaborator Author

We don't have EPUB 3 in distribution yet, so I don't really know. But it seems to me that if a reading system supports EPUB 3.2, then it should support XHTML(5) content documents. And the XML declaration will be the appropriate place to declare the character encoding.

So either we can:

  1. Test this in known EPUB 3 reading systems, before we remove the requirement.
  2. Remove the requirement until we find some issue, and then consider it again.

I would argue for proceeding with 2, if we currently do not have any known issues with reading systems (or other systems where parsing of the contents of the EPUB 3 packages are at play).

@kalaspuffar
Copy link
Collaborator

Hi @martinpub

When we have these discussions, I'm always cautious about creating issues for the end-users. However, in this case, we want to introduce a change that has a marginal impact on the producers and marginal impact on the size and readability of the epub document but might introduce an issue for an end-user whose reading system might not want to read the file outright.

So I would vote for 1, making a change just because it is not a good idea, in my opinion.

Best regards
Daniel

@josteinaj
Copy link
Member

I'd vote for 1 as well.

Reading systems that claim to be compliant with EPUB 3.2 (or any reading system really) will have a HTML rendering engine built in. In many cases this is Chromium or another web engine. If the rendering engine chooses to parse the document as HTML instead of XHTML, then we should have the meta tag. To make sure that the rendering engine uses XHTML and not HTML, we need to at least use the xhtml file extension instead of html, and possibly also declare the XHTML doctype. There might be other requirements for having the rendering engine choose XHTML over HTML as well, I'm not sure.

Some reading systems might even just go straight for a HTML rendering engine and assume that it won't cause problems (which in most cases it won't). For instance, I don't know what e-readers does (Kindle, Kobo, etc.) or some mobile apps.

When we distribute a HTML version of our books, we use the html file extension instead of xhtml as we've had problems with xhtml in the past (it was probably a Internet Explorer-thing, I don't quite remember).

@martinpub
Copy link
Collaborator Author

Thanks for your comments @kalaspuffar @josteinaj, and I think I agree, I'm just impatient getting our workflow going smoothly :-) Let's leave this open for now and return to next steps at the validation meeting.

@martinpub martinpub changed the title Make <meta charset="UTF-8"/> optional in head of content documents? Test removing <meta charset="UTF-8"/> in known reading systems to see if it is really necessary Oct 14, 2021
@martinpub
Copy link
Collaborator Author

Adjusted the headline of this issue to suggestion 1 in my comment #470 (comment).

@martinpub martinpub self-assigned this Oct 15, 2021
@martinpub
Copy link
Collaborator Author

Decision from validator group meeting on October 15: Martin to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
validator-revision EPUB 3 / HTML Validator revision: 2020-1
Projects
None yet
Development

No branches or pull requests

4 participants