Test removing `<meta charset="UTF-8"/>` in known reading systems to see if it is really necessary #470

martinpub · 2021-06-11T12:39:31Z

As can be seen in the quoted part of the RelaxNG schema file, the line <meta charset="UTF-8"/> is strictly required as the first child of content documents' <head>.

At MTM we are experiencing issues with EPUB editors removing that line when the document is processed. Investigating this, I'm starting to think that the strict requirement of this line is perhaps not well motivated in the 2020-1 validator edition. 1. It is not documented in the guidelines. 2. It can be considered redundant in an XHTML 5 setting, given that a. UTF-8 is the default encoding for HTML5, and b. for the XML serialization of HTML5 used in EPUB 3, the character encoding of the document will be recorded in the XML declaration (<?xml version="1.0" encoding="UTF-8"?>).

See also the example given in the HTML5 spec. Also, the EPUB 3.2 Content Doc specification does not mention any requirement to use the meta tag to specify the encoding of the document.

If you agree that this is redundant information, my suggestion is to make this optional in the validator for the 2020-1 guidelines. Ping @AndersEkl @kalaspuffar.

nordic-epub3-dtbook-migrator/src/main/resources/xml/schema/2020-1/nordic-html5.rng

Lines 531 to 538 in c9e59ff

    
           <element name="meta"> 
        
               <a:documentation>&lt;meta&gt; indicates metadata about the book. An empty 
        
                   element that may appear repeatedly only in &lt;head&gt;.</a:documentation> 
        
               <a:documentation>&lt;meta&gt; is the container for the Dublin Core attributes.</a:documentation> 
        
               <attribute name="charset"> 
        
                   <value>UTF-8</value> 
        
               </attribute> 
        
           </element>

The text was updated successfully, but these errors were encountered:

AndersEkl · 2021-06-11T12:49:06Z

If you agree that this is redundant information, my suggestion is to make this optional in the validator for the 2020-1 guidelines. Ping @AndersEkl @kalaspuffar.

I don't see any problems with that.

kalaspuffar · 2021-06-15T11:24:08Z

Hi @martinpub and @AndersEkl

Making this change should not be that complicated, and I don't have any strong opinions on the matter. I know from experience that not having explicit definitions of the charset could lead to many headaches for the developers and users of readers. Hopefully, the reading systems using these files are reading the files as XML files instead of HTML, but I don't think there is any guarantee that it would be the case.

So the safes thing to do is figuring out why these meta data tags are removed in the processing. Perhaps other important information is lost as well.

Best regards
Daniel

martinpub · 2021-06-15T14:22:38Z

Thanks for your input @kalaspuffar.

I assumed even an HTML(5) parser would interpret the data without this line as UTF-8, but perhaps that's not always true? In that case, I agree with checking the processing tool. Let's leave this open for now.

martinpub · 2021-08-25T08:39:05Z

Decided on validation group meeting on June 26 to keep this requirement.

martinpub · 2021-09-06T14:50:56Z

This issue was raised in the calibre editor bug reporting system, where the main developer pointed out the redundancy of such a declaration. As this piece of information is not mentioned in the guidelines, I argue that it be removed from the validation ruleset.

martinpub · 2021-09-09T13:55:55Z

@josteinaj If you have the time, I would be interested in your input on this issue :-)

josteinaj · 2021-09-10T08:26:35Z

@martinpub sure

I agree with @kalaspuffar: #470 (comment)

According to the standards, there's no extra information in this meta tag.

In generic XML, the encoding can be defined in the XML declaration, as mentioned by the Calibre developer you linked to: https://www.w3.org/TR/xml/#NT-XMLDecl
In XHTML, the standard suggests doing it the same way as in generic XML. But for HTML, the meta tag is suggested: https://html.spec.whatwg.org/#charset

I think this issue is mainly about what the purpose of the markup is. For production purposes, the meta tag doesn't give any extra information. However, for compliance with all reading systems, especially older ones, the meta tag might be necessary.

(whops, accidentally closed. I reopened again…)

martinpub · 2021-09-10T08:31:25Z

Thanks @josteinaj! I think we can conclude that the requirement of the meta tag should be removed, as it is redundant.

josteinaj · 2021-09-10T08:34:37Z

I accidentally closed and posted my incomplete comment, sorry.

The issue is what the purpose of the markup is, whether it's meant for production, or also for distribution and compliance with all reading systems.

martinpub · 2021-09-10T09:08:28Z

We don't have EPUB 3 in distribution yet, so I don't really know. But it seems to me that if a reading system supports EPUB 3.2, then it should support XHTML(5) content documents. And the XML declaration will be the appropriate place to declare the character encoding.

So either we can:

Test this in known EPUB 3 reading systems, before we remove the requirement.
Remove the requirement until we find some issue, and then consider it again.

I would argue for proceeding with 2, if we currently do not have any known issues with reading systems (or other systems where parsing of the contents of the EPUB 3 packages are at play).

kalaspuffar · 2021-09-10T10:00:39Z

Hi @martinpub

When we have these discussions, I'm always cautious about creating issues for the end-users. However, in this case, we want to introduce a change that has a marginal impact on the producers and marginal impact on the size and readability of the epub document but might introduce an issue for an end-user whose reading system might not want to read the file outright.

So I would vote for 1, making a change just because it is not a good idea, in my opinion.

Best regards
Daniel

josteinaj · 2021-09-10T11:11:19Z

I'd vote for 1 as well.

Reading systems that claim to be compliant with EPUB 3.2 (or any reading system really) will have a HTML rendering engine built in. In many cases this is Chromium or another web engine. If the rendering engine chooses to parse the document as HTML instead of XHTML, then we should have the meta tag. To make sure that the rendering engine uses XHTML and not HTML, we need to at least use the xhtml file extension instead of html, and possibly also declare the XHTML doctype. There might be other requirements for having the rendering engine choose XHTML over HTML as well, I'm not sure.

Some reading systems might even just go straight for a HTML rendering engine and assume that it won't cause problems (which in most cases it won't). For instance, I don't know what e-readers does (Kindle, Kobo, etc.) or some mobile apps.

When we distribute a HTML version of our books, we use the html file extension instead of xhtml as we've had problems with xhtml in the past (it was probably a Internet Explorer-thing, I don't quite remember).

martinpub · 2021-09-10T13:22:18Z

Thanks for your comments @kalaspuffar @josteinaj, and I think I agree, I'm just impatient getting our workflow going smoothly :-) Let's leave this open for now and return to next steps at the validation meeting.

martinpub · 2021-10-14T11:57:39Z

Adjusted the headline of this issue to suggestion 1 in my comment #470 (comment).

martinpub · 2021-10-15T10:10:54Z

Decision from validator group meeting on October 15: Martin to test.

martinpub added the validator-revision EPUB 3 / HTML Validator revision: 2020-1 label Jun 11, 2021

martinpub closed this as completed Aug 25, 2021

martinpub reopened this Sep 6, 2021

josteinaj closed this as completed Sep 10, 2021

josteinaj reopened this Sep 10, 2021

martinpub changed the title ~~Make <meta charset="UTF-8"/> optional in head of content documents?~~ Test removing <meta charset="UTF-8"/> in known reading systems to see if it is really necessary Oct 14, 2021

martinpub self-assigned this Oct 15, 2021

josteinaj mentioned this issue Oct 18, 2022

Further development, 2023 ("phase two") #523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test removing `<meta charset="UTF-8"/>` in known reading systems to see if it is really necessary #470

Test removing `<meta charset="UTF-8"/>` in known reading systems to see if it is really necessary #470

martinpub commented Jun 11, 2021

AndersEkl commented Jun 11, 2021

kalaspuffar commented Jun 15, 2021

martinpub commented Jun 15, 2021

martinpub commented Aug 25, 2021

martinpub commented Sep 6, 2021

martinpub commented Sep 9, 2021

josteinaj commented Sep 10, 2021 •

edited

martinpub commented Sep 10, 2021

josteinaj commented Sep 10, 2021

martinpub commented Sep 10, 2021

kalaspuffar commented Sep 10, 2021

josteinaj commented Sep 10, 2021

martinpub commented Sep 10, 2021

martinpub commented Oct 14, 2021

martinpub commented Oct 15, 2021

Test removing <meta charset="UTF-8"/> in known reading systems to see if it is really necessary #470

Test removing <meta charset="UTF-8"/> in known reading systems to see if it is really necessary #470

Comments

martinpub commented Jun 11, 2021

AndersEkl commented Jun 11, 2021

kalaspuffar commented Jun 15, 2021

martinpub commented Jun 15, 2021

martinpub commented Aug 25, 2021

martinpub commented Sep 6, 2021

martinpub commented Sep 9, 2021

josteinaj commented Sep 10, 2021 • edited

martinpub commented Sep 10, 2021

josteinaj commented Sep 10, 2021

martinpub commented Sep 10, 2021

kalaspuffar commented Sep 10, 2021

josteinaj commented Sep 10, 2021

martinpub commented Sep 10, 2021

martinpub commented Oct 14, 2021

martinpub commented Oct 15, 2021

Test removing `<meta charset="UTF-8"/>` in known reading systems to see if it is really necessary #470

Test removing `<meta charset="UTF-8"/>` in known reading systems to see if it is really necessary #470

josteinaj commented Sep 10, 2021 •

edited