Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPCC Glossary: ToC creation when using multiple HTML files #1236

Open
mrchristian opened this issue Nov 10, 2023 · 21 comments
Open

IPCC Glossary: ToC creation when using multiple HTML files #1236

mrchristian opened this issue Nov 10, 2023 · 21 comments

Comments

@mrchristian
Copy link

Is your feature request related to a problem? Please describe.
Creating ToCs when using multiple HTML files - looking for support pages.

Describe the solution you'd like
See a pointer to the project we're working on which is to typeset a Linked Open Data copy of the IPCC Glossary - see semanticClimate/glossary-sandbox#1

Additional context
There are a few related ToC issues: how to make the ToC main file; how to relate CSS styles to the different HTML files; how to get ToC items to appear in the the Vivlio navigator; How to get the ToCs from the different HTML files into the front ToC on the page. Sorry a lot here. I will clearly list them over on our site: semanticClimate/glossary-sandbox#1

@mrchristian
Copy link
Author

mrchristian commented Nov 11, 2023

Apologies I'm not giving enough context to the project, and secondly, I need to break down my ToC questions - a simple pointer to your support docs will give me all the answers I'm sure.

The publishing project is by a volunteer group who have the goal of making a semantic index of all IPCC Reports. A first level project is to semantify the IPCC Glossary. We have met with IPCC and other UN agencies are they receptive to this being done. Part of the project would be to create outputs from the semantic source - one of these being a Hyperbook with user enhancements . Wikipedia/data entries etc.

See IPCC Source https://apps.ipcc.ch/glossary/ and an example Vivliostyle output - https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/glossary-demo/main/html/index.html

@mrchristian
Copy link
Author

My questions about generating ToCs and using multiple HTML files.

Taking into account we think we want to use Vivliostyle.js and Vivliostyle CLI. We want to use CLI for PDF Bookmarks, PoD preparation, and other CLI features.

  1. How to combine multiple HTML docs into a publication? What would be the best way to do this? Currently I used this index.html example, borrowed from your Vivlio CLI https://github.com/semanticClimate/glossary-sandbox/blob/main/index.html
  2. How can we control what ToC items to appear in Vivliostyle's viewer drop down Navigator, in a publication main ToC and on per HTML file ToC page. I will need to add some more content to the publication and create a diagram of the content organisation to show this off.

@mrchristian
Copy link
Author

Re: Questions 1. Its seems from your documentation that a 'Web publication manifest' seems like the best route. Any recommendation to use W3C or Readium version, Readium seems seems more convenient due to its documentation and examples - but happy to use either - https://docs.vivliostyle.org/#/vivliostyle-viewer#web-publications-multi-html-documents

@mrchristian
Copy link
Author

I had a very basic go at using a Manifest example, just to get things going:

W3C Publication Manifest

https://semanticclimate.github.io/glossary-sandbox/ipccglossary.jsonld

Render

https://vivliostyle.vercel.app/#src=https://semanticclimate.github.io/glossary-sandbox/ipccglossary.jsonld

Tomorrow I'll work on building up a W3C Manifest properly.

Would be nice if you have a pointer to a good example of a W3C Manifest example thats good for copying and building on.

@MurakamiShinyu
Copy link
Member

Re: Questions 1. Its seems from your documentation that a 'Web publication manifest' seems like the best route. Any recommendation to use W3C or Readium version, Readium seems seems more convenient due to its documentation and examples - but happy to use either - https://docs.vivliostyle.org/#/vivliostyle-viewer#web-publications-multi-html-documents

Yes, you can use Publication Manifest to organize multiple HTML documents into one publication. (we use W3C standards unless there is a particular reason not to)

Vivliostyle.js recognizes ToC that is specified in the publication manifest. See the following sections in Publication Manifest:

A simple example of publication manifest that includes a ToC resource is below:

{
  "@context": [
    "https://schema.org",
    "https://www.w3.org/ns/pub-context"
  ],
  "conformsTo": "https://www.w3.org/TR/pub-manifest/",
  "type": "Book",
  "name": "IPCC Glossary",
  "author": "IPCC",
  "inLanguage": "en",
  "readingOrder": [
    {
      "url": "index.html",
      "rel": "contents"
    },
    "glossary.html",
    "acronyms.html"
  ]
}

In this example, "index.html" is the ToC file.

The table of contents in the ToC file is displayed in the ToC panel of Vivliostyle Viewer.

Note that when ToC resource (the item with "rel": "contents") is not found, Vivliostyle.js use the first item of "readingOrder" as ToC resource if ToC-like elements (e.g., <nav>) are found in that document. So if the "glossary.html" file contains table of contents with a <nav> element,

  "readingOrder": [
    "glossary.html",
    "acronyms.html"
  ]

is treated as if "rel": "contents" is specified in the "glossary.html" item, and the table of contents of glossary is displayed in the Vivliostyle Viewer's ToC panel. However, it would be better to specify "rel": "contents" explicitly when you use Publication Manifest.

You can also just use the ToC file without publication manifest (this idea is from http://glazman.org/e0/webbook.html). See the Vivliostyle Viewer document: https://docs.vivliostyle.org/#/vivliostyle-viewer#table-of-contents-in-html

When Web publication manifest does not exist, and there are links to other HTML documents in the table of contents in the specified HTML document, those documents are loaded automatically. Vivliostyle treats HTML elements that match the following CSS selector as a table of contents element: [role=doc-toc], [role=directory], nav li, .toc, #toc

There are a few advantages of using publication manifest:

  • The ToC resource need not be the first item in the "readingOrder".
  • The order of items in the ToC need not be same as the order in the "readingOrder".
  • It's a W3C standard.

About ToC generation

There is a simple ToC auto-generation option in Vivliostyle CLI. See the Vivliostyle CLI document:
https://docs.vivliostyle.org/#/vivliostyle-cli#creating-a-table-of-contents

However this feature is very limited: it generates only one ToC link item per one HTML document. There have been a feature request to extend it to include every (or selective) heading in HTML documents.
vivliostyle/vivliostyle-cli#254

@mrchristian
Copy link
Author

Thank you so much for your assistance here - wonderful. Apologies for my slow reply, but I got ill last week, and now only back to 'full power' as well as catching up on my 'day job' work :-)

Your answers about the ToC functions and using Vivlio CLI here are exactly what I needed right now - semanticClimate volunteer colleagues want to prepare a working publication for delgates to use at next weeks COP meeting https://unfccc.int/ UNFCCC produce the legal agreements behind COP - they have 200 such docs only as PDF. We convert to Scholarly HTML, then semantically stucture. While colleague continue to structure the HTML my I can create a publication containing all the content using a manigest and Vivlio CLI by the looks of it. I'll keep you posted.

And again thanks you - well give Vivlio a big credit :-)

@mrchristian
Copy link
Author

BTW I got the Manifest working on the IPCC Glossary in avery basic way, will improve https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/glossary-demo/main/ipccglossary.jsonld

And now I'll start on the COP docs https://github.com/semanticClimate/unfccc

@mrchristian
Copy link
Author

I wanted to ask about using CSS styles when I have lots of HTML files to bring together in a publication, at present its 26, but it may rise to 200.

Currently Ive used the CSS override in Vivlio, which works (excuse the style the HTML and CSS is all mixed up at moment).

https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/unfccc/main/publication.json&style=https://raw.githubusercontent.com/semanticClimate/unfccc/main/css/appaloosa.css&bookMode=true

  1. Is it possible to get manifest to apply the style or do I need to have the style in the first Reading order document.
  2. How are the CSS resources used in the Manifest?

Thanks

@MurakamiShinyu
Copy link
Member

  1. Is it possible to get manifest to apply the style or do I need to have the style in the first Reading order document.

No, the CSS stylesheets need to be specified in each HTML document.

  1. How are the CSS resources used in the Manifest?

Vivliostyle.js uses CSS stylesheets specified in HTML documents, and does not use the CSS resources in the publication manifest. The CSS resources in the publication manifest are meaningless for Vivliostyle.js.

@mrchristian
Copy link
Author

Thank you @MurakamiShinyu really appreciated. Things are moving along now well with the manifest use. I've been wanting to move onto using the manifest approach for a really long time, so happy to be able to use it at last - there's no going back now :-)

For the moment I'll append the Vivlio viewer with CSS as we are automaticallly generating the HTML files from a PDF extraction pipeline - I could have the CSS automatically linked here, but I'll do that later once were out of this development round.

https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/unfccc/main/publication.json&style=https://raw.githubusercontent.com/semanticClimate/unfccc/main/css/appaloosa.css&bookMode=true&f=epubcfi(/6!/4/60)

Eventually there will be about 200 HTML files linked into the publication, the higher level ones in the ToC via the manifest, and the others rendered on the page in a main ToC and then in section sub-ToCs - we'll of course generate these ToC and nav files automaticall from here:

https://github.com/petermr/pyamihtml/tree/main/test/resources/unfccc/unfcccdocuments1

@mrchristian
Copy link
Author

HI @MurakamiShinyu - we've been progressing well with the project.

I had a question about ToCs generated from the Publication Manifest and using Vivliostyle. I seem to be getting a problem of my main toc rendering at the end of a publication when I don't want it to be there.

I wondered if you could help solve the problem?

Here is the sample publication.json rendered in Vivliostyle Canary.

https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/cma3-test/main/CMA_3/publication.json&f=epubcfi(/20!)

This is the directory in the repository where the publication is created

https://github.com/semanticClimate/cma3-test/tree/main/CMA_3

I have looked at Vivlio's multi-file examples, and W3C docs, Vivlio docs - but I cant see a solution.

Thanks

Simon

@MurakamiShinyu
Copy link
Member

Your publication.json has "toc_ses_dec_res.html" in the "readingOrder" and "toc_toplevel_sum_ses_dec_res.html" in the "resources":

    "readingOrder": [
      "front_cover.html",
      "imprint.html", 
      "toc_ses_dec_res.html", 
      "LEAD/split.html",
      "Decision_1_CMA_3/split.html",
      "Decision_2_CMA_3/split.html",
      "Decision_3_CMA_3/split.html",
      "Decision_4_CMA_3/split.html",
      "back_cover.html"
    ],
    "resources": [
      {
        "type": "LinkedResource",
        "url": "toc_toplevel_sum_ses_dec_res.html",
        "rel": "contents"
      },

Unfortunately, Vivliostyle has a limitation that it cannot hide HTML documents listed in the "resources" in the output.

If you use "toc_ses_dec_res.html" in the "readingOrder" for "contents", you can avoid this problem:

    "readingOrder": [
      "front_cover.html",
      "imprint.html", 
      {
        "url": "toc_ses_dec_res.html",
        "rel": "contents"
      },
      "LEAD/split.html",
      "Decision_1_CMA_3/split.html",
      "Decision_2_CMA_3/split.html",
      "Decision_3_CMA_3/split.html",
      "Decision_4_CMA_3/split.html",
      "back_cover.html"
    ],

@mrchristian
Copy link
Author

Ah great thank you. Much appreciated - I'll have a go at this now :-)

I'm just writing instructions for my colleague @petermr to auto-generate manifests and tocs from the Text and Data Miniing software Py4ami as a first trial so I'm trying to get things done properly on what will be a first trial.

petermr/pyamihtml#9

@mrchristian
Copy link
Author

We've progressed well and will soon, like next week be cleaning things out and add the CSS and modifications to the HTML we generate to at least make a proof of concept presentation to the UN Climate people.

I wanted to ask a quick question about a issue we have with the ToC reading in Vivliostyle. Apologies in advance but I think this is us messing up our HTML but before continueing to troubleshoot the issue - which will eventually solve the issue I wondered if you could take a quick look as your more knowledgeable eyes will do better than us and it might be very obvious what were getting wrong.

Essentially we're getting the whole ToC doc showing up in the Vivlio menu.

See: https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/cma3-test/main/current/publication.json&style=https://raw.githubusercontent.com/semanticClimate/cma3-test/main/current/css/theme.css

Thank you

@MurakamiShinyu
Copy link
Member

The current TOC handling in Vivliostyle.js is not good for your HTML structure, unfortunately. Your HTML structure is like this:

<body>
  <div id="sessionpre">
    <img src="../images/UNlogo.jpg" alt="UN logo" id="unlogo">
    <div class="sessionCode">/PA/CMA/2021/10/Add.1</div><div class="contents">
      <div><span>Contents</span></div>
      <div><span>Decisions adopted by the Conference of …</span></div>
      <!-- TOC -->
      <div class="toc">
        <div>
          <span>Decision</span><span>Page</span></a>
        </div>

        <nav role="doc-toc">
          <ul>
            <li>
              <a href="../Decision_1_CMA_3/split.html"><span
                  class="descres-code">1/CMA.3</span><span
                  class="descres-title">Glasgow Climate Pact</span></a>
            </li></ul>
        </nav>
      </div>
    </div>
  </div>
</body>

Vivliostyle.js generates the TOC box (displayed in the TOC panel in the Viewer) from the HTML document, skipping elements that are BODY's child and not containing a TOC element. See the code:

case "body-child":
if (
!srcElem.querySelector(
"[role=doc-toc], [role=directory], nav li a, .toc, #toc",
)
) {
// hide elements not containing TOC.
computedStyle["display"] = Css.ident.none;
}
break;

In your HTML, the BODY has only one child element <div id="sessionpre"> and that has a TOC element, so no elements are skipped. As a result, the whole BODY content is copied to the TOC box.

Also note that stylesheets are ignored in the TOC box.

If you change the HTML structure like below, the TOC box will be generated better (but not very good because of lack of style):

<body>
  <div id="sessionpre">
    <img src="../images/UNlogo.jpg" alt="UN logo" id="unlogo">
    <div class="sessionCode">/PA/CMA/2021/10/Add.1</div></div>
  <div class="contents">
    <div><span>Contents</span></div>
    <div><span>Decisions adopted by the Conference of …</span></div>
    <!-- TOC -->
    <div class="toc">
      <div>
        <span>Decision</span><span>Page</span></a>
      </div>

      <nav role="doc-toc">
        <ul>
          <li>
            <a href="../Decision_1_CMA_3/split.html"><span
                class="descres-code">1/CMA.3</span><span
                class="descres-title">Glasgow Climate Pact</span></a>
          </li></ul>
      </nav>
    </div>
  </div>
</body>

@MurakamiShinyu
Copy link
Member

I am going to fix Vivliostyle.js on these problems:

@mrchristian
Copy link
Author

Amazing @MurakamiShinyu - appreciate you looking at this :-) Our HTML is an output of a Text and Data Mining process which converts PDF to HTML running a series of regex normalisation processes when dealing with a specific corpus - in this case it is the UN FCCC treaty agreements - Kyoto Protocol, Paris Agreement, then all the subsequent COP meetings which are based on these treatise. So our expercise here is to come up with a recommendation for fixes to the PDF to HTML conversion that will allow for HTML to workin Vivlio and create Publication Manifests - automagically. We are nearly complete on this prototype and then we want to present to UN FCCC and get them to organise their documents using the process going forwards. So big thank you. For demo puposes I'll clean up HTML in the way you suggest at present.

@MurakamiShinyu
Copy link
Member

Fixed in #1259 and now it works with your case:

See: https://vivliostyle.vercel.app/#src=https://raw.githubusercontent.com/semanticClimate/cma3-test/main/current/publication.json&style=https://raw.githubusercontent.com/semanticClimate/cma3-test/main/current/css/theme.css

@mrchristian
Copy link
Author

Amazing. Thank you so much :-) We were working on a work around Friday to create further DIV childs, but your fix makes it all work. I'll read up on the details etc. We can now proceed to demo the doc to the UN people, and then when we get the time integrate into the TDM pipline. We have a couple of weeks hackathon coming up in India so this will come in really useful with IPCC content too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants