v0 pdf rendering #364

skmnktl · 2022-09-22T01:45:30Z

Tasks to be done:

cleaning up the template itself
build out the RESTful endpoints for getting the pdf
much more testing. I'm not sure this will work for all the texts we've got. I'm making a lot of assumptions here.
font selection
deployment details-- (including latex packages, etc)
multiple templates; perhaps an alternate template with maybe 3 verses at most per page, so i can annotate while i translate

skmnktl · 2022-09-22T01:48:11Z

@shreevatsa @akprasad Let me know if you have any suggestions for improvement. As @akprasad knows, I'm rather new to the building things business; most of my past work has been analytics, so I'm happy to get feedback!

shreevatsa · 2022-09-22T16:02:24Z

Thanks, this is a start!

One way to put off thinking about deployment etc is to simply generate a .tex file that can be downloaded, and leave turning it into a PDF (by running xelatex or whatever) to the user: this won't be as useful to most users as having an actual PDF, but it's a start, and keeps the initial problem minimal.

In fact, we could put it off for quite a while: could simply generate the .tex and .pdf files offline (doesn't have to run as part of Ambuda), and have them stored in some repo / S3 / whatever.

skmnktl · 2022-09-22T18:02:51Z

Ah yes, that makes sense w.r.t. deployment. I think we can think about the S3 bucket + Overleaf API to do the compilation. But that's for later. I'll start testing this weekend. See if I can get a good number of the text we have up working.

akprasad

My main comment is that there's an easier way to manage the template logic by using Jinja, which is what we use for our HTML. Jinja gives us:

variable substitutions
loops and conditionals
macros

So it's well-suited to this use case. Here's a minimal example of how to use it:

from jinja2 import Environment
env = Environment()
template = env.from_string("This is {{foo}}")
print(template.render(foo="my foo"))

If Jinja's use of {} characters is confusing given the TeX context, you can pass your own overrides:

env = Environment(
    block_start_string='<%',        
    block_end_string='%>',
    variable_start_string='<<',
    variable_end_string='>>',
    comment_start_string='<#',
    comment_end_string='#>')

Docs: https://jinja.palletsprojects.com/en/3.1.x/templates/

render_pdf/tex_files/tufte-handout.cls

akprasad · 2022-09-23T00:34:25Z

render_pdf/tex_files/runscript

@@ -0,0 +1,4 @@
+xelatex template.tex


There might be an option to suppress these output files. I think this is better in case we end up deleting other files called .out, .log, etc.

xelatex itself doesn't have an option for cleaning up, but look into invoking it via say latexmk (it has -c for clean IIRC) or arara: its documentation mentions a directive like

% arara: clean: { extensions: [ aux, log ] }

— other options I haven't looked at are latexrun or (further afield) tectonic.

Yup, and I need xelatex for babel/font support. I'll fix this later; I'd rather not waste time now for something that has a clear/simple solution. And it's also not the end of the world if I leave the log files too. I think

we can process a pdf, store it in a bucket to cache, and

overwrite it for the next run of render_pdf.
They're fairly small files, so it's not too large a concern.

render_pdf/src/parser.py

skmnktl · 2022-09-23T00:45:12Z

Ohhh I didn't realize jinja could be used for this. Thank you!

skmnktl · 2022-09-30T01:53:24Z

@shreevatsa I've converted the code to a jinja template.

How do I suppress those god awful halting latex compile messages? I've tried xelatex and lualatex. Couldn't figure it out...
I support two of the texts on ambuda now. kumarasambhava and the śivopanishad. There's so much variation in texts. Right now I'm just writing a bunch of try escapes to get the fields to parse. Not sure if that's sustainable...

skmnktl · 2022-09-30T01:53:58Z

Should I focus on:

adding texts, or
getting the pdf to look nicer?

akprasad

Not sure if this is ready for review or not, so I kept it fairly high-level

render_pdf/src/make_latex.py

akprasad · 2022-09-30T01:53:57Z

render_pdf/src/parser.py

@@ -0,0 +1,129 @@
+from lxml import etree


One of our transitive dependencies uses lxml, but let's use xml.etree.ElementTree instead, which has 99% of the functionality and is also in the standard lib. grep in ambuda for examples

I'd rather not rewrite; any specific reason why you want to move to lxml? If possible, I'll leave this rewrite for another PR. @akprasad

akprasad · 2022-09-30T01:54:19Z

render_pdf/src/parser.py

+    # God this is ugly # TODO
+    try:
+        return root.find(tagname("head")).text
+    except:


no bare except -- generally good practice to catch a specific exception like except AttributeError

akprasad · 2022-09-30T01:54:23Z

render_pdf/src/parser.py

+                    .find(tagname("fileDesc")) \
+                    .find(tagname("titleStmt"))
+            return tag.find(tagname("title")).text
+        except:


no bare except here

shreevatsa · 2022-09-30T03:38:18Z

@skmnktl About supporting more texts v/s getting the pdf to look nicer, it's up to you (whatever you're more motivated to do), though if you're neutral between the two options, then I'd say that getting the PDF to look nicer can be done indefinitely, after a first version gets in. Also consider a third option, that of cleaning up the PR for production.

I could think of a sequence of PRs like, say:

A Python script that takes in an XML file and generates a .tex file (OK if it doesn't look great or works only on some XML files).
Extending that script to more files.
Offline tasks for generating PDF files by running the scripts, and storing those PDF files somewhere.
A route in the app to point to / download the corresponding (previously generated) PDF file, given a text.

But could also think of doing it in other orders.

BTW what did you mean by "halting latex compile messages"?

skmnktl · 2022-09-30T11:45:37Z

@shreevatsa For the tasks outlined:

A Python script that takes in an XML file and generates a .tex file (OK if it doesn't look great or works only on some XML files).

This is what make_latex.py does. I'm writing the resultant .tex file as output.tex.

Offline tasks for generating PDF files by running the scripts, and storing those PDF files somewhere.

Do we want to wait on this? Should I use AWS? Or perhaps a Google Cloud Storage Bucket? Or perhaps a mongo instance? If the main app will move to a more permanent storage, I can just piggyback off that. If you have preferences on what that storage will be in the future, we can do that as well.

A route in the app to point to / download the corresponding (previously generated) PDF file, given a text.

I'm hesitant to write this since the storage decisions have yet to be made. We could go the render on the fly every time route, and for that I can make a FastAPI endpoint. There's not that many texts right now anyway.

shreevatsa · 2022-09-30T13:37:05Z

@skmnktl Hi, to be clear, by "sequence of PRs" I meant doing each of those things in a separate PR (so to answer your latter two questions, yes definitely wait on storing in the cloud, and don't write the app now) — I was actually talking about removing stuff from this PR (or moving to another one), to keep it minimal and ready to merge. :-) So this PR would have only 3 files:

make_latex.py
parser.py
template.tex

if I understand correctly. (Because output.tex is generated by make_latex.py, and output.pdf is generated by running xelatex/lualatex on output.tex, and tufte-handout.cls / tufte-common.def are already part of TeX Live so don't need to be committed.)

In the meantime I'll download and compile output.tex on my computer, so that I can see what error messages you were talking about…

shreevatsa · 2022-09-30T14:43:46Z

@skmnktl I looked into output.tex locally. I see what you mean about the errors — there are indeed a lot of them, and even if hitting Enter (or putting TeX in batchmode) will proceed with those errors ignored, it's not a good state to get into :)

Let's start with a template where we understand why each line is needed, and which compiles without any errors or warnings, and add things as needed (e.g. the current template has a line or two of math-related stuff that I'm sure we won't need for a long time if ever).

The file compiles fine without any errors or warnings if I replace lines 1–53 of output.tex (or template.tex originally) with the following:

\documentclass{article}

\title{Kumārasaṃbhava}

\author{apauruṣeya}

\usepackage[parfill]{parskip}
\usepackage{fontspec}

\newenvironment{sanskrit}{}{}

or, if you want to keep tufte-handout (from looking around online, it seems to be not a very well-maintained package and has quite a few bugs unfortunately…), this works:

\documentclass[nobib]{tufte-handout}

\title{Kumārasaṃbhava}

\author{apauruṣeya}

\usepackage[parfill]{parskip}
\usepackage{fontspec}

% Fix for bug: see https://tex.stackexchange.com/q/200722
% and https://github.com/Tufte-LaTeX/tufte-latex/issues/64
% Set up the spacing using fontspec features
\renewcommand\allcapsspacing[1]{{\addfontfeature{LetterSpace=15}#1}}
\renewcommand\smallcapsspacing[1]{{\addfontfeature{LetterSpace=10}#1}}

\newenvironment{sanskrit}{}{}

Feel free to add your font choices back in later.

There were also some errors from Babel's Sanskrit setup; we don't seem to need it for this output.tex, and we can also consider using polyglossia instead of babel. (Sorry I haven't looked into the Devanagari version yet; I didn't run make_latex.py and just tried working with the output.tex that was already in the PR.)

shreevatsa

Left some comments and I'm yet to take a closer look atparser.py, but this is actually looking quite reasonable and close to merging after some small changes — thanks for working on this! We can continue improving after having a first version.

render_pdf/src/make_latex.py

render_pdf/src/parser.py

…traction

skmnktl · 2022-10-02T02:48:25Z

@shreevatsa I took your comments above and did a few things with my last commit:

I removed the tufte package entirely.
I removed a bunch of vestigial functions.
I changed the template block and variable markers for jinja.
I have a script in src now that toggles pdf creation, and then moves the pdf to a render_pdf/pdf_outputs titled <text_title>_<author>.pdf.
And all the intermediate files are deleted (including the tex file).
For now, I've suppressed the "sanskrit" environment that I had earlier for babel. But once we start supporting documents transliterated into other scripts, we can set those up I think.
The runscript now uses latexmk rather than xelatex, though it does still call the xelatex engine in the background.

shreevatsa · 2022-10-02T03:12:12Z

@skmnktl Great! I still see the tex_files with output.tex etc in the PR; could you take another look? Maybe they didn't get removed properly…

…traction

akprasad

First pass. Also, feel free to move this to ambuda/render_pdf so that it's easier to reuse and share utils with the rest of the project.

akprasad · 2022-10-02T16:14:51Z

render_pdf/src/make_latex.py

@@ -0,0 +1,80 @@
+from typing import Union
+import parser
+import os


Run isort render_pdf/**/*.py to standardize and reorder imports

render_pdf/src/make_latex.py

render_pdf/src/parser.py

akprasad · 2022-10-02T16:23:58Z

render_pdf/src/parser.py

+    except:
+        try:
+            root = alt_root
+            tag = root.find(tagname("teiHeader"))\


if using ElementTree:

try: return xml.find('./teiHeader/fileDesc/titleStmt').text except AttributeError: return None

render_pdf/src/make_latex.py

render_pdf/src/parser.py

shreevatsa · 2022-10-11T05:17:21Z

By the way, if it helps, feel free to consider splitting this PR further, and checking in just one of the two:

just the parsing code to get "data" out of the TEI XML file,
just the LaTeX template and build scripts, with clear documentation on what data it needs as input.

(E.g. it's ok to merge just (2) for now, and work on (1) in parallel.)

akprasad

High-level comments:

remove the zivopaniSad file from this PR
use xml.etree.ElementTree instead of lxml -- they have basically the same API, but ElementTree is built-in and we'll probably remove lxml in the future.
use xpath (xml.find('./element/anotherElement/thirdElement)) instead of chaining tagname
no bare excepts anywhere
run black over the code

akprasad · 2022-10-20T03:02:49Z

render_pdf/src/runscript

@@ -0,0 +1,5 @@
+python make_latex.py


better name -- maybe create_pdf.sh, give it the .sh extension and add #!/usr/bin/sh at the top of the file, then chmod +x create_pdf.sh to mark it as executable

akprasad · 2022-10-20T03:03:17Z

render_pdf/src/parser.py

+
+if __name__ == "__main__":
+    input_file, output_file = sys.argv[1], sys.argv[2]
+    verses, analysis, title = parse(url=link)


where is link defined?

akprasad · 2022-10-20T03:03:35Z

render_pdf/src/parser.py

+
+
+# probably need to use asyncio here? 
+def parse(url=None, data_location = None):


run black over the file to fix the spacing

akprasad · 2022-10-20T03:04:17Z

render_pdf/src/parser.py

+
+
+# probably need to use asyncio here? 
+def parse(url=None, data_location = None):


there are two entirely different code paths here and it doesn't make sense to make them a single function. Better to have parse_url andparse_file and split them up accordingly

akprasad · 2022-10-20T03:04:41Z

render_pdf/src/parser.py

+def tagname(tag):
+    return f"{{{root_url}}}{tag}"
+
+def parse_from_root(root) -> (dict, dict):


akprasad · 2022-10-20T03:07:02Z

render_pdf/src/parser.py

+                    .find(tagname("titleStmt"))
+            author = tag.find(tagname("author")).text
+    except:
+        return 


better to have explicit return None

akprasad · 2022-10-20T03:07:50Z

render_pdf/src/parser.py

+            author = tag.find(tagname("author")).text
+    except:
+        return 
+    else:


try / except / else is kind of funky. Can you rewrite this?

something like:

author = root.find("./teiHeader/fileDesc/titleStmt/author') if author: return author.text else: return None

akprasad · 2022-10-20T03:08:20Z

render_pdf/src/parser.py

+    return verse_id, phrases
+
+def parse_note(note):
+    corresp = note.get("corresp").replace("#","")


will this fail if corresp is missing?

akprasad · 2022-10-20T03:08:35Z

render_pdf/src/parser.py

+    return parse_lg(note.find(tagname("lg")), corresp)
+
+def parse_div(div):
+    verses = []


return [parse_lg(lg) for lg in div]

akprasad · 2022-10-20T03:08:59Z

render_pdf/src/parser.py

+    analysis = {}
+    for child in root.iter():
+        if child.tag == tagname("lg"):
+            v,txt = parse_lg(child)


spaces after commas. black will do this for you

akprasad · 2022-10-20T03:12:30Z

render_pdf/src/parser.py

+
+
+# CLUNKY Fix this. Closure maybe?
+root_url = ""


I recommend just looping over the document and deleting this root prefix. see _remove_namespace in tei_parser.py for examples

akprasad · 2023-04-07T01:32:23Z

Returning to this PR -- I suggest we scope this PR down to the simplest thing that adds new functionality then iterate on it with follow-up PRs.

skmnktl · 2023-04-07T01:41:47Z

Do you want to meet this weekend to finalize this? I can wrap it up in the next couple of days?

…

On Thu, Apr 6, 2023 at 9:32 PM Arun Prasad ***@***.***> wrote: Returning to this PR -- I suggest we scope this PR down to the simplest thing that adds new functionality then iterate on it with follow-up PRs. — Reply to this email directly, view it on GitHub <#364 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE2BJBATJKU2CSUW6P4PON3W75VDFANCNFSM6AAAAAAQSTANOE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

akprasad · 2023-04-07T01:47:59Z

Do you want to meet this weekend to finalize this? I can wrap it up in the next couple of days?
…
On Thu, Apr 6, 2023 at 9:32 PM Arun Prasad @.> wrote: Returning to this PR -- I suggest we scope this PR down to the simplest thing that adds new functionality then iterate on it with follow-up PRs. — Reply to this email directly, view it on GitHub <#364 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2BJBATJKU2CSUW6P4PON3W75VDFANCNFSM6AAAAAAQSTANOE . You are receiving this because you were mentioned.Message ID: @.>

Sure, just scheduled a weekly sync (see #general on Discord) or we can find some time elsewhere

skmnktl added 2 commits September 21, 2022 16:34

Parse XML to DICT

93a56fa

pdf rendering v0

f11de3f

skmnktl marked this pull request as draft September 22, 2022 01:45

akprasad requested changes Sep 23, 2022

View reviewed changes

skmnktl and others added 2 commits September 29, 2022 21:45

pdf rendering v0

7b13f08

Merge branch 'ambuda-org:main' into main

9e4631a

akprasad reviewed Sep 30, 2022

View reviewed changes

shreevatsa reviewed Oct 1, 2022

View reviewed changes

clean up: removing tufte package, creating runscript, fixed author ex…

a0905be

…traction

clean up: removing tufte package, creating runscript, fixed author ex…

5107ccd

…traction

skmnktl marked this pull request as ready for review October 2, 2022 16:04

akprasad reviewed Oct 2, 2022

View reviewed changes

render_pdf/src/parser.py Outdated Show resolved Hide resolved

skmnktl added 2 commits October 19, 2022 11:38

fixes

d75934f

resorting imports

8eddf77

akprasad requested changes Oct 20, 2022

View reviewed changes

akprasad reviewed Oct 20, 2022

View reviewed changes



		# probably need to use asyncio here?
		def parse(url=None, data_location = None):

v0 pdf rendering #364

Are you sure you want to change the base?

v0 pdf rendering #364

Conversation

skmnktl commented Sep 22, 2022

skmnktl commented Sep 22, 2022

shreevatsa commented Sep 22, 2022

skmnktl commented Sep 22, 2022

akprasad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skmnktl commented Sep 23, 2022

skmnktl commented Sep 30, 2022

skmnktl commented Sep 30, 2022

akprasad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skmnktl Oct 19, 2022 • edited

Choose a reason for hiding this comment

akprasad Sep 30, 2022 • edited

Choose a reason for hiding this comment

akprasad Sep 30, 2022 • edited

Choose a reason for hiding this comment

shreevatsa commented Sep 30, 2022

skmnktl commented Sep 30, 2022

shreevatsa commented Sep 30, 2022

shreevatsa commented Sep 30, 2022

shreevatsa left a comment

Choose a reason for hiding this comment

skmnktl commented Oct 2, 2022

shreevatsa commented Oct 2, 2022

akprasad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreevatsa commented Oct 11, 2022

akprasad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akprasad commented Apr 7, 2023

skmnktl commented Apr 7, 2023 via email

akprasad commented Apr 7, 2023

skmnktl Oct 19, 2022 •

edited

akprasad Sep 30, 2022 •

edited

akprasad Sep 30, 2022 •

edited