Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0 pdf rendering #364

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

v0 pdf rendering #364

wants to merge 8 commits into from

Conversation

skmnktl
Copy link

@skmnktl skmnktl commented Sep 22, 2022

Tasks to be done:

  1. cleaning up the template itself
  2. build out the RESTful endpoints for getting the pdf
  3. much more testing. I'm not sure this will work for all the texts we've got. I'm making a lot of assumptions here.
  4. font selection
  5. deployment details-- (including latex packages, etc)
  6. multiple templates; perhaps an alternate template with maybe 3 verses at most per page, so i can annotate while i translate

@skmnktl skmnktl marked this pull request as draft September 22, 2022 01:45
@skmnktl
Copy link
Author

skmnktl commented Sep 22, 2022

@shreevatsa @akprasad Let me know if you have any suggestions for improvement. As @akprasad knows, I'm rather new to the building things business; most of my past work has been analytics, so I'm happy to get feedback!

@shreevatsa
Copy link
Contributor

Thanks, this is a start!

One way to put off thinking about deployment etc is to simply generate a .tex file that can be downloaded, and leave turning it into a PDF (by running xelatex or whatever) to the user: this won't be as useful to most users as having an actual PDF, but it's a start, and keeps the initial problem minimal.

In fact, we could put it off for quite a while: could simply generate the .tex and .pdf files offline (doesn't have to run as part of Ambuda), and have them stored in some repo / S3 / whatever.

@skmnktl
Copy link
Author

skmnktl commented Sep 22, 2022

Ah yes, that makes sense w.r.t. deployment. I think we can think about the S3 bucket + Overleaf API to do the compilation. But that's for later. I'll start testing this weekend. See if I can get a good number of the text we have up working.

Copy link
Contributor

@akprasad akprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment is that there's an easier way to manage the template logic by using Jinja, which is what we use for our HTML. Jinja gives us:

  • variable substitutions
  • loops and conditionals
  • macros

So it's well-suited to this use case. Here's a minimal example of how to use it:

from jinja2 import Environment
env = Environment()
template = env.from_string("This is {{foo}}")
print(template.render(foo="my foo"))

If Jinja's use of {} characters is confusing given the TeX context, you can pass your own overrides:

env = Environment(
    block_start_string='<%',        
    block_end_string='%>',
    variable_start_string='<<',
    variable_end_string='>>',
    comment_start_string='<#',
    comment_end_string='#>')

Docs: https://jinja.palletsprojects.com/en/3.1.x/templates/

render_pdf/tex_files/tufte-handout.cls Outdated Show resolved Hide resolved
@@ -0,0 +1,4 @@
xelatex template.tex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be an option to suppress these output files. I think this is better in case we end up deleting other files called .out, .log, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xelatex itself doesn't have an option for cleaning up, but look into invoking it via say latexmk (it has -c for clean IIRC) or arara: its documentation mentions a directive like

% arara: clean: { extensions: [ aux, log ] }

— other options I haven't looked at are latexrun or (further afield) tectonic.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, and I need xelatex for babel/font support. I'll fix this later; I'd rather not waste time now for something that has a clear/simple solution. And it's also not the end of the world if I leave the log files too. I think

  1. we can process a pdf, store it in a bucket to cache, and
  2. overwrite it for the next run of render_pdf.
    They're fairly small files, so it's not too large a concern.

render_pdf/src/parser.py Show resolved Hide resolved
@skmnktl
Copy link
Author

skmnktl commented Sep 23, 2022

Ohhh I didn't realize jinja could be used for this. Thank you!

@skmnktl
Copy link
Author

skmnktl commented Sep 30, 2022

@shreevatsa I've converted the code to a jinja template.

  1. How do I suppress those god awful halting latex compile messages? I've tried xelatex and lualatex. Couldn't figure it out...
  2. I support two of the texts on ambuda now. kumarasambhava and the śivopanishad. There's so much variation in texts. Right now I'm just writing a bunch of try escapes to get the fields to parse. Not sure if that's sustainable...

@skmnktl
Copy link
Author

skmnktl commented Sep 30, 2022

Should I focus on:

  1. adding texts, or
  2. getting the pdf to look nicer?

Copy link
Contributor

@akprasad akprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is ready for review or not, so I kept it fairly high-level

render_pdf/src/make_latex.py Show resolved Hide resolved
render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/make_latex.py Show resolved Hide resolved
render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
@@ -0,0 +1,129 @@
from lxml import etree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of our transitive dependencies uses lxml, but let's use xml.etree.ElementTree instead, which has 99% of the functionality and is also in the standard lib. grep in ambuda for examples

Copy link
Author

@skmnktl skmnktl Oct 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not rewrite; any specific reason why you want to move to lxml? If possible, I'll leave this rewrite for another PR. @akprasad

# God this is ugly # TODO
try:
return root.find(tagname("head")).text
except:
Copy link
Contributor

@akprasad akprasad Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no bare except -- generally good practice to catch a specific exception like except AttributeError

.find(tagname("fileDesc")) \
.find(tagname("titleStmt"))
return tag.find(tagname("title")).text
except:
Copy link
Contributor

@akprasad akprasad Sep 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no bare except here

@shreevatsa
Copy link
Contributor

@skmnktl About supporting more texts v/s getting the pdf to look nicer, it's up to you (whatever you're more motivated to do), though if you're neutral between the two options, then I'd say that getting the PDF to look nicer can be done indefinitely, after a first version gets in. Also consider a third option, that of cleaning up the PR for production.

I could think of a sequence of PRs like, say:

  • A Python script that takes in an XML file and generates a .tex file (OK if it doesn't look great or works only on some XML files).
  • Extending that script to more files.
  • Offline tasks for generating PDF files by running the scripts, and storing those PDF files somewhere.
  • A route in the app to point to / download the corresponding (previously generated) PDF file, given a text.

But could also think of doing it in other orders.

BTW what did you mean by "halting latex compile messages"?

@skmnktl
Copy link
Author

skmnktl commented Sep 30, 2022

@shreevatsa For the tasks outlined:

A Python script that takes in an XML file and generates a .tex file (OK if it doesn't look great or works only on some XML files).

This is what make_latex.py does. I'm writing the resultant .tex file as output.tex.

Offline tasks for generating PDF files by running the scripts, and storing those PDF files somewhere.

Do we want to wait on this? Should I use AWS? Or perhaps a Google Cloud Storage Bucket? Or perhaps a mongo instance? If the main app will move to a more permanent storage, I can just piggyback off that. If you have preferences on what that storage will be in the future, we can do that as well.

A route in the app to point to / download the corresponding (previously generated) PDF file, given a text.

I'm hesitant to write this since the storage decisions have yet to be made. We could go the render on the fly every time route, and for that I can make a FastAPI endpoint. There's not that many texts right now anyway.

@shreevatsa
Copy link
Contributor

@skmnktl Hi, to be clear, by "sequence of PRs" I meant doing each of those things in a separate PR (so to answer your latter two questions, yes definitely wait on storing in the cloud, and don't write the app now) — I was actually talking about removing stuff from this PR (or moving to another one), to keep it minimal and ready to merge. :-) So this PR would have only 3 files:

  • make_latex.py
  • parser.py
  • template.tex

if I understand correctly. (Because output.tex is generated by make_latex.py, and output.pdf is generated by running xelatex/lualatex on output.tex, and tufte-handout.cls / tufte-common.def are already part of TeX Live so don't need to be committed.)

In the meantime I'll download and compile output.tex on my computer, so that I can see what error messages you were talking about…

@shreevatsa
Copy link
Contributor

@skmnktl I looked into output.tex locally. I see what you mean about the errors — there are indeed a lot of them, and even if hitting Enter (or putting TeX in batchmode) will proceed with those errors ignored, it's not a good state to get into :)

Let's start with a template where we understand why each line is needed, and which compiles without any errors or warnings, and add things as needed (e.g. the current template has a line or two of math-related stuff that I'm sure we won't need for a long time if ever).

The file compiles fine without any errors or warnings if I replace lines 1–53 of output.tex (or template.tex originally) with the following:

\documentclass{article}

\title{Kumārasaṃbhava}

\author{apauruṣeya}

\usepackage[parfill]{parskip}
\usepackage{fontspec}

\newenvironment{sanskrit}{}{}

or, if you want to keep tufte-handout (from looking around online, it seems to be not a very well-maintained package and has quite a few bugs unfortunately…), this works:

\documentclass[nobib]{tufte-handout}

\title{Kumārasaṃbhava}

\author{apauruṣeya}

\usepackage[parfill]{parskip}
\usepackage{fontspec}

% Fix for bug: see https://tex.stackexchange.com/q/200722
% and https://github.com/Tufte-LaTeX/tufte-latex/issues/64
% Set up the spacing using fontspec features
\renewcommand\allcapsspacing[1]{{\addfontfeature{LetterSpace=15}#1}}
\renewcommand\smallcapsspacing[1]{{\addfontfeature{LetterSpace=10}#1}}

\newenvironment{sanskrit}{}{}

Feel free to add your font choices back in later.

There were also some errors from Babel's Sanskrit setup; we don't seem to need it for this output.tex, and we can also consider using polyglossia instead of babel. (Sorry I haven't looked into the Devanagari version yet; I didn't run make_latex.py and just tried working with the output.tex that was already in the PR.)

Copy link
Contributor

@shreevatsa shreevatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments and I'm yet to take a closer look atparser.py, but this is actually looking quite reasonable and close to merging after some small changes — thanks for working on this! We can continue improving after having a first version.

render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/make_latex.py Show resolved Hide resolved
render_pdf/src/parser.py Outdated Show resolved Hide resolved
@skmnktl
Copy link
Author

skmnktl commented Oct 2, 2022

@shreevatsa I took your comments above and did a few things with my last commit:

  1. I removed the tufte package entirely.
  2. I removed a bunch of vestigial functions.
  3. I changed the template block and variable markers for jinja.
  4. I have a script in src now that toggles pdf creation, and then moves the pdf to a render_pdf/pdf_outputs titled <text_title>_<author>.pdf.
  5. And all the intermediate files are deleted (including the tex file).
  6. For now, I've suppressed the "sanskrit" environment that I had earlier for babel. But once we start supporting documents transliterated into other scripts, we can set those up I think.
  7. The runscript now uses latexmk rather than xelatex, though it does still call the xelatex engine in the background.

@shreevatsa
Copy link
Contributor

@skmnktl Great! I still see the tex_files with output.tex etc in the PR; could you take another look? Maybe they didn't get removed properly…

@skmnktl skmnktl marked this pull request as ready for review October 2, 2022 16:04
Copy link
Contributor

@akprasad akprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Also, feel free to move this to ambuda/render_pdf so that it's easier to reuse and share utils with the rest of the project.

@@ -0,0 +1,80 @@
from typing import Union
import parser
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run isort render_pdf/**/*.py to standardize and reorder imports

render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/parser.py Outdated Show resolved Hide resolved
render_pdf/src/parser.py Outdated Show resolved Hide resolved
render_pdf/src/parser.py Outdated Show resolved Hide resolved
render_pdf/src/parser.py Outdated Show resolved Hide resolved
except:
try:
root = alt_root
tag = root.find(tagname("teiHeader"))\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if using ElementTree:

try:
  return xml.find('./teiHeader/fileDesc/titleStmt').text
except AttributeError:
  return None

render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/make_latex.py Show resolved Hide resolved
render_pdf/src/make_latex.py Outdated Show resolved Hide resolved
render_pdf/src/parser.py Outdated Show resolved Hide resolved
@shreevatsa
Copy link
Contributor

By the way, if it helps, feel free to consider splitting this PR further, and checking in just one of the two:

  1. just the parsing code to get "data" out of the TEI XML file,
  2. just the LaTeX template and build scripts, with clear documentation on what data it needs as input.

(E.g. it's ok to merge just (2) for now, and work on (1) in parallel.)

Copy link
Contributor

@akprasad akprasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level comments:

  • remove the zivopaniSad file from this PR
  • use xml.etree.ElementTree instead of lxml -- they have basically the same API, but ElementTree is built-in and we'll probably remove lxml in the future.
  • use xpath (xml.find('./element/anotherElement/thirdElement)) instead of chaining tagname
  • no bare excepts anywhere
  • run black over the code

@@ -0,0 +1,5 @@
python make_latex.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name -- maybe create_pdf.sh, give it the .sh extension and add #!/usr/bin/sh at the top of the file, then chmod +x create_pdf.sh to mark it as executable


if __name__ == "__main__":
input_file, output_file = sys.argv[1], sys.argv[2]
verses, analysis, title = parse(url=link)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is link defined?



# probably need to use asyncio here?
def parse(url=None, data_location = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run black over the file to fix the spacing



# probably need to use asyncio here?
def parse(url=None, data_location = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two entirely different code paths here and it doesn't make sense to make them a single function. Better to have parse_url andparse_file and split them up accordingly

def tagname(tag):
return f"{{{root_url}}}{tag}"

def parse_from_root(root) -> (dict, dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a type

.find(tagname("titleStmt"))
author = tag.find(tagname("author")).text
except:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to have explicit return None

author = tag.find(tagname("author")).text
except:
return
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try / except / else is kind of funky. Can you rewrite this?

something like:

author = root.find("./teiHeader/fileDesc/titleStmt/author')
if author:
  return author.text
else:
  return None

return verse_id, phrases

def parse_note(note):
corresp = note.get("corresp").replace("#","")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this fail if corresp is missing?

return parse_lg(note.find(tagname("lg")), corresp)

def parse_div(div):
verses = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return [parse_lg(lg) for lg in div]

analysis = {}
for child in root.iter():
if child.tag == tagname("lg"):
v,txt = parse_lg(child)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces after commas. black will do this for you



# CLUNKY Fix this. Closure maybe?
root_url = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend just looping over the document and deleting this root prefix. see _remove_namespace in tei_parser.py for examples

@akprasad
Copy link
Contributor

akprasad commented Apr 7, 2023

Returning to this PR -- I suggest we scope this PR down to the simplest thing that adds new functionality then iterate on it with follow-up PRs.

@skmnktl
Copy link
Author

skmnktl commented Apr 7, 2023 via email

@akprasad
Copy link
Contributor

akprasad commented Apr 7, 2023

Do you want to meet this weekend to finalize this? I can wrap it up in the next couple of days?

On Thu, Apr 6, 2023 at 9:32 PM Arun Prasad @.> wrote: Returning to this PR -- I suggest we scope this PR down to the simplest thing that adds new functionality then iterate on it with follow-up PRs. — Reply to this email directly, view it on GitHub <#364 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE2BJBATJKU2CSUW6P4PON3W75VDFANCNFSM6AAAAAAQSTANOE . You are receiving this because you were mentioned.Message ID: @.>

Sure, just scheduled a weekly sync (see #general on Discord) or we can find some time elsewhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants