Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indent HTML lists correctly (Issue 1073) #1170

Open
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

lcgeneralprojects
Copy link

@lcgeneralprojects lcgeneralprojects commented May 17, 2024

Fixes #1073
Implements paragraph indentation via adjustments of pdf.x instead of using a natural number of whitespaces.
Breaks up list items into individual paragraphs.

Checklist:

  • The GitHub pipeline is OK (green),
    meaning that both pylint (static code analyzer) and black (code formatter) are happy with the changes of this PR.

  • A unit test is covering the code added / modified by this PR

  • This PR is ready to be merged

  • In case of a new feature, docstrings have been added, with also some documentation in the docs/ folder

  • A mention of the change is present in CHANGELOG.md

Not sure how to actually name the HTML2FPDF.list_pseudo_margin attribute. It is used for determining the height of the \n line created when a <ul> or <ol> starting tag is handled.

Should the re-implementation of paragraph indentation be reflected in a doctstring, even though the tag_indents parameter of write_html() has not been touched? If so, where should it be placed?

By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.

Some debugging and polish needed.
Some variables need tweaking.
Needs testing.
Code reuse unsatisfactory.
Some variables need tweaking.
Needs testing.
Code reuse unsatisfactory.
Potentially significant issues with tests:
1. test_html_ln_outside_p - IndexError: list index out of range.
2. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected.

Code reuse unsatisfactory.
Need feedback for handling <dd> and <blockquote>.

Potentially significant issues with tests:
1. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected.

Need feedback for whether or not the new indentation that contradicts old tests is satisfactory.

Code reuse unsatisfactory.
Need feedback.
Bug present: bullets are made one per line instead of one per paragraph.
Saving progress before introducing a `Bullet` class.
Feature implemented.
Testing and adjustments of tests needed.
Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets.
Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets.
`<ul>` and `<ol>` tags now cause a creation of a paragraph with the string `\n` being used to generate a fragment of the height `list_pseudo_margin`.
Adjusted defaults for `li_tag_indent`.
fpdf/fpdf.py Outdated Show resolved Hide resolved
Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line.
Dealing with the issue of inappropriately large distance between `<dt>` and their child `<dd>` elements when `Paragraph.top_margin` is 0.
Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line.
@lcgeneralprojects lcgeneralprojects marked this pull request as ready for review May 20, 2024 07:37
@gmischler gmischler changed the title Issue 1073 Indent HTML lists correctly (Issue 1073) May 20, 2024
Copy link
Collaborator

@gmischler gmischler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far, but the devil is in the detail...

As I'm sure you've noticed, the interplay between the HTML parser, text regions, line wrapping, and rendering is non-trivial. I've added some pointers of how to fix the parts that don't quite add up yet.

fpdf/line_break.py Outdated Show resolved Hide resolved
fpdf/html.py Outdated Show resolved Hide resolved
fpdf/html.py Show resolved Hide resolved
fpdf/text_region.py Show resolved Hide resolved
fpdf/text_region.py Outdated Show resolved Hide resolved
fpdf/text_region.py Show resolved Hide resolved
fpdf/text_region.py Show resolved Hide resolved
fpdf/html.py Outdated Show resolved Hide resolved
test/html/test_html.py Show resolved Hide resolved
…`list_vertical_margin`.

Removed the `MultiLineBreak.indent` attribute.
Added a test for long `<ol>` bullets.
…ags_and_tl` method and in the `Bullet` class.
Edited `TextRegion.md` to reflect the introduced changes for `Paragraph`s.
Added tests for `Paragraph` generation in `test_html.py`
@lcgeneralprojects
Copy link
Author

Reformatted html.py using black manually.

@gmischler
Copy link
Collaborator

Now that the code has largely settled, let's look at the results.

  • html_blockquote_indent.pdf - I know this is a pre-existing test, but it would probably be good to add a blockquoted paragraph with several lines, just to show the effect, and also to prevent possible future regressions.

  • html_customize_ul.pdf - Small indents.

  • html_li_prefix_color.pdf - Small indents.

  • html_li_tag_indent.pdf - Small indents.

  • html_ln_outside_p.pdf - Small indents.

  • html_ol_ul_line_height.pdf - Small indents.

  • html_ul_type.pdf - Small indents.

  • html_tag_indent.pdf - Besides the width of the indent, this one would also benefit from a bit more (multi-line) context.

  • html_description.pdf - Has received a vertical margin it didn't have before. Is this now more "correct" (HTML specs / typical browser behaviour)?

  • html_features.pdf - Reflects changes already seen in other files.

  • html_long_list_entries - New. Demonstrates multiline list item indent.

  • html_long_ol_bullets - New. Demonstrates bullet left cut-off.

The indents for lists are now much smaller than before. The previous default indent depended on the font size (5 x width of NBSP). This now seems to have changed to 5 "document units". With the default of mm this is too small. With the document units set to eg. inches, it will be way too large. You will have to pick a reasonable size in mm (eg. 8 or 10), and then make sure that if the document units aren't mm, this value gets converted appropriately before being used. The documentation also must clearly state that tag_indents values have to be given in document units.
Warning: I'm unable to test this right now, but this is my conclusion looking at the code. You'll have to test what actually happens when the document units aren't mm, and obviously those tests need to be added to our permanent collection.
Maybe the simplest solution will be to whenever we assign a value from DEFAULT_TAG_INDENTS to self.tag_indents, to systematically convert it from mm to document units.

Note that top and bottom margins of Paragraph()s are also in document units (which the documentation currently doesn't make sufficiently clear).
You'll have to take the same precautions with those as with the indents. In most cases they are defined in terms of font height, which is fine, but there are some hard-coded "magic numbers" present in the code as well. Most of those will have been there before you started (you can probably blame me for some of them 😉), but this is a good opportunity to fix them.
The same applies to the last hardcoded self._ln(2) call within the <li> tag code. This also incorrectly assumes mm, and needs to be converted to document units.

The docstring for Paragraph() looks good. However, the end user will access this functionality through ParagraphCollectorMixin.paragraph(), so maybe it should rather go there?

@lcgeneralprojects
Copy link
Author

lcgeneralprojects commented May 30, 2024

Regarding tests with small indents, do you want me to pass tag_indents values into pdf.write_html() calls in tests where that argument is not used, in addition to adjusting default tag indentation handling?

Regarding the new margins between elements in html_description.pdf - Firefox and Edge both produce the margin with that HTML code.

I might not have time to confidently deal with the 'magic numbers' tonight, so I will likely be pushing the changes tomorrow, and not today. Should I just do the conversion into appropriate units with them? If so, would you prefer for me to intentionally change them a little in order for them to look nicer, or would you prefer a more exact conversion?

EDIT: Going to note that, currently, due to the HTML2FPDF._ln() call when handling <li> start tags, there will be a gap of the relevant size, even if the list is the first visible content of the document, in case there will be a need to eliminate it in the future.

…margin values in `html.py` to the chosen document unit of measurement.

Adjusted default tag indent values.
Moved the `Paragraph` docstring to the `ParagraphCollectorMixin.paragraph()` method.
Changed the `CustomPDF` class in `test_html_customize_ul` to have non-static attributes `li_tag_indent` and `ul_bullet_char`.
Adjusted tests.
fpdf/html.py Outdated Show resolved Hide resolved
Minor changes to HTML contents of `test_html_measurement_units()`.
@@ -84,6 +84,10 @@ For more typographical control, you can use the following arguments. Most of tho
* line_height (float, optional) - factor by which the line spacing will be different from the font height. (default: by region)
* top_margin (float, optional) - how much spacing is added above the paragraph. No spacing will be added at the top of the paragraph if the current y position is at (or above) the top margin of the page. (Default: 0.0)
* bottom_margin (float, optional) - Those two values determine how much spacing is added below the paragraph. No spacing will be added at the bottom if it would result in overstepping the bottom margin of the page. (Default: 0.0)
* indent (float, optional): determines the indentation of the paragraph. (Default: 0.0)
* bullet_rel_x_displacement (float, optional) - determines the relative displacement of the bullet along the x-axis. The distance is between the rightmost point of the bullet to the leftmost point of the paragraph's text. (Default: 2.0)
Copy link
Collaborator

@gmischler gmischler Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "bullet_right_margin" describes more accurately what this does?

@@ -84,6 +84,10 @@ For more typographical control, you can use the following arguments. Most of tho
* line_height (float, optional) - factor by which the line spacing will be different from the font height. (default: by region)
* top_margin (float, optional) - how much spacing is added above the paragraph. No spacing will be added at the top of the paragraph if the current y position is at (or above) the top margin of the page. (Default: 0.0)
* bottom_margin (float, optional) - Those two values determine how much spacing is added below the paragraph. No spacing will be added at the bottom if it would result in overstepping the bottom margin of the page. (Default: 0.0)
* indent (float, optional): determines the indentation of the paragraph. (Default: 0.0)
* bullet_rel_x_displacement (float, optional) - determines the relative displacement of the bullet along the x-axis. The distance is between the rightmost point of the bullet to the leftmost point of the paragraph's text. (Default: 2.0)
* bullet_rel_y_displacement (float, optional) - determines the relative displacement of the bullet along the y-axis. The distance is between the topmost point of the bullet and the topmost point of the paragraph's text. (Default: 0.0)
Copy link
Collaborator

@gmischler gmischler Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And "bullet_top_margin" ?
Edit: I just noticed your earlier suggestion to remove this one entirely. That's a good idea, since it isn't really needed at the moment, and might stand in the way of other possible enhancements later.

…associated class attributes and docstring mentions to `bullet_r_margin`, `bullet_t_margin`, `bullet.r_margin`, `bullet.t_margin`.
fpdf/html.py Outdated
@@ -352,7 +359,7 @@ def __init__(
# "inserted" is a special attribute indicating that a cell has be inserted in self.table_row

if not tag_indents:
tag_indents = {}
tag_indents = {k: v / self.pdf.k for k, v in DEFAULT_TAG_INDENTS.items()}
if dd_tag_indent != DEFAULT_TAG_INDENTS["dd"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to compare between different units. The user supplied argument is supposed to be in document units, but your currently store the default values in points. For the assignment below you then convert document units to points, which also doesn't looks correct.

You should probably compare against tag_indents instead and not convert the user input.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the assignment below you then convert document units to points, which also doesn't looks correct

I'm not seeing that anywhere.

Although, it seems that I made some changes for naught tonight. More specifically, the conversion of li_tag_indent and dd_tag_indent. That was an error, as those values are only relevant if they differ from the defaults, meaning that they shouldn't be subject to this conversion. I'm going back on those changes.

As such, this note seems to not be correct:

Those are still magic numbers (overly large now for mm), which will vary in their effect depending on the current document units

as the generated files evidently do not differ with change in document units provided that those default values in question are used.

You should probably compare against tag_indents instead and not convert the user input

Assuming you are talking about the change in the if blocks for non-default li_tag_indent and dd_tag_indent, yeah, that was an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing that anywhere.

You actually do. Your next paragraph explains exactly what the problem is. 👍

As such, this note seems to not be correct:

At that time it wasn't clear to me that the value of 30 is actually in points (problem of readability!).
That works for DEFAULT_TAG_INDENTS, but for li_tag_indent and dd_tag_indent the default unit is mm, which would result in ridiculously large indents.
Actually, since those two parameters are deprecated and their default value should never get used, it might make sense to default them to None, which will make it simpler to determine whether they have been set by the user.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually do. Your next paragraph explains exactly what the problem is

But there was conversion from points to document units, and not the other way around.
In any case, seems like that matter is resolved now anyway.

That works for DEFAULT_TAG_INDENTS, but for li_tag_indent and dd_tag_indent the default unit is mm, which would result in ridiculously large indents

Considering that we only want to deal with those if the user passes the relevant parameters, and in that case we want to handle them regardless of whether or not they are equal to relevant entries in DEFAULT_TAG_INDENTS, I'm making them have the value of None by default, so that's not going to be an issue going forward anyway.

Actually, since those two parameters are deprecated and their default value should never get used, it might make sense to default them to None

Yeah. And there was another reason to handle them that way anyway.

fpdf/html.py Outdated Show resolved Hide resolved
fpdf/html.py Outdated
self.ol_type = [] # when inside a <ol> tag, can be "a", "A", "i", "I" or "1"
self.bullet = []
if list_vertical_margin is None:
self.list_vertical_margin = 0.3 / self.pdf.k
Copy link
Collaborator

@gmischler gmischler Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This converts 0.3 points (0.12 mm) to document units.

You may have seen self.font_size being converted like this a few times. The pitfall being that self.font_size is derived from pdf.font_size_pt, which actually is in points. We should probably rename self.font_size in this module also to self.font_size_pt to make this more obvious.

Besides the "magical number" problem in this specific instance, there are more cases where you do a similarly incorrect conversion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this conversion incorrect, and in what sense is it a 'magical number' if it's simply the default value for the vertical margin, considering that the conversion is handled?

EDIT: Do you want the default value of HTML2PDF.list_vertical_margin to depend on something else?

Copy link
Collaborator

@gmischler gmischler Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the conversion is intentional, but then the amount doesn't look like something you'd want.

I think the default here used to be 2 mm (implemented incorrectly without taking document units into account).

If you wanted to specify the same amount in points and convert it to document units for passing on, then 2 * util.get_scale_factor("mm") would result in ~5.67 points, not 0.3.
Your current value amounts to roughly a tenth of a mm, so you could just as well leave the margin away.

And again, a constant like that should come with a comment specifying its unit and how much that is in mm (most people are not aware of how big a typographical point is).

And no, I don't see this depending on any other size. What did you have in mind?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again, a constant like that should come with a comment specifying its unit and how much that is in mm (most people are not aware of how big a typographical point is)

Given that there were no such comments prior to my intrusion, I assumed that I shouldn't be changing that. Where else should I introduce such comments, if you don't mind me asking?

And no, I don't see this depending on any other size. What did you have in mind?

I am/was confused about why we should make get_scale_factor() calls instead of just using self.pdf.k. Do I understand it correctly now that you want me to return the relevant default values to how they were prior to my changes, and to convert them with * get_scale_factor('mm') / self.pdf.k into the appropriate units?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there were no such comments prior to my intrusion, I assumed that I shouldn't be changing that. Where else should I introduce such comments, if you don't mind me asking?

This module still contains a lot of legacy code of rather low quality. Any improvement to that is welcome. So far we have just fixed the parts we were touching anyway when adding features. I recommend you do the same, in order to keep the focus on your actual goal. At some later point someone will have to go through every line with a fine toothed comb, but that is a separate and rather tedious task, requiring a good grasp of the whole code base.

I've commented on recommended unit use further above.

@lcgeneralprojects
Copy link
Author

Actually, I see an issue with how the default values for li_tag_indent and dd_tag_indent are handled in my code. I am currently unable to dedicate time to fixing it, so that will have to wait until tomorrow.

`bullet_t_margin` and associated variables removed.
`li_tag_indent` and `dd_tag_indent` default values are changed to None.
Changes to `test_html_measurement_units` not included.
fpdf/html.py Outdated
if list_vertical_margin is None:
# Default value of 2 to be multiplied by the conversion factor
# for list_vertical_margin is given in mm
self.list_vertical_margin = 2 * self.default_conversion_factor
Copy link
Collaborator

@gmischler gmischler Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the argument assigned to self.list_vertical_margin if it isn't None?
And since we offer this configurability, there should also be a test verifying that a supplied value actually has the desired effect (which would have revealed this bug even before checking it in).

def __init__(self):
super().__init__()
self.li_tag_indent = 40
self.ul_bullet_char = "\x86"
Copy link
Collaborator

@gmischler gmischler Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of making those two values attributes of the (derived) FPDF instance?
I only see them changed below, but never actually getting used anywhere.

For something like that to at least potentially make sense, you'd have to subclass HTML2FPDF(), not FPDF(), and assign the derived class to pdf.HTML2FPDF_CLASS. But since you supply the relevant values as arguments to pdf.write_html() anyway, there's really no point at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent behind that portion of code seems to be to utilise this portion of FPDF.write_html():

        kwargs2 = vars(self)
        # Method arguments must override class & instance attributes:
        kwargs2.update(kwargs)
        html2pdf = self.HTML2FPDF_CLASS(self, *args, **kwargs2)

However, vars() does not return static attributes, meaning that previously the li_tag_indent and ul_bullet_char of the custom FPDF inheritor class did nothing. Now they are bound to individual objects and are returned by vars().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very weird piece of code there, and I'm not sure if it really still serves any purpose (if it ever has). As far as I'm aware, there's no relevant overlap between instance attributes of FPDF() and keyword parameters of HTML2PDF(). Most likely we could just eliminate kwargs2 there without anyone missing it, and I hope this is going to happen soon (edit: just created #1198 with that goal). I also don't see any documentation of why it would be useful to have this.

The reasonable (and much easier to understand) alternative would be to simply pass those two values to write_html() directly, without the weird detour of first stowing them into instance attributes of an otherwise unrelated class. If you absolutely have to prepare them in advance, you can also pack them in a dict, and pass that one as **kwargs.

It seems to be your fate to uncover really obscure and obsolete legacy code in fpdf2... Makes your task a bit confusing, but also gives us a chance of cleaning things up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very weird piece of code there, and I'm not sure if it really still serves any purpose (if it ever has). As far as I'm aware, there's no relevant overlap between instance attributes of FPDF() and keyword parameters of HTML2PDF().

Not sure what you mean here, but in case you are talking about whether or not the self.li_tag_indent and self.ul_bullet_char are correctly used in later calls of pdf.write_html(), then I did check that and this piece of code now works correctly. Previously, it did not, which I noticed after looking at the indentation of relevant lists in the resulting .pdf files.

Most likely we could just eliminate kwargs2 there without anyone missing it, and I hope this is going to happen soon (edit: just created #1198 with that goal)

Should I leave this for other volunteers (at least for now in case nobody will take it)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here

Well, passing arguments that way does technically work, just that it is a horrible idea in terms of software architecture.

Should I leave this for other volunteers (at least for now in case nobody will take it)?

Before actually removing that "feature", let's see what kind of reactions #1198 will receive. But I do recommend to not use it in your tests, or someone will have to fix those later as well.

CHANGELOG.md Outdated
### Changed
* [`FPDF.table()`](https://py-pdf.github.io/fpdf2/Tables.html) now raises an error when a single row is too high to be rendered on a single page
* `HTML2FPDF.handle_starttag()` now generates one `Paragraph` object for every `<li>` HTML element. Margins above lists are now handled with `<ul>` and `<ol>` tags.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog is directed at end users, who don't really care about which elements are treated internally as Paragraph()s, so this part could be removed. On the other hand, the way top and bottom margins are now handled could be made more explicit (eg. toplevel vs. nested).

def __init__(self):
super().__init__()
self.li_tag_indent = 40
self.ul_bullet_char = "\x86"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very weird piece of code there, and I'm not sure if it really still serves any purpose (if it ever has). As far as I'm aware, there's no relevant overlap between instance attributes of FPDF() and keyword parameters of HTML2PDF(). Most likely we could just eliminate kwargs2 there without anyone missing it, and I hope this is going to happen soon (edit: just created #1198 with that goal). I also don't see any documentation of why it would be useful to have this.

The reasonable (and much easier to understand) alternative would be to simply pass those two values to write_html() directly, without the weird detour of first stowing them into instance attributes of an otherwise unrelated class. If you absolutely have to prepare them in advance, you can also pack them in a dict, and pass that one as **kwargs.

It seems to be your fate to uncover really obscure and obsolete legacy code in fpdf2... Makes your task a bit confusing, but also gives us a chance of cleaning things up.

Added `test_html_list_vertical_margin`.
Fixed the non-assignment of `HTML2FPDF.list_vertical_margin` when the constructor argument `list_vertical_margin` is not None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Long HTML list entries are not correctly indented
3 participants