-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indent HTML lists correctly (Issue 1073) #1170
base: master
Are you sure you want to change the base?
Conversation
Some debugging and polish needed.
Some variables need tweaking. Needs testing. Code reuse unsatisfactory.
Some variables need tweaking. Needs testing. Code reuse unsatisfactory.
Potentially significant issues with tests: 1. test_html_ln_outside_p - IndexError: list index out of range. 2. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Code reuse unsatisfactory.
Need feedback for handling <dd> and <blockquote>. Potentially significant issues with tests: 1. test_html_ol_ul_line_height - actual distance between lines differs slightly from expected. Need feedback for whether or not the new indentation that contradicts old tests is satisfactory. Code reuse unsatisfactory. Need feedback.
Bug present: bullets are made one per line instead of one per paragraph. Saving progress before introducing a `Bullet` class.
Feature implemented. Testing and adjustments of tests needed.
Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets.
Prevented from `Paragraph.top_margin` being added to `pdf.y` of first lines of paragraphs with bullets. `<ul>` and `<ol>` tags now cause a creation of a paragraph with the string `\n` being used to generate a fragment of the height `list_pseudo_margin`. Adjusted defaults for `li_tag_indent`.
Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line. Dealing with the issue of inappropriately large distance between `<dt>` and their child `<dd>` elements when `Paragraph.top_margin` is 0.
Changed `Paragraph.generate_bullet_frag()` into `generate_bullet_frag_and_tl`, and made it also generate the bullet text line.
# Conflicts: # fpdf/html.py
Adjusted old tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work so far, but the devil is in the detail...
As I'm sure you've noticed, the interplay between the HTML parser, text regions, line wrapping, and rendering is non-trivial. I've added some pointers of how to fix the parts that don't quite add up yet.
…`list_vertical_margin`. Removed the `MultiLineBreak.indent` attribute. Added a test for long `<ol>` bullets.
# Conflicts: # CHANGELOG.md
…ags_and_tl` method and in the `Bullet` class.
Edited `TextRegion.md` to reflect the introduced changes for `Paragraph`s. Added tests for `Paragraph` generation in `test_html.py`
Reformatted |
# Conflicts: # CHANGELOG.md # test/html/html_features.pdf
Now that the code has largely settled, let's look at the results.
The indents for lists are now much smaller than before. The previous default indent depended on the font size (5 x width of NBSP). This now seems to have changed to 5 "document units". With the default of mm this is too small. With the document units set to eg. inches, it will be way too large. You will have to pick a reasonable size in mm (eg. 8 or 10), and then make sure that if the document units aren't mm, this value gets converted appropriately before being used. The documentation also must clearly state that Note that top and bottom margins of The docstring for |
Regarding tests with small indents, do you want me to pass Regarding the new margins between elements in I might not have time to confidently deal with the 'magic numbers' tonight, so I will likely be pushing the changes tomorrow, and not today. Should I just do the conversion into appropriate units with them? If so, would you prefer for me to intentionally change them a little in order for them to look nicer, or would you prefer a more exact conversion? EDIT: Going to note that, currently, due to the |
…margin values in `html.py` to the chosen document unit of measurement. Adjusted default tag indent values. Moved the `Paragraph` docstring to the `ParagraphCollectorMixin.paragraph()` method. Changed the `CustomPDF` class in `test_html_customize_ul` to have non-static attributes `li_tag_indent` and `ul_bullet_char`. Adjusted tests.
Minor changes to HTML contents of `test_html_measurement_units()`.
docs/TextRegion.md
Outdated
@@ -84,6 +84,10 @@ For more typographical control, you can use the following arguments. Most of tho | |||
* line_height (float, optional) - factor by which the line spacing will be different from the font height. (default: by region) | |||
* top_margin (float, optional) - how much spacing is added above the paragraph. No spacing will be added at the top of the paragraph if the current y position is at (or above) the top margin of the page. (Default: 0.0) | |||
* bottom_margin (float, optional) - Those two values determine how much spacing is added below the paragraph. No spacing will be added at the bottom if it would result in overstepping the bottom margin of the page. (Default: 0.0) | |||
* indent (float, optional): determines the indentation of the paragraph. (Default: 0.0) | |||
* bullet_rel_x_displacement (float, optional) - determines the relative displacement of the bullet along the x-axis. The distance is between the rightmost point of the bullet to the leftmost point of the paragraph's text. (Default: 2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "bullet_right_margin" describes more accurately what this does?
docs/TextRegion.md
Outdated
@@ -84,6 +84,10 @@ For more typographical control, you can use the following arguments. Most of tho | |||
* line_height (float, optional) - factor by which the line spacing will be different from the font height. (default: by region) | |||
* top_margin (float, optional) - how much spacing is added above the paragraph. No spacing will be added at the top of the paragraph if the current y position is at (or above) the top margin of the page. (Default: 0.0) | |||
* bottom_margin (float, optional) - Those two values determine how much spacing is added below the paragraph. No spacing will be added at the bottom if it would result in overstepping the bottom margin of the page. (Default: 0.0) | |||
* indent (float, optional): determines the indentation of the paragraph. (Default: 0.0) | |||
* bullet_rel_x_displacement (float, optional) - determines the relative displacement of the bullet along the x-axis. The distance is between the rightmost point of the bullet to the leftmost point of the paragraph's text. (Default: 2.0) | |||
* bullet_rel_y_displacement (float, optional) - determines the relative displacement of the bullet along the y-axis. The distance is between the topmost point of the bullet and the topmost point of the paragraph's text. (Default: 0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And "bullet_top_margin" ?
Edit: I just noticed your earlier suggestion to remove this one entirely. That's a good idea, since it isn't really needed at the moment, and might stand in the way of other possible enhancements later.
…associated class attributes and docstring mentions to `bullet_r_margin`, `bullet_t_margin`, `bullet.r_margin`, `bullet.t_margin`.
fpdf/html.py
Outdated
@@ -352,7 +359,7 @@ def __init__( | |||
# "inserted" is a special attribute indicating that a cell has be inserted in self.table_row | |||
|
|||
if not tag_indents: | |||
tag_indents = {} | |||
tag_indents = {k: v / self.pdf.k for k, v in DEFAULT_TAG_INDENTS.items()} | |||
if dd_tag_indent != DEFAULT_TAG_INDENTS["dd"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to compare between different units. The user supplied argument is supposed to be in document units, but your currently store the default values in points. For the assignment below you then convert document units to points, which also doesn't looks correct.
You should probably compare against tag_indents
instead and not convert the user input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the assignment below you then convert document units to points, which also doesn't looks correct
I'm not seeing that anywhere.
Although, it seems that I made some changes for naught tonight. More specifically, the conversion of li_tag_indent
and dd_tag_indent
. That was an error, as those values are only relevant if they differ from the defaults, meaning that they shouldn't be subject to this conversion. I'm going back on those changes.
As such, this note seems to not be correct:
Those are still magic numbers (overly large now for mm), which will vary in their effect depending on the current document units
as the generated files evidently do not differ with change in document units provided that those default values in question are used.
You should probably compare against tag_indents instead and not convert the user input
Assuming you are talking about the change in the if
blocks for non-default li_tag_indent
and dd_tag_indent
, yeah, that was an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing that anywhere.
You actually do. Your next paragraph explains exactly what the problem is. 👍
As such, this note seems to not be correct:
At that time it wasn't clear to me that the value of 30 is actually in points (problem of readability!).
That works for DEFAULT_TAG_INDENTS
, but for li_tag_indent
and dd_tag_indent
the default unit is mm, which would result in ridiculously large indents.
Actually, since those two parameters are deprecated and their default value should never get used, it might make sense to default them to None, which will make it simpler to determine whether they have been set by the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually do. Your next paragraph explains exactly what the problem is
But there was conversion from points to document units, and not the other way around.
In any case, seems like that matter is resolved now anyway.
That works for
DEFAULT_TAG_INDENTS
, but forli_tag_indent
anddd_tag_indent
the default unit is mm, which would result in ridiculously large indents
Considering that we only want to deal with those if the user passes the relevant parameters, and in that case we want to handle them regardless of whether or not they are equal to relevant entries in DEFAULT_TAG_INDENTS
, I'm making them have the value of None
by default, so that's not going to be an issue going forward anyway.
Actually, since those two parameters are deprecated and their default value should never get used, it might make sense to default them to None
Yeah. And there was another reason to handle them that way anyway.
fpdf/html.py
Outdated
self.ol_type = [] # when inside a <ol> tag, can be "a", "A", "i", "I" or "1" | ||
self.bullet = [] | ||
if list_vertical_margin is None: | ||
self.list_vertical_margin = 0.3 / self.pdf.k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This converts 0.3 points (0.12 mm) to document units.
You may have seen self.font_size
being converted like this a few times. The pitfall being that self.font_size
is derived from pdf.font_size_pt
, which actually is in points. We should probably rename self.font_size
in this module also to self.font_size_pt
to make this more obvious.
Besides the "magical number" problem in this specific instance, there are more cases where you do a similarly incorrect conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this conversion incorrect, and in what sense is it a 'magical number' if it's simply the default value for the vertical margin, considering that the conversion is handled?
EDIT: Do you want the default value of HTML2PDF.list_vertical_margin
to depend on something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so the conversion is intentional, but then the amount doesn't look like something you'd want.
I think the default here used to be 2 mm (implemented incorrectly without taking document units into account).
If you wanted to specify the same amount in points and convert it to document units for passing on, then 2 * util.get_scale_factor("mm")
would result in ~5.67 points, not 0.3.
Your current value amounts to roughly a tenth of a mm, so you could just as well leave the margin away.
And again, a constant like that should come with a comment specifying its unit and how much that is in mm (most people are not aware of how big a typographical point is).
And no, I don't see this depending on any other size. What did you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again, a constant like that should come with a comment specifying its unit and how much that is in mm (most people are not aware of how big a typographical point is)
Given that there were no such comments prior to my intrusion, I assumed that I shouldn't be changing that. Where else should I introduce such comments, if you don't mind me asking?
And no, I don't see this depending on any other size. What did you have in mind?
I am/was confused about why we should make get_scale_factor()
calls instead of just using self.pdf.k
. Do I understand it correctly now that you want me to return the relevant default values to how they were prior to my changes, and to convert them with * get_scale_factor('mm') / self.pdf.k
into the appropriate units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there were no such comments prior to my intrusion, I assumed that I shouldn't be changing that. Where else should I introduce such comments, if you don't mind me asking?
This module still contains a lot of legacy code of rather low quality. Any improvement to that is welcome. So far we have just fixed the parts we were touching anyway when adding features. I recommend you do the same, in order to keep the focus on your actual goal. At some later point someone will have to go through every line with a fine toothed comb, but that is a separate and rather tedious task, requiring a good grasp of the whole code base.
I've commented on recommended unit use further above.
… `dd_tag_indent` in `html.py`.
Actually, I see an issue with how the default values for |
`bullet_t_margin` and associated variables removed. `li_tag_indent` and `dd_tag_indent` default values are changed to None. Changes to `test_html_measurement_units` not included.
fpdf/html.py
Outdated
if list_vertical_margin is None: | ||
# Default value of 2 to be multiplied by the conversion factor | ||
# for list_vertical_margin is given in mm | ||
self.list_vertical_margin = 2 * self.default_conversion_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the argument assigned to self.list_vertical_margin
if it isn't None?
And since we offer this configurability, there should also be a test verifying that a supplied value actually has the desired effect (which would have revealed this bug even before checking it in).
def __init__(self): | ||
super().__init__() | ||
self.li_tag_indent = 40 | ||
self.ul_bullet_char = "\x86" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of making those two values attributes of the (derived) FPDF instance?
I only see them changed below, but never actually getting used anywhere.
For something like that to at least potentially make sense, you'd have to subclass HTML2FPDF()
, not FPDF()
, and assign the derived class to pdf.HTML2FPDF_CLASS
. But since you supply the relevant values as arguments to pdf.write_html()
anyway, there's really no point at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent behind that portion of code seems to be to utilise this portion of FPDF.write_html()
:
kwargs2 = vars(self)
# Method arguments must override class & instance attributes:
kwargs2.update(kwargs)
html2pdf = self.HTML2FPDF_CLASS(self, *args, **kwargs2)
However, vars()
does not return static attributes, meaning that previously the li_tag_indent
and ul_bullet_char
of the custom FPDF
inheritor class did nothing. Now they are bound to individual objects and are returned by vars()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very weird piece of code there, and I'm not sure if it really still serves any purpose (if it ever has). As far as I'm aware, there's no relevant overlap between instance attributes of FPDF()
and keyword parameters of HTML2PDF()
. Most likely we could just eliminate kwargs2
there without anyone missing it, and I hope this is going to happen soon (edit: just created #1198 with that goal). I also don't see any documentation of why it would be useful to have this.
The reasonable (and much easier to understand) alternative would be to simply pass those two values to write_html()
directly, without the weird detour of first stowing them into instance attributes of an otherwise unrelated class. If you absolutely have to prepare them in advance, you can also pack them in a dict, and pass that one as **kwargs
.
It seems to be your fate to uncover really obscure and obsolete legacy code in fpdf2... Makes your task a bit confusing, but also gives us a chance of cleaning things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very weird piece of code there, and I'm not sure if it really still serves any purpose (if it ever has). As far as I'm aware, there's no relevant overlap between instance attributes of
FPDF()
and keyword parameters ofHTML2PDF()
.
Not sure what you mean here, but in case you are talking about whether or not the self.li_tag_indent
and self.ul_bullet_char
are correctly used in later calls of pdf.write_html()
, then I did check that and this piece of code now works correctly. Previously, it did not, which I noticed after looking at the indentation of relevant lists in the resulting .pdf
files.
Most likely we could just eliminate kwargs2 there without anyone missing it, and I hope this is going to happen soon (edit: just created #1198 with that goal)
Should I leave this for other volunteers (at least for now in case nobody will take it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here
Well, passing arguments that way does technically work, just that it is a horrible idea in terms of software architecture.
Should I leave this for other volunteers (at least for now in case nobody will take it)?
Before actually removing that "feature", let's see what kind of reactions #1198 will receive. But I do recommend to not use it in your tests, or someone will have to fix those later as well.
CHANGELOG.md
Outdated
### Changed | ||
* [`FPDF.table()`](https://py-pdf.github.io/fpdf2/Tables.html) now raises an error when a single row is too high to be rendered on a single page | ||
* `HTML2FPDF.handle_starttag()` now generates one `Paragraph` object for every `<li>` HTML element. Margins above lists are now handled with `<ul>` and `<ol>` tags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changelog is directed at end users, who don't really care about which elements are treated internally as Paragraph()
s, so this part could be removed. On the other hand, the way top and bottom margins are now handled could be made more explicit (eg. toplevel vs. nested).
def __init__(self): | ||
super().__init__() | ||
self.li_tag_indent = 40 | ||
self.ul_bullet_char = "\x86" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very weird piece of code there, and I'm not sure if it really still serves any purpose (if it ever has). As far as I'm aware, there's no relevant overlap between instance attributes of FPDF()
and keyword parameters of HTML2PDF()
. Most likely we could just eliminate kwargs2
there without anyone missing it, and I hope this is going to happen soon (edit: just created #1198 with that goal). I also don't see any documentation of why it would be useful to have this.
The reasonable (and much easier to understand) alternative would be to simply pass those two values to write_html()
directly, without the weird detour of first stowing them into instance attributes of an otherwise unrelated class. If you absolutely have to prepare them in advance, you can also pack them in a dict, and pass that one as **kwargs
.
It seems to be your fate to uncover really obscure and obsolete legacy code in fpdf2... Makes your task a bit confusing, but also gives us a chance of cleaning things up.
Added `test_html_list_vertical_margin`. Fixed the non-assignment of `HTML2FPDF.list_vertical_margin` when the constructor argument `list_vertical_margin` is not None.
Fixes #1073
Implements paragraph indentation via adjustments of
pdf.x
instead of using a natural number of whitespaces.Breaks up list items into individual paragraphs.
Checklist:
The GitHub pipeline is OK (green),
meaning that both
pylint
(static code analyzer) andblack
(code formatter) are happy with the changes of this PR.A unit test is covering the code added / modified by this PR
This PR is ready to be merged
In case of a new feature, docstrings have been added, with also some documentation in the
docs/
folderA mention of the change is present in
CHANGELOG.md
Not sure how to actually name the
HTML2FPDF.list_pseudo_margin
attribute. It is used for determining the height of the\n
line created when a<ul>
or<ol>
starting tag is handled.Should the re-implementation of paragraph indentation be reflected in a doctstring, even though the
tag_indents
parameter ofwrite_html()
has not been touched? If so, where should it be placed?By submitting this pull request, I confirm that my contribution is made under the terms of the GNU LGPL 3.0 license.