Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: better <text> elements #56

Open
tantau opened this issue Apr 4, 2016 · 4 comments
Open

feature request: better <text> elements #56

tantau opened this issue Apr 4, 2016 · 4 comments
Assignees
Labels
feature feature request

Comments

@tantau
Copy link

tantau commented Apr 4, 2016

The elements generated by dvisvgm are "suboptimal". Consider the following input:

\begin{document}
Hallo Welt! Dies ist ein längerer Text.
\end{document}

The typical output is:

<text class='f0' x='67.746' y='63.7609'>Hallo<tspan x='93.9505'>W</tspan>
<tspan x='103.321'>elt!</tspan>
<tspan x='121.535'>Dies</tspan>
<tspan x='143.521'>ist</tspan>
<tspan x='157.374'>ein</tspan>
<tspan x='173.377'>l㑿</tspan>
<tspan x='176.116'>angerer</tspan>
<tspan x='211.474'>T</tspan>
<tspan x='217.812'>ext.</tspan>

There are two problems with this:

  1. There are ``no spaces'' in the text. As a consequence, when text is selected and copied and then pasted, there are no spaces in the resulting output: HalloWelt!Diesisteinlã‘¿angererText.

It would be ``more than helpful'' if spaces were inserted into the output, for instance following a heuristic that if horizontal advance between letters is above a certain threshold, a space is added.

  1. The output is simply ``very long'' because all of the tspan's need a lot of space and a tspan is typically inserted every three to four letters.

This is a real problem: The pgfmanual needs about 10 MB as a PDF and about 600 MB (!) as a sequence of SVGs. Admittedly, a lot of this is due to the embedded fonts (addressed in a different feature request), but we are talking about at least 100 MB caused just by tspan's...

I propose the following change (knowing that it is not trivial, but it should be doable):

For each line, use a single tspan (when there is a font change, use a sub-tspan for these, when there is a special or a rect, stop the current and restart afterwards) and use the dx attribute to set the spacing and kering for each letter:

<text class='f0' x='67.746' y='63.7609'>
<tspan dx='0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 ... -1.5'>
Hallo Welt! Dies ist ein l㑿angerer Text.</tspan>
<tspan x='67.746' dy='12' dx='0 0 0 0 0 -1 ...'>
Eine zweite Zeile.</tspan>
</text>

The semantics of dx is that you specify an offset for each letter. Naturally, we will have lot's of 0's followed by spaces, but it is still more compact than a tspan for every three to four letters (and more easily compressible). Also, the text stays uninterrupted in the XML, which is useful for searching an processing purposes.

@mgieseki
Copy link
Owner

mgieseki commented Apr 4, 2016

This seems reasonable too. Unfortunately, it's also not that easy to implement because it would require a lot of changes to the current code base. Maybe it's easier to post-process the SVG files with an XSLT stylesheet in order to derive the desired format. I'll have a look.

@mgieseki mgieseki added the feature feature request label Apr 4, 2016
@mgieseki mgieseki self-assigned this Apr 4, 2016
@mgieseki
Copy link
Owner

mgieseki commented Apr 6, 2016

I probably won't implement this feature as part of dvisvgm in the near future. However, it doesn't seem to be too complicated create the desired output with an XSLT script. Here is a quick first attempt. It takes the output of dvisvgm --no-merge --no-styles ..., collects all adjacent characters with the same y coordinate, and puts them in a single tspan element. Additionally, spaces are inserted if a given distance is exceed.
This is just a first draft. It doesn't handle colors and transformations properly yet. Feel free to adapt it according to your needs.

@zmanji
Copy link

zmanji commented Aug 23, 2021

I took a look at this issue, and I believe the first issue is impossible to solve with dvisvgm. I looked at the output dvi file and saw no set: commands that inserted a space character. Based on the discussions on TeX.SE it seems that this is a known issue in pdf output as well. The only solution I have possibly seen is the tagpdf package which allows for 'real spaces' to be produced in the pdf with interwordspace. There exists lua code to do that, but I couldn't get it to work with dvi output.

For the second issue, @mgieseki do you think it is possible to add that feature ? One issue is that it's possible for a single word to be broken up into multiple tspan tags with differing x coordinates, meaning that in Firefox and in Chrome it's not possible to search for that word anymore. If the word was contained in a single tspan with the use of dx it should be possible to search for words in the svg, although not multiple words because of the problem above.

@mgieseki
Copy link
Owner

It's indeed not easy to detect words and word boundaries from plain DVI data since spaces are realized by explicit movements of the virtual cursor which determines the position of the next character (or any other visual object) to be placed. Horizontal movements also occur in case of kerning, stretched letter spacing, inside math formulae, etc. There are some ways to guess whether a horizontal movement denotes a space or something else, e.g. based on the space-related TFM data of a font, but it's not completely reliable.

One issue is that it's possible for a single word to be broken up into multiple tspan tags with differing x coordinates, meaning that in Firefox and in Chrome it's not possible to search for that word anymore.

The search issue is not caused by spreading the characters over several tspan elements but by the newlines appended to the closing </tspan> tag which is interpreted as word boundary by most SVG renderers. I've fixed this in dvisvgm 2.12. At least in Firefox and Chrome, the search for letter sequences also works across tspan elements now.
However, hyphenated words and ligatures, like ff, ſt, or ffi, are still a problem if the search function doesn't recognize and treat them accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature request
Projects
None yet
Development

No branches or pull requests

3 participants