feature request: better <text> elements #56

tantau · 2016-04-04T11:57:42Z

The elements generated by dvisvgm are "suboptimal". Consider the following input:

\begin{document}
Hallo Welt! Dies ist ein längerer Text.
\end{document}

The typical output is:

<text class='f0' x='67.746' y='63.7609'>Hallo<tspan x='93.9505'>W</tspan>
<tspan x='103.321'>elt!</tspan>
<tspan x='121.535'>Dies</tspan>
<tspan x='143.521'>ist</tspan>
<tspan x='157.374'>ein</tspan>
<tspan x='173.377'>l㑿</tspan>
<tspan x='176.116'>angerer</tspan>
<tspan x='211.474'>T</tspan>
<tspan x='217.812'>ext.</tspan>

There are two problems with this:

There are ``no spaces'' in the text. As a consequence, when text is selected and copied and then pasted, there are no spaces in the resulting output: HalloWelt!Diesisteinlã‘¿angererText.

It would be ``more than helpful'' if spaces were inserted into the output, for instance following a heuristic that if horizontal advance between letters is above a certain threshold, a space is added.

The output is simply ``very long'' because all of the tspan's need a lot of space and a tspan is typically inserted every three to four letters.

This is a real problem: The pgfmanual needs about 10 MB as a PDF and about 600 MB (!) as a sequence of SVGs. Admittedly, a lot of this is due to the embedded fonts (addressed in a different feature request), but we are talking about at least 100 MB caused just by tspan's...

I propose the following change (knowing that it is not trivial, but it should be doable):

For each line, use a single tspan (when there is a font change, use a sub-tspan for these, when there is a special or a rect, stop the current and restart afterwards) and use the dx attribute to set the spacing and kering for each letter:

<text class='f0' x='67.746' y='63.7609'>
<tspan dx='0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 ... -1.5'>
Hallo Welt! Dies ist ein l㑿angerer Text.</tspan>
<tspan x='67.746' dy='12' dx='0 0 0 0 0 -1 ...'>
Eine zweite Zeile.</tspan>
</text>

The semantics of dx is that you specify an offset for each letter. Naturally, we will have lot's of 0's followed by spaces, but it is still more compact than a tspan for every three to four letters (and more easily compressible). Also, the text stays uninterrupted in the XML, which is useful for searching an processing purposes.

The text was updated successfully, but these errors were encountered:

mgieseki · 2016-04-04T12:35:24Z

This seems reasonable too. Unfortunately, it's also not that easy to implement because it would require a lot of changes to the current code base. Maybe it's easier to post-process the SVG files with an XSLT stylesheet in order to derive the desired format. I'll have a look.

mgieseki · 2016-04-06T08:09:46Z

I probably won't implement this feature as part of dvisvgm in the near future. However, it doesn't seem to be too complicated create the desired output with an XSLT script. Here is a quick first attempt. It takes the output of dvisvgm --no-merge --no-styles ..., collects all adjacent characters with the same y coordinate, and puts them in a single tspan element. Additionally, spaces are inserted if a given distance is exceed.
This is just a first draft. It doesn't handle colors and transformations properly yet. Feel free to adapt it according to your needs.

zmanji · 2021-08-23T22:11:45Z

I took a look at this issue, and I believe the first issue is impossible to solve with dvisvgm. I looked at the output dvi file and saw no set: commands that inserted a space character. Based on the discussions on TeX.SE it seems that this is a known issue in pdf output as well. The only solution I have possibly seen is the tagpdf package which allows for 'real spaces' to be produced in the pdf with interwordspace. There exists lua code to do that, but I couldn't get it to work with dvi output.

For the second issue, @mgieseki do you think it is possible to add that feature ? One issue is that it's possible for a single word to be broken up into multiple tspan tags with differing x coordinates, meaning that in Firefox and in Chrome it's not possible to search for that word anymore. If the word was contained in a single tspan with the use of dx it should be possible to search for words in the svg, although not multiple words because of the problem above.

mgieseki · 2021-08-24T14:26:33Z

It's indeed not easy to detect words and word boundaries from plain DVI data since spaces are realized by explicit movements of the virtual cursor which determines the position of the next character (or any other visual object) to be placed. Horizontal movements also occur in case of kerning, stretched letter spacing, inside math formulae, etc. There are some ways to guess whether a horizontal movement denotes a space or something else, e.g. based on the space-related TFM data of a font, but it's not completely reliable.

One issue is that it's possible for a single word to be broken up into multiple tspan tags with differing x coordinates, meaning that in Firefox and in Chrome it's not possible to search for that word anymore.

The search issue is not caused by spreading the characters over several tspan elements but by the newlines appended to the closing </tspan> tag which is interpreted as word boundary by most SVG renderers. I've fixed this in dvisvgm 2.12. At least in Firefox and Chrome, the search for letter sequences also works across tspan elements now.
However, hyphenated words and ligatures, like ﬀ, ﬅ, or ﬃ, are still a problem if the search function doesn't recognize and treat them accordingly.

mgieseki added the feature feature request label Apr 4, 2016

mgieseki self-assigned this Apr 4, 2016

hchauvet mentioned this issue Apr 18, 2020

Embed fonts (using dvisvgm -v > 2.0) hchauvet/beampy#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request: better <text> elements #56

feature request: better <text> elements #56

tantau commented Apr 4, 2016

mgieseki commented Apr 4, 2016

mgieseki commented Apr 6, 2016

zmanji commented Aug 23, 2021

mgieseki commented Aug 24, 2021

feature request: better <text> elements #56

feature request: better <text> elements #56

Comments

tantau commented Apr 4, 2016

mgieseki commented Apr 4, 2016

mgieseki commented Apr 6, 2016

zmanji commented Aug 23, 2021

mgieseki commented Aug 24, 2021