Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

64 image minimum is too small #23

Open
ghost opened this issue Aug 16, 2021 · 13 comments
Open

64 image minimum is too small #23

ghost opened this issue Aug 16, 2021 · 13 comments

Comments

@ghost
Copy link

ghost commented Aug 16, 2021

The image facility is cell-based, and can be used for far more than just "images". It can be a fallback for fonts (which could be handy if alt-text or similar is available), font size (e.g. VT100 double-width/double-height), images of course, custom emojis, and much more. Example - this (multiplexed multihead) screen shows at least 20 images comprising the main picture (each text row is a separate image), plus 94 images, one for each CJK glyph.

A "64 images" max might be a "64 CJK glyphs" or "64 emojis". (I know the spec calls these minimums, but for the purpose of this discussion we should assume them to be a maximum.)

I think the minimum system requirement should be:

  • 80x24 = 1920 distinct images screen (minimum VT100 screen size)
  • Up to 1920*{cell width in pixels}*{cell height in pixels} total RGBA8888 pixels in storage, enough to fully fill an 80x24 region of screen at 32 bit depth.

This also supports the design philosophy that text cell operations work on images the same way, on a per-cell basis, rather than other protocols' intention for images and text to be fully distinct entities.

@wez
Copy link

wez commented Aug 17, 2021

FWIW, the implementation of Sixel, iTerm2 and Kitty Image protocols in wezterm map the incoming image into texture coordinates on cells in the display; those cells reference the same "atomic" image data chunk, but slice into it.

Allowing for differing z-index values overlapping requires tracking multiple textures per cell, but even without that, the bare minimum that I think is generally useful would be rows * cols images, but probably more than that (eg: a couple of pages of scrollback for folks that are visualizing complex output).

FWIW, as an implementor, I honestly didn't think about this feature in terms of min or max number of images, cells or pixels that I wanted to support: my take was that the TE should try to display what was asked of it, and if the system runs out of resources then expose that issue to the user and/or via a response to the application so that it can react.

Implementing the kitty protocol was a bit frustrating wrt. the previous paragraph, because it separates transmission from placement and that means that there is potential for "unreferenced" data that needs to be garbage collected: this is the only place where I've put an explicit resource management constraint as notcurses-demo seems to rely on the TE garbage collecting images rather than aggressively deleting them. The constraint is based on the total amount of RAM used by the images rather than size or quantity of images.

@ghost
Copy link
Author

ghost commented Aug 17, 2021

FWIW, the implementation of Sixel, iTerm2 and Kitty Image protocols in wezterm map the incoming image into texture coordinates on cells in the display; those cells reference the same "atomic" image data chunk, but slice into it.

So far as I know you are the only implementation of all three protocols in one terminal. I am curious if any of the following ideas/questions hold true compared to your experience?

  • Going from zero image support to one (iTerm2 in your case) would be decently hard. Getting the second (sixel) would be less hard. Getting the third would be tedious (due to the complexity of the whole spec) but not really stretching the brain.
  • Fundamentally the same data structures on the inside for one bitmap-type wire protocol can work for any of them. This is the real question for me. I thought mintty did a different conceptual model for sixel vs iTerm2. Very curious if kitty's model of "images are very distinct from text" made things harder or easier compared to sixel or iTerm2.
  • Does testing one protocol benefit all protocols?
  • If you had to do it over again, would you have gone in a different order?

@wez
Copy link

wez commented Aug 17, 2021

From zero -> iTerm2 wasn't all that difficult, and I like its relative simplicity compared to the other protocols. I think the biggest potential stumbling block was allowing for arbitrary sized OSC buffers in the parser, which I know some TE maintainers don't like, and I suppose the second is likely decoding the image containers, which is hard if you can't use a pre-existing library of some kind, but easy otherwise; Rust's image crate made this trivial in wezterm. Adding support for animated gifs (and pngs) added a little bit of complexity, but it was all on the render side rather than the data model side. I'm glad that I did that as it also made it easier to reason about the kitty animation features later.

Using that as the foundation for the model made the others reasonably easy: reference the incoming image data and then you can "simply" track (Arc<ImageData>, TextureCoordinate) per cell if you're not hyper-fixated on per-cell memory usage. (WezTerm deals with this by having a pointer to optional additional cell data and placing the image info there, so that the common case is still relatively compact)

If I'd started with Sixel, I think I would have built things differently, but not in a good way. Starting with the above made it easier to look at Sixel as two stages: 1) parse the sixel data into a bitmap, 2) feed the bitmap into the same slicing logic used for the iTerm2 protocol. If I'd started with Sixel, I might have been inclined to do something like per-cell bitmaps and I think that would probably have been a bit horrible.

When it came to implementing kitty, I opted to also map it to the same data model I used for the others which meant that the attached image data now turns into an array of (Arc<ImageData>, TextureCoordinate, ZIndex) so that the layering can be respected. This made it easier to reconcile the viewport/scrollback position with the image location. I noted that the version of Kitty shipped with Fedora 33 was a bit quirky wrt. re-synchronizing the image position when resizing the window, and my semi educated opinion is that that was likely because of the separation of text from images.

Implementing kitty was tedious because the surface area of the protocol is so high: I had to augment my parser to support APC sequences, allow for chunking data across multiple sequences (which meant introducing a buffer in a slightly awkward place, and the logic for re-assembling that), and then implement the (de-)?serialization of the large number of parameters. There are about 2000 lines of code for that, and probably <200 lines for the relevant part of the iTerm implementation.

The image-distinct-from-text-ness of the protocol largely disappears in the wezterm implementation; the difference is dealt with largely at placement time but does mean that some operations that are conceptually O(1) in the protocol (eg: various delete operations) turn into O(image_area_in_cells) and that the terminal needs to maintain a side index of placement id to cell location to achieve that. The other area where the difference is a bit awkward is that over-writing cells explicitly preserves kitty image attachments, but not iTerm or Sixel images, except for operations that are intended to clear the screen, so there's some additional logic to handle that when applying new text to the terminal model.

I would agree that the same data structures work for all of the protocols, if you pick the right ones!

If I were to do this again, I would do it in the same order because the simpler protocol as a starting point led to a simpler internal design than I think I would have built if I'd started with a more complex protocol.

In terms of testing, there is a little bit of core code that is common, but I think the scary bits that would benefit most from testing are around protocol decoding. Sixel has more stuff going on than iTerm2 and some weird stuff too, like the hue in its HSL scheme being rotated away from the common standard hue angle. The kitty protocol has so many parameters with single character names that have different meaning between different commands (I'm not hating on the protocol: it looks like an honest case of the design evolving that way rather than a deliberate choice) that I'm 100% sure there are at least a handful of issues yet to be discovered in my implementation that aren't obvious dumb things like me just not having implemented some of the various deletion subcommands yet.

I think the biggest issue wrt. testing, is that there isn't a great way for an external test suite to run and measure conformance. A TE author can write tests that look at internal state, but, for example, I can't take Kitty's image protocol tests and run them against wezterm. It would be interesting if there was a way to run a TE with a fixed font and a way to capture a bitmap of the display and compare it against known bit-patterns. Additionally/alternatively: defining a test protocol for TEs that exported the display information in a defined format that could then be used to perform assertions. esctest sort of does the latter by taking advantage of a screen region checksumming feature to validate xterm and iTerm2, but it doesn't appear to be actively maintained (general lack of activity, and I have a MR for wezterm that hasn't had a response so far), and the image stuff wouldn't be reflected in that checksum in any case.

Thinking about how all the above might shape GIP: my general inclination is that fewer protocol commands/parameters overall are "better" from a combinatorial explosion perspective, and that consistency and clarity in naming would be nice. For example, there are a lot of single character parameters in kitty's protocol that are easy to confuse or misinterpret and that it would be less prone to misinterpretation if the names were slightly longer; 2-3 chars for some of them would increase clarity a lot without dramatically inflating the bandwidth, and it's probably worth making a pass over the GIP spec with that in mind to future proof it for later versions of GIP.

@dankamongmen
Copy link

i would love for there to be emphasis on supporting a greater number of smaller bitmaps. the ideal is what i have been calling "mosaics", where bitmaps are treated entirely as cell-sized entities, which i believe y'all are in agreement with. last i checked, Jexer uses wide graphics of one cell height, right @klamonte ? that has its advantages, but going all the way to the sweet land of mosaics would essentially eliminate my most complicated state machines -- there would no longer be a need to "wipe" and "restore" cell-sized regions within a larger graphic.

with that said, since kitty 0.20.0's addition of reflective animation, this is not really a huge issue. a wipe involves transmitting a constant cell's worth of 0-alpha RGBA, constant across all cells. a restore involves a single directive and no data transmission. there is one place where it would still help, though: kitty lets you position graphics on a z-axis, a tristate with regards to glyphs. that z position is graphic-wide, though, so you cannot both (a) print a glyph atop a graphic and (b) print a glyph below a partially transparent cell of that same graphic. mosaics would resolve this last pain in my life neatly.

@dankamongmen
Copy link

FWIW, as an implementor, I honestly didn't think about this feature in terms of min or max number of images, cells or pixels that I wanted to support: my take was that the TE should try to display what was asked of it, and if the system runs out of resources then expose that issue to the user and/or via a response to the application so that it can react.

@wez out here preaching the Good Word <3 <3 <3

@ghost
Copy link
Author

ghost commented Aug 17, 2021

@dankamongmen

last i checked, Jexer uses wide graphics of one cell height, right @klamonte

Jexer permits every text cell to have its own image. On output, it concatenates adjacent images on the same row into a single image and encodes that to whichever protocol that particular user-facing screen is using (sixel, iTerm2, or jexer). All images are only one cell high. (It also caches previously-generated output for performance.)

Since Jexer is both multiplexer and windowing system, it may show image pieces from different terminals or application windows, and any image fragment could be obscured by an overlapping window.

since kitty 0.20.0's

I've seen a few references to different versions of the kitty protocol, but only one document online. Are these versions fixed and available, and can an application determine which "version" the TE actually complies to?

@dankamongmen
Copy link

Since Jexer is both multiplexer and windowing system, it may show image pieces from different terminals or application windows, and any image fragment could be obscured by an overlapping window.

grokked

I've seen a few references to different versions of the kitty protocol, but only one document online. Are these versions fixed and available, and can an application determine which "version" the TE actually complies to?

the kitty document specifies when various features were added, but only with respect to kitty versions, not terminal-independent versions of the protocol itself. this is going to cause problems moving forward as more terminals pick it up. the next new feature i use will have to be matched against a kitty version for kitty and a wezterm version for wezterm. i'd love to see the protocol versioned instead.

@ghost
Copy link
Author

ghost commented Aug 17, 2021

i'd love to see the protocol versioned instead.

It would either have to be versioned as a whole protocol (examples of that: HTTP, VT100 (DA2)), or be able to negotiate specific features of that protocol with graceful fallback (example of that: Kermit).

The thing with these bitmap image protocols (iTerm2, GIP) is that none of them except SIXEL are defined well enough that one could feasibly burn the encoder/decoder in hardware and stick it in a television that will last 15 years, or in an industrial machine that will last 40 years. That's the level of a standard I would like to see someday.

@dankamongmen
Copy link

The thing with these bitmap image protocols (iTerm2, GIP) is that none of them except SIXEL are defined well enough that one could feasibly burn the encoder/decoder in hardware and stick it in a television that will last 15 years, or in an industrial machine that will last 40 years. That's the level of a standard I would like to see someday.

eh, i've still never seen a true spec on Sixel. what are the failure modes when too little data is sent for a specified size? the meaning of P2=1 is still unclear. if the bottom rows are entirely transparent, do they still result in scrolling? etc. but yes, the larger protocols certainly introduce more questions.

@ghost
Copy link
Author

ghost commented Aug 17, 2021

eh, i've still never seen a true spec on Sixel. what are the failure modes when too little data is sent for a specified size? the meaning of P2=1 is still unclear. if the bottom rows are entirely transparent, do they still result in scrolling? etc. but yes, the larger protocols certainly introduce more questions.

AFAIK there is no failure for "too little" data: the raster attribute just sets the initial background square, it's still fine to draw less than the raster or more than the raster (the image square gets bigger). In practice xterm will fully discard (not even crop) images that exceed 1000 pixels in either direction, so we are stuck with that if we want to be interoperable. STD 070 might have more details. jerch probably knows those answers too. :)

@ghost
Copy link
Author

ghost commented Dec 23, 2021

@wez @dankamongmen

Inspired by notcurses, I have been coding again and playing with transparency (missing pixels) in a multiplexed environment. The image-as-cell model does in fact work OK for images-over-text; I don't think as much images-under-text though in a multiplexer. But who knows, maybe I will be happily surprised yet again. ;-)

I have written up a few more notes and screenshots over here. (Plus a general thank you, including y'all here too. :-) )

It seems hard to obtain sixel images with missing pixels. I couldn't figure out how to get img2sixel or Imagemagick to do it. So I made a couple small ones and put them over here. If anyone has more such images, or better yet how to generate them, I would love to include it.

Anyway, happy holidays! :-)

@christianparpart
Copy link
Member

Hey guys, hey @klamonte!

Sorry for my sparse presence this year, that's due to some real-life implications. I'm almost free to resume my work on this protocol so it seems (yay).
notcurses also changed a lot in my thinking on images especially with regards to transparent pixels, which in fact I've implemented in my own Sixel implementation, too, but also to image layers.
I'm glad you found some time to hack on again.
I hope we eventually - in year 2022 - can actually progress on this end here, too.

Happy holidays ;)

@ghost
Copy link
Author

ghost commented Feb 4, 2022

@wez @dankamongmen @christianparpart

I have added kitty support to Jexer's output. Some notes as they relate to GIP:

Erasure

Erasing a single cell at (x, y) on an image erases the entire image.

This makes the multiplexer case -- especially floating windows -- quite a bit harder, even when images are only one text row high. (You drag a text dialog over part of an image, erase that cell, and other areas are erased.) I resorted to erasing and redrawing all images on a row that:

  1. Do not have image data anymore (because text cannot destroy images).
  2. Or DO have image data but something is different.

I suspect my second option is buggy: I'm probably replacing every image on every frame at the moment, but I will dig into it more later.

xterm's equivalent bugs with sixel is what led me to the horizontal strips design in the first place, and it works on alacritty+ayosec sixel; but alacritty-ayosec is not working well with notcurses due to its unusual erasure behavior.

The point is that knowledge of how this frame damages images on previous frames is a bit hard to come by when you are composing overlapping windows.

I believe GIP is already unambiguous here.

Base64

Kitty does not recognize valid base64, which can include line feeds. The default as per RFC 4648 is "don't add line feeds unless told to" so Kitty is not in actual error if that is the base64 standard it is expecting, but since Kitty does not actually refer to any base64 RFCs, it isn't clear until you try and see in Kitty's log why it isn't displaying.

Other image protocols (e.g. iTerm2) do handle line feeds in base64 cleanly. But base64 encoding within OSC (as used by iTerm2) did have one terminal erroneously handling the line feeds as C0. It was fixed a while ago. As per the VT320 manual:

Operating system command | OSC | Introduces an operating system command.*
Privacy message | PM | Introduces a privacy message string.*
Application program command | APC | Introduces an application program command.*

* The VT320 ignores all following characters, until it receives an ST control character. ESC, CAN, and SUB no longer cancel device control strings.

GIP should be very clear which base64 it is relying on. RFC 4648, or others. (I may have mentioned that elsewhere in GIP, sorry if I did...)

Chunking:

The 4kb chunking is dumb. With the additional "the cursor can't move, nothing else can happen in between chunks" stuff (hmm where did that come from? 🤔) it's just extra logic on the application's part to accomplish nothing. No CRC, no error recovery, no windowing, no actual protocol: just 10 extra lines in every application rather than 10 extra lines once in the terminal. (If someone had looked around more they might have read about Kermit protocol's design changes around TCP, and why it's generally better just to let the TCP/IP stack handle that part. 🙄 Or just tried it out and see what 75-300 kb/sec can do (and that's sixel!) and make some data-driven decisions.)

Off-topic and annoyed... The 4kb chunking is probably why Kitty readily spews garbage on screen when you send it sixel. The whole point of APC + base64 (which is something I suggested to him when he started this, contrasting with Terminology whose sequences can lead to artifacts on other terminals) was that other terminals should quietly ignore it. It's good that he understood that, as he references it directly on his spec, but not good that he fails to give other protocols the same consideration.

(Plus the little "xterm keyboard protocol is obsolete, go tell the application to change" message in the log is rather cheeky coming from a terminal whose vttest score is lower than Hyperterm's was 20 years ago. It is now a hard design goal to make DOOM playable with stock xterm. If that keyboard protocol makes it into xterm, then great; until then, rot in hell Kitty.)

Someone else can open issues there if they give a shit. I won't.

Bugs

wezterm and kitty are not producing the same output. I will file reports on wezterm in a little while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants