New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asciicast format version 2 #196
Conversation
CasTTY is an asciicast-compatible recorder and web player that also records / plays back audio. I'd like to keep it compatible with the asciicast format, but also have some of the needs of my tools supported. In some cases (audio, resizing), CasTTY already has a solution, so compatibility between the tools would be great to maintain. Audio(From an IRC discussion in #asciinema) There is no interest in supporting PCM audio formats for recorders or players. All audio should be output / input in a format that formally describes the structure of that audio. As such, PCM audio and properties shouldn't be part of this spec. However, we should add an attribute to specify a path to some audio data. This should be specified as a URI to allow for audio stored on non-HTTP endpoints (for example, on a local or network filesystem while recording). For live streaming, we should consider the complexity of synchronizing the events with audio. In particular, events and audio are going to be delivered over separate channels, and the audio must get priority. However, people tuning into the stream in the middle will need some method of synchronization to understand the position of their starting audio WRT the start of the cast so that:
The CasTTY player synchronizes events with respect to audio.currentTime (because audio gets real-time treatment whereas setTimeout is a joke, especially when events are consistently ~100ms apart). Compatibility & Streamingv2 players will not be compatible with v1 video, given NDJSON, and vice versa. Well, at least v1 definitely can't support v2, but v2 needs a fair bit of back-compat glue around parsing and/or content type to support v1. The two reasons for using NDJSON are for flushing each event to avoid issues with crashes, and to better support live streaming from a recording session. Handling partial outputThe first argument for NDJSON is that JSON requires an entire object to be valid. CasTTY writes to its output stream with no buffer. From the ticket mentioned, in the rationale, even outputting a partial recording would have been preferable. This would be possible simply by not buffering output (which is what CasTTY does). However, a partial output file in the v1 format is fundamentally broken, possibly in two ways. The most obvious is that you may have a partial event at the end. The second is that the duration of the video is still unknown. Live StreamingLive streaming is assumed to be complicated with a complete JSON object because you do not have a complete JSON object. SuggestionThis suggestion allows trivial compatibility of v2 players with v1 format files, and compatibility of v1 players with v2 recordings(!) -- v1 players would only not support live streaming and e.g. keyboard input subtitles. Both cases of the partial output problem can be fixed trivially with a utility that follows this silly algorithm:
For live streaming, we can set up websockets for each event channel. In this case, the event format does not change. Each event is still represented as Therefore, I would propose the new protocol look exactly like v1 for a recording, except for the introduction of a new {
"version": 2,
"width": 80,
"height": 24,
"duration": 1.515658,
"command": "/bin/zsh",
"title": "",
"env": {
"TERM": "xterm-256color",
"SHELL": "/bin/zsh"
},
"stdout": [
[0.248848, "\u001b[1;31mHello \u001b[32mWorld!\u001b[0m\n"],
[1.001376, "I am \rThis is on the next line."]
],
"stdin": [
[0.248848, "Hello World!\n"],
[1.001376, "I am \rThis is on the next line."]
]
} v1 players can then play back the v2 recording format. v2 players can play back v1 recording format. Live streams would be slightly different: {
"version": 2,
"width": 80,
"height": 24,
"command": "/bin/zsh",
"title": "",
"env": {
"TERM": "xterm-256color",
"SHELL": "/bin/zsh"
},
"stdout": "ws://uri.to/stdout",
"stdin": "ws://uri.to/stdin"
} The data over the websockets would be presented in the form A v1 player can be modified to detect the type of stdout and refuse to play it if Using websockets here has an additional advantage: multiplexing data from N different event streams (keyboard input, terminal output, audio) into M different outputs where M<N is going to be difficult. I suspect that asciinema's recorder forks at least once to do a EnvironmentCan we further specify what is taken from the environment? Environment can be sensitive, and while the file format only specifies Input and PausingThere were some concerns mentioned about input. Though the spec doesn't mention it yet, it was mentioned that the input events would be used to show keystrokes during playback as if they were subtitles. The use-case is that keystrokes for utilities like GNU screen or tmux might not appear in output. However, other keystrokes that do not appear in output are cases where ResizingCasTTY handles resizing by enforcing that the resize is never larger than the terminal window at the start of recording, or the options passed to it in I would recommend not making resizing an event encoded in the stream for a few reasons. But in particular, it's hard to get correct. Window resize events are delivered via If you really wanted to add this, it would still be much easier as a separate event stream, instead of being interleaved -- especially since SIGWINCH is going to be delivered to the process on the master end of the pty. |
@dhobsd wow, that's a comment! Re audio: I think we're on the same page. I agree that audio/screen sync is a challenge, and syncing screen updates to audio clock sounds like a good solution for it (although see my suggestion/question at the end of this comment re using audio clock). A meta-data attribute like Re compatibility & streaming: Re knowing the duration of the recording: for live streams you obviously don't know and don't need to know. For recorded sessions the player can just go through all the stream events and add that up right after loading the file into memory. One important reason for writing to disk in realtime is to not buffer whole recording in memory. I mentioned crashes, but preventing incremental mem usage growth (as the recording proceeds) is also desired. This would enable creating arbitrary-length recording sessions (only disk space is the limit), and would also allow Making player support both v1 and v2 doesn't look like a big problem for me, and in my opinion this extra overhead is worth given the problems v2 solves (more on why I don't think extending v1 would work below). I don't like the idea that we may be producing broken JSON files. Some people record sessions automatically when their shell is started (they basically log everything they do in terminal to some central directory of sessions) and when they reboot in one way (soft) or another (hard reboot) they would be getting broken files. This would require extra "fixing" logic in all tools reading asciicasts ( About this example you gave: {
"version": 2,
"width": 80,
"height": 24,
"duration": 1.515658,
"command": "/bin/zsh",
"title": "",
"env": {
"TERM": "xterm-256color",
"SHELL": "/bin/zsh"
},
"stdout": [
[0.248848, "\u001b[1;31mHello \u001b[32mWorld!\u001b[0m\n"],
[1.001376, "I am \rThis is on the next line."]
],
"stdin": [
[0.248848, "Hello World!\n"],
[1.001376, "I am \rThis is on the next line."]
]
} This prevents writing to a file in real-time, because now you can't write both stdout and stdin. Of course you can be writing these two data streams to two separate tmp files during recording and at the end read them and build a single JSON file. This however leaves you with tmp files in case of crash/reboot. Re environment: I fully agree about being more specific here. We can consider removing Re Input and Pausing: I love this idea ❤️ , but I don't see how this relates to file/stream format. It seems to be recorder-only thing to me. Re resizing: This whole resizing business is tricky, and I don't think there's one good way to solve it. If I understand correctly, when castty receives SIGWINCH it passes down Re total order of events: I believe having total order of all kinds of events makes things simpler and easier to reason about. We're not dealing here with High-Frequency-Trading system or any other where microseconds make difference, so multiplexing everything that happens onto a single stream doesn't have practical downsides to me. I have a felling many of your suggestions come out of the way you implemented castty. You talk about synchronizing clocks, threads, processes, SIGWINCH delivered to random process/thread and coordination problems related to all that. If you create a dedicated thread for writing all sorts of events (not including audio here) into file/websocket, then this process can have a single clock, and read events from a buffered, thread-safe queue. All other processes/threads (handling stdin, stdout, all of them trapping sigwinch) can just push events to this queue in fire-and-forget fashion. I guess having such thread-safe buffered queue is okay if you only have threads and gets tricky when you have multiple processes. But maybe you can just use a pipe here as buffered queue (not sure how UNIX pipes deal with multiple concurrent writers though...). Then there's audio, but maybe there's no need to use audio clock for timing non-audio events. What's wrong with dumping raw audio stream to audio file, just like that, and using single Thanks for the comment, it definitely opened my eyes for many things! |
AudioI would suggest that in the case that asciinema.org were to support audio upload, the way it would work would be that the recording would always reference a local file. The upload functionality would send both the JSON and the audio file referenced from it to the remote server, which would then rewrite the JSON to reference a path to the uploaded audio. To download such an asciicast, you would receive a tarball that extracted both the JSON and the audio into a directory; the JSON would have an Regarding the name, CompatibilityLet's throw out the idea of player compatibility entirely, then. I thought it might be useful, but maybe not. You made a point about having input commands go through a control channel to the output process (which basically is what they do anyway), so I guess NDJSON is fine. But there are still clock problems, and more on that later. It would still be nice if the 3-tuple of each event was Pause, Input Mute, and FormatIt doesn't have to do with the format, but I think the spec should warn people implementing an input recorder to consider the security implications of that feature, and that pause / input mute are recommended for that reason. Total Order of Events and ClocksTotal order makes the events simpler to think about during playback, but the complexity of linearizing time ends up somewhere. Your point about just sending input events to the output thread over a separate channel is reasonable. That's how I implement commands, and how I was going to solve the clock sharing issue, but I didn't consider the implications on the requirements for the stream (and in particular that interleaving it stops being a problem then). So good point. Regarding using multiple clocks when recording audio, you just can't reasonably do it. Audio is recording at some fixed sample rate in real-time, and the system clock has no real-time guarantees. Every time your read of the system clock is late, you either accumulate that latency, or try to synchronize it with the audio clock. In which case, you may as well have just used the audio clock. So what I do right now is effectively what you describe. The audio clock is just the sample rate, which is incremented whenever a frame is queued for writing. The "delta" of an event is defined as the difference between two readings of There is still one real problem with this approach, which is that you can (and basically almost always will) have latency reading the clock. Because of this, you still carry accumulating skew every time you read the clock later than before. The delta of two absolute values from a start time does give you a way to get out of the skew, which is that you can do an exact read and get rid of all previously accumulated latency, or you can do a read with less latency and get rid of some of it. But basically there's some constant factor of clock read latency present, and any time that fluctuates to a higher value, the only way to counter it is with an exact read, which doesn't necessarily happen often. If instead we recorded these at absolute offsets, we would have When audio is in the mix, any latency sucks, and I still see drift come and go in longer recordings. From this perspective, I'd really love to go back to what I used to do, which is having offsets recorded as delta from start, instead of delta from previous. This makes live streaming harder, but only if your header doesn't include audio start time from 0 (which probably isn't hard). Live streaming without audio isn't difficult because you're just going to play whatever events you have whenever you get them, so the delta is basically useless. Live streaming with audio means that you're going to play events you have whenever you get them, unless they're in the future wrt the audio. Playback and seek of a regular recording with or without audio is actually easier, because the player doesn't have to keep track of an event duration anywhere: the next event is always the delta between the previous events (modulo the audio time, if that's playing). Soooo, this makes me wonder whether I could convince you to make the event delta a delta from start instead of delta from previous? The JSON is bigger, but compression helps. |
Let me start with addressing the last thing you wrote: making time absolute (seconds elapsed since the beginning), I'll get to all the other things in following comments. I would consider this. Let's assume we use single interleaved stream of events. If we used absolute time, the players/tools which don't understand/use certain events could just completely filter them out, without affecting all subsequent events. With delta-from-prev, as we have now, you still need to take delta from ignored/unused events and accumulate them into next ones. That would be one (relatively small) argument to have absolute time. Having absolute time would be more precise, not only because of not accumulating clock read skew, but also because of not accumulating float addition skew. The downsides of absolute time, as I see today, would be: "Time compression" ( The other related thing is, if you want to remove a stdout print event today (because it printed some secret you don't want to disclose, or you just want to remove some part of the recording), then you just open JSON and remove the lines you want. With absolute time you can't just do that - after removing any line you need to adjust the time of all the following events, which isn't feasible for human, you need a tool for that. The size of the file would slightly go up, but in a minor way, so that's fine. |
I hadn't even thought of the delta-prev issue for dealing with interleaved event streams with possibly unsupported event types. It is indeed a small burden to support the additional delta, but it's an interesting point. The float skew is another one I hadn't thought about, though I suspect that's going to be relatively minor: events are basically ms-level accuracy, recorders and players are using IEEE-754 doubles, and the precision of One additional point is that it does actually make the code simpler. My LOC went up on both the recording and player side to do delta-prev. I mean, it's obviously a trivial amount, but deleted code is debugged code. Time compression isn't a thing I care much about, because I eventually want to support appending to a recording, and one can already compress time by "pausing" during recording. Since I'm mostly concerned with syncing to audio tracks, compressing events would be a poor UX. It also seems like this could be solved on the player side as well, where the person actually watching the video could define a maximum timeout, or watch the video with some timer coefficient. Removing an event because of secrets I do very much care about, but since I also care about audio, removing an event actually shouldn't shift the times of any subsequent events. Here again, pausing during recording provides a reasonable strategy to avoid the problem entirely. But even if we assume that's not enough, I'm not sure I agree that you can always just remove the event anyway. What if the secret-containing event also has some terminal escape sequences in it that move the cursor? All of a sudden, you have to actually go edit one or more events, and if you don't preserve terminal sequences, the rest of your recording is wonky. I've considered adding editing tools for this kind of thing (to be able to cut bits of audio and shift bits of recording along with it). But I really think these should be tools: if you want a time shift, or you want to "mute" a secret over some sequence of events, having a tool that knows how to preserve the rest of your terminal is pretty helpful. |
Summary of points: I consider necessary and non-controversial:
I consider necessary and possibly controversial:
I don't really care about, but I think would be nice from an implementor perspective:
|
Quick note on resize events: it seems there's a family of CSI sequences for controlling the terminal window, including the one to resize it. See here: #198 That's another argument to ditch separate stream of resize events. The recorder implementation could either do what castty does (don't record sigwinch, clamp them for slave), or insert them as |
Given we're left with only 1 currently supported event type ("o" - print to stdout), and 1 possibly useful but not supported/used yet ("i" - stdin), I'm considering going back to I have different priorities now than displaying keystrokes (really not sure when I would get to this, it's already waiting years), and there's always possibility to make v3 format in the future. The most important thing for me right now is to make this incremental writing / appending / streaming friendly. @dhobsd is stdin recording important to you in a short term? |
I realized something re using control seq for storing resize events. If this seq ends up in stdout not as a result of SIGWINCH but because of an app printing it (essentially requesting window resize) and terminal doesn't support this (doesn't send SIGWINCH here) then the full screen app (for ex. vim) doesn't resize, but the JS player would resize (it doesn't have a way to tell if this esc seq resulted in resize or not). Need to give it more thought... |
I don't have any short term goals for supporting stdin recording. Right now it's playback with audio on console and possibly some editing tools. I'm fine to punt on stdin for now and window sizing entirely. But can we make it delta-start instead of delta-prev? :) |
I added this comment in #127 - and its certainly not as well thought out as the above comments - but adding here for posterity. One thing that would be good would be to have the start time of the session stored in the initial metadata. This could be extracted in other tooling to provide auditable ssh sessions. Additionally, being able to "inject" metadata might be interesting, so you could potentially tag a created session with stuff like:
And have that be exposable in some external ui. |
I think custom keys could be useful. It'd be up to the recording tool to determine how to let users set them, but I think it's a good idea. I like the color palette suggestion in asciinema/discussions#8 as well. If automatically detecting the colors used turns out to be a hard problem, we can always export a CLI option that allows using a pre-set palette by name, or defining the colors to use in some |
Currently I'm developing re-implementation of server for similiar session-capture (just focused more on auditing than casting) and I was looking into using asciicast format. Some feedback about format:
|
@XANi I think the unix timestamp of start of the recording could be top level field of its own, that's a good idea (also suggested by @josegonzalez). Re "is input echo on?" example, in this particular case we have this information in recorded stdout stream in a form of non-printable esc sequence, as apps turn it on/off by just writing to stdout. Almost all terminal's internal state is driven by stuff written to stdout, and resizing of terminal is an event coming from outside of the terminal, which is very different in nature than state modified by the apps running within this terminal. I don't know about any other "external force" other than |
About collected environment variables: What could work is to have a white-list of env variables saved under Having this you could set it to for example UPDATE: I opened separate issue to discuss env var collection: #222 |
As for extra, non-environment meta-data in the header: I see following options:
UPDATE: I opened separate issue to discuss this in detail: #223 |
About event timing in v2: I'm convinced to go with absolute (relative from the start of the recording) time. /cc @dhobsd UPDATE: this is already implemented in this branch. |
|
There's also a case of file extension for v2. We used From what I understand json-lines spec and ndjson spec allow each to be any JSON value, which means our format of first header line (object) + subsequent event lines (arrays) conforms to them. So I'd rather go with I'd consider going even further with this: use new media type (content-type), something like UPDATE: I opened separate issue to discuss this in detail: #224 |
Custom extension/type is definitely good idea as then it can be bound to app under desktop or browser and "just work" when clicked. As long as it is still parseable (nd)json(-lines) As for metadata I think there should just be That way if you want to "anonymize" cast, just drop 2-3 keys instead of iterating over whole structure to drop a bunch of them.
That allows format to be used not just pretty demos, but also for stuff like auditing and access control |
I opened separate issue for discussing color theme shape: #221 (let's discuss this topic there) |
I added |
The ideas behind asciicast v2 format are:
format the final recording JSON can only be written as a whole after finishing
the recording session),
line (contrary to v1 format which requires reading the whole file), without need to buffer whole recording in memory,
Preview of the doc: https://github.com/asciinema/asciinema/blob/v2/doc/asciicast-v2.md