Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: a convention for multiline statuses (line breaks) #157

Open
dbohdan opened this issue Feb 16, 2021 · 18 comments
Open

Suggestion: a convention for multiline statuses (line breaks) #157

dbohdan opened this issue Feb 16, 2021 · 18 comments

Comments

@dbohdan
Copy link
Contributor

dbohdan commented Feb 16, 2021

I would like to propose to a convention for multiline status updates or newlines in the twtxt format. The convention is backwards compatible with clients that do not support it. The conventions is: when the client sees a sequence of statuses with the same timestamp, join their text with a newline. A feed following this convention looks reasonable in a client that does not understand it as long as the client displays statues with the same timestamp in the order they appear.

For example, twtxt currently renders

1845-01-29T12:00:00Z	Once upon a midnight dreary, while I pondered, weak and weary,
1845-01-29T12:00:00Z	Over many a quaint and curious volume of forgotten lore—
1845-01-29T12:00:00Z	    While I nodded, nearly napping, suddenly there came a tapping,
1845-01-29T12:00:00Z	As of some one gently rapping, rapping at my chamber door.
1845-01-29T12:00:00Z	“’Tis some visitor,” I muttered, “tapping at my chamber door—
1845-01-29T12:00:00Z	            Only this and nothing more.”

as

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
Once upon a midnight dreary, while I pondered, weak and weary,

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
Over many a quaint and curious volume of forgotten lore—

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
While I nodded, nearly napping, suddenly there came a tapping,

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
As of some one gently rapping, rapping at my chamber door.

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
“’Tis some visitor,” I muttered, “tapping at my chamber door—

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
Only this and nothing more.”

If support for this convention was implemented, twtxt could render the same file as

➤ http://127.0.0.1:8081/poe.txt (175 years ago):
Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore—
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
“’Tis some visitor,” I muttered, “tapping at my chamber door—
Only this and nothing more.”

I have implemented the convention in my twtxt.tcl library and GUI feed reader. I have also made a page explaining it (pretty much like this issue does).

What do you think?

@bkil
Copy link

bkil commented Apr 17, 2023

This requires mandating the yarn.social extension of keeping the feed chronologically sorted past to future. Compare this to the twtxt format:
https://twtxt.readthedocs.io/en/latest/user/twtxtfile.html#format-specification

A specific ordering of the statuses is not mandatory.

Not even a specific ordering was not mandatory, there was no chronological order recommended either (i.e., past to future vs. future to past).

Note that this will also need careful implementation to maintain compatibility with a future scheme that references posts by the feed URL (or follower nick) and the timestamp in combination, since timestamps may no longer be used as a primary key.

@prologic
Copy link

The added complexity and burden on clients makes this proposal more difficult to adopt than a simple replacement of the Unicode new line code point u2028.

@dbohdan
Copy link
Contributor Author

dbohdan commented Apr 17, 2023

@bkil:

This requires mandating the yarn.social extension of keeping the feed chronologically sorted past to future.

I am not suggesting that. My proposal only suggests that statuses with the same timestamp be displayed first-to-last-line when they are sequential in the file. This does not affect the overall structure of the file. It is fully backwards compatible.

This is a valid use of the convention:

1999-01-01T00:00:00 Foo
2023-01-01T00:00:00 Bar line 1
2023-01-01T00:00:00 Bar line 2
2014-01-01T00:00:00 Baz
2023-01-01T00:00:00 Qux (not merged with the Bar lines)

timestamps may no longer be used as a primary key.

Isn't this already the case? Timestamps are not guaranteed to be unique.

@prologic:

The added complexity and burden on clients makes this proposal more difficult to adopt than a simple replacement of the Unicode new line code point u2028.

This proposal's advantage is that it does not require Unicode on both the writer and the reader side, an editor that can preserve \u2028, and a way for the writer to input \u2028. For these reasons it is better suited for retrocomputing and for plain text editing.

In an imperative programming language the client burden amounts to tracking whether the previous line has the same timestamp as the current line when reading a text stream. It is greater than the burden of splitting the status text on \u2028, but only by so much.


I'll also note that this is not an either-or thing: a single client can support both this convention and \u2028.

@prologic
Copy link

This is true we could support both forms 👌

@dbohdan
Copy link
Contributor Author

dbohdan commented Apr 17, 2023

I see that the official client truncates status text on \u2028. This seems like a bug: the spec says nothing about \u2028, so it should presumably be passed through like any other character, not truncated on.

It does illustrate a problem with \u2028 that this proposal avoids: edge cases around uncommon Unicode characters.

This proposal's advantage is that it does not require Unicode on both the writer and the reader side ...

To be clear, I know the spec mandates UTF-8. I mention this as a practical advantage for partially compatible clients, for example, on old devices. (I meant to write a client for FreeDOS but never got to it.)

@bkil
Copy link

bkil commented Apr 17, 2023

You may be living in parts of the world where English is the only language spoken. For almost any other language, some part of your alphabet will contain accented characters that can only be represented in Unicode. I do not find it wise to develop software that lacks at least basic support for it.

Note that even for a Linux/BSD command line application, you can just copy the byte stream from the file to the output as the terminal emulator or frame buffer will handle it for you.

If you used a standard library or compiler without UTF-8 support, you could also just split on the 3-byte sequence of e2 80 a8 and then be done with it. Although, I personally recommended the 4-byte sequence of <br> in the past as a viable alternative, as I prefer to keep the text format easily editable and I'm not a fan of invisible markup either.

Note that most DOS-based computers lacked TCP/IP and Internet access, so I find this retrocomputing experience oddly anachronistic. Why not develop for some grounds up alternative that is still maintained (BSD, Haiku, Redox, SerenityOS, OpenWrt, MenuetOS, AROS)?

@dbohdan
Copy link
Contributor Author

dbohdan commented Apr 17, 2023

I agree you should support Unicode on modern systems. On most old systems it is awesome if you do, but it is the norm that you don't. A discussion about what retro and hobby operating systems are better would be out of place here.

If you used a standard library or compiler without UTF-8 support, you could also just split on the 3-byte sequence of e2 80 a8 and then be done with it.

Yep.

Although, I personally recommended the 4-byte sequence of <br> in the past as a viable alternative, as I prefer to keep the text format easily editable and I'm not a fan of invisible markup either.

This would have probably been a better choice than \u2028. The line separator being plain text would avoid problems with existing clients. At worst it might get in the way of people quoting a bit of HTML, but I have never seen that in a twtxt feed.

@dbohdan
Copy link
Contributor Author

dbohdan commented Apr 17, 2023

@prologic Could you migrate from \u2028 to <br>? For example, recognize both but only emit <br>? It would improve backward compatibility.

@bkil
Copy link

bkil commented Apr 17, 2023

Note that we may consider to implement a subset for HTML to serialize rich text markup (i.e., line breaks in this case) similar to how it had been done in ActivityPub (and most previous protocols on The Fediverse), Matrix, XMPP and others. Surely it always just boils down to a subset of HTML.

In our case, this subset could be:

  • escape literal < with &lt;
  • escape literal > with &gt;
  • escape literal & with &amp;
  • interpreting <br> as line breaks
  • anything else is on-demand and optional (escaping " with &quot; is also a good idea)

Even a simple hand-made parser can be implemented in a forgiving manner if one forgets to escape. For example, by restricting interpretation to only known tags and without allowing whitespace between < and the tag name.

In my proof of concept enhanced twtxt-client, I actually implement it like this, along with a subset of tags easy to implement that correspond to the most often used ones in CommonMark/gemini (<b>, <i>, etc.) along with \u2028 and a bit of CommonMark for interim compatibility.

It's actually interesting how many GUI widget toolkits support native interpretation of a subset of HTML for rich text formatting without resorting to embedding a full-blown web browser engine:

@prologic
Copy link

@prologic Could you migrate from \u2028 to <br>? For example, recognize both but only emit <br>? It would improve backward compatibility.

I'm not sure what you mean exactly? 🤔

@dbohdan
Copy link
Contributor Author

dbohdan commented Apr 17, 2023

I mean to ask, would it be possible for you to switch from the character \u2028 to the string <br> as proposed by bkil in https://dev.twtxt.net/doc/multilineextension.html? It seems like a nice plain text-alternative in harmony with the existing use of @<example http://example.org/twtxt.txt>. It is compatible with the official client, visible in the editor, and easy to type.

@prologic
Copy link

I mean to ask, would it be possible for you to switch from the character \u2028 to the string <br> as proposed by bkil in https://dev.twtxt.net/doc/multilineextension.html? It seems like a nice plain text-alternative in harmony with the existing use of @<example http://example.org/twtxt.txt>. It is compatible with the official client, visible in the editor, and easy to type.

Ahh, what I'm not sure about is where we do this (in the actual feed itself) or in a version of the feed for better backwards compat. The only issue I see with this is it might break the Markdown parser and cause unintended side-effects. I'd have to test this.

As an aside, we're already talking about the merits of proving a twtxt.txt feed that has all Markdown stripped, Subjects stripped, multi-lines replaced with either -- or <br> or whatever you want and truncated to 140 chars. But the problem I have with this is quite simply, what value are we bringing to the table by doing this? What problem are we solving beyond trying to adhere to the original spec in its purest form? At that point we may as well just fork entirely (as doing this would have the same effect either way) or use some other format entirely. -- I worry for example that our Twtxt friends that use tt or jenny would suffer unnecessarily by doing this and no longer have the benefit of linked images, half the context would be missing, threading would be gone, etc.

@bkil
Copy link

bkil commented Apr 18, 2023

I wouldn't like to comment on the specific forking effort, but generating a frugal feed for each rich yarn.social account sounds like a sensible and easy option. The frugal feed would be much more useful than what you imply above. Let me elaborate about the features of such a frugal feed.

A status may be longer than 140 characters - no hard limit was ever defined similar to Twitter:

A status should consist of up to 140 characters, longer status updates are technically possible but discouraged.

I recommend replacing the blake-hash in the subject with the character-exact timestamp of the message you are replying to. My client supports threading by both blake-hash and by a combination of timestamp & user mention - whichever is present. Adding a user mention of the previous post in the same thread along the root post is also considered good manners, this might benefit from synthesis during conversion.

I'm not sure I'd keep the extended #<tag URL> syntax as it is usually just line noise, especially when abused to refer to post blake-hashes for machines instead of labels meant for users. Hashtags are usually understood to be #HashTag. Just use a bare URL to refer to post permalinks where a @<URL> user mention and a timestamp is not appropriate. Incidentally I normalize ingested feeds just like.

What you seem to be missing without markdown is inline images. But if you still included the bare URL of the image, a client could decide to show it inline anyway or just autolink it. That's precisely the way I do it in my client:

  • If the origin is well known, handle it appropriately
  • If the URL includes a file extension, then embed it appropriately - jpeg, webm, ogg, js, etc.
  • Optional: execute a CORS HEAD request to determine media type and possible alternate format links. This round trip could be optimized by starting to stream the next point.
  • Finally do a CORS request to fetch its content to provide for a preview and summary if available and learn the CMS on the origin. If the file proved to be multimedia, render the fetched content as a blob.
  • Otherwise, without CORS available, try to assume that it is an image or video and try to inline it with a tag and act upon the onerror handler.
  • If all above fails, try to discover the CMS on the origin by probing various URIs (e.g., nodeinfo for The Fediverse or well-known for other commonly hosted apps). If supported, use the CORS endpoints corresponding to the known software to access the given content (also useful for GitLab, Gitea, PeerTube and others)

@bkil
Copy link

bkil commented Apr 18, 2023

Oh, and just a bit more thought about <br>. How was escaping supposed to work in twtxt originally? I.e., How could one submit a post that is talking about the format itself containing potentially unbalanced angle brackets without interpreting them? This may also occur with ASCII line-art.

Similarly, how is one supposed to type in the corner case of an @-sign at the end of the line (i.e., producing the phrase first line@<br>second line)? This is actually a simple case, because it can be declared that user mentions are only allowed in the following forms:

  • @username
  • @<username URL>
  • @<URL>

Hence @<br> is not valid syntax, so it is sensible to handle it implicitly (i.e., by running the line breaking replacement first, and only search for mentions afterwords). Escaping the @-sign with&#64; as in HTML is always an option, though.

@prologic
Copy link

I recommend replacing the blake-hash in the subject with the character-exact timestamp of the message you are replying to. My client supports threading by both blake-hash and by a combination of timestamp & user mention - whichever is present. Adding a user mention of the previous post in the same thread along the root post is also considered good manners, this might benefit from synthesis during conversion.

This does not take into consideration the "network". You cannot have a threading model whereby you either have to a) keep a global id somewhere (counter to a decentralised system) or b) a high rate of collisions (such as a timestamp in one feed that collides with timestamps in all other feeds)

@prologic
Copy link

I'm not sure I'd keep the extended # syntax as it is usually just line noise, especially when abused to refer to post blake-hashes for machines instead of labels meant for users. Hashtags are usually understood to be #HashTag. Just use a bare URL to refer to post permalinks where a @ user mention and a timestamp is not appropriate. Incidentally I normalize ingested feeds just like.

This hasn't been the case for years. Simple #tags are used.

@prologic
Copy link

Anyway, I'm happy to explore @dbohdan 's original suggestion here of using "same-timestamp" as multi-line posts and/or use of <br> or some other marker, the later of which would require some experimentation. I am also consistent of the fact that at this point adoption of the Multiline Extension is already widely adopted by quite a few client implementations already, so this is pretty low-priority work for me.

@bkil
Copy link

bkil commented Apr 18, 2023

@prologic The recommendation of replacing the blake-hash was for the frugal feed you produce for every yarn.social feed. You can do whatever you want within the legacy yarn.social feed. Please also read the fine prints again: I told you that I connect threads using both the timestamp and the user mention at the beginning of a post, and there can be no collision with that. Incidentally, this scheme was recommended to you years ago in your issue tracker by users, but you went with blake-hashes anyway and claimed that you liked the idea but couldn't "change" it.

Recall the example I gave you in the past:

http://example.com/joke:

2022-10-31T06:54Z\tWhy do programmers confuse Halloween with Christmas?
2022-10-31T23:00Z\t@<http://example.com/lola> (2022-10-31T11:11Z) @<http://example.com/joke> (2022-10-31T06:54Z) Spot on! Oct 31 = Dec 25

http://example.com/lola:

2022-10-31T11:11Z\t@<http://example.com/joke> (2022-10-31T06:54Z) Something related to eight?

http://example.com/kids:

2022-10-31T22:22Z\t@<http://example.com/joke> (2022-10-31T06:54Z) Beats me

Note that I also support abbreviating user mentions of followers as @joke @lola but that's beside the point here.

This does not take into consideration the "network". You cannot have a threading model whereby you either have to a) keep a global id somewhere (counter to a decentralised system) or b) a high rate of collisions (such as a timestamp in one feed that collides with timestamps in all other feeds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants