Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to store metadata about a feed #48

Open
Benaiah opened this issue Feb 9, 2016 · 65 comments
Open

How to store metadata about a feed #48

Benaiah opened this issue Feb 9, 2016 · 65 comments
Labels

Comments

@Benaiah
Copy link

Benaiah commented Feb 9, 2016

A number of different issues and ideas have made clear the need for a place to specify metadata about a twtxt.txt feed. For instance, essentially every idea for notifications so far needs to know where the notifications should go (technical details vary based on the proposal). The question then is how to store metadata.

Discussion in #22 has suggested a general comment character, thus allowing clients to handle individually how the metadata would be stored. I suggest building on this, allowing for general comments, but make the following format specifically for metadata:

# this is a regular comment

# the next line is a metadata entry
# nick = benaiah

This echoes the .ini format of the twtxt config file, which I think gives it a nice consistency.

The other main suggestion for metadata is to have another file. I dislike this approach because it complicates the protocol, significantly increases how much twtxt has to hit the network, and requires either a second URL for each person (for the metadata file), switching twtxt.txt to hold metadata and having another file hold the feed, or putting a metadata entry in twtxt.txt that points to the metadata file.

@tedder
Copy link
Contributor

tedder commented Feb 9, 2016

pros of second file:

  • can cache it (if-modified-since, etag).
  • a single file must be fully loaded/parsed to find metadata. huge file could make that painful. easy to only care about the very top or very bottom of a twtxt file.

One file has the advantage of showing metadata that changes- for instance "added new profile pic on date" or "followed @ on date", if we use syntax that is similar to the non-commented version:

# [date] \t key = value

@reednj
Copy link

reednj commented Feb 10, 2016

I like the idea of doing this in comments at the top of the file. I think the advantages of having everything in the same file outweighs any added complexity when swapping out the files if they get too big or whatever.

However, I think we would quickly hit limitations with simple key value system - how would you easily store a list of follows with this for example?

A good format could be yaml, I think. Its human readable and writable, and widely supported - we would just need to strip out the comment character at the start of each line before parsing it.

I imagine the header for twtxt would then look something like this:

# the three dashes indicate the start of the data block, so we know where
# to start converting to yaml
# ---
# username: reednj
# following: 
#  - buckket http://buckket.org/twtxt.txt
#  - xena https://xena.greedo.xeserv.us/files/xena.txt
#  - whatever http://whatever.com/twtxt.txt

Edit: somehow forgot to add the urls to the user list...

@erlehmann
Copy link

@tedder consider that if you use a separate file for metadata and it also supports including messages you quickly obsolete the twtxt format as the syndication format of choice. Every client will just use the format that provides more data. Thus, the original twtxt format would be mainly useful as input (like Markdown or ReStructured Text) for scripts generating feeds.

@tedder
Copy link
Contributor

tedder commented Feb 10, 2016

@erlehmann I said nothing about including messages in a second file.

@reednj I like the idea of metadata at the top, instead of happening anywhere in twtxt. I (personally) like yml, it's extensible in cases like this.

@erlehmann
Copy link

@tedder to demonstrate: Yeah, you do not have to include messages. But any format that is powerful enough to include the metadata can be utilized for that and then you are back at using a single file. I have written a small shell script that converts a twtxt feed to the format described in RFC 4287, which describes how to convey author name/email, contributor name/email, the time of publication and the last update for a document. Since RFC 4287 also describes how to include messages, I just included them!

Here is the input file: http://daten.dieweltistgarnichtso.net/tmp/docs/twtxt.txt
Here is the output file: http://daten.dieweltistgarnichtso.net/tmp/docs/twtxt.xml

@erlehmann
Copy link

@reednj RFC 5005 describes a mechanism to link together several physical documents that form one logical document. It is not that hard it seems, as long as the first document contains the metadata about the aggregate.

@erlehmann
Copy link

@reednj I see a problem with your example as it does not give URLs in the source, only nicknames. In reality, you would need the URL.

@erlehmann
Copy link

@reednj I am not familiar with yaml. How can you do namespaces in yaml? As far as I see, you would need namespacing for forwards compatibility.

@reednj
Copy link

reednj commented Feb 11, 2016

So sounds like commented YAML could be the way to go? I wonder if @buckket has an opinion?

Also, please no namespaces, that is the very definition of YAGNI

@erlehmann
Copy link

reednj could you explain how a format can be extensible if you do not have namespaces without basically ignoring everything in the file that is not in the default namespace? Or is the metadata format you envision a fixed format without any additional semantics, ever?

@otherjoel
Copy link

Personally I would love to see twtxt either commit to a truly minimalist “no metadata” stance, or simply use Atom as the default format in a single file. Atom has everything you need. It is not the most terse file format; the existing twtxt format is the most terse if that’s what you’re shooting for. But as soon as we start trying to approximate feature-parity with Twitter, it’s likely we’ll just end up reinventing Atom/RSS poorly. Atom is human-readable, it’s a truly well-made and well-defined standard, there’s widespread support for it.

@reednj
Copy link

reednj commented Feb 12, 2016

You can have meta data about the user at the top of the file, without having any meta data about the messages, which is basically what I'm pushing for.

I don't think we can or should or need to compete with twitter. The appeal of twtxt is its simplicity, and xml is the opposite of that in every way.

@mkody
Copy link

mkody commented Feb 12, 2016

I second @reednj.

twtxt is a decentralised, minimalist microblogging service for hackers.

The minimalist part here needs to stay. The fact that we can use only one (or two soon?) lines for each tweets make it simple and clear to use.

@Benaiah
Copy link
Author

Benaiah commented Feb 12, 2016

You can have meta data about the user at the top of the file, without having any meta data about the messages, which is basically what I'm pushing for.

I agree - we need user data for any sort of network propagation, but the messages themselves should remain as ephemeral and simple as they are currently. I think you hit the nail on the head.

@erlehmann
Copy link

@mkody as I said, twtxt can be an input format for an already existing representation, like Markdown. Try http://news.dieweltistgarnichtso.net/bin/twtxt2atom out and you might see what I am proposing.

@Benaiah what is “network propagation” ?

@otherjoel
Copy link

the messages themselves should remain as ephemeral and simple as they are currently

So to be clear, official support for things like replies to chain messages together in conversations are absolutely off the table? If so, then that feels consistent and I can dig it.

@mkody
Copy link

mkody commented Feb 12, 2016

@erlehmann So you mean that we could keep the twtxt file and make an atom feed from it?
For the atom to have some sort of metadata, it means that our input (the twtxt file) should have them somewhere too.
That feels redundant to use two files for the same purpose. And convert the file every time.

@DracoBlue
Copy link
Contributor

I like the way @reednj posted!

Advantages:

  • people can add comments without thinking about metadata at all
  • the --- indicates yaml data to occur (thats very common)
  • having one file with also "following" etc resolves the issue of syncing following list

I really like atom and especially atom sync protocol, but twtxts simplicity and posting to your feed as simple as TIMESTAMP\tmessage is what makes it a very nice format to host on whatever webspace and post it with whatever client you have.

Everything we add with # like I suggested in #22 is an extra and should not be mandatory. Even though having yaml in twtxt like @reednj posted, could make the config file nearly unecessary ;).

@mdom
Copy link
Contributor

mdom commented Mar 6, 2016

After thinking about this topic for a few days, I'm sure benaiah's first suggestion would be a very good fit for twtxt. If we just use comments like

# follow david http://example.org/david.txt
# unfollow http://example.org/user.txt
# nick mdom
# twturl http://example.org/user.txt 

somewhere in the file, it would be very easy even for the most simple client to read and write metadata in the feed. Whereas with things like yaml or ini you couldn't just read the file line by line and you probably need a parser to do the work. And this format would also allow the record who you once followed or your old twturl if somebody needs that. And for the argument about needing to parse the whole twtfile just to get the metadata: We currently are parsing the complete file every time to build the timeline so i'm not sure if this is even an issue.

I have the strong feeling we should just use the easiest and most minimal solution one can think of. I mean, that's what twtxt is all about, right? :)

@archusr
Copy link

archusr commented Mar 6, 2016

mdom's suggestion sounds very reasonable. I also like the log style approach therein.

@mdom
Copy link
Contributor

mdom commented Mar 6, 2016

We talked a little about it on irc, and we would also propose to add a timestamp to the comment, so the client can reorder metadata as it seems fit. Some would leave it interspersed in the file and others could move metadata to the top of the file.

@archusr
Copy link

archusr commented Mar 6, 2016

to still allow for simple sorting by timestamps, irc style commands could be an alternative to # comments:

# 2016-03-06T23:23:23Z  follow user https://example.org/user/twtxt.txt
2016-03-06T23:23:23Z    /follow user https://example.org/user/twtxt.txt

@Lymkwi
Copy link
Contributor

Lymkwi commented Mar 7, 2016

to still allow for simple sorting by timestamps, irc style commands could be an alternative to # comments

Then tweets cannot start with a '/' (0x2F) character anymore. I don't think it's that much of a bother compared to what metadata storage can do, and I assume it's easier to parse than having to determine that the first character is a '#' and parse date and metadata altogether. He you can just parse things naturally using the existing methods, and if the first character of the message is a '/', then store that lline as metadata, not a tweet.
I was wondering when I started thinking of storing metadata : where you we store them once they're downloaded? Of course I thought of the Cache, but it isn't very generic, it was designed to store tweets, and adding metadata managing in it requires some twisting of its current methods...

@mdom
Copy link
Contributor

mdom commented Mar 7, 2016

Though i still prefer the lines starting with comments, this would be also a fine choice. It's a good point that you wouldn't have to add special syntax. But i wonder how often users want to start tweets with /me or path names and then you need some kind of escaping mechanism... :/

@otherjoel
Copy link

If this is the approach it would be better to use some uncommon unicode character (e.g. or ) instead of a slash.

@Benaiah
Copy link
Author

Benaiah commented Mar 7, 2016

Maybe a vertical tab would work :P

On Mon, Mar 7, 2016 at 1:57 PM -0800, "Joel Dueck" <notifications@github.commailto:notifications@github.com> wrote:

If this is the approach it would be better to use some uncommon unicode character (e.g. ? or ?http://www.fileformat.info/info/unicode/char/261e/index.htm) instead of a slash.

Reply to this email directly or view it on GitHubhttps://github.com//issues/48#issuecomment-193473667.

@mdom
Copy link
Contributor

mdom commented Mar 8, 2016

Maybe we can use C99 oneline comment syntax. Using // would be visible distinctive, shouldn't be that common in normal tweets and it feels like a rather nice fit for a service for hackers.

@archusr
Copy link

archusr commented Mar 17, 2016

We could define one reserved word, as in:

timestamp     /twtxt action parameters

@DracoBlue
Copy link
Contributor

If we take IRC, you cannot start your text with a slash, too.

If we need the date of the action, putting it into a normal message and prefixing it with / will work (with the drawbacks mentioned).

If we don't need the timestamp, there is no real reason to integrate it as some kind of special message. So we are at:

#nick dracoblue
TIMESTAMP\tmy post

again ;).

Since I really want to have metadata in the twtxt, to finish the persistent storage for https://web.twtxt.org - it would be good to have a decission on this. /cc @buckket

@mdom
Copy link
Contributor

mdom commented Mar 17, 2016

I would really like to have a defined order of metadata. For example it would be really useful for follow/unfollow command, or you can define multiple twturls and the last should be used for fetching but the others urls could still be used for collapsing mentions etc.# timestamp nick dracoblue again? But i feel we now have iterated through all possible ways to define metadata multiple times ... :)

@adiabatic
Copy link

If we take IRC, you cannot start your text with a slash, too.

Most mature IRC clients have a way of sending something that starts with a slash to a channel, whether by making the user write two slashes, press control-enter, or write /msg #twtxt /me is the command we're using.

What about

TIMESTAMP action

vs.

TIMESTAMP\tpost

to distinguish actions from posts? Namely, actions and metadata start with a space, while posts start with a tab.

@adiabatic
Copy link

More ideas on TIMESTAMP action (as opposed to TIMESTAMP\tpost):

For a belt-and-suspenders approach, one could do

2016-03-17T21:16:56Z /PREFERREDNICK katabatic

That is, posts match "{}\t{}" whereas actions match "{} /{}" (in Python str.format() minilanguage)

@archusr
Copy link

archusr commented Mar 22, 2016

In the above comments are examples of lines to be parsed as ...

  • 0-5, 9 metadata
  • 6-6 indirect speech (IRC /me)
  • 7-8 plain messages
(0) timestamp /action parameters
(1) # timestamp     action parameters
(2) timestamp       /action parameters
(3) timestamp       // action parameters
(4) timestamp       # action parameters
(5) timestamp       /twtxt action parameters
(6) timestamp       /me likes this discussion
(7) timestamp       // drunk — fix later
(8) timestamp       # You are not expected to understand this.
(9) timestamp#action parameters

Looking at these, it seems we could/should identify metadata as 0, 2 or 5, with 5 being most strict? // edited to add 9

@DracoBlue
Copy link
Contributor

@archusr thanks for summarizing!

I think (2) and (5) are good ways, too.

I implemented (2) in https://web.twtxt.org (and changed my https://dracoblue.net/twtxt.txt accordingly) but it is not a big deal to change it to (5).

@buckket what do you think?

@mdom
Copy link
Contributor

mdom commented Mar 23, 2016

If we're leaning to option two or five, i would prefer 5 as we wouldn't have to code special cases to prevent /me from disappearing. I change txtnix accordingly. @quite, @DracoBlue would you change your clients too? Can maybe somebody with more python chops add it to twtxt and send a PR?

@adiabatic
Copy link

@DracoBlue What do you like about 2 and 5 that you don't like about 0? Because it uses a space instead of a tab, there's no way for a user to accidentally make an action that was supposed to be a post — and I like that.

@mdom
Copy link
Contributor

mdom commented Mar 23, 2016

Overloading of whitespace is fragile. Look at make. I would even argue, that twtxt shouldn't care what kind and what amount of whitespace is between timestamp and text. Think about all the editors that are autoconverting tabs to spaces. But that's probably an issue for another time... :)

@DracoBlue
Copy link
Contributor

@mdom Yep!

TIMESTAMP#action param

would be more explicit.

Actually 2+5 would be compatible to current clients.

So we implement

TIMESTAMP\t/twtxt action param1

In the alternative clients and somebody with python skills adds it with a PR to the official client?

@adiabatic
Copy link

@mdom Makes sense. If you hate

TIMESTAMP /… …

then I'd suggest

#TIMESTAMP\taction

because there's still no way to accidentally make an action.

We could, of course, have one before-the-timestamp marker for actions and another before-the-timestamp marker for comments.

@adiabatic
Copy link

TIMESTAMP#action

would be great. Are we sure we want to standardize on 2 or 5 for the backwards-compatibility concerns of three clients and six users, all of which can probably be updated in two hours total?

@mdom
Copy link
Contributor

mdom commented Mar 23, 2016

@adiabatic I'm a big fan of the #TIMESTAMP\taction syntax. I just had the feeling that there was a movement for the irc style metadata. I think we just need to decide for one solution. @DracoBlue, @quite What about TIMESTAMP#action?

@DracoBlue
Copy link
Contributor

Ok for me, too. Can somebody try how twtxt and current registries behave if
this is in the feed?

@mdom
Copy link
Contributor

mdom commented Mar 23, 2016

Let's find out. I just updated my twtxt.txt with both version.

@DracoBlue
Copy link
Contributor

http://twtxt.reednj.com/user/8c8d189d1c6f8810

Handles (0) like a normal "post". The others dont appear.

roster, registry and twtxt-ui ignore all versions in your posts.

Am Mittwoch, 23. März 2016 schrieb Mario Domgoergen :

Let's find out. I just updated my twtxt.txt with both version.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#48 (comment)

http://dracoblue.net

@mdom
Copy link
Contributor

mdom commented Mar 23, 2016

twtxt dies with a stacktrace when parsing (1), but ignores (0) and (9). Seperating timstamp from metadata with a hash sign, seems to be ignored by all clients. And we could still allow any kind of ws for normal tweets. 👍

@DracoBlue
Copy link
Contributor

Ok.

So:

TIMESTAMP#action param1

is the final version?

@archusr
Copy link

archusr commented Mar 23, 2016

Shall we vote? Until when? (Wait for >50% of 14 participants (=8) in this thread?) https://doodle.com/poll/gh27hhtixvbttvdp Result so far:

2016-03-24T20:15:00+01:00#action

@timofurrer
Copy link
Contributor

Let's vote in here with the emojis ;)

@DracoBlue
Copy link
Contributor

I think 4 votes is clear! ;)

@mdom
Copy link
Contributor

mdom commented Mar 24, 2016

txtnix and twtxt-roster both support the new syntax.

@buckket
Copy link
Owner

buckket commented Mar 24, 2016

I have a few questions here:

  • Do we always need a timestamp prepending metadata? There are plenty of time-insensitive use cases, e.g. the linkback URL, where a timestamp just doesn’t make sense. Having multiple values for the same key (with different timestamps) would require additional parsing work to figure out which value is the right one (i.e the most recent). This just makes things more complicated, while providing none or very little benefits overall.
  • What are the advantages of specifying time-sensitive actions? Reading through the issue I saw /me and /follow being proposed. I get that this opens up many possibilities, but I’m not a fan of adding something without knowing what we will end up using it for.
    • If you want to do a /me-style message just use *having a good day*, adding a new kind of message type, which then is displayed differently in the client is unnecessary.
    • Announcing ones followings is something that doesn’t belong in the main twtxt feed. It’s sole purpose is delivering twts, while also giving basic information about the source of those twts. Adding this would combine a metadata feed with a content feed. If you really want to share this information a separate followings file is the place to do that.

After giving it some thought, I’d rather stick with a very simple, yet robust concept:

# Hello, this is a comment, it should be ignored.
# This is my twtxt feed, be welcome!
#
# NICK = buckket
# LINKBACKURL = http://example.org/linkback
# FOLLOWINGS = http//example.org/followings.txt
2016-02-25T18:11:02+01:00   Rather busy this week, will try to resolve some issues with twtxt soon!
2016-02-25T18:11:31+01:00   Especially the metadata situation needs some attention.

This way we can strip all the unnecessary metadata by removing lines starting with #, thus getting all the raw twts without much parsing work. E.g. by using: sed '/^#/ d'. That illustrates the idea and intention behind twtxt very well. Keeping everything so simple that you can modify, extract and use the data with simple shell commands. Other benefits are the easy parsing and the rather clear optical differentiation between content and metadata.

Another reason why it might be good having metadata at the top without having to go through the entire file: HTTP Range Requests. If you want to check only the metadata, request only the first x bytes, where x is a number big enough to house all relevant information.

Sorry for not responding sooner.

@mdom
Copy link
Contributor

mdom commented Mar 24, 2016

On Thu, Mar 24, 2016 at 08:45:55AM -0700, Felix Bayer wrote:

  • Do we always need a timestamp prepending metadata? There are plenty

I think prepending a time stamp makes things easier for twtxt clients as
we can still just append to the twtxt file. If we put the metadata in the
header, i have to rewrite the twtxt file every time metadata changes.
And i probably should flock the twtxt file then. Whereas appending is
atomic on linux up to 4k. And 512 bytes on most unixes.

And if we decide to not add a timestamp and in five weeks we find a
metadata where it would be really usefull to add time information, the
ship sailed. It would be nice to have the most general solution.

Maybe we can have an optional timestamp and in case it's missing we just
assume now() for ordering?

  • If you want to do a /me-style message just use *having a good day*, adding a new kind of message type, which then is displayed
    differently in the client is unnecessary.

I don't think anyone propsed that. The discussion was if the /command
syntax would make it impossible to use /me in the beginning of a tweet.
How to display the /me should be up to the client.

After giving it some thought, I’d rather stick with a very simple, yet robust concept:

# FOLLOWINGS = http//example.org/followings.txt
2016-02-25T18:11:02+01:00 Rather busy this week, will try to resolve some issues with twtxt soon!
2016-02-25T18:11:31+01:00 Especially the metadata situation needs some attention.

This way we can strip all the unnecessary metadata by removing lines
starting with #, thus getting all the raw twts without much parsing work. E.g. by using:sed '/^#/ d'`.

I always likes the # ts metadata idea. Maybe with an optional ts and
no requirement to add it at the beginning of the file?

Another reason why it might be good having metadata at the top without
having to go through the entire file: HTTP Range Requests. If you want
to check only the metadata, request only the first x bytes, where x is
a number big enough to house all relevant information.

If we just append, we could remember the end of the last request and
only request new lines. But as the order of tweets is not defined and
users can change their twtfiles in the middle, this is not happening...
:)

@archusr
Copy link

archusr commented Mar 24, 2016

Just a wild idea for now to keep it simple and open:

# This line some random comment.
# @nick mynick
# @nick[2016-03-24] mynick
# @followings url http://twtxt.org/followings.txt
# @followings json [{"url":"http://twtxt.org/twtxt.txt", "nick":"twtxt"}, {..}]
# @follow[2016-03-24T21:33:47+01:00] twtxt @<foo http://foo.bar>, @<eg http://eg.org>

i.e. parameter[optional date/timestamp] literal or datatype and value

@DracoBlue
Copy link
Contributor

one vs two files

If we want to put some meta data in an extra file: let's put most of the data in this extra file.

Having

# meta=https://dracoblue.net/twtxt.meta

at the beginning, would allow us to reuse the ini style of twtxts config with its content:

https://dracoblue.net/twtxt.meta:

[twtxt]
nick=dracoblue
twturl=https://dracoblue.net/twtxt.txt
[followings]
buckket=http://buckket.org/twtxt.txt

Having additional #twtxt.nick=dracoblue in the twtxt file to avoid the extra request, would be nice, but not really necessary.

The advantage of this approach is, that the range requests could really be applyable, since the meta head wouldn't change at all or that often.

The information about followings and so on, would be nice to "display" a profile page (like in twtxt-ui) and to have a officially supported way store the information.

timestamp for meta

If I can see in my timeline, at which time one of my followings started to follow somebody, it's quite nice ;). Having /me likes this resolved to * dracoblue likes this is nice in the client, bur no problem if the client doesn't have this magic.

@smeagolthellama
Copy link

are there any conventions about this stuff yet? Or, just in general, any progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests