Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feed schema: Add metadata #4

Open
cyroxx opened this issue Mar 11, 2015 · 28 comments
Open

Feed schema: Add metadata #4

cyroxx opened this issue Mar 11, 2015 · 28 comments

Comments

@cyroxx
Copy link

cyroxx commented Mar 11, 2015

As a developer of a canteen parser, I would find it useful to include some (optional) metadata about the canteen and the parser itself.

  • Is the canteen publicly accessible? Company canteens usually aren't, so I could give a hint about that.
  • Parser-related info/metadata (this is more technical and might or might not be shown to the actual end-user)
    • Source URL: As a parser developer, I could give a hint about where I got my information from.
    • Parser version
    • Developer info (Name, email address, and/or URL) - or, maybe even better, a URL to a website with more information about the parser or developer (this could be a link to Github). It should give enough information to contact the developer in case something is wrong with the parser. So far it is not possible to get this information via the API.
@mswart
Copy link
Member

mswart commented Mar 15, 2015

I think this is a good proposal. But I am thinking about the best way of implementation.

First it is important that these information do not have the same scope:

  • Accessibility and source url are canteen specific.
  • Parser version is parser specific - aka in most cases some for a group of canteens.
  • Developer information are developer or parser specific.

General developer information (Name, email, URL) could be implemented as profile fields. The other developer information are specific per parser. These information could but need not be delivered by the feed. As a author of multiple parsers I prefer to include this into the feed.

The parser version is a very technical meta data and should be included within the feed. As the parser has already the source URL, it is the best to put this information also in the feed.

The canteen accessibility is tricky: another attribute in the developer admin interface and feed attribute are reasonable.

If we work on meta information I would think about a way to provide the current canteen meta data (address, name ...) via the feed. Maybe not with direct overwrite of the old data - but at least a semi-automatically way and a notification to the developer if the information have changed.

In addition I think we should start to add the parser as separate entity to openmensa (a canteen is provided by a parser). This make multiple openmensa workflows easier.

Last we have to decided which information are displayed to the user. The parser version is for no interest for other users, or? The developer information should be displayed to the user, but we should have an approval from the user (especially for the name and email).

@jgraichen
Copy link
Member

I would also like to think about a push API to allow "parsers" pushing data to OpenMensa. On the long run this would free us from needing to implement fetch strategies and update pulls etc. "Parsers" could send data of any kind (meals, status information, meta information, etc.) using the e.g. public HTTP API.

@mswart
Copy link
Member

mswart commented Mar 15, 2015

I am strongly against dropping the pull API. The pull approach make it very easy to write parsers (many logic is implemented by openmensa) and this should remain our goal.

We could think about an optional push API - to push meta data or trigger a new data fetch, but at the moment I really see a real advantage/any parser how would use this.

@jgraichen
Copy link
Member

I'm not for dropping now (or soon) - just for thinking about adding a push API. Otherwise I agree with you about the information handling. I would additionally highlight the problem how to display "restricted" canteens on the page? Should they be included in listings for Apps by default?

@mswart
Copy link
Member

mswart commented Mar 16, 2015

Your point "On the long run this would free us from needing to implement fetch strategies and update pulls etc" is only correct, if we drop the PULL API. And I am against this, even on the long run.

@jgraichen
Copy link
Member

Then just take it as never told. Do we want to make separate issues for the different fields?

@cmur2
Copy link
Member

cmur2 commented Mar 16, 2015

I'm generally in favor of accumulating more meta data esp. when it's useful to the users in case of more direct error reporting to the developer etc. (A push API sounds nice btw.)

@mswart
Copy link
Member

mswart commented Mar 25, 2015

I have thought the last couple of days about the best implementation. My proposal is as follows:

We extend the data model: separate parser information from canteen information.
The canteen tables stores only the needed meta data about the canteen (name, position, city, state, but no parser urls or fetch hour).
The parserInstances tables stores the data about the parser instance (e.g. the parser "university of potsdam" for griebnitzsee): main url, fetch data, author, state, last fetch time (maybe parser version), parser identify ("Potsdam" in our example).
The canteenData stores meta data proposals for canteens. Records will be created when the parser returns new meta data or (later) if a user propose a correction (e.g. from the address). If a new record is created, the author is ask to decide whether the changes should be copied into the canteens table.

Currently I do not think that we need a separate table for parser itself (for Potsdam, not a specific instance). But I keep this in mind.

So we can create canteens for canteens we users ask for a parser (for state ask or something like that). And later we can support multiple parsers per canteen (like fallback/alternative parser).

Add meta data url for parsers:
The meta data do not changes as often as meal data, to be able to fetch the meta data with a lower frequency or and another time of the day, I would support a different url for that (which could be the some url, but most not). I plan to fetch this meta data once at night. The returned meta data are put into a canteenData record if they are new. The meta data should also be able to return the menu url and the today url, should are written directly into the parserInstances record.

Add states for many tables:
I think states are helpful for canteens (ask, working, failing, broken, archived), and for parserInstances (new, working, failing, broken, disabled). Failing is a temporary error and broken are (more or less) permanent errors.

What to do with the parser version:
I think the parser version is a very helpful information - e.g. I would love to get "the new parser version fixed the errors with Brandenburg" from OpenMensa. I would therefore store them with created errors. As we do not get a parser version on errors, we need to store the last parser version within the parserInstance record. We could later also support extracting parser version from special http header.

That are my thoughts for now. Any questions or other proposals @jgraichen, @kaifabian? Kai you implemented a basic address extracting in you parser, or?

@mswart
Copy link
Member

mswart commented Mar 27, 2015

In any case we do need to extend the feed schema (v2): Kai and I discussed how to extend the feed and propose the following.

Example for Potsdam Griebnitzsee:

<openmensa>
  <canteen>
    <name>Mensa Griebnitzsee</name>
    <address>August-Bebel-Str. 89, 14482 Potsdam</address>
    <city>Potsdam</city>
    <contact type="phone">(0331) 977 3749/3748</contact>
    <location latitude="52.3935353446923" longitude="13.1278145313263" />
    <accessibility>privileged</accessibility>
    <feed name="today">
      <!-- cron like schedule information -->
      <schedule dayOfMonth="*" dayOfWeek="*" hour="8-14" retry="30m 1" />
      <url>http://kaifabian.de/om/potsdam/griebnitzsee.xml?today</url>
      <source>http://www.studentenwerk-potsdam.de/mensa-griebnitzsee.html</source>
    </feed>
    <feed name="full">
      <schedule dayOfMonth="*" dayOfWeek="1" hour="8" retry="1h 5 1d" />
      <url>http://kaifabian.de/om/potsdam/griebnitzsee.xml</url>
      <source>http://www.studentenwerk-potsdam.de/speiseplan/</source>
    </feed>
    <!-- day attributes -->
  </canteen>
<openmensa>

Example for Ulf:

<openmensa>
  <version>93.3</version>
  <canteen>
    <name>Ulf's Café (HPI Cafeteria)</name>
    <address>Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam</address>
    <contact type="phone">(0331) 5509-380</contact>
    <city>Potsdam</city>
    <location latitude="52.3932931010875" longitude="13.131183385849" />
    <accessibility>public</accessibility>
    <!-- day attributes -->
  </canteen>
<openmensa>

What do you think about it?

@kaifabian
Copy link
Member

I think it's great! ;)

If we had to implement that in openmensa/openmensa, we could use the following database structure (as Malte and I discussed just now):
openmensa-databaselayoutformetadata

@cmur2
Copy link
Member

cmur2 commented Mar 27, 2015

@mswart do you plan to have any requirements regarding esp. formatting on the meta-info delivered in feeds like address or phone number? Or is it completely free text? There are dozen ways to format phone numbers by humans but only one I would like to accept (also prepend 0049 for germany yes/no?)...

@jgraichen
Copy link
Member

I would recommend enforcing E.123 maybe even limited to international format only.

@jgraichen
Copy link
Member

I'm not sure about some attribute vs tag usage.

Why address and city tags but <contact type="phone">? Could also be <contact type="address"> or <phone> (<email>?). And of what usage is <feed name=X>? Do we want to have more than "today" and "full"? If yes, does a name matters anyways? Is it a title or just a developer used identification for e.g. debugging?

I'm also unsure about the wording of accessibility if you want to refer if a canteen is e.g. for employees only as the term accessibility usually "refers to the design of products, devices, services, or environments for people with disabilities" (Barrierefreiheit). Unless you mean this. Then we would need a flag if a canteen is open for general public (<public>?)

Personally I do not like the attribute scheme for <schedule>. I assume it should code when and how often a feed should be queried? Imho it's not needed to be written as such many attributes (humanized?) but a simple single cron pattern and ISO 8601 time or period formats would be better. As of now I do not understand how "retry" is coded.

@jgraichen
Copy link
Member

If we want to support canteens in other countries it may also be good to have <country> similar to <city> for searching/indexing.

@cmur2
Copy link
Member

cmur2 commented Mar 27, 2015

Do we want to make this an extension to v2 or call this v3 when it's done? We have to deal with missing data, any ideas?

@mswart
Copy link
Member

mswart commented Mar 28, 2015

format restrictions: We have thought about restrictions on address and contact but what is the point? I mean this data are only displayed to the user. If require ensure a specific format, all parser developers are required to parse, reformat the telephone number from whatsoever format the canteen is using. I prefer rather to have telephone number in some (maybe only human readable) format than no telephone number at all. Add the address field has currently also no restriction.

So I would not ensure a restriction but recommend a / some formats in the documentation.

Yes, <contact type="email"></contact> should also be possible. And I think an address is special enough that we treat it separately. In addition an address is more like a location than a contact information.

accessibility: I am free for a different word for accessibility - but I have implement something like this, I would support at least 3 different states: restricted, public and privileged. And I would prefer to have all meta data as attributes within the canteen not some as attributes for the canteen tag.

country tag: At some type we should probably add a country flag, but I am not sure whether it is needed now: I mean the website is only in German.

feed name: The name identifier is on the one hand a some description for the developer, but more important it is an identifier for better merging new feed data into the current database. The idea was to allow the developer to define how many feeds he provides. Maybe a today feed hourly, the current week daily and the future only once a week. Main point is that a crash parsing a later day/week should not influence the parsing of the current data.

schedule-tag: For the schedule: I prefer to have a XML representation that is easy understandable as human by simple reading it and the cron format is not that intuitive. In addition we thought about not supporting the minute flag. Can you give an example what do you mean with the "ISO 8601 time or period formats". The retry attributes lists a time interval (many only in seconds without a suffix) and a retry limit. So you can say: retry 5 times in 5 hour intervals but afterwards only daily.

v2 or v3: All changes are extensions so no need to create a V3 version and the developers would only be a little more confused. I do not see any real problem with missing data. I mean we have to convert the current parsers but that's all. If we get now metadata, we do not change anything. No problem.

@jgraichen
Copy link
Member

On 2015-03-28 12:42, Malte Swart wrote:

Yes, || should also be possible. And I
think an address is special enough that we treat it separately. In
addition an address is more like a location than a contact information.

Still the question why not just <phone> and <email> or should the
type attribute be a free-text field?

accessibility: I am free for a different wore for accessibility - but
I have implement something like this, I would support at least 3
different states: restricted, public and privileged.

Can you elaborate what "restricted, public and privileged" means?

feed name: The name identifier is on the one hand a some description
for the developer, but more important it is an identifier for better
merging new feed data into the current database. The idea was to allow
the developer to define how many feeds he provides. Maybe a today feed
hourly, the current week daily and the future only once a week. Main
point is that a crash parsing a later day/week should not influence the
parsing of the current data.

So, the main point is that the attribute text itself is only for the
developer? Or does a specific name implies a specific merge strategy?

schedule-tag: For the schedule: I prefer to have a XML representation
that is easy understandable as human by simple reading it and the cron
format is not that intuitive.

I personally cannot say I understand the attribute above "intuitive" and
I do not see the need for having a human readable XML feed, since it is
almost only read by machines.

In addition we thought about not
supporting the minute flag. Can you give an example what do you mean
with the "ISO 8601 time or period formats".

The ISO 8601 specifies format not only for date and times but also for
time periods, durations [1], intervals [2], repeating intervals [3] etc.
Using them means using a standard.

[1] https://en.wikipedia.org/wiki/ISO_8601#Durations
[2] https://en.wikipedia.org/wiki/ISO_8601#Time_intervals
[3] https://en.wikipedia.org/wiki/ISO_8601#Repeating_intervals

@mswart
Copy link
Member

mswart commented Mar 28, 2015

On Saturday 28 March 2015 05:06:45 Jan Graichen wrote:

On 2015-03-28 12:42, Malte Swart wrote:

Yes, || should also be possible. And I
think an address is special enough that we treat it separately. In
addition an address is more like a location than a contact information.

Still the question why not just <phone> and <email> or should the
type attribute be a free-text field?

Because Kai and I used both the contact version. But I have no problem with
<phone> and <email>.

accessibility: I am free for a different wore for accessibility - but
I have implement something like this, I would support at least 3
different states: restricted, public and privileged.

Can you elaborate what "restricted, public and privileged" means?

restricted: only for limited group of people
public: everybody is tread same - e.g. Ulf
privileged: everybody has access, but at least guests have a different status
(mostly for price) - e.g. most university canteens

feed name: The name identifier is on the one hand a some description
for the developer, but more important it is an identifier for better
merging new feed data into the current database. The idea was to allow
the developer to define how many feeds he provides. Maybe a today feed
hourly, the current week daily and the future only once a week. Main
point is that a crash parsing a later day/week should not influence the
parsing of the current data.

So, the main point is that the attribute text itself is only for the
developer? Or does a specific name implies a specific merge strategy?

It is in no way a merge strategy. It is an identifier. To much feed tags from
the xml feed with previous created feed records from the database: e.g. the
new data (not meals! metadata like schedule ...) for feed "today" should
override the current "today" feed ...

Of course is it wise to choose descriptive names (from the developer point of
view) for the feeds.

schedule-tag: For the schedule: I prefer to have a XML representation
that is easy understandable as human by simple reading it and the cron
format is not that intuitive.

I personally cannot say I understand the attribute above "intuitive" and
I do not see the need for having a human readable XML feed, since it is
almost only read by machines.

Of course there is no need. But we name the canteen tag also canteen and not c
or 1. If you not provide a real reason against a readable version, I will
implemented it in a readable way.
By the way: It is not true for you, but at least parser authors are required
to process the XML feed manually (debugging purpose/control the results ...).
I have to choose between a compact cron like syntax where I have to think
everytime I read it or a descriptive way, I use the descriptive way.

In addition we thought about not
supporting the minute flag. Can you give an example what do you mean
with the "ISO 8601 time or period formats".

The ISO 8601 specifies format not only for date and times but also for
time periods,
durations,
intervals,
repeating
intervals
etc.
Using them means using a standard.

Of course I love to use an standard, but only if it is applicable. The
standard only mentions absolute intervals (or do I missed something), and we
need relative intervals. We could use the duration format, be currently I
prefer seconds only for the repeat intervals.
In addition for my the cron concept is a kind of standard.

Therefore I asked for an example! Because I can not see how to use the stand
here in a meaningful way. So please get concrete, express the given examples
in your preferred way and stop throwing words and standards at my.

@jgraichen
Copy link
Member

Therefore I asked for an example!

For example, "repeat 5 times at every hour starting 8:00 UTC" could be coded as R5/08:00Z/P1H. Or for retry 10 times at half a hour: R10/P0.5H.

The retry attributes lists a time interval (many only in seconds without a suffix) and a retry limit.

I'm not sure if I understood the retry attribute. Given an interval of 1h 5 1d. The 1h 5 is "retry 5 times at a 1h delay"? Given hour="8" it would try at 8:00, 9:00, 10:00, 11:00,12:00 and 13:00? "Retry" means only in case of a failure (e.g. timeout, error), right?

What's the meaning of 1d at the end? If it is a series ("retry daily") the example would retry at Monday, 8:00, 10:00, 11:00, 12:00, 13:00, Tue 13:00, Wed 13:00, Thu 13:00, Fr 13:00, Sat 13:00 and Sunday 13:00 until next Monday 8:00, when a new run will be started? Or is it the "retry limit"? But isn't it already limited by "retry 5 times"?

<schedule dayOfMonth="*" dayOfWeek="*" hour="8-14" retry="30m 1" />

The attribute hour="8-14" implies a new run is started every hour between 08:00 and 14:00 (like R6/08:00Z/P1H)?

So it's like <schedule run="R6/08:00Z/P1H" retry="R1/PT30M" />. I'm still thinking about the second example, but this is at least an example as requested.

Different thing. How is scheduling interpreted? As "will run not before given time", but maybe after time passed?

@cmur2
Copy link
Member

cmur2 commented Mar 28, 2015

format restrictions: allowing a parser developer to simply output the raw contact information is a good point, didn't thought of that. My general attitude was to have as clear metadata as possible since I imagined that a parser developer would have to extract and hardcode that information manually... @mswart is it plausible that this extraction can be automated given that, as you argued, e.g. the telephone number format is so unpredictable?

accessibility: I wonder whether it is actually useful to have the "privileged" state, since that information is somehow already present in the price information (only on a per-meal base, ofc). And yes, better name for this feature is necessary...

schedule-tag: I agree that cron is standard-alike enough to be used, and a more verbose structure is always nice since it makes it easier to apply XML Schema constraints right in the definition (like a plausible hour range or '*', in addition to documentation), a combined format string like ISO 8601 is hard to decode at this level. I'm in favor to more verbosity since the overhead is negligible. Also I know cron but don't know ISO 8601 details (yet).

v2 or v3: Right, so we add it v2.

@mswart
Copy link
Member

mswart commented Mar 29, 2015

@cmur2 All these meta data are optional. You can still edit them directly for the canteen online. So the only point to serve them with the meta data feed is if you can extract them automatically.
It can be complex enough to find/extract the telephone number. As humans develop endless ways to formatting one / or multiple telephone numbers, I do not want to enforce a specific format, as it add a new level of complexity (e.g. Google wrote a whole library only to normalize telephone numbers). I would love to have similar formatted telephone numbers, but I would only recommend them, not enforce them.

@jgraichen I get your idea. But I still do not see how to express schedule run that differ from day to day (e.g. only on Mondays ...). If we would need to add an additional dayOfWeek attribute, I prefer to use the cronlike syntax directly.

<schedule dayOfMonth="*" dayOfWeek="1" hour="8" retry="1h 5 1d" />

The idea is that the feed is retry first hourly. After 5 unsuccessful retries, OpenMensa should only retry once a day, as it is likely a permanent error. It is no retry limit passed, so retry until the next regular fetch time.

I think two interpretation are reasonable, your one: fetch on 13:00 the next days (waiting one day after the last unsuccessful try from the "1h 5" interval). It would also be possible to wait until the next complete 1d duration from the original start time. So fetch on 8:00 on the following days.

Which one to choose, is not so important. Also I like the second one in this case, I think the first one is more intuitive and also easier to implement. So I would go with this one.

fetch times: Yes, the idea is that the specified times are the earliest time. So OpenMensa tries to fetch directly at this time, but e.g. depending on the other fetch tasks, I could be later. This is more ore less like today, OpenMensa fetches at full hour all required canteens, so depending on the work load a canteen could first at 8:25 instead of 8:00.

@mswart
Copy link
Member

mswart commented Mar 31, 2015

I created a PR (openmensa/doc.openmensa.org#9) for the required feed changes. Please check whether it matches our discussed version.

mswart added a commit that referenced this issue Mar 31, 2015
As discussed in #4 restructure the parser and feed model. This is
also a preparation to support metadata extracted from feeds.
@mswart
Copy link
Member

mswart commented Apr 2, 2015

I think we missed one important meta data: opening times! Its a bit tricky because we have a few question to answer.

First question is: how do we want to tread canteens with lunch menu and diner menu? Currently they are two canteens. At to moment at recommend to keep it this way. So it is now problem to favorite only the lunch menu but not the diner menu.

Second question: are opening times normal meta data that are specified central per canteen and/or can they be specified per date within the normal feed?

Third question: do we specify opening times, menu times or both?

At least the general opening times should be central/meta data - e.g.:

<openingTime monday="11-14" tuesday="11-14" wednesday="11-14" thursday="11-14" friday="11-14" saturday="11-13" sunday="" />
<!-- or -->
<times type="opening">
  <weekday name="monday">11-14</weekday>
  <weekday name="tuesday">11-14</weekday>
  <weekday name="wednesday">11-14</weekday>
  <weekday name="thursday">11-14</weekday>
  <weekday name="friday">11-14</weekday>
  <weekday name="saturday">11-13</weekday>
  <weekday name="sunday"></weekday>
</times>

With the closed tag we have already a way to override opening times in some way. I think to most important aspect is to close a canteen irregular, do we want to support to specify irregular opening times as well?

@kaifabian @jgraichen opinions? I am very unsure what is the best way / or whether to postpone the or some of the decisions (e.g. for now not opening time override).

@kaifabian
Copy link
Member

Concerning the opening times, I would recommend a XML element such as

<open>8-14</open>

as a sub-element of a day. Surely, this makes feeds contain even more redundant information - but this would also allow feed providers to specify deviations from the usual opening schedule. The user is most likely interested in exactly that information (not: when is the canteen opened usually, but in particular at a given date).

Another point in favor of this proposal is, that this would fit the style we already set with the

<closed />

element.

@jgraichen
Copy link
Member

I like both ideas. Having a global <open> tag with e.g. <monday>11-14</monday> allow easy configuration and an optional <open> tag within the day allow simple overriding. The <times> tag for different kind of times looks promising too.

I cannot say how hard it would be for parsers and developers but it looks good.

@cmur2
Copy link
Member

cmur2 commented Apr 3, 2015

Maybe I'm lacking imagination but I don't think that special opening times are so common. I would stick with the global declaration.

mswart added a commit that referenced this issue Apr 5, 2015
As discussed in #4 restructure the parser and feed model. This is
also a preparation to support metadata extracted from feeds.
@mswart
Copy link
Member

mswart commented Apr 17, 2015

The current state of implementation is as follows:

  • feed is extended (v2.1)
  • update is extended to work with (v2.1) feeds
  • data model is enhanced with parser, source, feed
  • developer can manage parser, sources, feeds manual
  • developer can manage source/feeds by meta data feed and index URL
  • extend canteen meta data via email and telephone number
  • the developer mailer is extended (e.g. with helpful summary subject lines)
  • user can submit error reports for canteens
  • user can propose (meta) data changes
  • user can report canteens that needs a parser
  • implement source list synchronization
  • implement source synchronization
  • adjust fetch crontab for new improved scheduling
  • extend parser mailer to messages for source and parser
  • view error reports
  • accept/merge data proposals
  • improve become developer workflow (documentation, plus explicit action)
  • parser info status page for developer
  • error/messages page for developer
  • handle archived canteens (and the belonging sources and feeds) correctly
  • support editing canteens / open canteen from parser#show
  • support developer to provide more information about himself like public email address, URL ...
  • add parser box to canteen#show page to provide information about author and parser (if data are provided/wanted)
  • support marking parser as maintainer wanted ...
  • mark canteens automatically as emtpy
  • display / edit opening times
  • style text area on report error page
  • build nice interface (js?) to set opening times / support opening times corrections
  • support not to correct/adjust canteen position on new data proposals
  • convert "" to nil on most new fields

@mswart mswart self-assigned this Apr 19, 2015
mswart added a commit that referenced this issue Apr 20, 2015
* Rename error_report to feedback
* Extend developer information by public name, email, info url
* Add maintainer wanted flag to parser
* New parser info box with developer information and if wanted
  maintainer request
@mswart mswart mentioned this issue Jun 10, 2015
27 tasks
@mswart
Copy link
Member

mswart commented Jul 7, 2015

@cyroxx All proposed meta data are implemented/standardized within the feed v2.1 (availability, source url, parser version, information about the developer), but currently only the information about the developer are displayed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants