Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add correction for padding multiplier in "Verteilung TRL" #5

Open
daimpi opened this issue Jul 1, 2020 · 28 comments
Open

Add correction for padding multiplier in "Verteilung TRL" #5

daimpi opened this issue Jul 1, 2020 · 28 comments

Comments

@daimpi
Copy link

daimpi commented Jul 1, 2020

The plots in "Verteilung Transmission Risk Level (TRL) in Diagnoseschlüsseln" currently use the number of keys transmitted including the padded fake keys afaiu. As long as the padding factor stays the same this shouldn't be a problem. But this factor will change from tomorrow on (the plan is to bring it down to 1 eventually). The changes in the padding multiplier will cause some distortion in those graphs as new data will receive less weight.

My suggestion would be to use the data which has been corrected for this multiplier like in the "Geteilte Diagnoseschlüssel von positiv getesteten Personen" section.
@mh- has introduced an automatic detection for the multiplier used in the data set in his parsing tool: corona-warn-app/cwa-server#620 (comment)

@cfritzsche
Copy link

Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)

@mh-
Copy link

mh- commented Jul 3, 2020

Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)

No, the new multiplier 5 was applied during the day, so for more correct values you would have to use the hourly key packages. 2 or 3 of these still used 10.

@micb25
Copy link
Owner

micb25 commented Jul 3, 2020

No, the new multiplier 5 was applied during the day, so for more correct values you would have to use the hourly key packages. 2 or 3 of these still used 10.

It was changed at 11 AM CEST (9 AM UTC; package 9; vide infra)

Is the tool by mh- able to see if the padding change really applied for the whole daily package? The upload numbers for today seem to be quite high. If that is a real increase I am happy ;-)

Good news! I checked this twice and also uploaded the hourly packages. The number of users seems to be correct:

sum of hourly packages:
3+2+5+8+7+4+3+5 = 37 users


daily package:

37 user(s) found.
They submitted these numbers of keys:
4 user(s): 1 Diagnosis Key(s)
3 user(s): 4 Diagnosis Key(s)
1 user(s): 5 Diagnosis Key(s)
1 user(s): 6 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
1 user(s): 9 Diagnosis Key(s)
25 user(s): 13 Diagnosis Key(s)
80 keys not parsed (16 without padding).
37 / 4*1, 3*4, 1*5, 1*6, 1*7, 1*8, 1*9, 25*13

hourly package 6:

Length: 390 keys
Padding Multiplier detected: 10
3 user(s) found.
They submitted these numbers of keys:
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
3 / 3*13

hourly package 7:

Length: 260 keys
Padding Multiplier detected: 10
2 user(s) found.
They submitted these numbers of keys:
2 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
2 / 2*13

hourly package 9:

Length: 220 keys
Padding Multiplier detected: 5
5 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 4 Diagnosis Key(s)
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
5 / 1*1, 1*4, 3*13

hourly package 11:

Length: 200 keys
Padding Multiplier detected: 5
8 user(s) found.
They submitted these numbers of keys:
1 user(s): Invalid Transmission Risk Profile
2 user(s): 1 Diagnosis Key(s)
1 user(s): 3 Diagnosis Key(s)
2 user(s): 4 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
1 user(s): 13 Diagnosis Key(s)
Old Android app used by 1 user(s).
30 keys not parsed (6 without padding).
8 / 2*1, 1*3, 2*4, 1*8, 1*13 (1 old Android app(s))

hourly package 14:

Length: 345 keys
Padding Multiplier detected: 5
7 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 9 Diagnosis Key(s)
4 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
7 / 1*1, 1*7, 1*9, 4*13

hourly package 15:

Length: 195 keys
Padding Multiplier detected: 5
4 user(s) found.
They submitted these numbers of keys:
1 user(s): 1 Diagnosis Key(s)
1 user(s): 8 Diagnosis Key(s)
2 user(s): 13 Diagnosis Key(s)
20 keys not parsed (4 without padding).
4 / 1*1, 1*8, 2*13

hourly package 16:

Length: 195 keys
Padding Multiplier detected: 5
3 user(s) found.
They submitted these numbers of keys:
3 user(s): 13 Diagnosis Key(s)
0 keys not parsed (0 without padding).
3 / 3*13

hourly package 19:

Length: 155 keys
Padding Multiplier detected: 5
5 user(s) found.
They submitted these numbers of keys:
1 user(s): Invalid Transmission Risk Profile
1 user(s): 1 Diagnosis Key(s)
1 user(s): 5 Diagnosis Key(s)
1 user(s): 7 Diagnosis Key(s)
1 user(s): 13 Diagnosis Key(s)
Old Android app used by 1 user(s).
25 keys not parsed (5 without padding).
5 / 1*1, 1*5, 1*7, 1*13 (1 old Android app(s))

@Tho-Mat
Copy link

Tho-Mat commented Jul 3, 2020

I assume the package at 19: has only 3 users.
Old android should no longer be possible after pushing server version 1.0.9 online
corona-warn-app/cwa-server#640

one user 13 keys (1.7-19.6), 1 user 12 keys (1.7-19.6; has no key for 24.6), and 1 user 6 keys (1.7-26.6)

or 4 Users if no hole is allowed:
one user 13 keys (1.7-19.6), 1 user 7 keys (1.7- 25.6), 1 user 6 keys (1.7-26.6), and 1 user 5 keys (23.6-19.6)

@micb25
Copy link
Owner

micb25 commented Jul 3, 2020

I assume the package at 19: has only 3 users.
Old android should no longer be possible after pushing server version 1.0.9 online
corona-warn-app/cwa-server#640

This also affects package 11: one user with an "Invalid Transmission Risk Profile". So, it might be 2 users less for yesterday.

@Tho-Mat
Copy link

Tho-Mat commented Jul 3, 2020

It was changed at 11 AM CEST (9 AM UTC; package 9; vide infra)

I think you are right, but:
how do you know that? 2*5 = 10, so the package at 06: and 07: could also have a multipler of 5.

@micb25
Copy link
Owner

micb25 commented Jul 3, 2020

how do you know that? 2*5 = 10, so the package at 06: and 07: could also have a multipler of 5.

You are absolutely right. My claim was just based on the inspection of the hourly packages. I don't see any way to improve the estimated numbers for yesterday. Hopefully, we do not see these multiplier changes too frequently.

@Tho-Mat
Copy link

Tho-Mat commented Jul 3, 2020

I assume the package at 19: has only 3 users.
Old android should no longer be possible after pushing server version 1.0.9 online
corona-warn-app/cwa-server#640

This also affects package 11: one user with an "Invalid Transmission Risk Profile". So, it might be 2 users less for yesterday.

for package 11 i get 7 users with hole and 8 users without hole.

micb25 added a commit that referenced this issue Jul 3, 2020
Due to a padding multiplier change from 10 to 5 yesterday, the reported numbers of the daily package were incorrect. These values have been manually corrected by an analyse of the hourly packages.
@kai-truempler
Copy link

I am not sure if this is the correct place here, but you may have seen the Spiegel interview with Mr. Spahn (here (paywall). He says:

SPIEGEL: Wie viele Infektionen wurden inzwischen in der App eingetragen?
Spahn: Wir gehen von rund 300 Infektionen aus, die bislang per App gemeldet wurden. Das ist die Zahl der Verschlüsselungs-Codes, die von der Hotline ausgegeben wurden, um andere zu warnen. Mehr wissen wir aus Datenschutzgründen nicht.

Do you think people would go through the trouble of calling the hotline and then not submit, or is there an issue with the padding factor calculation that leads to a result that is off by a factor of two?

@janpf
Copy link

janpf commented Jul 3, 2020

my guess is that it's in the first day, since noone knows how the packet from "2020-06-23" is actually padded

@micb25
Copy link
Owner

micb25 commented Jul 3, 2020

Do you think people would go through the trouble of calling the hotline and then not submit, or is there an issue with the padding factor calculation that leads to a result that is off by a factor of two?

@kai-truempler: Thanks for sharing this. I totally agree and I would rather expect people not to call the hotline in case of a positive test (stigma, time, effort, etc.).

my guess is that it's in the first day, since noone knows how the packet from "2020-06-23" is actually padded

@janpf: This might be an issue, however, I want to point out that every day there's a significant number of keys which get not parsed (vide infra).

Thus, I would expect that the estimates by diagnosis-keys from @mh- are rather conservative (which I personally prefer). I may add a chart with these unparsed key numbers. At the end, it would be beneficial, if these number would be published by an official institution such as the RKI on a daily base (in addition of publishing the daily download counts).

2020-06-23.dat:89 keys not parsed (8 without padding).
2020-06-24.dat:30 keys not parsed (3 without padding).
2020-06-25.dat:50 keys not parsed (5 without padding).
2020-06-26.dat:150 keys not parsed (15 without padding).
2020-06-27.dat:250 keys not parsed (25 without padding).
2020-06-28.dat:40 keys not parsed (4 without padding).
2020-06-29.dat:100 keys not parsed (10 without padding).
2020-06-30.dat:160 keys not parsed (16 without padding).
2020-07-01.dat:290 keys not parsed (29 without padding).
2020-07-02.dat:80 keys not parsed (16 without padding).

@janpf
Copy link

janpf commented Jul 3, 2020

Oh absolutely true, I forgot about those "keys not parsed"

What might be beneficial: on my dashboard I just changed to an hourly analysis, as suggested above by @mh-.
This means I check every hourly package and calculate the padding, number of keys, number of users etc. individually and then sum things up.

This way I'm currently at a total of 218 users and thereby off by a factor of 1.37 @kai-truempler ;)
And if we now consider "keys not parsed" and Mr. Spahn maybe rounding numbers a bit I think it's very hard to get closer to the real number.

At the end, it would be beneficial, if these number would be published by an official institution such as the RKI on a daily base (in addition of publishing the daily download counts).

Absolutely.

@Tho-Mat
Copy link

Tho-Mat commented Jul 3, 2020

With parsing all keys you can get a minimum number of infected persons.
Theoretically each single key could belong to one person (the maximum).

If i count the minimum users that submit keys i get round about 250. (23.6. - 02.07.)
So there may be 250 users => 300=round(250;-2)

@janpf
Copy link

janpf commented Jul 3, 2020

And I'm back down to 188 as the parser just got updated: mh-/diagnosis-keys@104388c

@mh-
Copy link

mh- commented Jul 3, 2020

Ok, maybe I could change the strategy, now that "old Android apps" cannot submit Diagnosis Keys anymore.
For this, it would be nice to understand what information you need from the parsing.

For example, just counting the number of users is very simple now, it would just require counting all keys with TRL 6, because every user will submit exactly one key with that TRL. (And of course divide by the padding multiplier.)

The harder part is to count the number of keys per each user, something that I wanted to do in order to find out if keys can be linked together (violating the "non-linkability-across-multiple-day" promise).

So what exactly do you want from the parser?

@janpf
Copy link

janpf commented Jul 3, 2020

Great idea counting the "6"s!
Gonna change to that later for the overall user count and most likely going to keep your "counting script" as is for the "number of keys published per user".

Update: did change it and now we're back up to ~200.
So still pretty far from the announced 300, but since there are only ~200 "6"s in the database this should be pretty reliable.

janpf added a commit to janpf/ctt that referenced this issue Jul 3, 2020
@mh-
Copy link

mh- commented Jul 4, 2020

I added the option -n / --new-android-apps-only to the parser script. If you use this, this should decrease the number of unparsed keys.
However, in the near future it might become impossible to do correct counting, see the end of https://github.com/mh-/diagnosis-keys/blob/master/doc/algorithm.md for details.

@cfritzsche
Copy link

However, in the near future it might become impossible to do correct counting, see the end of https://github.com/mh-/diagnosis-keys/blob/master/doc/algorithm.md for details.

Just looking at the example you provided there, you can still at least provide the minimum user count. You can still have the case that it is in fact more users transmitting only random unconnected days, but if you have too many „1“s or „6“s than one user can have, it’s still at least two users.
You could collect the minimum user count per risk level (minimum number for the 1s, 6s etc) and then take the max() of them to come to the absolute minimum users generating these keys.

@mh-
Copy link

mh- commented Jul 4, 2020

Yes, in the example with the 14 keys, there must have been between 2 and 14 users. This is a wide range, though.

@cfritzsche
Copy link

Ok, sure, but in almost all cases it will be the minimum number or very close to it. Which is good enough for the kind of analytics most are looking for.

@Tho-Mat
Copy link

Tho-Mat commented Jul 5, 2020

Note: If you download the hour/day package you will notice, that they will change their content.
I seams keys with date<14 days will be deleted. Also the keys are moved to other days.
60 Key of 24.06 are now moved to 23.06. Also 44 keys are deleted for 23.06.
For 23.06 hour files have changed from 08, 13, 17 => 10, 15, 18.
So to get the right keys you have to use the files downloaded one the day they have been published.

@Tho-Mat
Copy link

Tho-Mat commented Jul 5, 2020

I have made an excel tab and did an manual examination of the keys.
I have taken into account, that a device could be switched off for 1 or more day.
Nearly every key-chain could be assigned.
Only the 23.06- 8:00 keys are not so clear.
01.07. 17:00 is the only one, that contains a chain with no "6". I think, the 6 was not submitted/deleted since it was to old (17.06).
After all i get 219(minimum) users, that submit keys. The maximum should be 241.

https://github.com/Tho-Mat/corona-stuff/blob/master/%C3%BCberblick.xlsx

@janpf
Copy link

janpf commented Jul 5, 2020

Note: If you download the hour/day package you will notice, that they will change their content.
So to get the right keys you have to use the files downloaded one the day they have been published.

Are there any information on why they would do this?

After all i get 219(minimum) users, that submit keys.

Just by counting "6"s I get 231 with the "new" packages for 23./24. and 226 with the old ones.
And this method is still more a lowerbound, since it misses some, as you correctly pointed out:

01.07. 17:00 is the only one, that contains a chain with no "6".

Update: I noticed you're doing a "per-key"-padding analysis, while I'm on a "per-package"-basis. That explains the differences. 👍

@Tho-Mat
Copy link

Tho-Mat commented Jul 5, 2020

Are there any information on why they would do this?

I think they will reduce traffic, since it makes no sense to check keys, that are older than 14 day.

@micb25
Copy link
Owner

micb25 commented Jul 5, 2020

Note: If you download the hour/day package you will notice, that they will change their content.
I seams keys with date<14 days will be deleted. Also the keys are moved to other days.

@Tho-Mat: Thanks for your comment. At first, I was already a little bit confused last night, because the old hourly packages were changed. My wrong assumption was that the clean-up of the keys older than 14 days is based on a package level and not on the individual key level.

@kai-truempler
Copy link

kai-truempler commented Jul 13, 2020

Just as an update to my previous comment, from Phoenix:

Lothar Wieler: "...[rund] 500 Teletans sind ausgegeben worden."

That looks closer to the estimate than the 300 from Mr. Spahn 10 days ago.

@micb25
Copy link
Owner

micb25 commented Jul 15, 2020

That looks closer to the estimate than the 300 from Mr. Spahn 10 days ago.

Fortunately, the RKI is publishing these numbers on a weekly basis. Thus, I have added another diagram for the published teleTANs last night. However, it is a single PDF which gets overwritten every week.

Looking at the number of issued teleTANs:
In one week (06/07-13/07) 125 teleTANs have been issued. At the same time, parse_keys.py counted 102 unique users based on the hourly package data which results in a ratio (users counted vs issued teleTANs) of about 82%. I'm very interested where the larger errors comes from (estimated users vs people getting a teleTAN but not sharing their keys). Furthermore, these statistics somehow tell us that the intended way of sharing your keys based on a lab test combined with a QR code is at the moment insignificant.

stefanb pushed a commit to sledilnik/ctt that referenced this issue Oct 1, 2020
@daimpi
Copy link
Author

daimpi commented Oct 23, 2020

I think this issue can be closed, now that padding multiplier is set to one on the server. @micb25 do you agree?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants