New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is this label correct? on case 00015 #21
Comments
Thanks for pointing this out. It seems that this is a fluid collection in the spleen (your visualization appears to be left-right flipped from the standard view) that was mistaken for a complex renal cyst, and therefore it is incorrectly labeled. Since the data is frozen for the 2019 challenge, we will release a corrected version after July 29. |
@svabhishek29 Which software do you use to check the labels? |
for this screenshot i used papaya online viewer there is also ami viewer that you can use |
Hi @neheller , |
@JunMa11 yes, this is allowed. You can find some previous discussion about this on the challenge forum. |
Thanks for the question, @mehrtash. I spoke with our clinical chair and we went back to the pathology report. This is a mistake of a different sort -- that large (~10cm) feature that your crosshairs are on is actually a cystic renal tumor with extensive necrosis. It should have a label of "2". Due to the data freeze, this too will be fixed after the challenge. |
Hi @VariableXX, on quick inspection, I don't see any glaring issues. What about it concerns you? |
|
That's a benign cyst. The solid tumor is correctly labeled. To be sure, I've verified this with the radiology and pathology reports. |
Hi there, |
Thanks for the question, @FabianIsensee, I was hoping to have a discussion on this. We're certainly willing to do all that you described (with our existing workflow it won't take long), but is everyone in agreement that this is the best thing to do? Once new labels are out, the live leaderboard submissions will have an explicit advantage over the official leaderboard. The alternative would be to leave them as is until the KiTS20 data is released in March, which will include updated labels for the KiTS19 training images. That way the "fair" comparison would be preserved a little bit longer, and once the 2020 data is out, interest may shift to the new problem anyways. There seem to be pros and cons to each. What do you think? |
Hi there, (LiTS for example has a MICCAI leaderboard and an open leaderboard: https://competitions.codalab.org/competitions/17094#results) My personal opinion is that the data should be updated. That is because I am not sure what your plan with KiTS2020 is, so I don't know how any of this fits into it. Can you already share some details about KiTS2020? What is the general direction you will be taking? More data, different segmentation task (explicitly also segment other structures?)? Is there going to be a new test set (would be good because people can basically use the open leaderboard to continue optimizing toward the current test set). If you are interested I can share the predictions of my five fold cross-validation. You can use them to double check against the training data as well. Best, |
I agree that the "official" leaderboard should be preserved, and aside from removing a few more teams who never addressed the "major revisions" comments on their manuscripts, I will be leaving that leaderboard alone. Regarding your first comment, I'm inclined to agree, but there are varying degrees of ground truth errors. Those pointed out in this thread are cases where entire regions are missed (errors of intention, I like to call them), but there's also unavoidable "noise" in every other case due to the fact that these were hand-delineated. We can take steps to reduce this (e.g. through crowdsourcing and aggregation) but I don't think we'll ever be able to eliminate it completely, and I do think that the field should come to terms with this. It's a valid research direction to weigh algorithms operating on "high quality, low quantity" data against those designed for "high quantity, low quality", because this will always be a trade-off on the data collection side. You're right that allowing teams to modify the data does introduce the possibility for some unfair advantages. We decided to allow this because the alternative is virtually unenforceable, but if you feel differently then please let me know, we're open to new policies for KiTS20. In 2020, we're planning to (a) add more data, including some from sites other than UMN and a new test set, (b) decompose the tumor class into various subtypes of tumors (e.g. Clear Cell RCC, Papillary RCC, Oncocytoma, etc. Benign Cyst will be its own class as well) and (c) release some basic demographics along with the imaging (e.g. age, gender, BMI). We think that this will improve the clinical utility of the work as well as make the segmentation problem much more challenging, since even radiologists cannot always determine subtype from CT only. Further, the plan is to leverage crowdsourcing to improve our annotation throughput and hopefully (with aggregation) reduce label noise. We're also bringing on a radiologist to help ensure that the label accuracy isn't compromised. There's a lot of work to do between now and then, and things might change, but these are our plans at the moment. If you're willing to share your cross-validation predictions, I'd love to see them. I noticed that you pointed out a few cases in your manuscript that never agreed with the predictions. I'll be looking into those. I think you're right that we should make these fixes now. I think it's a nice idea to have "living" ground truth labels where maintainers are willing to address issues as they're raised, or maybe even accept "pull requests", like in open source software. It doesn't seem like this is a common practice yet, though. Best, |
Hi,
When I was talking ab out 'perfect' gt annotations I was just referring to the fixing the first kind of errors you mentioned (the big ones). There will of course always be a little noise, especially at the borders. I think that this is however not a problem at all, especially because this will not interfere much with the target metric (dice score). If the HD95 or something similar was chosen for evaluation then this would be different. Anyways, When I looked at the data I did not even notice anything, and I am really pedantic when it comes to training data (I bothered the Decathlon team a lot about this :-) ). So yeah, don't worry about the noise. My concerns were about the big mislabeled regions (distinction of lesions vs cysts for example).
I agree that allowing the teams to modify the data is necessary (unless you enforce that they make code public to reproduce results on the original dataset). But if this is allowed then I think that known errors should be corrected in the KiTS data. If I understand that correctly, a living ground truth is one that is constantly updated if errors are found? I think this is a laudable goal and would very much like to see this. There will however be issues arising from that that should be addressed. Here are just some ideas: 1) there should be version numbers of the dataset 2) each submission should be tagged with a data version number 3) evaluation should always be on the latest test set for all submissions, regardless of what training data they were trained on (this is necessary because a change in the test set could result in a drop in performance and algorithms using the old (wrong) test set would gain an advantage). In an ideal world you could try to correct all major mistakes that may be present in the dataset at once. I will gladly help and provide my cross-validation niftis for this purpose. Ideally you could also ask other participants for theirs so that you can identify cases where many of us had bad dice scores. ResUNet_modified_dataset.zip
That indeed sounds like a lot of work until then, but it's really interesting as well! I think adding more clinically relevant questions to the challenge is a great idea for obvious reasons, so I am all for it. If in the following I may sound a little negative, this is just because I am attempting to point you towards potential difficulties that may arise from that :-) Kind regards, |
Hi Fabian, No worries at all! I think the current labels for kidney are quite good (I'm not sure there's much improvement to be had beyond 0.985 Dice between observers), but I think there's ample room for improvement on the tumors. Personally, I would like to see an interobserver agreement of at least 0.95 here (~0.924 currently).
Your understanding of the "living dataset" is what I had in mind, and the issues you raise are good ones. Unfortunately existing infrastructure is not amenable to this. For instance, on grand-challenge.org, changing all submissions to measure against new test data would require enormous manual effort. I'll be sure to include this in the long list of feature requests that I'm preparing for them.
For the KiTS19 training data, I agree completely. But when we release new data for KiTS20, some errors might take longer to catch than others. We're taking more steps to prevent these this time around, but some might still slip through. We will wait longer to "freeze" the data this time, since it's clear that a vast majority of teams don't even download the data until less than a month before the deadline.
Of course! I really do want all the feedback I can get. I think we as challenge organizers should be doing more to solicit community feedback, especially in the design phase.
The annotation process here is three-stage:
This is great feedback! I agree that segmentation+subtyping is more of an instance segmentation problem than semantic segmentation. I thought it might be a good direction because that seems to be the way the computer vision community as a whole is moving (e.g. COCO). I'm not opposed to splitting the challenge into two tasks, but I'm worried that the segmentation task will be trivially easy next year, especially with an expanded dataset. We may have to look into including other structures (e.g. ureter, renal artery/vein, tumor thrombus). Best, |
I've finally gotten around to checking on those four cases in the training set that @FabianIsensee's models consistently disagreed with. Here are the verdicts:
Amended labels are on the way for these two errors as well as the other two mentioned previously (15 and 37). Thanks for everyone's patience on this. |
The labels are now corrected with respect to all known errors. Feel free to reopen or submit a new issue with any further concerns. |
hi,
i found that in the interpolated scan of case 00015, the mask label 1 signifying kidney is extending eyond the kidney region, please see the image below
is this label correct
thanks
The text was updated successfully, but these errors were encountered: