Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is this label correct? on case 00015 #21

Closed
svabhishek29 opened this issue May 2, 2019 · 19 comments
Closed

is this label correct? on case 00015 #21

svabhishek29 opened this issue May 2, 2019 · 19 comments
Assignees
Labels
Label Error All issues pertaining to the provenence of the segmentation labels

Comments

@svabhishek29
Copy link

hi,

i found that in the interpolated scan of case 00015, the mask label 1 signifying kidney is extending eyond the kidney region, please see the image below

Screenshot 2019-05-02 at 8 33 28 PM

Screenshot 2019-05-02 at 8 33 53 PM

is this label correct

thanks

@neheller
Copy link
Owner

neheller commented May 2, 2019

Thanks for pointing this out. It seems that this is a fluid collection in the spleen (your visualization appears to be left-right flipped from the standard view) that was mistaken for a complex renal cyst, and therefore it is incorrectly labeled.

Since the data is frozen for the 2019 challenge, we will release a corrected version after July 29.

@neheller neheller self-assigned this May 2, 2019
@neheller neheller added the Label Error All issues pertaining to the provenence of the segmentation labels label May 2, 2019
@ShannonZ
Copy link

@svabhishek29 Which software do you use to check the labels?

@svabhishek29
Copy link
Author

@svabhishek29 Which software do you use to check the labels?

for this screenshot i used papaya online viewer
http://rii.uthscsa.edu/mango/papaya/

there is also ami viewer that you can use
https://fnndsc.github.io/ami/#viewers_upload

@JunMa11
Copy link

JunMa11 commented May 31, 2019

Hi @neheller ,
Can participants manually refine the label of case 15 and then training CNN with the refined label?

@neheller
Copy link
Owner

@JunMa11 yes, this is allowed. You can find some previous discussion about this on the challenge forum.

@mehrtash
Copy link

same question for case 37

image

@neheller
Copy link
Owner

Thanks for the question, @mehrtash. I spoke with our clinical chair and we went back to the pathology report. This is a mistake of a different sort -- that large (~10cm) feature that your crosshairs are on is actually a cystic renal tumor with extensive necrosis. It should have a label of "2". Due to the data freeze, this too will be fixed after the challenge.

@VariableXX
Copy link

The label of case 206 seems incorrectness......

@neheller
Copy link
Owner

neheller commented Jul 8, 2019

Hi @VariableXX, on quick inspection, I don't see any glaring issues. What about it concerns you?

@VariableXX
Copy link

Hi @VariableXX, on quick inspection, I don't see any glaring issues. What about it concerns you?

1
2
It looks like a tumor in the position where I am pointing at

@neheller
Copy link
Owner

neheller commented Jul 8, 2019

That's a benign cyst. The solid tumor is correctly labeled. To be sure, I've verified this with the radiology and pathology reports.

@FabianIsensee
Copy link

Hi there,
I was just wondering when we could expect an update on the training data? There seem to be a few cases in here where a correction of the reference would be quite welcome.
In our paper we also pointed out a few cases where we were not sure what the correct segmentation was (https://arxiv.org/pdf/1908.02182.pdf). I would greatly appreciate it if you could have a look at those.
Lastly I think it may also be a good idea to analyze the test set and see if there are some cases where the top ranking teams (top5?) consistently had bad dice scores, just to double check that all the segmentations in the test set are also correct.
I know this is a lot of work but I think it could improve the quality of the dataset even further.
Best,
Fabian

@neheller
Copy link
Owner

Thanks for the question, @FabianIsensee, I was hoping to have a discussion on this.

We're certainly willing to do all that you described (with our existing workflow it won't take long), but is everyone in agreement that this is the best thing to do? Once new labels are out, the live leaderboard submissions will have an explicit advantage over the official leaderboard.

The alternative would be to leave them as is until the KiTS20 data is released in March, which will include updated labels for the KiTS19 training images. That way the "fair" comparison would be preserved a little bit longer, and once the 2020 data is out, interest may shift to the new problem anyways.

There seem to be pros and cons to each. What do you think?

@FabianIsensee
Copy link

FabianIsensee commented Aug 23, 2019

Hi there,
I think there are several options you could consider. I think it would be a good idea to keep a leaderboard that reflects the MICCAI 2019 challenge standings. I am not sure if you intend to update http://results.kits-challenge.org/miccai2019/ with new submissions, but if not then just link this on the challenge homepage as the official KiTS2019 results page. Then, if you decide to update the data (both train and test), just add a statement to the live leaderboard that the results may not be entirely comparable to the MICCAI leaderboard. You can also encourage participants to rerun their algorithms on the new data and submit the results to the live leaderboard.

(LiTS for example has a MICCAI leaderboard and an open leaderboard: https://competitions.codalab.org/competitions/17094#results)

My personal opinion is that the data should be updated. That is because
a) As a participant I want to be able to feel confident about both my cross-validation score and the test score. If there are mistakes in the gt labels then it gets very difficult to compare algorithms because you never know whether one algorithm did better because it is actually superior of whether it did better because it coincidentally made the same mistake as is in the ground truth data. Some challenge organizers like to make an argument about that algorithms should be able to deal with faulty training data, but I disagree wholeheartedly. If you have wrongly labelled data, then the algorithm that just happens to hit the sweet spot wrt regularization will be winning. This is not going to advance field very much, because all it does is encourage people to play with regularization, not actually improve the algorithms.
b) The KiTS rules state that you can modify the training data if you want. So if you are not updating the dataset, the team that has the best medical partners can get an edge over the others because they can manually verify the training data and correct it as needed. This would result in the team with the best data winning the leaderboard, and not the team with the best algorithm. Therefore, I think it would be good to give everybody an equal chance by providing 'perfect' training data.

I am not sure what your plan with KiTS2020 is, so I don't know how any of this fits into it. Can you already share some details about KiTS2020? What is the general direction you will be taking? More data, different segmentation task (explicitly also segment other structures?)? Is there going to be a new test set (would be good because people can basically use the open leaderboard to continue optimizing toward the current test set).

If you are interested I can share the predictions of my five fold cross-validation. You can use them to double check against the training data as well.

Best,
Fabian

@neheller
Copy link
Owner

I agree that the "official" leaderboard should be preserved, and aside from removing a few more teams who never addressed the "major revisions" comments on their manuscripts, I will be leaving that leaderboard alone.

Regarding your first comment, I'm inclined to agree, but there are varying degrees of ground truth errors. Those pointed out in this thread are cases where entire regions are missed (errors of intention, I like to call them), but there's also unavoidable "noise" in every other case due to the fact that these were hand-delineated. We can take steps to reduce this (e.g. through crowdsourcing and aggregation) but I don't think we'll ever be able to eliminate it completely, and I do think that the field should come to terms with this. It's a valid research direction to weigh algorithms operating on "high quality, low quantity" data against those designed for "high quantity, low quality", because this will always be a trade-off on the data collection side.

You're right that allowing teams to modify the data does introduce the possibility for some unfair advantages. We decided to allow this because the alternative is virtually unenforceable, but if you feel differently then please let me know, we're open to new policies for KiTS20.

In 2020, we're planning to (a) add more data, including some from sites other than UMN and a new test set, (b) decompose the tumor class into various subtypes of tumors (e.g. Clear Cell RCC, Papillary RCC, Oncocytoma, etc. Benign Cyst will be its own class as well) and (c) release some basic demographics along with the imaging (e.g. age, gender, BMI). We think that this will improve the clinical utility of the work as well as make the segmentation problem much more challenging, since even radiologists cannot always determine subtype from CT only. Further, the plan is to leverage crowdsourcing to improve our annotation throughput and hopefully (with aggregation) reduce label noise. We're also bringing on a radiologist to help ensure that the label accuracy isn't compromised. There's a lot of work to do between now and then, and things might change, but these are our plans at the moment.

If you're willing to share your cross-validation predictions, I'd love to see them. I noticed that you pointed out a few cases in your manuscript that never agreed with the predictions. I'll be looking into those.

I think you're right that we should make these fixes now. I think it's a nice idea to have "living" ground truth labels where maintainers are willing to address issues as they're raised, or maybe even accept "pull requests", like in open source software. It doesn't seem like this is a common practice yet, though.

Best,
Nick

@FabianIsensee
Copy link

Hi,
apologies. I was quite busy these last few days and I try to not work on weekends ;-)

Regarding your first comment, I'm inclined to agree, but there are varying degrees of ground truth errors. Those pointed out in this thread are cases where entire regions are missed (errors of intention, I like to call them), but there's also unavoidable "noise" in every other case due to the fact that these were hand-delineated.

When I was talking ab out 'perfect' gt annotations I was just referring to the fixing the first kind of errors you mentioned (the big ones). There will of course always be a little noise, especially at the borders. I think that this is however not a problem at all, especially because this will not interfere much with the target metric (dice score). If the HD95 or something similar was chosen for evaluation then this would be different. Anyways, When I looked at the data I did not even notice anything, and I am really pedantic when it comes to training data (I bothered the Decathlon team a lot about this :-) ). So yeah, don't worry about the noise. My concerns were about the big mislabeled regions (distinction of lesions vs cysts for example).

You're right that allowing teams to modify the data does introduce the possibility for some unfair advantages. We decided to allow this because the alternative is virtually unenforceable, but if you feel differently then please let me know, we're open to new policies for KiTS20.

I agree that allowing the teams to modify the data is necessary (unless you enforce that they make code public to reproduce results on the original dataset). But if this is allowed then I think that known errors should be corrected in the KiTS data.

If I understand that correctly, a living ground truth is one that is constantly updated if errors are found? I think this is a laudable goal and would very much like to see this. There will however be issues arising from that that should be addressed. Here are just some ideas: 1) there should be version numbers of the dataset 2) each submission should be tagged with a data version number 3) evaluation should always be on the latest test set for all submissions, regardless of what training data they were trained on (this is necessary because a change in the test set could result in a drop in performance and algorithms using the old (wrong) test set would gain an advantage).

In an ideal world you could try to correct all major mistakes that may be present in the dataset at once. I will gladly help and provide my cross-validation niftis for this purpose. Ideally you could also ask other participants for theirs so that you can identify cases where many of us had bad dice scores.
Here is a link to my files. I uploaded them for both the original version of the dataset as well as the dataset that I modified for our submission

ResUNet_modified_dataset.zip
UNet_Baseline_modified_dataset.zip
UNet_Baseline_original_dataset.zip

In 2020, we're planning to (a) add more data, including some from sites other than UMN and a new test set, (b) decompose the tumor class into various subtypes of tumors (e.g. Clear Cell RCC, Papillary RCC, Oncocytoma, etc. Benign Cyst will be its own class as well) and (c) release some basic demographics along with the imaging (e.g. age, gender, BMI). We think that this will improve the clinical utility of the work as well as make the segmentation problem much more challenging, since even radiologists cannot always determine subtype from CT only. Further, the plan is to leverage crowdsourcing to improve our annotation throughput and hopefully (with aggregation) reduce label noise. We're also bringing on a radiologist to help ensure that the label accuracy isn't compromised.

That indeed sounds like a lot of work until then, but it's really interesting as well! I think adding more clinically relevant questions to the challenge is a great idea for obvious reasons, so I am all for it. If in the following I may sound a little negative, this is just because I am attempting to point you towards potential difficulties that may arise from that :-)
If radiologists already have troubles to make a distinction between the tumor subtypes (and rarely also make mistakes with cyst vs tumor), how you leverage potentially unskilled crowdsourcing workers?
To me, the distinction of the tumor subtypes does not sound like a problem that is best addressed with semantic segmentation. For example, I think it will be quite common for segmentation methods to create segmentations where parts of a lesion are one subtype and other parts another. This will make is fairly hard to generate a subtype label for an entire lesion. This sounds more like a problem that could be addressed with instance segmentation/detection methods (such as MaskRCNN) rather than semantic segmentation methods. I would also expect many teams to come up with a two step pipeline where they first segment tumor, then try to distinguish the subypes with some kind of radiomics-based approaches.
My personal belief is that the popularity of the challenge could be reduced if you make the subtype segmentation the only task of the challenge. Instance segmentation is much less common (and popular) than semantic segmentation. In my experience, anything that goes further than the semantic segmentation part is getting specialized quite quickly resulting in a drop in the number of participants (see BraTS survival prediction vs segmentation). So my recommendation would be to split the challenge in two task: 1) semantic segmentation just like this year (tumor vs kidney (vs cyst?)) 2) tumor subtype classfication (or segmentation). That way you get the best of both worlds.

Kind regards,
Fabian

@neheller
Copy link
Owner

Hi Fabian,

No worries at all!

I think the current labels for kidney are quite good (I'm not sure there's much improvement to be had beyond 0.985 Dice between observers), but I think there's ample room for improvement on the tumors. Personally, I would like to see an interobserver agreement of at least 0.95 here (~0.924 currently).

If I understand that correctly, a living ground truth is one that is constantly updated if errors are found? I think this is a laudable goal and would very much like to see this. There will however be issues arising from that that should be addressed. Here are just some ideas: 1) there should be version numbers of the dataset 2) each submission should be tagged with a data version number 3) evaluation should always be on the latest test set for all submissions, regardless of what training data they were trained on (this is necessary because a change in the test set could result in a drop in performance and algorithms using the old (wrong) test set would gain an advantage).

Your understanding of the "living dataset" is what I had in mind, and the issues you raise are good ones. Unfortunately existing infrastructure is not amenable to this. For instance, on grand-challenge.org, changing all submissions to measure against new test data would require enormous manual effort. I'll be sure to include this in the long list of feature requests that I'm preparing for them.

In an ideal world you could try to correct all major mistakes that may be present in the dataset at once. I will gladly help and provide my cross-validation niftis for this purpose. Ideally you could also ask other participants for theirs so that you can identify cases where many of us had bad dice scores.

For the KiTS19 training data, I agree completely. But when we release new data for KiTS20, some errors might take longer to catch than others. We're taking more steps to prevent these this time around, but some might still slip through. We will wait longer to "freeze" the data this time, since it's clear that a vast majority of teams don't even download the data until less than a month before the deadline.

That indeed sounds like a lot of work until then, but it's really interesting as well! I think adding more clinically relevant questions to the challenge is a great idea for obvious reasons, so I am all for it. If in the following I may sound a little negative, this is just because I am attempting to point you towards potential difficulties that may arise from that :-)

Of course! I really do want all the feedback I can get. I think we as challenge organizers should be doing more to solicit community feedback, especially in the design phase.

If radiologists already have troubles to make a distinction between the tumor subtypes (and rarely also make mistakes with cyst vs tumor), how you leverage potentially unskilled crowdsourcing workers?

The annotation process here is three-stage:

  1. A radiologist quickly marks the rough boundaries of the tumor--we'll call this "guidance"
  2. Crowd workers use the imaging+guidance to make full voxel-wise annotations
  3. A medical student matches each annotated tumor to its description in the surgical pathology report, which includes its gold-standard subtype

To me, the distinction of the tumor subtypes does not sound like a problem that is best addressed with semantic segmentation. For example, I think it will be quite common for segmentation methods to create segmentations where parts of a lesion are one subtype and other parts another. This will make is fairly hard to generate a subtype label for an entire lesion. This sounds more like a problem that could be addressed with instance segmentation/detection methods (such as MaskRCNN) rather than semantic segmentation methods. I would also expect many teams to come up with a two step pipeline where they first segment tumor, then try to distinguish the subypes with some kind of radiomics-based approaches.

My personal belief is that the popularity of the challenge could be reduced if you make the subtype segmentation the only task of the challenge. Instance segmentation is much less common (and popular) than semantic segmentation. In my experience, anything that goes further than the semantic segmentation part is getting specialized quite quickly resulting in a drop in the number of participants (see BraTS survival prediction vs segmentation). So my recommendation would be to split the challenge in two task: 1) semantic segmentation just like this year (tumor vs kidney (vs cyst?)) 2) tumor subtype classfication (or segmentation). That way you get the best of both worlds.

This is great feedback! I agree that segmentation+subtyping is more of an instance segmentation problem than semantic segmentation. I thought it might be a good direction because that seems to be the way the computer vision community as a whole is moving (e.g. COCO).

I'm not opposed to splitting the challenge into two tasks, but I'm worried that the segmentation task will be trivially easy next year, especially with an expanded dataset. We may have to look into including other structures (e.g. ureter, renal artery/vein, tumor thrombus).

Best,
Nick

@neheller
Copy link
Owner

neheller commented Dec 5, 2019

I've finally gotten around to checking on those four cases in the training set that @FabianIsensee's models consistently disagreed with. Here are the verdicts:

  • 23: Label Error
  • 68: Correct
  • 125: Label Error
  • 133: Correct

Amended labels are on the way for these two errors as well as the other two mentioned previously (15 and 37). Thanks for everyone's patience on this.

@neheller
Copy link
Owner

The labels are now corrected with respect to all known errors. Feel free to reopen or submit a new issue with any further concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Label Error All issues pertaining to the provenence of the segmentation labels
Projects
None yet
Development

No branches or pull requests

7 participants