Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you provide the metrics evaluation scripts for the WORDS and FLARE datasets? #14

Open
OeslleLucena opened this issue Oct 13, 2023 · 4 comments

Comments

@OeslleLucena
Copy link

OeslleLucena commented Oct 13, 2023

I am trying to reproduce STU-NET results and would like to have the same evaluation scripts you used to compute the dice score for the WORDS and FLARE datasets. I would appreciate it if the authors could make these scripts available.
Best

@Ziyan-Huang
Copy link
Collaborator

Dear @OeslleLucena,

Thank you for reaching out. We primarily calculated the Dice Similarity Coefficient (DSC) for each class. You should be able to find relevant code for this quite easily. For instance, the official FLARE repository contains scripts for computing the DSC. We recommend checking there as a starting point.

Best regards,

Ziyan Huang

@OeslleLucena
Copy link
Author

Dear @Ziyan-Huang

Thank you for your response. Apologies on my side because I think I did not make myself clear enough. What I meant is that STU-NET outputs the segmentation for all labels from the TotalSegmentator dataset, and I would like to know how the selection and merging of these labels were when compared with the ground truth for the WORD and FLARE datasets. I.e. WORD datast has 16 labels, some were them are easy to find such as liver but the rest are a bit different than TotalSegmentator ones. Hope that is clear enough. Many thanks in advance,

@blueyo0
Copy link
Contributor

blueyo0 commented Oct 25, 2023

Hi, @OeslleLucena

For WORD, we selected 13 out of 16 classes overlapping with TotalSegmentator for inference and metric calculation; for FLARE22, all 13 categories were calculated. You can refer to the appendix of our arxiv paper for details. To clarify the details and help to conduct experiments and reproduce the results, we will release the code for direct inference soon.

Here is a simple dict in Python showing which categories are selected, more details will be clarified soon 😉.

Task560_WORD_sys = {
    "1": "liver",
    "10": "colon",
    # "11": "intestine",
    # "12": "adrenal",
    # "13": "rectum",
    "14": "urinary_bladder",
    "15": "femur_left",
    "16": "femur_right",
    "2": "spleen",
    "3": "kidney_left",
    "4": "kidney_right",
    "5": "stomach",
    "6": "gallbladder",
    "7": "esophagus",
    "8": "pancreas",
    "9": "duodenum"
}
FLARE22_sys = {
    "1":  "liver",
    "10": "esophagus",
    "11": "stomach",
    "12": "duodenum",
    "13": "left kidney",
    "2":  "right kidney",
    "3":  "spleen",
    "4":  "pancreas",
    "5":  "aorta",
    "6":  "IVC",
    "7":  "RAG",
    "8":  "LAG",
    "9":  "gallbladder"
}

Hope my answer can help you.

@OeslleLucena
Copy link
Author

HI @blueyo0, Thank you loads for the details. Looking forward to the code for direct inference. Best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants