Prognostic Studies and ASReview #1547
-
Hi, I will soon conduct a systematic review of prognostic studies, an area that, regrettably, faces the challenge of non-standardized terminology when describing prognostic methods and outcomes. I would like to know if someone is aware of any scholarly papers that have compared the error rates when using machine learning (ASReview) in various types of studies, such as diagnostic accuracy, RCTs, and prognosis. I am asking since ASReview clusters similar words together, which could pose a significant challenge in the context of prognostic studies due to the lack of such terminological uniformity. Any guidance would be immensely appreciated. Regards, Emanuel |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @Emanuel-1986! We are currently working on publishing such a paper! I will update you as soon as the preprint is available. As for your question, the clustering of words is dependent on your choice of feature extractor. While it is true that TF-IDF clusters similar words together, this is not the case for doc2vec or sBERT. These models are much more context dependent and have proven to be resistant to divergent terminology. To simplify a little (a lot), these models will analyze the contextual usage of every word compared to every other word. This means that if two terms exist for the same concept but are used in similar contexts, they will end up closer in the embedding space. This is in contrast to simpler techniques like TF-IDF, which rely more on the frequency of individual terms and may not capture such nuances. In practical terms, if you are dealing with a corpus that has a lot of domain-specific jargon or synonyms, more advanced models like doc2vec or sBERT may provide more accurate representations of the underlying semantic structures. These models can capture the semantic similarity between different terms that are contextually similar, even if they do not appear the same. I hope this clarifies your question. Feel free to reach out if you have further inquiries. JT |
Beta Was this translation helpful? Give feedback.
Hi @Emanuel-1986!
We are currently working on publishing such a paper! I will update you as soon as the preprint is available.
As for your question, the clustering of words is dependent on your choice of feature extractor. While it is true that TF-IDF clusters similar words together, this is not the case for doc2vec or sBERT. These models are much more context dependent and have proven to be resistant to divergent terminology.
To simplify a little (a lot), these models will analyze the contextual usage of every word compared to every other word. This means that if two terms exist for the same concept but are used in similar contexts, they will end up closer in the embedding space. This is…