-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concept Mapping / Aliases #107
Comments
Thanks for starting this conversation Jim. Consolidating concepts is something that we've discussed a fair amount, but haven't yet addressed. We actually left an empty 'conceptid' column in the d_items table to act as a placeholder for the merged itemids. @parisni will be looking at consolidating concepts as part of his PhD and I think @mghassem knows of others working on it too. If you would like to make a start on the merging the concepts immediately, one option would be to create a new folder in the MIMIC Code Repository called something like "conceptmaps", and then add an SQL script to generate a table with itemids mapped to a set of consolidated ids. Versioning is an issue that we need to address across the repo, but I think adding a "last updated" column to the concept map would be a good start. |
@jeblundell @tompollard as per our email exchange, I've also begun this effort as well with the help of my collaborator Zack Lipton from UCSD and some folks at both Children's Hospital LA and Stanford. A TL;DR summary of my comments:
Dave |
@jeblundell @tompollard regarding contaminating the schema, versioning, etc.: I was leaning toward keeping my effort SEPARATE from the core MIMIC3 store. I'd like a community of researchers to be able to agree on and use a benchmark without imposing it on everyone else. That said, perhaps there is a way to have our cake and eat it, too. Maybe "outside" efforts creating task-specific mappings can just forge on ahead as they see fit but then a sanctioned "internal" effort (which might include some of the same people) can synthesize feedback from those efforts and create a canonical mapping that goes into the MIMIC3 store itself. |
Myself and @nickopotamus can certainly help from a physician point of view and some of the grunt work of getting some mappings done. I was planning on submitting a pull request with provisional approach and some input from @nickopotamus that could be built on/rejected by the team, so can do that and see where we go? |
@jeblundell happy to take a look at whatever you have. If you have a branch or pull request # or something on github, send it my way (I can only look -- no actual power over MIMIC repos). |
Some additional thoughts:
|
Thanks for the spreadsheet - looks like an epic amount of work you've done, so good job! Very much agree with your thoughts re: simple format and toolchain neutrality. It looks like we're gravitating towards CSV as straightforward as 2 columns of concept name and itemid. What've you got as your general schema at the moment for generating the data behind that spreadsheet? Our first pass is with an alias_id, but I can always adapt our SQL scripts to in such a way that the CSV is still useful, i.e. we'll try to get our alias name's to match up to whatever names you have in your spreadsheet so you don't have to put unnecessary work in (and will let you know if we need to make changes). Few of things from medicine POV:
I'll get working with @nickopotamus on finishing up our set of mappings and we can then try and merge them with yours to get a definitive first version list (then sort out the various discrepancies between us). Rather nice that we've effectively done the standard >1 person systematic review approach to make sure we cover all the bases properly! |
Btw, what does the mean/stdev correspond to? Definitely very sensible to check for crazy outliers (misbehaved units!), though am a little concerned that there's even a single patient with a pH <1! |
Thanks @jeblundell for including me in this thread; such a mapping was looking a mammoth task for my own work using MIMIC, but as a relative SQL novice I can probably contribute more to the clinical side here. Personally I'd handle venous and arterial values independently; there are subtle and not so subtle differences between central and mixed venous, capillary, and arterial gases. Once they're identified as such it would be straightforward for any clinical application which treats them as equivalent to then strip a value of pH from the available gases. The same applies to blood pressure measurements: flagging NIBP and IBP as independent variables allows comparisons between the two, which can be interesting in its own right, and it would be trivial to extract "a blood pressure" if all potential IBP and NIBP aliases are known. |
Hi, all! This sounds very similar to something I have been looking to get started on. And it seems like you have already put in a lot of good work. :) What kind of assistance could you use? "Grunt work" sounds like what I would be useful for at this stage. I do have access to Wikipedia, and perhaps also Google... Cheers, |
@aruberutou @nickopotamus if you have preferred Google Drive accounts, send them to me and I can give you access to my spreadsheet. |
@turambar Thanks! I have just sent an email to your listed account. Cheers, |
@turambar @nickopotamus @aruberutou I'd be strongly in favor of the first mapping being simple variable lookups, and I'm happy to be an additional reviewer on variables/item_id mappings. Side note - there's a rich literature on mapping signals to existing ontologies, and I can think of several people who would be very interested in contributing once the project moves that direction. |
I'm gonna give replying via email a shot. Let's proceed with editing my If you guys want to "unbundle" variables (invasive and noninvasive BP, pH, I will try to add medications, inputs, etc., to this in the next day or two. Please do start identifying new variables! I'm hoping you guys can help me My workflow tends to be to use text search (either browser or the more Two other steps:
Note that for my work, I also use minValue and maxValue to rescale all FYI, credit where credit is due: the terminology and structure of that spreadsheet is due to David Ledbetter and folks in the VPICU at Children's Hospital LA.
Dave |
Cheers Dave. I've just started looking through the work you've already done - a mammoth undertaking! I've come up with a couple of issues looking at bilirubin that it might be worth clarifying for continuity and ease of reference:
|
@nickopotamus thanks! By the way, you've taught me something new for Google Sheets (i.e., use of scripts and filters). re: bilirubin, good to know. I'm pretty sure I made similar mistakes on other labs, e.g., WBC. Would LOVE to be better understand the different types and our options for handling them (i.e., can we integrate constituent parts into total?). I think that given enough data, my model of choice (neural nets) could learn functions for combining them or, say, estimating the total Bilirubin even with some constituents missing. Likewise for, say, estimating a latent "true pH" from just venous or arterial pH. On a related note, I'd like to understand the nuances of measuring labs from the different body fluids. For example, what's the difference between blood, serum, pleural, ascites, urine, joint fluid, other body fluid, etc. re: nomenclature, do whatever you think is best for now. We can iterate from there. I'd err on the side of being descriptive. I know I used some abbreviations (a convention I adopted by the Physionet Challenge 2012 data set, which I used as a target list of variables), but I'm going to change these back to full names, I think. re: marking things as "not to be used" I suggest putting a "n" in the USE column. I've likewise changed "x" to "y." Blank entries are not yet decided. I also added a NOTE column at the end for adding explanations. |
@turambar I'm happy to trade medical information for code any time ;) I'll send you an email with a bit more about bilirubin over the weekend... I'll start going through the things you've already mapped in ernest next week before starting on new variables. Is it worth us also filling in the |
You MUST differentiate bili-serum and bili-CSF. They cannot be mixed up! |
Absolutely @rgmark; I suppose the question I'm asking is what's the best way to do this. Do we give each alias a descriptive name including the measurement and fluid (e.g. 'bilirubin-total-serum' and 'bilirubin-csf') or do we provide a fluid flag for each alias (e.g. name=bilirubin, fluid=blood and name=bilirubin, fluid=csf)? |
use LOINC codes for all labs |
Hi all, sorry for coming late to this game but I've been pondering a the problem for a while now. The best I could come up with is a tree structure that would help with aliasing. The issues outlined above with regards to PaO2, NIBP/BP, Bilirubin...etc should be solvable if we nest the aliases. E.g. top level for PaO2 could split into venous PaO2, arterial PaO2, central PaO2, Pulmonary PaO2 or if the researcher wishes to completely ignore the differences, they can just use the top level code with the aliases "flattened". Thoughts on this approach? |
Hi @ngageorange sound sensible, but using a tree raises some confusion as to what point you do the branching (using the bilirubin example, do you branch at total/direct/indirect, or which fluid?). I think the best approach is using LOINC codes as @rgmark suggested, which would then allow such trees to be generated as users wish or using @turambar's scripts. See https://search.loinc.org/ for examples. |
@rgmark @nickopotamus @ngageorange no objections to using LOINC codes for labs. No reason to re-invent an ontology where there already exists one -- assuming it is easy to use. Of course, learning such structure from the data would probably work even better (see https://arxiv.org/abs/1602.05568), but that's for future work! In seriousness, from my POV, I'm trying not to let the perfect be the enemy of the good here. My collaborators want to release a task benchmark data set ASAP, so I'm going to take what we have so far and codify it in an initial version. But I'll accept whatever updates you guys can make in the near future! |
If anyone on this thread would like an invitation to my private repository where I'm developing the python code to build benchmark data sets, let me know. |
Yes please @turambar! I've been meaning to catch up on this conversation... |
Yes please @turambar - I'd be happy to help also if you are still working on this! |
Is this still an active area? Or has the itemid mapping effort died done somewhat? |
@rustyBilges @Saqibm128 we've taken the initiative on this, resulting in the benchmark data set described in this manuscript and whose code is hosted here and maintained by the YerevaNN lab. We are very interested in having other folks contribute by helping us expand and enrich the benchmark. If you'd like to get involved head over to the benchmark repository and start a thread or send us a PR. In particular, here are areas of need:
|
@turambar Hi! I have browsed almost of the codes and paper that you mentioned before. And i have been focusing on the solution of the confusion of ID_ITEMS table which is definately a tedious and important work. So It looks like a great amount of work you've done and good job! |
I was wondering whether people had opinions on the contribution of supplementary data. I've started to try to analyse chartevents and it's a little frustrating to find out that PaO2 and PaCO2 have about 5 or 6 different item names each. It would be particularly helpful if we could come up with a way of aliasing them.
I would obviously very much prefer to not contaminate the mimiciii schema and keep that purely for the original MIMIC data, but perhaps we could add a contrib schema for data such as these? (Maybe contribiii to make it clear that it goes with mimiciii?)
contrib.aliases
alias_id
aliasname
contrib.d_items_aliases
row_id
alias_id (references contrib.aliases.alias_id)
itemid (references d_items.itemid)
This does introduce a whole host of potential issues, again relating to database versioning: I can hopefully do a sensible first-pass of various measures (PaO2, PaCO2, pH etc.), but there may come a point where another intensivist may turn around and say that actually "SpO2 XYZ" is actually not really the same as "SpO2 ABC" and "SpO2".
Would a possible solution in this case be to add some sort of version column to aliases? So that if someone decides to do their analysis with my incredibly flawed versions of aliases, then they can still replicate their data or (hopefully) choose the later version.
Another issue to consider with this sort of schema is whether there should be a unique mapping of items to aliases, i.e. could someone theoretically create a view that does SELECT * FROM chartevents and tacks on a column representing aliasname? I'm tempted to argue against this as it would potentially preclude versioning/other alias interpretations etc.
Thoughts?
The text was updated successfully, but these errors were encountered: