Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make ETL run in Google Big Query #65

Open
vojtechhuser opened this issue May 5, 2020 · 7 comments
Open

make ETL run in Google Big Query #65

vojtechhuser opened this issue May 5, 2020 · 7 comments

Comments

@vojtechhuser
Copy link
Contributor

Current code is Postgres SQL flavor specific.
(e.g., ::integer) in code.

To run on other platforms, notes how to do port this is needed.

@spfohl
Copy link
Contributor

spfohl commented May 5, 2020

The most straightforward way to do this might be through the BigQuery DBAPI: https://googleapis.dev/python/bigquery/latest/dbapi.html since it will likely allow for significant code reuse without having to write a lot of BigQuery "Standard SQL". However, I don't know the details of the ETL or the DBAPI well enough though to know how much additional work is required to make this happen. It may be just as much work to write the ETL from scratch to accommodate BigQuery

@vojtechhuser
Copy link
Contributor Author

Stanford team has done some work with this ETL and GBQ. Stay tuned for more details.

@spfohl
Copy link
Contributor

spfohl commented May 20, 2020

I'm going to add myself to that list of Stanford collaborators. I have worked on the ETL in the past, have the converted MIMIC-OMOP data in BigQuery, and have been working on ETL+modeling tools

@tompollard
Copy link
Member

Thanks @spfohl, we're looking to coordinate a single mapping, preferably within this repository.

The plan is then for the contributors to submit a well-described version of the output dataset to PhysioNet, to (1) allow the specific version used in a study to be clearly cited and (2) to avoid users having to build the OMOP version themselves.

We are in the process of identifying a technical lead for the work who can take responsibility for managing the development process (i.e. overseeing development work, code review, testing framework, etc). It would be good to chat if you have thoughts on this!

@spfohl
Copy link
Contributor

spfohl commented May 20, 2020

I'll sync with the others working with the BigQuery pipeline and see where I can best contribute and then follow up. I don't currently have the bandwidth to take on a leadership role here, but I am highly interested and motivated in broadly improving the quality and usability of this ETL since I am involved several on-going research efforts that would benefit from that

@tompollard
Copy link
Member

tompollard commented May 20, 2020

Sounds good, thanks @spfohl. We'll post updates as things develop.

We've found it difficult to decide how best to manage different multiple SQL dialects for other projects. If whoever becomes lead decides that BigQuery syntax is best, then maybe we just port this whole repo.

It's an interesting thought that there may be multiple MIMIC to OMOP mappings already out there and being used. If so, it would be an interesting study to explore how the choice of mapping contributes to the output of an analysis.

@spfohl
Copy link
Contributor

spfohl commented May 28, 2020

I worked with @jdposada and @PriyaDesai70 on a proof of concept to see how long it would take us to convert one SQL script to BigQuery syntax. It took about 90 minutes to convert the procedure_occurence script, but I imagine that further tables would be much faster. There are just a few simple patterns that need to substituted, and some like SELECT DISTINCT ON that required some more complicated logic. See here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants