Clinical Trial Data Analysis

The clinicaltrials.gov website provides bulk data exports in XML which are difficult to query or manipulate with modern frameworks. These tools transform it into more modern/favorable formats like JSON and PARQUET for further analysis.

Usage

Install dependencies with pipenv install
Download data from clinicaltrials.gov
Run the script ct-xml-to-json.py to convert the XML data into JSON
Run the script ct-json-to-parquet.py to convert the JSON data into PARQUET

python3 ct-xml-to-json.py 
    --schema ./schemas/public.xsd 
    --src ./data/AllPublicXML 
    --dst ./data/OutputJSON

python3 ct-json-to-parquet.py 
    --src ./data/OutputJSON
    --dst ./data/OutputPARQUET

Configuring AWS Athena for Querying

Push the PARQUET data to an S3 bucket with aws s3 sync ./data/OutputPARQUET/ s3://your-bucket-here/ct-data/
Configure a table in AWS Athena/Glue that points to your S3 location and infers the schema from the data.
You should get a new table with a schema that looks like the one saved here.
Query!
- Since the data is with structs/arrays instead of columns, there is some nuance to querying. See the AWS docs for more info.
- Optional fields within structs/arrays can be wrapped with the try function (try(X[1].Y)) to nullify errors.

SELECT 
    clinical_study.id_info.nct_id, 
    clinical_study.brief_title, 
    clinical_study.overall_status
FROM <your-table-name> 
LIMIT 10;

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
schemas		schemas
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
ct-json-to-parquet.py		ct-json-to-parquet.py
ct-xml-to-json.py		ct-xml-to-json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schemas

schemas

.gitignore

.gitignore

LICENSE

LICENSE

Pipfile

Pipfile

README.md

README.md

ct-json-to-parquet.py

ct-json-to-parquet.py

ct-xml-to-json.py

ct-xml-to-json.py

Repository files navigation

Clinical Trial Data Analysis

Usage

Configuring AWS Athena for Querying

About

Languages

License

nolanshah/clinical-trial-data-analysis

Folders and files

Latest commit

History

Repository files navigation

Clinical Trial Data Analysis

Usage

Configuring AWS Athena for Querying

About

Resources

License

Stars

Watchers

Forks

Languages