Repository for the ISARIC 3.0 Pipeline project.
The ISARIC 3.0 fhir resources are derived from the fhir.resources package.
FHIR resources can be initialised using a data dictionary
from fhir.resources.patient import Patient
data= {
"id": "f001",
"name": [{"text": "Micky Mouse"}],
"gender": "male",
"deceasedBoolean": False,
"address": [{"country": "Switzerland"}],
"birthDate": "1996-05-30",
}
patient = Patient(**data)
or in bulk from a FHIR export as an .ndjson file.
from fhir.resources.patient import Patient
patients = Patient.fhir_bulk_import("patient_export.ndjson")
Once initialised, FHIR resources can be transformed to FHIRflat files using the to_flat()
function like this
patient.to_flat("patient_flat.parquet")
which will produce a parquet file which can be read in pandas, producing a dataframe with the following structure
resourceType | id | gender | birthDate | deceasedBoolean |
---|---|---|---|---|
Patient | f001 | male | 1996-05-30 | False |
or a FHIRflat file can be generated directly from a FHIR .ndjson export file.
from fhir.resources.patient import Patient
Patient.fhir_file_to_flat("patient_export.ndjson")
will create a "patient_export.parquet" FHIRflat file. This first initialises a Patient data class for each row to make use of the Pydantic data validation, then creates a FHIRflat file.
FHIR resources can also be created directly from FHIRflat files
Patient.from_flat("patient_flat.parquet")
which will return either a single Patient resource, or a list of Patient resources if the Parquet file contains multiple rows of data.
The FHIRflat structure closely follows that of FHIR, and simply flattens nested columns
in a manner similar to pd.json_normalize()
. Some fields are excluded either because they are simply used for convenience within a FHIR server, because they contain information not relevant within ISARIC clinical data, or because they would contain Personally identifiable information (PII). These fields can be accessed and edited for each resource using the flat_exclusions
property. There are a few specifics to FHIRflat that differ from simply normalising a FHIR structure, noted below.
-
codeableConcepts
CodeableConcepts are converted into 2 lists, one of codes and one of the corresponding text. The coding is compressed into a single string with the format
system|code
. The ‘|’ symbol was chosen as it is the standard way to query codes in FHIR servers (example). Thus a JSON snippet containing a codebleConcept:"code": { "coding": [ [ { "system": "http://loinc.org", "code": "3141-9", "display": "Body weight Measured", }, { "system": "http://snomed.info/sct", "code": "27113001", "display": "Body weight", }, ] ] }
is coded as two fields
code.code code.text ["http://loinc.org|3141-9", "http://snomed.info/sct|27113001"] ["Body weight Measured", "Body weight"] Note that the external
coding
label is removed. -
References
Reference are a string with the name of the resource with the ID, separated by a forward slash.
"subject": { "reference": "Patient/f001", "display": "Donald Duck" }
becomes
subject.reference "Patient/f001" The display text will not be converted due to the risk of revealing identifying information (e.g., a patient's name).
-
Extensions
The base FHIR schema can be extented to meet the needs of individual implementations using extension fields. FHIRflat displays these with the extension
url
as part of the column name. For example"extension": [ { "url": "timingPhase", "valueCodeableConcept": { "coding": [ { "system": "http://snomed.info/sct", "code": 278307001, "display": "on admission", } ] }, }, { "url": "relativePeriod", "extension": [ {"url": "relativeStart", "valueInteger": 2}, {"url": "relativeEnd", "valueInteger": 5}, ], }, ]
becomes
extension.timingPhase.code extension.timingPhase.text extension.relativePeriod.relativeStart extension.relativePeriod.relativeEnd "http://snomed.info/sct|278307001" "on admission" 2 5 Complex (nested) extensions such as relativePeriod also omit the internal
extension
labels. -
0..* cardinality fields
Fields which can contain an unspecified number of duplicate entries are dealt with according to the number of entries present. lists of length == 1 are expanded out as above, while any longer lists are kept in a single column with the data in it's original nested structure and
_dense
appended to the end of the field name. These fields are not expected to be queried regularly in standard analyses.For example, the
diagnosis
field of the Encounter resource has 0..* cardinality. If a single diagnosis is present, the field is expanded out:"diagnosis": [ { "condition": [{"reference": {"reference": "Condition/stroke"}}], "use": [ { "coding": [ { "system": "http://terminology.hl7.org/CodeSystem/diagnosis-role", "code": "AD", "display": "Admission diagnosis", } ] } ], } ]
becomes
diagnosis.condition.reference diagnosis.use.code diagnosis.use.text Condition/stroke "http://terminology.hl7.org/CodeSystem/diagnosis-role|AD" Admission diagnosis whereas if 2 different diagnoses are present
"diagnosis": [ { "condition": [{"reference": {"reference": "Condition/stroke"}}], "use": [ { "coding": [ { "system": "http://terminology.hl7.org/CodeSystem/diagnosis-role", "code": "AD", "display": "Admission diagnosis", } ] } ], }, { "condition": [{"reference": {"reference": "Condition/f201"}}], "use": [ { "coding": [ { "system": "http://terminology.hl7.org/CodeSystem/diagnosis-role", "code": "DD", "display": "Discharge diagnosis", } ] } ], }, ]
becomes
encounter.diagnosis_dense "[{"condition": [{"reference"...}]}]"