Skip to content

kaiko-ai/spark-dicom

Repository files navigation

spark-dicom

Spark DICOM connector in Scala

How to use

Once loaded in the classpath of your Spark cluster, you can load DICOM data in Spark using the dicomFile as follows:

val df = spark.read.format("dicomFile").load("/some/hdfs/path").select("PatientName", "StudyDate", "StudyTime")

You can select DICOM attributes defined in the DICOM standard registry using their keyword. They are defined in the official DICOM standard.

Each attribute is written to a column with a Spark data type equivalent to its VR. The mapping is as follows:

VR Spark Data type
AE, AS, AT, CS, DS, DT, IS, LO, LT, SH, ST, UC, UI, UR, UT String
PN {"Alphabetic": String, "Ideographic": String, "Phonetic": String}
FL, FD [Double]
SL, SS, US, UL [Integer]
SV, UV [Long]
DA String (formatted as DateTimeFormatter.ISO_LOCAL_DATE)
TM String (formatted as DateTimeFormatter.ISO_LOCAL_TIME)

Pixel Data

The PixelData attribute in a DICOM file can be very heavy and make Spark crash. Reading it is disabled by default. In order to be able to select the PixelData column, please turn the includePixelData option on:

spark.read.format("dicomFile").option("includePixelData", true).load("/some/hdfs/path").select("PixelData")

Other columns

  • isDicom: true if file was read as a DICOM file, false otherwise

De-identification

The DICOM dataframe can be de-identified according to the Basic Confidentiality Profile in the DICOM standard. To use the de-identifier, do the following in scala:

import ai.kaiko.spark.dicom.deidentifier.DicomDeidentifier._

var df = spark.read.format("dicomFile").load("/some/hdfs/path")
df = deidentify(df)

The resulting dataframe will have all the columns dropped/emptied/dummyfied according to the actions described here.

To perform the de-identification with any of the options described in the table, use:

import ai.kaiko.spark.dicom.deidentifier.DicomDeidentifier._
import ai.kaiko.spark.dicom.deidentifier.options._

val config: Map[DeidOption, Boolean] = Map(
  CleanDesc -> true,
  RetainUids -> true
)

var df = spark.read.format("dicomFile").load("/some/hdfs/path")
df = deidentify(df, config)

Current limitations of the de-identification are:

Expected behavior Current behavior
Tags with SQ VR are de-identified Tags with SQ VR are ignored
Private tags are de-identified Private tags are ignored
The U action pseudonimizes the value The U action replaces the value with ToPseudonimize
The C action cleans the value of PHI/PII The C action replaces the value with ToClean

Development

Development shell

A reproducible development environment is provided using Nix.

$ nix-shell

it will provide you the JDK, sbt, and all other required tools.

Build with Nix

Build the JAR artifact:

$ nix-build

Updating dependencies

When changing sbt build dependencies, change depsSha256 in default.nix as instructed.

CI

CI is handled by GitHub actions, using Nix for dependency management, test, build and caching (with Cachix).

Note: for CI to run tests, the CI needs the Nix build to run tests in checkPhase.

You can run the CI locally using act (provided in the Nix shell).

Release

Creating a release is done with the help of the sbt-sonatype, sbt-pgp and sbt-release plugins.

Before starting, make sure to set the Sonatype credentials as environment variables: SONATYPE_USERNAME & SONATYPE_PASSWORD. In addition, make sure to have the gpg utility installed and the release GPG Key available in your keyring.

Then, run:

$ nix-shell
$ sbt
$ release

You will be prompted for the "release version", the "next version" and the GPG Key passphrase. Make sure to follow the SemVer versioning scheme. If all went well, the new release should be available on Maven Central in 10 minutes.