Refactor/tag info codegen #169

bpeake-illuscio · 2020-12-30T23:37:25Z

Hello!

I have taken a crack here at #147. I'll break this summary into sections where I discuss the reasoning behind certain choices.

Overview of Changes

Innolitics Git Submodule

issue #147 mentions using the JSON dump found in the Innolitics Dicom Standard repo. Rather than downloading and copy/pasting the attributes.json, I decided to add the entire repo as a git submodule.

The submodule lives in /pkg/tag/dicom-standard, and here arre the tradeoffs of using this approach, IMO:

Pros

The source of our data is very explicit, down to the exact commit we are pulling from
Updating to the latest version becomes as easy as updating the submodule.

Cons

Git submodules are not super-ubiquitous, so it does add a small amount of complexity.
The Innolitics repo is a little heavy. However, this is offset by submodules not being fetched on clone by default, so only developers who are updating the codegen feature will need to fully fetch it.

Overall I feel the pros are worth the cons.

Codegen migrated from Python to Go

I took the opportunity to move the codegen feature from Python to Go. This lowers the barrier to entry for any developers that wish to make PRs around the codgen feature, but may not know python. Go compiles quickly enough that I do not see this being an issue, especially with how few and far between regeneration of the code is going to be need.

The codgen tool now lives in /pkg/tag/codgen.

I'll go into more about the structure of this new codegen below.

Added Keyword-Indexed Map

Added tagDictByKeyword that indexes tag.Info objects by Info.Name, allowing us to do a map-lookup when calling tag.FindByName, rather than iterating over tagDict.

Added Codegen to Makefile

You can now type make codegen to run go generate. Since there were already make commands for tests I figured I would add one for this too.

Challenges

The main challenge was that the Innolitics JSON dump only contains tags for the DICOM file spec, and does not include tags for 0x0000 Group, which is used by the DIMSE / Dicom-Net spec.

Fo this PR, I have retained this group from the original CSV currently being used for code generation, and paired it down to just items in group 0x0000. However, this meant reading from multiple sources of truth, and the architecture of the code reflects that.

New Codgen Architecture.

Interfaces

The new codegen module relies on two interfaces to do the Heavy lifting:

TagReader

// Interface to implement for reading tags from a source.
type TagReader interface {
	// Name of reader for error messages.
	Name() string
	// Yield next tag or io.EOF if done.
	Next() (TagInfo, error)
	// Implements io.Closer() for closing underlying reader.
	Close() error
}

This interface reads tag information from a source file. A master implementation of TagReader is defined to pull from multiple TagReaders and return parsed tags until all child readers are exhausted. This means that if we add multiple sources, we can just define a new TagReader and register it.

CodeWriter

// CodeWriter is an interface for writing to a dicom tag codegen file.
type CodeWriter interface {
	// Name to use in error messages related to this writer.
	Name() string

	// WriteLeading writes the opening part of a file. Called once before any calls to
	// WriteTag()
	WriteLeading() error

	// WriteTag writes codegen for a given tag. It may be called many times between
	// WriteLeading and WriteTrailing. WriteTag should return io.EOF when all tags
	// are successfully written.
	WriteTag(info TagInfo) error

	// WriteTrailing writes all codegen after WriteTag calls are complete.
	WriteTrailing() error

	// Close closes any underlying open resources.
	Close() error
}

The CodeWriter interface is responsible for writing the codegen for a single go file. A Master implementation of this interface distributes each event to all child CodeWriters.

Basic Flow

The above interfaces allow the main flow to be simple:

A master TagReader is created with child readers for each source file we are parsing.
A master CodeWriter is created with child writers for each taget codegen file we are writing.
The master CodeWriter writes any leading code it's children need to put down before tag information is written.
We iterate over the master TagReader, passing each tag to the master CodeWriter as it is parsed to generate the code for that tag.
When iteration is complete, the master writer writes any trailing code needed to complete the codegen files and closes all resources.

This architecture allows us to add new generated files or new tag metadata sources by defining and registering new readers and writers.

Open Questions

I noticed the original also had a number of attributes like ACR_NEMA_2C_CoefficientsSDDN. These are not present in the Innolitics jump and I am afraid I am not quite sure where they come from. I can copy and paste them all from the old CSV into the current dimse.csv, but I would like to understand more about what those values are, why their keywords conform to a different convention, etc.
I have not done any sort of in-depth diff of any other consts that may have disappeared. Are we worried about breaking changes if there are missing values from the old CSV that are not in the current?

Looking Forward

If you like this approach, I think I would like to start adding some more information into the tag.Info struct that is currently not being captured or fully parsed. Some possibilities include:

Pydicom also collates a collection of well-known private tags from this source. We could do the same.
More granular ValueMultiplicity information ala Feature/Value Multiplicity Info #166
Right now retired tags are skipped but many users may be working with older dicoms, so I would propose adding a Retired bool field to tag.Info, and including them.
Some tags have multiple possible Value Representations. For instance (0018,9810) ZeroVelocityPixelValue has a VR of 'US or SS'. Right now I am pulling the second value, as that aligned with what the previous CSV did, but I think we should consider capturing both for completeness, especially if users want to use this lib to validate DICOM writes.
Right now we are using the Name field to hold the Keyword defined for the tag, but the spec defines both machine-useable Keyword and a human-readable Name. It would be nice to capture both.

Let me know what you think of this PR. If you like it, I think all of the above will be pretty low-hanging fruit and I am happy to knock them out in a series of PRs that build on this one.

suyashkumar · 2023-04-25T02:09:47Z

Thank you @bpeake-illuscio @peake100, apologies for this slipping through the cracks. I think much of this is on the right track. I'm going to take a closer look, and may fork your branch to add commits to make various changes and updates in the next couple of weeks (unless you'd like to make them yourself, but I figure you may be busy no pressure at all).

Thank you again for all your contributions!

peake100 · 2023-04-25T02:17:26Z

Hey! Thanks so much for reaching out. I'm no longer at Illuscio, so this is my current user.

I probably don't have the bandwidth for this right now, but I appreciate you reaching out and am happy to field questions about what I did if you have any!

suyashkumar · 2023-04-25T02:22:59Z

Sounds good, and totally understood! Will do :) Thanks again!

peake100 added 6 commits December 28, 2020 15:20

added innolitics dicom-spec as git submodule

175eb11

golang-based generation foundation laid down

c11e50f

added new dict file

5cefbe7

tests passing

26a1bcb

comments

270e157

better code comments

20b4051

suyashkumar mentioned this pull request Jan 4, 2021

Feature/Value Multiplicity Info #166

Closed

suyashkumar self-requested a review January 4, 2021 01:45

suyashkumar self-assigned this Jan 4, 2021

suyashkumar added the enhancement New feature or request label Jan 4, 2021

better comments

b338a0d

bpeake-illuscio mentioned this pull request Mar 23, 2021

dicom.Write fails on 'SmallestImagePixelValue' with VR of 'US' #190

Open

bpeake-illuscio marked this pull request as draft April 6, 2021 03:22

suyashkumar mentioned this pull request Apr 20, 2023

add EncapsulatedDocumentLength #271

Open

suyashkumar mentioned this pull request May 25, 2024

dicom.Write: The Pixel data only supports OW type #299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor/tag info codegen #169

Refactor/tag info codegen #169

bpeake-illuscio commented Dec 30, 2020 •

edited

suyashkumar commented Apr 25, 2023

peake100 commented Apr 25, 2023

suyashkumar commented Apr 25, 2023

Refactor/tag info codegen #169

Are you sure you want to change the base?

Refactor/tag info codegen #169

Conversation

bpeake-illuscio commented Dec 30, 2020 • edited

Overview of Changes

Innolitics Git Submodule

Codegen migrated from Python to Go

Added Keyword-Indexed Map

Added Codegen to Makefile

Challenges

New Codgen Architecture.

Interfaces

Basic Flow

Open Questions

Looking Forward

suyashkumar commented Apr 25, 2023

peake100 commented Apr 25, 2023

suyashkumar commented Apr 25, 2023

bpeake-illuscio commented Dec 30, 2020 •

edited