Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWC io: DFS-sort nodes, optionally do not reindex #110

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

clbarnes
Copy link
Collaborator

@clbarnes clbarnes commented Nov 23, 2022

Previously, nodes were sorted by parent index, which would ensure that root nodes appear before non-roots but did not guarantee that all children are defined after their parent.
Now, nodes are sorted in depth-first order (addressing their children in order of their index).

Additionally, added the option of return_node_map=None, which does not re-index nodes before writing.
This is useful when exporting SWC files of several neurons where node IDs must persist their global uniqueness.

Finally, adds a label "9" for nodes which have both pre- and
post-synapses.

@schlegelp
Copy link
Collaborator

Out of curiosity: have you experienced issues with the previous sorting when loading the SWC files elsewhere?

Also: Did you do any benchmarking, i.e. does this impact performance for writing large numbers (say >1000) of SWC?

@clbarnes
Copy link
Collaborator Author

clbarnes commented Nov 23, 2022

Out of curiosity: have you experienced issues with the previous sorting when loading the SWC files elsewhere?

I haven't experienced any issues with the previous sorting but I don't think I use any SWC readers which are actually dependent on the order (and wrote a rust CLI for sanitising any SWC files we're sending elsewhere, because of course I did...). But I'm pretty sure the previous implementation, sorting by parent ID, doesn't ensure parents are defined before children: if you trace from dendrite to soma and then sort by parent node ID you'll have the root first, but the second row will be the end with a parent which isn't defined yet.

We could either drop the convention that they are sorted (which doesn't matter much, at least in the SWC implementations I've used/contributed to), and not sort at all, or sort them in a way which does what it needs to, for which I think the DFS is best suited. It does do some possibly-unnecessary sorts internally, so that each node's children is addressed in a predictable order; could cut a little time out by addressing them in arbitrary order, or reverse insertion order.

I haven't done any benchmarking but can give it a go. My intuition is that it will slow down make_swc_table a bit but that that may not matter so much when it's rolled in to the whole process of file IO.

For me, main thrust of the PR is allowing persisting node IDs, the sorting is just something I noticed along the way.

@clbarnes
Copy link
Collaborator Author

clbarnes commented Nov 23, 2022

The DFS sort costs about 9ms for a randomly-shuffled SWC file from navis.data compared to the current sort. The internal sorts cost about 0.3ms. make_swc_table using the other changes in this PR but with the current sort takes ~64ms (so changing to the DFS sort would increase runtime by 14%). write_swc for a single SWC file takes 100ms (so changing to the DFS sort would increase runtime by 9%).

@schlegelp
Copy link
Collaborator

Thanks for testing this! I think 9% slow down isn't the end of the world.

I guess there are actually three scenarios to consider here:

  1. Leave node IDs as is (which may be useful when working with e.g. CATMAID data but could cause troubles with some parsers)
  2. Re-index nodes such that their IDs are sequential but don't worry about re-ordering them (i.e. half-compliant)
  3. Re-index AND re-order using DFS

It'd be nice if we could cater for all cases but not super hung up on it. I wonder if this warrants changing/clarifying the parameter though? Perhaps something like this:

reindex_nodes :   'DSF' | bool
                    If 'DSF' (default), will use a depth-first search to reindex and reorder nodes
                    such that each parent is guaranteed to be defined before its childs. This comes
                    at the cost of a small overhead but makes the file fully compliant with the SWC
                    format (although most modern parsers don't strictly require this).
                    If True, will re-index nodes but not re-order the nodes (slightly faster) and good
                    enough for most parsers.
                    If False, will not reindex nodes (i.e. their current IDs will be written).

Returns 
--------
node_id_map :    dict
                   If reindex_nodes is not False will return a dictionary mapping the old to the re-indexed node IDs.

If you think scenario 2 is unnecessary and complicates things, I'd be happy to drop it but I think changing the parameter name could be helpful. Thoughts?

@clbarnes
Copy link
Collaborator Author

Yes, those are the cases I arrived at for the rust tool too. I guess probably best handled as two bools, "sort" (maybe topo_sort) and "reindex"? Or a single IntEnum, but they're not very clear/common. I would always return the index mapping on reindex (if we're generating it anyway, no reason not to return it). I'd be inclined to return None in its place if no reindexing happened; easier to do type annotations that way, but that would be a breaking change (although so would changing the parameter name).

Previously, nodes were sorted by parent index, which would ensure that
root nodes appear before non-roots but did not guarantee that all
children are defined after their parent.
Now, nodes are sorted in depth-first order (addressing their children in
order of their index).

Additionally, added the option of return_node_map=None, which does not
re-index nodes before writing.
This is useful when exporting a SWC files of several neurons along with
where node IDs must persist their global uniqueness.

Finally, adds a label "9" for nodes which have both pre- and
post-synapses.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants