Making `PhySortingExtractor` more efficient #2197

DradeAW · 2023-11-13T14:37:15Z

No description provided.

DradeAW · 2023-11-13T14:38:14Z

Right now the sorting loads all .tsv and .csv in the __init__ by default, which is very slow.

I think by default we should not do that, and only load what's strictly necessary.

alejoe91 · 2023-11-13T14:54:47Z

What about loading all by default, but give the option to opt out? I think that the average user would want to read all properties, and you are not the average user ;)

DradeAW · 2023-11-13T14:59:55Z

you are not the average user ;)

I know ^^'
But I'm going with Sam's philosophy that __init__ should be as light as possible and as fast as possible

h-mayorquin · 2023-11-13T15:11:28Z

src/spikeinterface/extractors/phykilosortextractors.py

@@ -178,6 +179,13 @@ def __init__(

        self.add_sorting_segment(PhySortingSegment(spike_times_clean, spike_clusters_clean))

+        # Caching spike vector for faster computation.


You can use the self.to_spike_vector method with cache=True to avoid duplication.

If I call to_spike_vector, it's not going to compute it efficiently.
Here I'm using the data that's already loaded in the files!

Makes sense. Can you add a comment along those lines so nobody stumbles like myself in the future. Like:

# Using properties available at __init__ to build the a cached spike_vector and avoid computation.

h-mayorquin · 2023-11-13T15:16:23Z

This PR though, makes Phy less eficient at initialization by doing a computation. If you pickle the file for example in a long chain like the ones that you like the use this will make the loading slower.

What are you improving? Have you profiled?

Remember the rules of optimization:
https://wiki.c2.com/?RulesOfOptimization

DradeAW · 2023-11-13T15:18:43Z

This PR though, makes Phy less eficient at initialization by doing a computation.

I'm planning on removing some computation done beforehand, hence the fact that it's still a draft :)

h-mayorquin · 2023-11-13T15:21:06Z

Makes sense, the draft escaped me. Don't forget to profile.

DradeAW · 2023-11-29T15:05:51Z

I benchmarked the fastest way to load a tsv file by trying the following:

# Benchmark using timeit
pd.read_csv("cluster_group.tsv", delimiter='\t')                  # ~ 7.08s for 10,000 iterations
dict(csv.reader(open('cluster_group.tsv', 'r'), delimiter='\t'))  # ~ 0.27s for 10,000 iterations
list(csv.reader(open('cluster_group.tsv', 'r'), delimiter='\t'))  # ~ 0.25s for 10,000 iterations

Plus the fact that importing pandas is slower than importing csv, and that csv is in the standard library, I think we should drop pandas for the PhySortingExtractor. This will remove a dependency and, hopefully, make things faster!

DradeAW · 2023-11-29T15:15:58Z

There is a question on how to retrieve the unit_ids for this extractor:
If we retrieve them from the spike vector (meaning from spike_clusters.npy), then it will remove the empty units (which is not done by default in the current version).
However if we retrieve the unit_ids from the tsv information (current version), then we run the risk of a tsv file not having all the ids, which deletes all units not in that tsv file (I actually encountered this bug once in the past where some units where not loaded).

What do you think is best?

zm711 · 2023-12-04T14:09:43Z

I use the PhyExtractor sometimes (so have an interest here), could I have you clarify two things for me:

then it will remove the empty units (which is not done by default in the current version).

is this a problem? If a unit is empty is there a reason I would want to keep it around? Isn't this a weird KS artifact that happens sometimes?

which deletes all units not in that tsv file

You mean like masks those units right? It's not that they are actually deleted, it's just that they are not analyzed in that session because they are not loaded. I would be curious about the exact time you encountered this. Since I used this extractor sometimes I might want to go back and double check myself.

DradeAW · 2023-12-04T14:16:10Z

is this a problem? If a unit is empty is there a reason I would want to keep it around? Isn't this a weird KS artifact that happens sometimes?

I myself don't want to keep an empty unit around, but I want to make sure that everybody is on the same page!
Yes it happens for Kilosort, for some reason.

You mean like masks those units right? It's not that they are actually deleted

There's not deleted from the phy folder, but they are not in the sorting object

I would be curious about the exact time you encountered this

I made a tsv file but with the unit_id instead of the cluster_id, and instead of having weird values for this entry I had almost half of the units I was supposed to have in the PhySortingExtractor ^^'

zm711 · 2023-12-04T15:03:31Z

Cool thanks for the response. I would also agree to not load the empty units (even though that was old behavior), but if we want to stay consistent I could understand maintaining the old behavior.

Caching PhySortingExtractor spike vector

57a18be

h-mayorquin reviewed Nov 13, 2023

View reviewed changes

alejoe91 added enhancement New feature or request extractors Related to extractors module performance Performance issues/improvements and removed enhancement New feature or request labels Nov 14, 2023

Merge branch 'main' into fast_phy_extractor

2d65c7f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making `PhySortingExtractor` more efficient #2197

Making `PhySortingExtractor` more efficient #2197

DradeAW commented Nov 13, 2023

DradeAW commented Nov 13, 2023

alejoe91 commented Nov 13, 2023

DradeAW commented Nov 13, 2023

h-mayorquin Nov 13, 2023

DradeAW Nov 13, 2023

h-mayorquin Nov 13, 2023 •

edited

h-mayorquin commented Nov 13, 2023

DradeAW commented Nov 13, 2023

h-mayorquin commented Nov 13, 2023

DradeAW commented Nov 29, 2023 •

edited

DradeAW commented Nov 29, 2023

zm711 commented Dec 4, 2023

DradeAW commented Dec 4, 2023

zm711 commented Dec 4, 2023

		@@ -178,6 +179,13 @@ def __init__(

		self.add_sorting_segment(PhySortingSegment(spike_times_clean, spike_clusters_clean))

		# Caching spike vector for faster computation.

Making PhySortingExtractor more efficient #2197

Are you sure you want to change the base?

Making PhySortingExtractor more efficient #2197

Conversation

DradeAW commented Nov 13, 2023

DradeAW commented Nov 13, 2023

alejoe91 commented Nov 13, 2023

DradeAW commented Nov 13, 2023

h-mayorquin Nov 13, 2023

Choose a reason for hiding this comment

DradeAW Nov 13, 2023

Choose a reason for hiding this comment

h-mayorquin Nov 13, 2023 • edited

Choose a reason for hiding this comment

h-mayorquin commented Nov 13, 2023

DradeAW commented Nov 13, 2023

h-mayorquin commented Nov 13, 2023

DradeAW commented Nov 29, 2023 • edited

DradeAW commented Nov 29, 2023

zm711 commented Dec 4, 2023

DradeAW commented Dec 4, 2023

zm711 commented Dec 4, 2023

Making `PhySortingExtractor` more efficient #2197

Making `PhySortingExtractor` more efficient #2197

h-mayorquin Nov 13, 2023 •

edited

DradeAW commented Nov 29, 2023 •

edited