Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making PhySortingExtractor more efficient #2197

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

DradeAW
Copy link
Contributor

@DradeAW DradeAW commented Nov 13, 2023

No description provided.

@DradeAW
Copy link
Contributor Author

DradeAW commented Nov 13, 2023

Right now the sorting loads all .tsv and .csv in the __init__ by default, which is very slow.

I think by default we should not do that, and only load what's strictly necessary.

@alejoe91
Copy link
Member

What about loading all by default, but give the option to opt out? I think that the average user would want to read all properties, and you are not the average user ;)

@DradeAW
Copy link
Contributor Author

DradeAW commented Nov 13, 2023

you are not the average user ;)

I know ^^'
But I'm going with Sam's philosophy that __init__ should be as light as possible and as fast as possible

@@ -178,6 +179,13 @@ def __init__(

self.add_sorting_segment(PhySortingSegment(spike_times_clean, spike_clusters_clean))

# Caching spike vector for faster computation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the self.to_spike_vector method with cache=True to avoid duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I call to_spike_vector, it's not going to compute it efficiently.
Here I'm using the data that's already loaded in the files!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Can you add a comment along those lines so nobody stumbles like myself in the future. Like:

# Using properties available at __init__ to build the a cached spike_vector and avoid computation. 

@h-mayorquin
Copy link
Collaborator

This PR though, makes Phy less eficient at initialization by doing a computation. If you pickle the file for example in a long chain like the ones that you like the use this will make the loading slower.

What are you improving? Have you profiled?

Remember the rules of optimization:
https://wiki.c2.com/?RulesOfOptimization

@DradeAW
Copy link
Contributor Author

DradeAW commented Nov 13, 2023

This PR though, makes Phy less eficient at initialization by doing a computation.

I'm planning on removing some computation done beforehand, hence the fact that it's still a draft :)

@h-mayorquin
Copy link
Collaborator

Makes sense, the draft escaped me. Don't forget to profile.

@alejoe91 alejoe91 added enhancement New feature or request extractors Related to extractors module performance Performance issues/improvements and removed enhancement New feature or request labels Nov 14, 2023
@DradeAW
Copy link
Contributor Author

DradeAW commented Nov 29, 2023

I benchmarked the fastest way to load a tsv file by trying the following:

# Benchmark using timeit
pd.read_csv("cluster_group.tsv", delimiter='\t')                  # ~ 7.08s for 10,000 iterations
dict(csv.reader(open('cluster_group.tsv', 'r'), delimiter='\t'))  # ~ 0.27s for 10,000 iterations
list(csv.reader(open('cluster_group.tsv', 'r'), delimiter='\t'))  # ~ 0.25s for 10,000 iterations

Plus the fact that importing pandas is slower than importing csv, and that csv is in the standard library, I think we should drop pandas for the PhySortingExtractor. This will remove a dependency and, hopefully, make things faster!

@DradeAW
Copy link
Contributor Author

DradeAW commented Nov 29, 2023

There is a question on how to retrieve the unit_ids for this extractor:
If we retrieve them from the spike vector (meaning from spike_clusters.npy), then it will remove the empty units (which is not done by default in the current version).
However if we retrieve the unit_ids from the tsv information (current version), then we run the risk of a tsv file not having all the ids, which deletes all units not in that tsv file (I actually encountered this bug once in the past where some units where not loaded).

What do you think is best?

@zm711
Copy link
Collaborator

zm711 commented Dec 4, 2023

I use the PhyExtractor sometimes (so have an interest here), could I have you clarify two things for me:

then it will remove the empty units (which is not done by default in the current version).

is this a problem? If a unit is empty is there a reason I would want to keep it around? Isn't this a weird KS artifact that happens sometimes?

which deletes all units not in that tsv file

You mean like masks those units right? It's not that they are actually deleted, it's just that they are not analyzed in that session because they are not loaded. I would be curious about the exact time you encountered this. Since I used this extractor sometimes I might want to go back and double check myself.

@DradeAW
Copy link
Contributor Author

DradeAW commented Dec 4, 2023

is this a problem? If a unit is empty is there a reason I would want to keep it around? Isn't this a weird KS artifact that happens sometimes?

I myself don't want to keep an empty unit around, but I want to make sure that everybody is on the same page!
Yes it happens for Kilosort, for some reason.

You mean like masks those units right? It's not that they are actually deleted

There's not deleted from the phy folder, but they are not in the sorting object

I would be curious about the exact time you encountered this

I made a tsv file but with the unit_id instead of the cluster_id, and instead of having weird values for this entry I had almost half of the units I was supposed to have in the PhySortingExtractor ^^'

@zm711
Copy link
Collaborator

zm711 commented Dec 4, 2023

Cool thanks for the response. I would also agree to not load the empty units (even though that was old behavior), but if we want to stay consistent I could understand maintaining the old behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extractors Related to extractors module performance Performance issues/improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants