Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor find peaks 1d #3196

Open
wants to merge 6 commits into
base: RELEASE_next_minor
Choose a base branch
from

Conversation

CSSFrancis
Copy link
Member

@CSSFrancis CSSFrancis commented Jul 19, 2023

Description of the change

Refactor find_peaks1D to copy the find_peaks2D functionality

  • Add support for lazy peak finding
  • Removed custom dtype (for lazy plotting etc)
  • Renamed function for more inline

Progress of the PR

  • Change implemented (can be split into several points),
  • update docstring (if appropriate),
  • update user guide (if appropriate),
  • add an changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting changelog entry in the readthedocs doc build of this PR (link in github checks)
  • add tests,
  • ready for review.

Minimal example of the bug fix or the new feature

import hyperspy.api as hs
import numpy as np
s = hs.signals.Signal1D(np.arange(10))
peaks = s.find_peaks()

Note that this example can be useful to update the user guide.

@CSSFrancis CSSFrancis changed the base branch from RELEASE_next_minor to RELEASE_next_major July 19, 2023 16:43
@CSSFrancis CSSFrancis mentioned this pull request Jul 19, 2023
57 tasks
@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Attention: 17 lines in your changes are missing coverage. Please review.

Comparison is base (56a5db3) 81.30% compared to head (641f48a) 81.38%.
Report is 670 commits behind head on RELEASE_next_major.

Files Patch % Lines
hyperspy/utils/peakfinders1D.py 72.58% 10 Missing and 7 partials ⚠️
Additional details and impacted files
@@                  Coverage Diff                   @@
##           RELEASE_next_major    #3196      +/-   ##
======================================================
+ Coverage               81.30%   81.38%   +0.08%     
======================================================
  Files                     176      177       +1     
  Lines                   24406    24461      +55     
  Branches                 5681     5688       +7     
======================================================
+ Hits                    19843    19908      +65     
+ Misses                   3258     3246      -12     
- Partials                 1305     1307       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jlaehne
Copy link
Contributor

jlaehne commented Sep 5, 2023

Good idea, but how does the ragged signal ease access to e.g. the array of peak-centre values? The data object now contains an array of 3 elements at every pixel. And even a multidimensional array if maxpeak>1.

In the ideal case, I'd like to access the peak parameters in a similar way like with the parameters of a fit results, e.g. Gaussian.center.map['values'] to get the array of center values of a Gaussian component. So something like peaks[0]['center'] for the center values of the first peak found.

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Sep 6, 2023

Good idea, but how does the ragged signal ease access to e.g. the array of peak-centre values? The data object now contains an array of 3 elements at every pixel. And even a multidimensional array if maxpeak>1.

In the ideal case, I'd like to access the peak parameters in a similar way like with the parameters of a fit results, e.g. Gaussian.center.map['values'] to get the array of center values of a Gaussian component. So something like peaks[0]['center'] for the center values of the first peak found. As a general rule directly working with the data attribute can cause some

@jlaehne That's a good point. I would prefer to avoid the custom dtype implementations in numpy, especially when all of the column types are equivalent. You end up losing a fair bit of functionality such as slicing the data using a boolean array which is necessary if you want to say exclude some peak. There is a fairly big speed hit as well. Not being able to select some column using integers can start to be a little frustrating and it's not always very clear what the names of the indexes are. Additionally, when doing things like creating markers from a list of vectors the custom dtype starts to become problematic.

In any case custom dtypes don't really work terribly well in hyperspy. For example to get the centers at position 10,15 you would have to do peaks.data[15,10]["center"] as peaks.inav[10,15]["center"] returns an error because peaks.inav[10,15] returns a ragged signal. I'm not the biggest fan of directly dealing with the data attribute as it can cause weird things with the axes_manager

I understanding the desire to slice the data using strings rather than indexes in some cases. Unfortunately, this is currently a little difficult in hyperspy. Ragged signals are a big headache and to be honest I've never really figured out how to properly deal with them.

#3055 might help...

@jlaehne
Copy link
Contributor

jlaehne commented Sep 6, 2023

Well, the old implementation is not practical at all either. So I'm happy to change it. I like the custom dtype for the components, where you can get a whole map as it is defined for the overall array. But for the peak-finder result, the custom dtype is defined at every position, which definitely is bogus. However, if we are already breaking the api, I would like to have an easy to use solution - and we should add some examples to the documentation on how to access the data. I thought keywords of a custom dtype might help, as slicing the multi-dimensional dataset is not straightforward. I would be happy with another solution if it is well documented. Currently, I struggle with both the old and new implementation to get the information out.

What spontaneously comes to my mind is that we should be able to extract:

  • Peak parameters at a certain position
  • 1st ... nth peak if maxpeaks>1
  • Maps of individual peak parameters for one of multiple peaks found
    While allowing for slicing with boolean arrays

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Sep 6, 2023

Well, the old implementation is not practical at all either. So I'm happy to change it. I like the custom dtype for the components, where you can get a whole map as it is defined for the overall array. But for the peak-finder result, the custom dtype is defined at every position, which definitely is bogus. However, if we are already breaking the api, I would like to have an easy to use solution - and we should add some examples to the documentation on how to access the data. I thought keywords of a custom dtype might help, as slicing the multi-dimensional dataset is not straightforward. I would be happy with another solution if it is well documented. Currently, I struggle with both the old and new implementation to get the information out.

Yea that part frustrates me as well. I find it very ambiguous about what you are actually looking at.

What spontaneously comes to my mind is that we should be able to extract:

  • Peak parameters at a certain position
  • 1st ... nth peak if maxpeaks>1
  • Maps of individual peak parameters for one of multiple peaks found
    While allowing for slicing with boolean arrays

Additionally I would like to add:

  • Information about pixel vs calibrated values
  • simple method for annotation (i.e. to peaks method)
  • Implicit casting from the find_peaks method.

I've played around with this a lot as well/ looked for different implementations in other packages and haven't really seen anything that I've actually liked. Honestly the fastest way to handle vectors is to flatten everything into a long sorted list, but in that case you can't really have lazy lists of vectors and the data is a little awkward.

One thing that I am going to do with pyxem is add the alias "diffraction_signal" to the DiffractionVector class which should make it so that the find_peaks method returns a DiffractionVector rather than a BaseSignal. At that point my plan is to just make all of the above options available through custom methods.

We could add a VectorSignal class if we want to share some methods:

class VectorSignal(BaseSignal):

    """A Vector signal is a Ragged array of Vectors.  At each navigation position there is some list of
    n x m vectors where n is variable but m is fixed and the dimensions of m are defined by the `.axes_manager.vector_axes`
    attribute"""
    
    def __init__(self,):
        self.ivec = SpecialVectorSlicer(self) # for slicing along the ragged dimension using the map function

class SpecialVectorSlicer:

    def __init__(self,signal):
        self.signal=signal
    
    def __getitem__(value):
        # allow getting an item based on the signal.axes_manager.vector_signals indexes (i.e. s.ivec["center"])
        # allow for more complex slicing as well?  s.ivec["center", "width" ]
        # Boolean operations?  s.ivec["center>5", "width<1"]
        # Filtering based vector slicing? s[s.ivec["center>5", "width<1"]]
        return self.signal.map(getitem_vector, value, inplace=False)

I've implemented something fairly similar a couple of times but it's never really been clean enough to add to hyperspy.

I realize that isn't the best answer but every implementation I've written has been too fragile for a good generic addition to hyperspy. That's why I've moved towards trying to shift the implementation to pyxem. At least there I can have a bit better control and as long as upstream in hyperspy:

  1. The signal axes are saved in the metadata
  2. The signal is automatically cast to a pyxem DiffractionVector Signal

Then it should be easy to handle things downstream in pyxem and maybe eventually we can move a more stable implementation back to hyperspy.

@jlaehne
Copy link
Contributor

jlaehne commented Sep 6, 2023

Just thinking loud, could find_peaks return a model, where each peak is a registered component. It would not result from a fit, but the type of data is pretty similar so that it would be nice to have a similar datastructure and way to access it.

@CSSFrancis
Copy link
Member Author

So the map function could return a model or a list of components but I don't know if it would work the same way that a model would in hyperspy.

I've considered this in the past as I would like the ability to use the map function for model fitting if you already have preset idea of the fitting parameters. The problem seems to be that the model class is not really designed with multidimensionality in mind. I'm not necessarily sure that using the Model class is the exact right way to go because of that.

Another thing to consider is that Components are separate and for peaks we necessarily have all of the same components. As a result Components would unnecessarily complicate things and make things slow to manipulate and operate on as a result.

Copy link
Member

@ericpre ericpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the structure that it should return, what's about returning separate ragged signals, one for each characteristic: position, width, and height?
I am not sure if we would expect the find_peak function to be use for mapping purposes, or at least not directly, maybe find the finds on a sum/average data and parse these to know the feature of interest of the dataset and use them to create a model or to get some maps.

Comment on lines +1278 to +1279
peaks.metadata.add_node("Peaks") # add information about the signal Axes
peaks.metadata.Peaks.signal_axes = deepcopy(self.axes_manager.signal_axes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that the data are in "pixels"? Why not use the axes information to convert to calibration values? If in some scenarios, it is convenient to have the data in "pixel", maybe add an argument to make it optional, with the default of returning calibrated values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just how the find_peaks (2D) method works so I copied it. We should have the option in returning real units in both cases. In any case saving the axes_manager in the metadata is a good backup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it would be very good to have both (1D and 2D) returning similar output!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I do would want to get calibrated units without a detour through the axis manager. For me that would be the actually be the expected default behavior.

@CSSFrancis
Copy link
Member Author

Regarding the structure that it should return, what's about returning separate ragged signals, one for each characteristic: position, width, and height?

@ericpre I don't think that is a great idea. 1st returning multiple outputs isn't (currently) supported from the map function. I've got some "workingish" code that would allow multiple outputs but we would have to add some things like a hs.compute() function to merge the task graphs so things run efficiently.

I am not sure if we would expect the find_peak function to be use for mapping purposes, or at least not directly, maybe find the finds on a sum/average data and parse these to know the feature of interest of the dataset and use them to create a model or to get some maps.

I'm not sure either. I'm mostly expecting what is returned from this function to get pushed to the sub packages. For pyxem there are so many things we want to do with vectors and it's better to just handle it there and make sure all of the information is maintained.

@ericpre
Copy link
Member

ericpre commented Sep 24, 2023

If there is consensus on what would be the API, I think that it would be good (and easily feasible) to get it done for the 2.0 release as this will be the API. @jlaehne, @CSSFrancis, what do you think?

@CSSFrancis
Copy link
Member Author

@ericpre I'm all for getting this in before the 2.0.0 release :) I'd love some suggestions on what to do here.

What do you think about creating a new ragged signal?

I think we want:

  1. units for each column
  2. name for each column
  3. if each column is in pixels or calibrated units.

With some functions to:

  1. convert from pixel units--> real units
  2. get the values from some column

@ericpre
Copy link
Member

ericpre commented Sep 24, 2023

Sorry, I didn't see your message above! Not as easy then... 😅

In principle, the calibration information should go in s.metadata.Signal.quantity:
https://hyperspy.org/hyperspy-doc/dev/reference/metadata.html#signal and in the gain (https://hyperspy.org/hyperspy-doc/dev/reference/metadata.html#variance-linear-model)? Currently, as this is for "intensity", it may not work well for 2D signal...

Maybe we need to leave to leave for after the 2.0 release... in this case, we could have the old and the new function living side by side with the old one deprecated and they would be fairly standalone.

@ericpre
Copy link
Member

ericpre commented Sep 25, 2023

The Signal2D.find_peaks method has a get_intensity argument, which add a column to the ragged array. Maybe we should still use this approach for now, this is already an improvement on the current situation and will be consistent with the Signal2D counterpart?

Regarding utilities to convert units, etc. maybe it would be best to leave for later, as this will most likely not be an API break?

@CSSFrancis
Copy link
Member Author

@ericpre this is fine. I think that I will probably start to write a VectorSignal class in pxyem and then once we figure out some of the bugs we can start to move some of the implementation upstream. As long as I set the signal_type to "diffraction" the results of the find_peaks function returns a pyxem signal. I'll just need to add a 0-D Signal for each of the signals.

The Signal2D.find_peaks method has a get_intensity argument, which add a column to the ragged array. Maybe we should still use this approach for now, this is already an improvement on the current situation and will be consistent with the Signal2D counterpart?

So what would be action here before the 2.0.0 release?

Not to crush the hopes of getting the real units to work, but currently that would be helped significantly by #3031 and #3055 as they add in features like converting arrays of points to calibrated values for all axes types. Otherwise you would have to rewrite that part.

Regarding utilities to convert units, etc. maybe it would be best to leave for later, as this will most likely not be an API break?

Yea I think this requires a better handle on the Axes classes in hyperspy. :) So I think we have come full circle.

@ericpre
Copy link
Member

ericpre commented Sep 25, 2023

So what would be action here before the 2.0.0 release?

I would suggest:

  • use the same approach as the Signal2D.find_peaks: returns ragged array with 1, 2 or 3 columns (?) as defined by an argument. The main difference is to return a ragged array.
  • (good to have) Add an argument to use "pixel" or "calibrated values"? This doesn't break the API, so it can well be done later.

@ericpre
Copy link
Member

ericpre commented Oct 31, 2023

@CSSFrancis, what are your thoughts for this PR?

@CSSFrancis
Copy link
Member Author

CSSFrancis commented Nov 1, 2023

@ericpre I don't know if I'll have time to come back to this in the next couple of days or if we have a great answer for how to handle this kind of data.

I don't really need this for anything I am planning on doing so it's a little lower priority for me :) I would say just let this slide and then we can deprecate and replace it

@ericpre
Copy link
Member

ericpre commented Nov 1, 2023

Okay, let's park this for now then, particularly if you think that it would be benefit from other PRs, like #3031 and #3055. In any case, the deprecation cycle will be simple to handle as we can simply add Signal1D.find_peaks and keep Signal1D.find_peaks1D_ohaver.

@ericpre ericpre removed this from the v2.0 Split milestone Nov 1, 2023
@CSSFrancis CSSFrancis mentioned this pull request Nov 21, 2023
10 tasks
@CSSFrancis CSSFrancis deleted the branch hyperspy:RELEASE_next_minor December 22, 2023 14:03
@CSSFrancis CSSFrancis closed this Dec 22, 2023
@ericpre
Copy link
Member

ericpre commented Dec 22, 2023

Re-opening because this has been closed automatically by mistake!

@ericpre ericpre reopened this Dec 22, 2023
@ericpre ericpre changed the base branch from RELEASE_next_major to RELEASE_next_minor December 22, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants