Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online quantization algorithm for gudhi #536

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

tlacombe
Copy link
Contributor

@tlacombe tlacombe commented Oct 15, 2021

Provide a quantization algorithm to "summarize" a collection of persistence diagrams.

(At least) One thing that may be discussed :

  • I put the code in the python/gudhi/wasserstein/ repo, because it is of a "Wasserstein metric" flavor (we minimize something in terms of Wasserstein distance between persistence diagrams). However, it does not rely on POT as other functions in this repo do ; we actually never need to explicitly compute a Wasserstein distance/matching explicitly. Perhaps would it belong directly to the gudhi/ repo ?

Also TODO :

  • Check for quantization.py : is the copyright correct?

biblio/bibliography.bib Outdated Show resolved Hide resolved
src/python/doc/wasserstein_distance_user.rst Show resolved Hide resolved
src/python/gudhi/wasserstein/quantization.py Outdated Show resolved Hide resolved
src/python/doc/wasserstein_distance_user.rst Show resolved Hide resolved
src/python/doc/wasserstein_distance_user.rst Outdated Show resolved Hide resolved
tlacombe and others added 7 commits November 10, 2021 16:26
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
@VincentRouvreau

This comment was marked as resolved.

@tlacombe

This comment was marked as resolved.

@VincentRouvreau

This comment was marked as resolved.

@tlacombe
Copy link
Contributor Author

tlacombe commented Dec 1, 2021

Corrected.
Any thought on if this code belong to the /wasserstein module? From a theoretical perspective, the goal is to solve some minimization with respect to the Wasserstein distance between persistence diagrams (so it makes sense to put it there), but from a practical perspective (code), it does not rely on pot contrarily to other functions in the /wasserstein module (because we can solve our optimization problem without explicitly computing such distances/matchings).
My feeling is that it can stay there, doing gudhi.wasserstein.quantization makes clear that we are quantizing something with respect to the Wasserstein distance ; but I am open to discussion of course.

Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess keeping it in wasserstein/ is ok.

src/python/doc/wasserstein_distance_user.rst Outdated Show resolved Hide resolved
src/python/doc/wasserstein_distance_user.rst Outdated Show resolved Hide resolved
different tori with some additional noise.
Starting from an initial codebook ``c0``, centroids are iteratively updated as new diagrams are provided.
As we use the standard metrics between persistence diagrams (denoted here by :math:`\mathrm{OT}_2`), points in the
diagrams that are close to the diagonal do not interfere in the codebook update process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is the same as having an implicit point on the diagonal in the codebook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More precisely, having a point in the codebook that represents "all the points on the diagonal" (or, formally, looking at the quotient space where you identify the points on the diagonal).

src/python/doc/wasserstein_distance_user.rst Outdated Show resolved Hide resolved
src/python/doc/wasserstein_distance_user.rst Show resolved Hide resolved
src/python/gudhi/wasserstein/quantization.py Outdated Show resolved Hide resolved
@tlacombe
Copy link
Contributor Author

tlacombe commented May 18, 2022

I just realized that I never managed to do the last requested modifications (my local build was broken for some reason at that time).
I finally did it.
As I'm working on a new machine, I hope I managed correctly the fork/branching/etc.

PS : and one day later I realize that I forgot to post this comment... 😴

Copy link
Member

@mglisse mglisse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm is presented as on online algorithm. So it should be normal to give it some data, look at the codebook at that point, pass it more data, look at the updated codebook, etc. The init parameter could be used towards that goal, but the number of diagrams (or batches) already processed is forgotten, and indeed t (the learning rate) is reset to 0 at every call.

(the two loops generating the tori).

.. figure::
./img/quantiz.gif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, the GIF is cool. On the other hand, I have trouble reading the doc with that thing moving on my screen...

if withdiag:
a = np.argmin(M[:-1, :], axis=1)
else:
a = np.argmin(M[:-1, :-1], axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit strange to call _build_dist_matrix, whose main difference with cdist is that it adds the diagonal, just to drop the diagonal immediately... But I don't think it really matters.

X_batch = np.concatenate(list_of_non_empty_diags)
return X_batch
else:
return np.array([])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is sometimes useful to force the shape of empty arrays, to (0,2) for instance. I don't know if that's the case here.

:param internal_p: Ground metric to assess centroid affectation. Default is ``2.``.
:type internal_p: ``float``

:returns: The final codebook obtained after going through the all pdiagset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:rtype: kx2 numpy array?


def _init_c(pdiagset, k, internal_p=2):
"""
A naive heuristic to initialize a codebook: we take the k points with largest distances to the diagonal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the first diagram has fewer than k points?

Comment on lines +122 to +123
:param batch_size: Size of batches used during the online exploration of the ``pdiagset``.
Default is ``1``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, should I stick to the default value of 1? If I already have all the diagrams, I may think that I don't need an online algorithm, which is for when data appears progressively, and consider using one huge batch under the impression that it disables the "online" stuff and gets the best result.

# stochastic-gradient-descent like approach (decreasing learning rate).
c_current[j] = c_current[j] - grad / (t + 1)
else:
raise NotImplemented('Order = %s is not available yet. Only order=2. is valid' %order)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could error out earlier (or not provide this option at all and just say that it is W2).

@mglisse mglisse marked this pull request as draft July 8, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants