Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem in selection of the principal directions in Tangent PCA. #1878

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

YannCabanes
Copy link
Collaborator

This PR is related to PR #1513 in which a solution is proposed.

The current Tangent PCA algorithm does not necessarily select the axis which corresponds to the largest variance.
The problem is that the current Tangent PCA does not take into account the metric at the mean to sort the principal directions.

To show this problem, I have created an almost Euclidean manifold (e488adb).
This manifold is the Euclidean space of dimension 2 whose metric has been modified to give more weight to the second axis. It's infinitesimal metric element is defined by: ds^2 = dx1^2 + 10000 dx2^2
Then if we run the following codes (e488adb):

almost_euclidean_dim_two = AlmostEuclideanDimTwo()
tpca = TangentPCA(space=almost_euclidean_dim_two, n_components=1)
point_a = gs.array([10, 0], dtype=float)
point_b = gs.array([0, 1], dtype=float)
X = gs.stack([point_a, -point_a, point_b, -point_b], axis=0)
mean = gs.array([0, 0], dtype=float)
tpca.fit(X=X, base_point=mean)
variance_axis_1 = almost_euclidean_dim_two.metric.squared_dist(mean, point_a)  # variance_axis_1 = 100
variance_axis_2 = almost_euclidean_dim_two.metric.squared_dist(mean, point_b)  # variance_axis_1 = 10000
assert variance_axis_2 >= variance_axis_1 # True
axis_1 = gs.array([1, 0])
axis_2 = gs.array([0, 1])
assert gs.all(tpca.components_ == axis_1) # True

The tangent PCA returns axis_2 as the principal axis since it is the axis which
corresponds to the larger variance. It is returning axis_1 instead.

To solve this problem, we can multiply the log vectors at the mean by square root of the metric_matrix to select the correct principal directions (see PR #1513).

@YannCabanes
Copy link
Collaborator Author

Hello @luisfpereira and @ninamiolane,
Do you agree that the current behavior of the tangent PCA described above is not the expected one?
(On the example in dimension 2, it does not rank first the direction with the largest variance.)

@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #1878 (12b3266) into master (6386ade) will increase coverage by 9.00%.
The diff coverage is 93.34%.

@@            Coverage Diff             @@
##           master    #1878      +/-   ##
==========================================
+ Coverage   82.59%   91.58%   +9.00%     
==========================================
  Files         136      141       +5     
  Lines       13577    13645      +68     
==========================================
+ Hits        11213    12496    +1283     
+ Misses       2364     1149    -1215     
Flag Coverage Δ
numpy 87.38% <93.34%> (?)
pytorch 82.47% <4.45%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
geomstats/learning/pca.py 94.00% <93.34%> (+77.60%) ⬆️

... and 45 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@luisfpereira luisfpereira self-requested a review July 26, 2023 14:46
@YannCabanes
Copy link
Collaborator Author

The tests pass except Linting, Testing / build (ubuntu-latest, 3.11, autograd (errors not related to this PR) and DeepSource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant