Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing the intermediate SOM states during the training #182

Closed
fransua opened this issue Mar 30, 2021 · 30 comments
Closed

Accessing the intermediate SOM states during the training #182

fransua opened this issue Mar 30, 2021 · 30 comments

Comments

@fransua
Copy link

fransua commented Mar 30, 2021

Is your feature request related to a problem? Please describe.
I trained my SOM for 2000 epochs, and would like to store intermediate results (each 500 epochs), something like:

datainfo = loadCSVSet(:test,files,header=false)
som = initGigaSOM(datainfo, 20, 20, seed=seed)
radius_list = [10, 8.9, 7.8, 6.7, 5.6, 4.5, 3.4, 2.3, 1.2, 0.5, 0.1]
for i in 1:10:
    som = trainGigaSOM(som, datainfo, rStart=radius_list[i], rFinal=radius_list[i+1], epochs=200, radiusFun=linearRadius)
    e = embedGigaSOM(som, datainfo)
    e2 = distributed_collect(e)
    writedlm(string("GigaSOM_iker_1400k_embed_seed",seed,"_epochs",epochs,".tsv"),e2,'\t')
    open(f -> serialize(f, som), ("partly_trained_%s.jls", i), "w");
end

I want to assess if I did enough training. Problem is that in with this strategy I can only use a linearRadius, or do a very ugly hack
inputing specific radius function.

Describe the solution you'd like
Perhaps one could input starting/ending epoch/iteration to the train function (here:

for j = 1:epochs
).

Describe alternatives you've considered
Allow to "do something" (call a function) each X epochs in order to serialize the som object, or save the coordinates.

Additional context
none

@exaexa
Copy link
Collaborator

exaexa commented Mar 30, 2021

Hello! Would a per-epoch callback work for you? very roughly like:

trainGigaSOM(som, di, ...,
  beforeEachEpochDo = (epoch, data, som) ->
    if epoch%20 == 0
      e=embedGigaSOM(data, som)
      writedlm(...e...)
    end )

Anyway, 2000 epochs is a bit of overkill for normal use, so I guess you're basically trying to visualize how the SOM training progresses?

EDIT: one of the good ways to check if your SOM has been trained well is to plot your datapoints (cells) projected to a normal 2D scatterplot, and then plot the som.codes to them -- the distribution should roughly match the one of the datapoints, although will be much sparser. If the SOM is under-trained, you'll likely see that the "corners" of the data are not getting enough "coverage" by som.codes.

@fransua
Copy link
Author

fransua commented Mar 31, 2021

Hello,
thanks for the quick reply.
Sorry I don't really know Julia, I think what you are proposing is to create a function to be called each xx epochs (20 in the example), to store data or do whatever. This indeed would help, it would allow me to visualize the progress of the training.
One side question: can I create a gigasom object from an embedded array? or do I need to serialize the som instance? From your experience is serialization of these huge gigasom objects a good (possible) option?

Another part of my problem is if I train my data with 1000 epochs, stop the process, and then want to resume the training (after the 1000 epochs). As your epoch count starts a 1 (

for j = 1:epochs
), the expRadius function will start again with this exponential decay if I resume the training after 1000 epochs.
My solution so far is to create a radius function hacking the rStart and rFinal parameters to account for the extra rounds of epochs:

function myRadius(initRadius::Float64, finalRadius::Float64, iteration::Int64, epochs::Int64)
    if (iteration + convert(Int, finalRadius)) <= convert(Int, initRadius)
        val = [expRadius(100.0)(20.0, 0.1, i, 5000) for i in 1:5000][iteration + convert(Int, finalRadius)]
    else
        val = [1 - 1 / 200 * i for i in 1:200][iteration + convert(Int, finalRadius) - convert(Int, initRadius)]
    end
end

(I know... super ugly)
So perhaps adding extra parameters to the trainGigaSOM function would make it easier to split and control the training.

About the 2000 epochs, between 500 and 1000 epochs I still see important movements (same seed), we are now trying 2000 epochs. And as this is very long (days) we (and the planet :P) would rather start from the already pre-trained som... The data is huge, millions of datapoints thousands of dimensions, and btw, we are very thankful for GigaSOM and to your team!

About the check you propose, I tried with a random example:

using GigaSOM
using Plots
using Random

 Random.seed!(1)
       d = randn(30000,5) .+ rand(0:1, 30000, 5).*10
       som = initGigaSOM(d, 20, 20)
       anim = @animate for train in 1:1:100
           radius = (100-train)^2/1000 + 0.1
           print("$train: $radius")
           print("\n")
           som = trainGigaSOM(som, d, epochs=1, rStart=radius, rFinal=radius)
           e = embedGigaSOM(som, d)
           e2 = embedGigaSOM(som, som.codes)
           scatter(e2[:,1], e2[:,2], ms=4, legend=false, alpha=0.5, color="grey")
           scatter!(e[:,1], e[:,2], ms=1, legend=false)
           annotate!(9, 21, text("epoch-$train (r:$(round(radius, digits=1)))", 9))
           xlims!(0, 20)
           ylims!(0, 21)
           savefig("test_$train.png")    
       end

gif(anim, "anim_fps15_1-100_r10-0.1.gif", fps = 15)

which gives me:

anim_fps15_1-100_r10-0 1

But I am not sure that the fact that datapoints are surrounding the map is due to a lack of training but rather to a too large radius.

For instance if I do the same keeping the radius fixed at 1.0 I got:

anim_fps15_1-100_r1 0

My interpretation is that after very few epochs with a small radius you will get a more or less homogeneous distribution of datapoints per codebooks.

The way I wanted to check for training was to visualize (measure?) the amount of changes between epochs. If the datapoints and codebooks are still moving a lot, than I would do more training.
Does this strategy sounds good to you?

thanks a lot for the answer, and for the software in general :)

@exaexa
Copy link
Collaborator

exaexa commented Mar 31, 2021

OK, adding the function callback to the TODO list, I'll hopefully get to it soon.

Regarding the SOM training: there's no good way to detect if the SOM is "optimal" -- you can have metrics like MQE etc, but these are equally well optimized by normal k-means and by SOM with a small radius (it's useful to think a bit about the kMeans/SOM correspondence).

With SOMs, the beginning of the training with large radius forces the SOM to become smooth and topologically correct (while ignoring complicated details, such as MQE). Dduring the smooth shift to smaller radius, the learned "global" shape hopefully stays as intact as possible while the optimization gradually optimizes more and more "raw" quantization error, as kMeans.

The optimal training is, say, an equilibrium of those -- if you do too much of the first phase, you'll basically waste the computation and the output's gonna be chunky; if you don't do enough of it you will get nice output but global structure won't make much sense. If you omit the "middle", you will have a good "global structure" but medium-size details within clusters will be mixed up. (Cytometry example: it's gonna be able to separate CD4s from CD8s, but Thelpers will be e.g. split into 2 clusters because there was no "zoom level" where this would be optimized).

For your visualizations:

  • embedding the SOM codes seems to give interesting result, actually I never thought about it this way (it's basically "embedding the codes on themselves", which doesn't make much sense as a projection, but it's cool nevertheless)
  • you may want to plot a simple 2D projection of the points (like, just first 2 dimensions) with 2D projection of the codes (also unembedded) so that you see how the SOM behaves in the "real space" throughout the learning process

@fransua
Copy link
Author

fransua commented Mar 31, 2021

I understand your point about the optimization
thank you for taking the time to answer!

... and indeed it is a much better visualization
anim_fps15_1-100_r10-0 1_umap

(umap)

thanks a lot!

@exaexa
Copy link
Collaborator

exaexa commented Mar 31, 2021

UMAP doesn't convey much about what's actually happening geometrically :]

I wanted to show something like this (this is on 2D data for clarity but feel free to play with it):
anim_fps15_1-100_r10-0 1

The point: you see what the radius choice causes the SOM to do certain stuff. Notably, the "precise mapping" of the actual clusters happens all the way back in the last several steps. We implemented the exponential radius decay precisely because of forcing the SOM to spend some time in this step.

Also, the initialization is not optimal, this would get solved with a few more rounds with high radius (btw. this is usually an issue only in 2D)

(EDIT: compare with your original animation -- in the beginning, all datapoints are "far away" from the SOM borders, thus displayed on the borders of the embedding.)

Code is like yours, except:

d = randn(30000, 2) .+ rand(0:2, 30000, 2) .* 10

and

    scatter(som.codes[:, 1], som.codes[:, 2], ms = 4, legend = false, alpha = 0.5, color = "grey")
    scatter!(d[:, 1], d[:, 2], ms = 1, legend = false)

@fransua
Copy link
Author

fransua commented Mar 31, 2021

Nice! I see, thanks a lot!
I will give a try to this direct representation also,the problem is that with ~6000 dimensions I am afraid that taking only the first 2 would be misleading.
Anyway, this discussion helped me a lot (and produced many gifs for future "SOM" google image searches :P )

@exaexa
Copy link
Collaborator

exaexa commented Mar 31, 2021

Ah, 6000 dimensions... Suggest squashing it down a bit either by the oldschool PCA, or (usually better) random projections or (even better) random radial basis functions. What's the data?

@fransua
Copy link
Author

fransua commented Mar 31, 2021

ok, yes, we were going for the random projection (did not knew about random radial basis functions thanks!).
The data are protein-protein interaction (PPI) probabilities (110x110) in millions of subsamples of a global PPI network...

@exaexa
Copy link
Collaborator

exaexa commented Mar 31, 2021

okay wow, that sounds cool. Let us know if you get some nice graphics, we're always interested in user stories. :]

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

Hi, can you try with the code from #183 ? The new callback parameter for trainGigaSOM is called eachEpoch, gets called after epochs with parameters (epoch, radius, codes), and also once before the first epoch starts with epoch = 0.

@laurentheirendt
Copy link
Member

Leaving this issue open until checked

@fransua
Copy link
Author

fransua commented Apr 1, 2021

Hi,
thanks a lot!
got an error, I guess here:

https://github.com/LCSB-BioCore/GigaSOM.jl/blob/develop/src/analysis/core.jl#L161

epoch should be j I guess (or the other way around :P)
Also, as it is, if I want to do the embedding in the eachEpoch function right now I have to create a new Som object from the codes, which I guess is ok, using square root of the length of the codes to get the x/ydim, assuming that the grid is symmetric... but it sounds a bit weird. Wouldn't it be easier to directly pass the Som instead of the codes?
thanks!

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

I see, sorry, my testing pipeline has totally failed :D There's already a fix in #184.

Regarding som -- you can just put it into the original SOM (should be okay). I'm not sure how the variable aliasing will work, will have a deeper look at it.

@fransua
Copy link
Author

fransua commented Apr 1, 2021

I saw your travis uses julia 1.5; the error raises only with julia 1.6 (don't know why in julia 1.5 this part of the code is not checked until called with a not None function).
ok I will plugin the new codes into the existing som, true.

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

I just rewrote it a bit to actually work with a copy of the SOM (that should be done from the beginning anyway). Pushing in 5 minutes. :]

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

(the code with the tests (running now, hopefully working correctly) is here: https://github.com/exaexa/GigaSOM.jl/tree/develop )

@fransua
Copy link
Author

fransua commented Apr 1, 2021

in julia 1.6 I got this:
image

... sorry I can't test further, I started with Julia this week and still don't know how to Pkg.add a local repo xD

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

I'm battling the test framework in parallel, hopefully I get it to some shape in a few minutes :]

@fransua
Copy link
Author

fransua commented Apr 1, 2021

Thanks a lot!
(There is really no rush)

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

no rush

I broke it, feeling a moral obligation to fix it asap :D :D

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

Anyway the branch seems to have no problems anymore (https://github.com/exaexa/GigaSOM.jl/runs/2245496108 passed); we'll try to merge to develop ASAP. We'll push a version with this.

@fransua
Copy link
Author

fransua commented Apr 1, 2021

Ok, this is working nicely!
about the same time, but much less RAM usage than with our hack... and our code is now much cleaner!
thanks a lot!

and again thanks for the tool, this is game changing for us! :)

@exaexa
Copy link
Collaborator

exaexa commented Apr 1, 2021

OK, great to hear that! The change will (slowly) bubble to official package repo, should be eventually available in 0.6.5 if you need to depend on it reliably.

Thank you for the idea (and the pictures :D )!

@exaexa exaexa closed this as completed Apr 1, 2021
@fransua
Copy link
Author

fransua commented Apr 1, 2021

thanks!
pictures are so satisfying... :)

anim_fps15_1-100_r20-0 1

@exaexa
Copy link
Collaborator

exaexa commented Apr 2, 2021

Btw could you share the 3D plotting code? I've failed to put that together properly (with 3param scatter I only got empty plots), and have some PRIME material for making animations (https://bioinfo.uochb.cas.cz/embedsom/vignettes/bones.html)

@fransua
Copy link
Author

fransua commented Apr 2, 2021

This is amazing!!

here is the code:

using GigaSOM
using Random
using Plots


function eachEpoch(epoch::Int64, radius::Float64, som_result::Som)
    if epoch %1 == 0
#         print("$epoch $radius\n")
        scatter3d(datapoints[:,1], datapoints[:,2], datapoints[:,3], ms=1, label="")
        scatter3d!(som_result.codes[:,1], som_result.codes[:,2], som_result.codes[:,3], 
            ms=4, alpha=0.5, color="grey", fg_legend = :transparent, label="", legendtitlefontsize=8,
            legendtitle="epoch: " * lpad(epoch, 3, " "))  # annotate not working in 3D :P
        Plots.frame(anim)
    end
end


Random.seed!(1)
datapoints = randn(30000, 3).+ rand(0:1, 30000, 3).* 10
som = initGigaSOM(datapoints, 20, 20, seed=13)  # yes, I tried a couple to get the visually perpendicular start
anim = Plots.Animation()
trainGigaSOM(som, datapoints, eachEpoch=eachEpoch, epochs=100)

gif(anim, "3d-anim_fps15_1-100_r20-0.1.gif", fps = 15)

3d-anim_fps15_1-100_r20-0 1

Note that I am using the eachEpoch new parameter, and this has one down side in this case, I cannot use the @animate macro and this slows a bit the run (~30%).

@exaexa
Copy link
Collaborator

exaexa commented Apr 2, 2021

Great, thanks a lot! Also nice to have it here as a reference

@exaexa
Copy link
Collaborator

exaexa commented Apr 2, 2021

aaaaaaaaaaaaaaaaaaand we finally have a decent illustration of how the whole thing works! :D
sidebyside

@fransua
Copy link
Author

fransua commented Apr 2, 2021

very nice!!!
I bet this is going to be in many intro slides!!
(mine for sure xD)

@exaexa exaexa pinned this issue Apr 6, 2021
@exaexa
Copy link
Collaborator

exaexa commented Apr 6, 2021

Pinning & renaming this for better future reference.

@exaexa exaexa changed the title Option to resume training at a given stage Accessing the intermediate SOM states during the training Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants