Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculation of LvR and correlation coefficients of ground state #14

Open
neworderofjamie opened this issue Mar 25, 2020 · 26 comments
Open
Assignees

Comments

@neworderofjamie
Copy link

Thanks to you awesome help with #12, I've been comparing some of our simulation runs with the 33fb5955558ba8bb15a3fdce49dfd914682ef3ea dataset and are having some issues especially with the populations with low firing rates. To try and narrow down where this is coming from, I tried re-using the analysis code from the repository on the FEF spike data from the 33fb5955558ba8bb15a3fdce49dfd914682ef3ea dataset as follows:

from correlation_toolbox import helper as ch
import numpy as np

# **NOTE** this is heavily based off the analysis code from the paper
def load(filename, duration_s):
    tmin = 500.
    subsample = 2000
    resolution = 1.

    spikes = np.load(filename)
    ids = np.unique(spikes[:, 0])
    dat = ch.sort_gdf_by_id(spikes, idmin=ids[0], idmax=ids[0]+subsample+1000)
    bins, hist = ch.instantaneous_spike_count(dat[1], resolution, tmin=tmin, tmax=duration_s * 1000.0)
    rates = ch.strip_binned_spiketrains(hist)[:subsample]
    cc = np.corrcoef(rates)
    cc = np.extract(1-np.eye(cc[0].size), cc)
    cc[np.where(np.isnan(cc))] = 0.
    return np.mean(cc)


# Values from the json file on gnode
dataset_fef_6e_corr = 0.0004061593782540619
dataset_fef_5i_corr = 0.0020706629333541817

duration_s = 10.5

# Population sizes
num_fef_5i = 3721
num_fef_6e = 16128

# Load data
nest_fef_5i_corr = load("33fb5955558ba8bb15a3fdce49dfd914682ef3ea-spikes-FEF-5I.npy", duration_s)
nest_fef_6e_corr = load("33fb5955558ba8bb15a3fdce49dfd914682ef3ea-spikes-FEF-6E.npy", duration_s)

print("FEF 5I corr coeff - NEST:%f, Dataset:%f" % (nest_fef_5i_corr, dataset_fef_5i_corr))
print("FEF 6E corr coeff - NEST:%f, Dataset:%f" % (nest_fef_6e_corr, dataset_fef_6e_corr))

The output shows fairly significant differences between the versions from the json file in the same directory:

FEF 5I corr coeff - NEST:0.001910, Dataset:0.002071
FEF 6E corr coeff - NEST:0.000156, Dataset:0.000406

Am I doing something dumb or missing something?

Thanks in advance for your help!

@albada
Copy link
Collaborator

albada commented Mar 26, 2020

Could it be the condition for neurons to have spiked at least once where this goes wrong? So that you are still including some silent neurons in the calculation?

@neworderofjamie
Copy link
Author

Well-spotted! rates ends up with less than 2000 neurons that spike after calling strip_binned_spiketrains. However, I don't quite understand how this ever worked as the paper says that the ground state simulations were run for 10.5 seconds and the 10.5 second spike trains in the repository don't have enough data

@neworderofjamie
Copy link
Author

So it's working but not leaving 2000 neurons

@albada
Copy link
Collaborator

albada commented Mar 26, 2020

Yes that sounds right. I agree the relevant wording in the paper is ambiguous at best...

@neworderofjamie
Copy link
Author

So were the stats calculated using a longer simulation?

@albada
Copy link
Collaborator

albada commented Mar 26, 2020

No, I don't think so. The paper states that for chi=1, the simulation duration was 10.5 s. I think the calculation started with 2000 neurons, from which the ones that spiked at least once were taken, leaving fewer than 2000 neurons in some cases.

@neworderofjamie
Copy link
Author

Ahh ok so how come our recalculation of the stats using the same code and (presumably) the same data don't match? I don't see any non-determinism in that code

@mschmidt87
Copy link
Collaborator

I think that @albada is right in saying that for some populations, there were probably fewer than 2000 neurons entering the calculation, simply because these population are small in the first place and then have low rates.

For what it's worth, I wouldn't call these deviations significant. Keep in mind that these values are very low, 10^-3 - 10^-4, on a scale of 0 to 1. If your simulations do not produce exactly the same spikes, these deviations can easily occur, but they're not significant, in my opinion.

I think, unless you're using the exact same configuration for your compute system (MPI processes, threads) and the same NEST version (+ some other dependencies that influence the random numbers), it's unlikely that you can produce the same spikes.

@neworderofjamie
Copy link
Author

Hey @mschmidt87 - thanks for looking at this. My concern is that, as a test, we're calculating these metrics from the published spiking data using the published code and we don't get the published correlation coefficients

@mschmidt87
Copy link
Collaborator

Okay, sorry, I missed that point, I thought you had run your own simulations. Let me take a look at that tomorrow then, I'll get back to you.

@neworderofjamie
Copy link
Author

Thank you so much!

@mschmidt87
Copy link
Collaborator

mschmidt87 commented Mar 27, 2020

I can reproduce your finding for FEF, 6E. I am getting the same value as you for the cross-correlation coefficient. I can also see a deviation for the LvR value (0.2683 from recalculating vs. 0.4178 in the json file). However, I can reproduce the the population rates from the json files, which makes me conclude that the spike data is the correct one and I didn't use other data for the calculation of the json files.

I couldn't find the problem in the analysis code (I tried to change the part where we subsample etc., but no success). Since I produced the json files with the code that is in the repo and the file hasn't been modified, I am suspecting that this might perhaps be a version problem of the dependencies.

I am now using Python 3.8 and numpy 1.18.1 as well as the latest master of correlation_toolbox.
Unfortunately, I didn't record the exact dependencies at the time I produced the data in the data repo, so I can't investigate this in an easy manner.

@neworderofjamie
Copy link
Author

Glad to hear you can reproduce. Wouldn't it be possible that the spike rate would result in the same mean rates but different correlatation and irregularity (in the extreme case you could generate spike trains from populations of poisson sources with the same mean rates)?

Nonetheless, I can try and investigate older versions today. correlation_toolbox doesn't look to have changed a huge amount at least. The data was pushed on the 8/1/2018 so, assuming you created it around then, I can try and bisect numpy versions.

@neworderofjamie
Copy link
Author

neworderofjamie commented Mar 27, 2020

I've tried using numpy 1.13.3 and 1.11.3 (which seems like a good guess for era-appropriate bleeding edge and less bleeding edge) on both python3 and python2 and using the older version of correlation_toolbox.sort_gdf_by_id (https://github.com/INM-6/correlation-toolbox/blob/f91d9f2fff2a1b27c1ace4cd771464e57c109648/correlation_toolbox/helper.py#L44-L79) (as this seems to be only change with much chance of affecting this) and the results remain unchanged so I'm a little stumped

@mschmidt87
Copy link
Collaborator

Thanks for your tests. Of course, it is possible to produce the exact same rates with different 2nd order statistics, but to achieve that with two different runs of the same simulation (with different RNG seeds), which is what I would have suspected, is extremely likely, i.e. can be excluded.

I'll think about it a little more.

@neworderofjamie
Copy link
Author

That is very true

@neworderofjamie
Copy link
Author

Hey @mschmidt87, any thoughts?

@neworderofjamie
Copy link
Author

I've done a little bit of code archeology and found a change in the LvR calculation. If I calculate the LvR before with and without this change:

from correlation_toolbox import helper as ch
import numpy as np

# **NOTE** this is heavily based off the analysis code from the paper
def load(filename, duration_s, num, check):
    tmin = 500.
    subsample = 2000
    resolution = 1.
    tref= 2.0

    spikes = np.load(filename)

    # calc lvr
    i_min = np.searchsorted(spikes[:, 1], tmin)
    i_max = np.searchsorted(spikes[:, 1], duration_s * 1000.0)
    LvR = np.array([])
    data_array = spikes[i_min:i_max]
    for i in np.unique(data_array[:, 0]):
        intervals = np.diff(data_array[np.where(data_array[:, 0] == i)[0], 1])
        if intervals.size > 1:
            val = np.sum((1. - 4 * intervals[0:-1] * intervals[1:] / (intervals[0:-1] + intervals[
                         1:]) ** 2) * (1 + 4 * tref / (intervals[0:-1] + intervals[1:])))
            LvR = np.append(LvR, val * 3 / (intervals.size - 1.))
        else:
            LvR = np.append(LvR, 0.0)
    # CHANGE HERE
    if check and len(LvR) < num:
        LvR = np.append(LvR, np.zeros(num - len(LvR)))
    return np.mean(LvR)


# Values from the json file on gnode
dataset_fef_6e_lvr = 0.4178813296671444
dataset_fef_5i_lvr = 0.9737456740769203

duration_s = 10.5

# Population sizes
num_fef_5i = 3721
num_fef_6e = 16128

# Load data
nest_fef_5i_lvr = load("33fb5955558ba8bb15a3fdce49dfd914682ef3ea-spikes-FEF-5I.npy", duration_s, num_fef_5i, False)
nest_fef_6e_lvr = load("33fb5955558ba8bb15a3fdce49dfd914682ef3ea-spikes-FEF-6E.npy", duration_s, num_fef_6e, False)

print("FEF 5I LvR - NEST:%f, Dataset:%f" % (nest_fef_5i_lvr, dataset_fef_5i_lvr))
print("FEF 6E LvR - NEST:%f, Dataset:%f" % (nest_fef_6e_lvr, dataset_fef_6e_lvr))

With check=True:

FEF 5I LvR - NEST:0.973484, Dataset:0.973746
FEF 6E LvR - NEST:0.268638, Dataset:0.417881

With check=False:

FEF 5I LvR - NEST:0.973746, Dataset:0.973746
FEF 6E LvR - NEST:0.473715, Dataset:0.417881

Which is significantly closer to the values in the published JSON. As this change was after the original submission date, might the published LvR data have been calculated prior to this change?

@mschmidt87
Copy link
Collaborator

mschmidt87 commented Apr 22, 2020

I think you are probably right that this padding of silent neurons with 0.0 values to the results had not been used in the publication and the published data.

The manuscript says in the methods:
" Spike-train irregularity is quantified for each population by the revised local variation LvR [82] averaged over a subsample of 2000 neurons. The cross-correlation coefficient is
computed with bin width 1 ms on single-cell spike histograms of a subsample of 2000 neurons
per population with at least one emitted spike per neuron. Both measures are computed on the
entire population if it contains fewer than 2000 neurons."

This does not mention the padding and your calculations indicate that at least the new results are closer to the bublished with check=False (in your code). Unfortunately, I can't recall why I added the padding at some point.

I guess it's a question of definition how silent neurons should be accounted for in the calculations, but I would say now that assigning them value of 0 does not make much sense.

I think that @akorgor applied your code to more populations and can confirm your findings.

@neworderofjamie
Copy link
Author

neworderofjamie commented Apr 22, 2020

thanks for looking into this some more @mschmidt87! It's still weird that the value for FEF 5I fits exactly but the one for FEF 6E doesn't though - again that seems unlikely to be caused by two different runs of the same simulation.

@neworderofjamie
Copy link
Author

@akorgor has kindly shared the full set of spike trains with me so I thought I'd share this (very bad) plot of the distribution of LvR metrics for 6E:

Figure_1_2

@akorgor
Copy link
Collaborator

akorgor commented Apr 26, 2020

I found that if in the code snippet above on the LvR calculation check is chosen to be True and num is chosen to be len(np.unique(spikes[:, 0])), the results from the code and from gnode match up to a difference of ~1e-15, 1e-16. Thus, for computing the average LvR value only the neurons that spiked at least once during the simulation contribute a 0 but not the completely silent neurons like the definition of num in the code snippet indicates. I could verify this for the areas FEF, V1, V2 for the populations in all layers in the chi=1.0 10.5s simulation.

@akorgor
Copy link
Collaborator

akorgor commented Apr 29, 2020

For the correlation coefficient calculation however, I could not find the reason for the deviations until now. The gnode values are for the majority of the populations (15/24) larger than the ones from the recalculation. The differences range from -0.00025 to +0.00025 and do not seem to depend on the total number of neurons in the population or the number of spiking neurons. Except for FEF 6E there are always more than subsample=2000 neurons after ch.strip_binned_spiketrains(hist).
Since I found that this calculation of the correlation coefficients is in general sensitive to subsampling in terms of neurons and time, I tried setting tmin=0, tmax=max(np.unique(spikes[:, 1]) and subsample=3000, but none of these three simulations yielded the gnode values. @albada @mschmidt87, I would be happy to test if you have further ideas where the deviations could come from.

@mschmidt87
Copy link
Collaborator

I found that if in the code snippet above on the LvR calculation check is chosen to be True and num is chosen to be len(np.unique(spikes[:, 0])), the results from the code and from gnode match up to a difference of ~1e-15, 1e-16. Thus, for computing the average LvR value only the neurons that spiked at least once during the simulation contribute a 0 but not the completely silent neurons like the definition of num in the code snippet indicates. I could verify this for the areas FEF, V1, V2 for the populations in all layers in the chi=1.0 10.5s simulation.

Thank you @akorgor for that detective work (and obvisouly sorry for the misleading code and created confusion). I think that this these choices are definitely debatable, however, since the deviations are not too large from the data shown in the paper, I would suggest to just edit the code in the repo such that the results match and stick to the algorithm used for the paper.

@mschmidt87
Copy link
Collaborator

@akorgor has kindly shared the full set of spike trains with me so I thought I'd share this (very bad) plot of the distribution of LvR metrics for 6E:

What does the lower panel show? I think it can't be the values from the published paper, because the values for 6E are mostly well below 1 (Fig. 3F).

@neworderofjamie
Copy link
Author

neworderofjamie commented May 5, 2020

@akorgor has kindly shared the full set of spike trains with me so I thought I'd share this (very bad) plot of the distribution of LvR metrics for 6E:

What does the lower panel show? I think it can't be the values from the published paper, because the values for 6E are mostly well below 1 (Fig. 3F).

The lower panel is the values in the json file on gnode (from the simulation where chi=1.9)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants