Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability dataset clustering data loss #10

Open
franzigeiger opened this issue Apr 27, 2022 · 3 comments
Open

Stability dataset clustering data loss #10

franzigeiger opened this issue Apr 27, 2022 · 3 comments

Comments

@franzigeiger
Copy link

Hi FLIP authors,
I have been working with the data split routine you applied to the meltome atlas data and found some irregularities. You create the train and test splits based on clusters from mmseq2 but the notebook routine seems off (in collect_flip/2_meltome_atlas.ipynb).
For creating the mixed dataset based on the cluster you remove the cluster center datapoints from the set once you encountered it in the full protein list which I think makes the output datasets incorrect:

Cell 30, last 20 LOC

            if key in train: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'train'
                })
                train.remove(key)  <--- HERE
            elif key in test: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                
                mixed_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                test.remove(key) <--HERE

While removing the sequences is fine for the test set (only the cluster center points are used anyways), for the training set it holds out all sequences of this cluster that are processed in the loop after the cluster center.
Upon fixing this I get a training set of 67361 datapoints + 3134 test datapoints (in comparison to 24817 training datapoints reported on the paper).

Do I understand something wrong here? 67361 is also 80% of the full cluster dataset (84030 entries) so this would make more sense based on the setting. The mixed set should in the end be 80% of all data in train + only cluster centers for test, which are obviously a lot less than 20% of all data.

I haven't checked if the same error happened on the other datasets but would recommend to do so.

@franzigeiger
Copy link
Author

I rewrote the algorithm in this way, which is where I got my numbers from:

      for protein in proteins:
            if protein.get('sequence') \
                    and protein.get('meltingPoint') \
                    and not math.isnan(protein.get('meltingPoint')) \
                    and protein.get('runName'):
                key = f"{protein.get('uniprotAccession')}_{'_'.join(protein.get('runName').split(' '))}"
                filtered_proteins[key] = {
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                }
        mixed_set = list()
        for key, content in tqdm(filtered_proteins.items()):
            hits = sequence_clusters.loc[key].values
            cluster_rep_key = hits[0]
            if isinstance(cluster_rep_key, np.ndarray):
                # For some reason two entries end up being arrays
                cluster_rep_key = cluster_rep_key[0]
            if cluster_rep_key in train:
                mixed_set.append({
                    **content,
                    'set': 'train'
                })
            elif cluster_rep_key in test:
                if key == cluster_rep_key:
                    mixed_set.append({
                        **content,
                        'set': 'test'
                    })

@sacdallago
Copy link
Collaborator

Hi @franzigeiger , thanks for bringing this up! It took me some time to get into the code again, and indeed it could be that instead of grabbing all sequences in from a cluster selected for training, we only pick the cluster representatives. In other words, it's cluster representatives for train vs. cluster representatives for test, while in the paper we discussed whole clusters for train vs. representatives for test.

Before moving on to changing the data, it might be useful to run the baselines again and see if we need to make any updates on our interpretations and results. I don't expect this change to affect performance significantly, but I may be wrong :)

@sacdallago
Copy link
Collaborator

P.S.: feel free to open a PR for the proposed changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants