Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences on the Cora dataset #24

Open
hechtlinger opened this issue Feb 8, 2018 · 3 comments
Open

Differences on the Cora dataset #24

hechtlinger opened this issue Feb 8, 2018 · 3 comments

Comments

@hechtlinger
Copy link

The lables here at keras-gcn does not seem to corresponds with the labels of the gcn repository when you load the data. It's the same indices, but not the same values.
Also if you sum y_train here, there aren't 20 labels per class.

Are the two datasets actually different as it seems?
What's the reason for that and which one should we use to replicate the paper results?

@tkipf
Copy link
Owner

tkipf commented Feb 8, 2018

Thanks for commenting and apologies for the confusion. Only the ‘gcn’ repository was intended to reproduce the results of the paper and hosts the dataset splits that we used (which were introduced in the Planetoid paper).

This repository (keras-gcn) is not intended to replicate the results of the paper (some subtleties in Keras did not allow me to re-implement the model with the exact same details as in the paper). Also, the dataset loader here does not load the splits from the Planetoid paper but instead the original version of the Cora dataset- splits are then generated at random.

I will update the description to make this (important) point a bit clearer. Thanks for catching this!

@haczqyf
Copy link

haczqyf commented Feb 19, 2018

Hi Thomas,

Thanks for providing such a well-written implementation of GCN. I have been working on studying GCN for a while. I am new to this area and have learned a lot from your paper and code.

In terms of the dataset loader in keras-gcn, I also would like to add a few remarks about which I was confused and have figured them out after some experiments just in case others might also have such questions.

I was wondering if the training set, validation set and test set are always the same subsets of the whole CORA, i.e., if the training sets are different when we run the code at different times. You mentioned the splits here are generated at random. In fact, the training set always takes the first 140 samples in the CORA which means that the training set is indeed fixed when we run the code at different times. Thus I think that it would be better to make it more clear about this phrase 'splits are then generated at random' to avoid ambiguity. If I understand wrong about this, please correct me.

Another remark is about one detail in the function 'encode_onehot' in utils.py. I find that the 'enumerate' function might assign different one hot vectors to classes when we run the code at different times. To be more clear, I have provide an example below.

# First time
classes = {'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}

for i, c in enumerate(classes):
    print(i)
    print(c)

# Output
0
Genetic_Algorithms
1
Theory
2
Probabilistic_Methods
3
Reinforcement_Learning
4
Case_Based
5
Neural_Networks
6
Rule_Learning

# Second time
classes = {'Case_Based',
 'Genetic_Algorithms',
 'Neural_Networks',
 'Probabilistic_Methods',
 'Reinforcement_Learning',
 'Rule_Learning',
 'Theory'}
for i, c in enumerate(classes):
    print(i)
    print(c)

# Output
0
Probabilistic_Methods
1
Rule_Learning
2
Neural_Networks
3
Genetic_Algorithms
4
Theory
5
Case_Based
6
Reinforcement_Learning

In this case, if we want to calculate the frequency distribution of each class in the training set, this might cause inconvenience. One potential solution for fixing the one hot vectors of classes I figure out is to slightly modify one line code in the 'encode_onehot' function:

def encode_onehot(labels):
    # classes = set(labels)
    classes = sorted(list(set(labels)))
    classes_dict = {c: np.identity(len(classes))[i, :] for i, c in enumerate(classes)}
    labels_onehot = np.array(list(map(classes_dict.get, labels)), dtype=np.int32)
    return labels_onehot

Finally, I am also thinking about how we can randomly split dataset into training set, validation set and test set. For example, in terms of training set, how we can randomly have 140 samples from the whole dataset as training set when we run the code at different times. The current 'get_splits' function split the dataset as defined below. The training set always takes the first 140 rows of dataset.

idx_train = range(140)
idx_val = range(200, 500)
idx_test = range(500, 1500)

I feel that the frequency distribution of different classes in the training set might affect the prediction accuracy, and it would make more sense that the training set, validation set and test set are randomly split from the whole dataset if we run a few times of code and calculate the average accuracy.

Best,
Yifan

@tkipf
Copy link
Owner

tkipf commented Feb 26, 2018

Hi Yifan, Thanks for looking into this. I agree with all of your points. As mentioned previously this data loader is only meant as a 'quick and dirty' example to show how data can be loaded into the model. For the reproducible dataset splits used in our paper, please have a look at https://github.com/tkipf/gcn

I have updated the project readme with a big warning to hopefully avoid confusion about this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants