Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Train, test, val split for graphs #158

Open
Dsantra92 opened this issue Jul 15, 2022 · 3 comments
Open

Better Train, test, val split for graphs #158

Dsantra92 opened this issue Jul 15, 2022 · 3 comments
Labels

Comments

@Dsantra92
Copy link
Collaborator

Problems

  1. Splitting graphs is a bit more complicated than normal data. Graphs can be split based on node, edge or whole graph itself. We should be able to support all of that with minimal alterations. Currently, graph splits in MLDatases.jl have no specification.
  2. The API for accessing train, test and val split is inconsistent.

Proposed solution

I propose the following API:

  1. Splits should be present in the metadata. We can access splits using splits = dataset.metadata.split
  2. The split is a named tuple for train, test and val. Is empty if no predefined split is present.
  3. Each split is a named tuple/struct which contains type of split $\in$ [node, edge, graph] and the value of the split.
  4. There can be two or more splits depending on the feature you are splitting on. There are such examples in OGBDatasets. So it makes sense to make the value of the split a dict or named tuple.

This should ideally handle most of the cases. We only support non-dynamic graphs for now, I have not worked with dynamic graphs. Most likely we can support splits with little to no changes.

This recursive structure of splits can be confusing for the end-user. We can support APIs in MLUtils.jl like:

train_data, test_data, val_data = split(dataset)
# If no split is present in the dataset, all data will training data or error out.

train_data = split(dataset, :train)

# If we want to choose edge split
train_data = split(dataset, :train, :edge)

# Get training split for a feature
train_data = split(dataset, :train, :feature_name)

Keeping this a discussion for now.

cc: @CarloLucibello

@CarloLucibello
Copy link
Member

Right now for graph level tasks the dataset.metadata contains the train/test mask.

For node and edge tasks we store the masks in the single graph metadata. Notice that for node/edge level tasks we are not really splitting a graph in two, we just have masks that we apply on the model outputs.

The same dataset can define multiple splits, we just have to define entries in the datasets or graph metadata, e.g. "edge_train_mask", "node_train_mask".

So it seems to me we basically have already what you ask for, maybe we just have to write down a naming convention scheme for the masks, and also specify in the docs the mask types (I think we have boolean masks for most graph dataset, the alternative being a list of indices).

To make graph datasets consistent with non-graph datasets, for graph-level tasks (where the datasets contains multiple graphs) we should allow the selection of the split at construction time, as we do with MNIST(split=:train).

@Dsantra92
Copy link
Collaborator Author

Yes, we already have a lot of things done. We just need to document it and solidify it.

On, another note is MNIST(split=:train) the best way to go? As you mentioned, train, test and val split are just masks. So if we just do something like MovieLens("100K", split=:train), we will compute the whole dataset and then apply the training mask. So for loading 3 different splits we will basically do 3 sets of the same computation. Instead split(dataset, mask) makes more sense here.

MNIST(split=:train) will make sense when we are storing pre-processed splits separately in parquet files and loading them individually.

@CarloLucibello
Copy link
Member

For MNIST the splits are actually separate files, so no shared computation, but for other datasets this is not the case and that's not ideal. Storing processed datasets would be the path forward. The thing is we already went through some breaking changes recently so I really want to avoid others in the near future, so let's see in which case the split=:train API is really a huge a inconvenience and eventually mitigate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants