Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'list' object has no attribute 'local_scope' #7292

Closed
tyccc22 opened this issue Apr 10, 2024 · 6 comments
Closed

AttributeError: 'list' object has no attribute 'local_scope' #7292

tyccc22 opened this issue Apr 10, 2024 · 6 comments
Assignees

Comments

@tyccc22
Copy link

tyccc22 commented Apr 10, 2024

🐛 Bug

When I run dgl\examples\pytorch\graphsage\dist\train_dist.py on GPUs as the file README.md, it works fine, but when changing the network layer of the model the following problem occurs:

AttributeError: 'list' object has no attribute 'local_scope'

To Reproduce

Steps to reproduce the behavior:

  1. The model can be trained well by running the following command. The code in workspace is copied from dgl\examples\pytorch\graphsage\dist\ .
/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 1 --backend nccl"
  1. Change the network layer in dgl\examples\pytorch\graphsage\dist\train_dist.py in the following way
# GAT
class DistGAT(nn.Module):
    def __init__(
        self, in_feats, n_hidden, n_classes, heads
        # n_layers, activation, dropout
    ):
        super().__init__()
        self.gat_layers = nn.ModuleList()
        # two-layer GAT
        self.gat_layers.append(
            dglnn.GATConv(
                in_feats,
                n_hidden,
                heads[0],
                feat_drop=0.6,
                attn_drop=0.6,
                activation=F.elu,
            )
        )
        self.gat_layers.append(
            dglnn.GATConv(
                in_feats * heads[0],
                n_classes,
                heads[1],
                feat_drop=0.6,
                attn_drop=0.6,
                activation=None,
            )
        )

    def forward(self, g, inputs):
        h = inputs
        for i, layer in enumerate(self.gat_layers):
            h = layer(g, h)
            if i == 1:  # last layer
                h = h.mean(1)
            else:  # other layer(s)
                h = h.flatten(1)
        return h

def run(args, device, data):
    ...
    # Define model and optimizer
    model = DistGAT(
        in_feats,
        args.num_hidden,
        n_classes,
        heads=[8, 1]
    )
        # args.num_layers,
        # F.relu,
        # args.dropout,
    # )
    ...

execute

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 gat-2-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"

The cluster starts as expected and then the following problem occurs

Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
  1. GCN is probably more similar to sage. If you make the same changes, the same message will appear
# GCN
class DistGCN(nn.Module):
    def __init__(self, in_size, hid_size, out_size):
        super().__init__()
        self.layers = nn.ModuleList()
        # two-layer GCN
        self.layers.append(
            dglnn.GraphConv(in_size, hid_size, activation=F.relu)
        )
        self.layers.append(dglnn.GraphConv(hid_size, out_size))
        self.dropout = nn.Dropout(0.5)

    def forward(self, g, features):
        h = features
        for i, layer in enumerate(self.layers):
            if i != 0:
                h = self.dropout(h)
            h = layer(g, h)
        return h

def run(args, device, data):
    ...
    # Define model and optimizer
    # model = GCN(
    #     in_feats,
    #     args.num_hidden,
    #     n_classes,
    #     args.num_layers,
    #     F.relu,
    #     args.dropout,
    # )
    model = DistGCN(in_feats, 16, n_classes).to(device)
    ...

execute

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 gcn-dist-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"

The information obtained is

Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[0] in group[0] is exiting...

Expected behavior

Apply distributed training to the training of other models, e.g. GAT, GCN, GIN, etc.

Environment

  • DGL Version (e.g., 1.0): DGL 2.1.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 2.1.0
  • OS (e.g., Linux): ubuntu 20.04
  • How you installed DGL (conda, pip, source): conda
  • Build command you used (if compiling from source):
  • Python version: Python 3.9.18
  • CUDA/cuDNN version (if applicable): cuda_12.1.0_530.30.02_linux
  • GPU models and configuration (e.g. V100): The graphics card on one machine is a GeForce RTX 2060 SUPER and the graphics card on the other machine is a GeForce GTX 1660 SUPER.
  • Any other relevant information: I train the above models on a local cluster consisting of two computers and have not migrated them to the cloud yet. There is a different graphics card on each of these two computers.

Additional context

After reviewing the documentation on docs.dgl.ai, I am still unclear on how to resolve the following error:

AttributeError: 'list' object has no attribute 'local_scope'

The code in the dgl/examples/pytorch/graphsage/dist file is quite enlightening, and I am interested in expanding it to incorporate additional models. Any guidance you could offer would be greatly appreciated.

The command that executes the training has a few more parameters or paths than the command in README.md because the following problems occurs:

  1. Probably because I installed dgl in conda's virtual environment, if I don't add a path to python3, there will be
ModuleNotFoundError: No module named 'numpy'

or

ModuleNotFoundError: No module named 'dgl'
  1. If I use the default "--backend" parameter gloo, it comes up with
 [E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1695392035629/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

I have no idea how to solve this.

Once again, thank you for your exceptional work!

@BarclayII
Copy link
Collaborator

@Rhett-Ying do we have DistDGL examples?

@Rhett-Ying
Copy link
Collaborator

please refer to non-dist version of GAT/GCN models such as https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat to make sure it's runnable. Model code should be same both in DistDGL and non-dist.

A better suggestion for running various model with distributed training/inference is utilizing GraphStorm which offers high level APIs.

@tyccc22
Copy link
Author

tyccc22 commented Apr 15, 2024

Thanks for your advice. Since the "Gloo connectFullMesh failed with..." error is not resolved, I am trying to train some models from https://github.com/dmlc/dgl/tree/master/examples/pytorch/ on 2 machines.

Also, I would like to ask about dataset partitioning. When dividing the dataset with https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist/partition_graph.py, the memory size required is several times the size of the dataset. Are there any corresponding optimisations for memory, or are other tools provided?

@Rhett-Ying
Copy link
Collaborator

Are there any corresponding optimisations for memory, or are other tools provided?

Unfortunately there's no much optimization available for the partition stage. dgl.distributed.partition_graph() is the most convenient API that is available for now. But we also support partition graph with distributed pipeline if you have multiple machines with small CPU RAM. please refer to here for more details. This partition pipeline requires some more additional preprocesses.

@Rhett-Ying Rhett-Ying self-assigned this Apr 15, 2024
Copy link

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

@frozenbugs
Copy link
Collaborator

Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants