Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A possible solution to OOM in Metattack. #128

Open
Leirunlin opened this issue Feb 5, 2023 · 11 comments
Open

A possible solution to OOM in Metattack. #128

Leirunlin opened this issue Feb 5, 2023 · 11 comments

Comments

@Leirunlin
Copy link

Hi!
As mentioned in issues #90 and #127, OOM occurs when running Metattack in a higher version of Pytorch.
I check the source code in metattack.py and find that function ''get_adj_score()'' seems to be the reason.

def get_adj_score(self, adj_grad, modified_adj, ori_adj, ll_constraint, ll_cutoff):
adj_meta_grad = adj_grad * (-2 * modified_adj + 1)
# Make sure that the minimum entry is 0.
adj_meta_grad -= adj_meta_grad.min()
# Filter self-loops
adj_meta_grad -= torch.diag(torch.diag(adj_meta_grad, 0))
# # Set entries to 0 that could lead to singleton nodes.
singleton_mask = self.filter_potential_singletons(modified_adj)
adj_meta_grad = adj_meta_grad * singleton_mask
if ll_constraint:
allowed_mask, self.ll_ratio = self.log_likelihood_constraint(modified_adj, ori_adj, ll_cutoff)
allowed_mask = allowed_mask.to(self.device)
adj_meta_grad = adj_meta_grad * allowed_mask
return adj_meta_grad

I try substituting line 128 and 130 with explicit subtraction and it works fine to me to avoid OOM, that is using
adj_meta_grad = adj_meta_grad - adj_meta_grad.min()
and
adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0))
In fact, I found it is enough if only line 128 is replaced.
I think something goes wrong when "-=" and ".min()" are used together.
It would be really helpful for me if anyone could offer an explanation to it.

@pqypq
Copy link

pqypq commented Feb 9, 2023

Hi! I've encountered the same problem that CUDA out of memory when using the following environment on Ubuntu 20.04.5 LTS:

numpy==1.21.6
scipy==1.7.3
torch==1.13.1
torch_geometric==2.2.0
torch_scatter==2.1.0+pt113cu116
torch_sparse==0.6.16+pt113cu116

I have already made the changes that you suggested. Could you please help me solve this problem?
Thanks

@Leirunlin
Copy link
Author

Hi! I'm sorry that I don't know why the changes don't work in your environment.
Here, I will provide my environment and detailed steps that work for me:

numpy==1.23.4
scipy==1.9.3
torch==1.13.0
torch_geometric==2.2.0
torch_scatter==2.1.0
torch_sparse==0.6.15

I'm running the code on a GPU with 98GB memory. If no changes are made, I encounter CUDA out of memory just like you after generating two or three graphs against Metattack. (I guess there are gradients or something not removed from GPU.) After doing the changes I mentioned, the memory cost on dataset Cora is about 3000-4000M, which is now acceptable to me.
I wonder if it is possible for you to provide more information about your problem, like how did you make the changes and the memory cost in your cases.
Thanks.

@pqypq
Copy link

pqypq commented Feb 12, 2023

Hi, thanks for your reply!
I'm now working on the mettack on the graph data. I'm running the cora dataset on a GPU with 10.76GB memory.
I tried to replace

adj_meta_grad -= adj_meta_grad.min()
adj_meta_grad -= torch.diag(torch.diag(adj_meta_grad, 0))

to

adj_meta_grad = adj_meta_grad - adj_meta_grad.min()
adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0))

When the process comes to around 45%, I encountered the OOM error, the error message shows like below:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB 
(GPU 1; 10.76 GiB total capacity; 9.90 GiB already allocated; 11.56 MiB free; 9.91 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Do you have any suggestions for me?

Thanks!

@Leirunlin
Copy link
Author

Hi!
The phenomenon is quite similar to me when no change is made. The OOM occurs in the middle of training.
For you to debug, I suggest that you first check which part of Metattack lead to the problem in your environment. In issue #90, one mentioned that the inner_train() function could also be problematic. While the function works fine for me, I suggest that you check for it.
Anyway, I would try reproducing the bug and solution on other devices, but I'm afraid it may not be so fast.
Maybe we should have someone else provide more samples of the problem.

@pqypq
Copy link

pqypq commented Feb 14, 2023

Hi, thanks for your suggestions!

I tried to create an environment with the details below:

numpy==1.18.1
scipy==1.4.1
pytorch==1.8.0
torch_scatter==2.0.8
torch_sparse==0.6.12

Under these settings, I can successfully run the dataset: 'cora', 'cora_ml', 'citeseer', 'polblogs'. But for the dataset 'pubmed', still has the OOM problem. Have you ever encountered this problem?

Thanks!

@Leirunlin
Copy link
Author

Leirunlin commented Feb 15, 2023

Hi!
You can try MetaApprox from mettack.py. It is an approximated version of Metattack. In ProGNN, PubMed is attacked using MetaApprox.
If it still does not work for you, maybe you should try more scalable attacks.

@nowyouseemejoe
Copy link

I found an efficient implementation in GreatX, hope it helps you.

@pqypq
Copy link

pqypq commented Feb 17, 2023

Hi @Leirunlin @nowyouseemejoe , thanks for your advice, I will try that!

@ChandlerBang
Copy link
Collaborator

ChandlerBang commented Feb 26, 2023

Thank you all for the great discussion and suggestions! We are a bit shorthanded right now and you may want to directly make pull request if you found any bugs.

For the OOM issue, mettack is very memory consuming and we need a ~30 GB GPU to run it on Pubmed with MetaApprox. I have just added a scalable global attack PRBCD.

pip install deeprobust==0.2.7

You may want to try python examples/graph/test_prbcd.py or take a look at test_prbcd.py.

@EnyanDai
Copy link

Hi,
This problem can be solved by revising the Line 126 as:
adj_meta_grad = adj_grad.detach() * (-2 * modified_adj.detach() + 1)

Hi! As mentioned in issues #90 and #127, OOM occurs when running Metattack in a higher version of Pytorch. I check the source code in metattack.py and find that function ''get_adj_score()'' seems to be the reason.

def get_adj_score(self, adj_grad, modified_adj, ori_adj, ll_constraint, ll_cutoff):
adj_meta_grad = adj_grad * (-2 * modified_adj + 1)
# Make sure that the minimum entry is 0.
adj_meta_grad -= adj_meta_grad.min()
# Filter self-loops
adj_meta_grad -= torch.diag(torch.diag(adj_meta_grad, 0))
# # Set entries to 0 that could lead to singleton nodes.
singleton_mask = self.filter_potential_singletons(modified_adj)
adj_meta_grad = adj_meta_grad * singleton_mask
if ll_constraint:
allowed_mask, self.ll_ratio = self.log_likelihood_constraint(modified_adj, ori_adj, ll_cutoff)
allowed_mask = allowed_mask.to(self.device)
adj_meta_grad = adj_meta_grad * allowed_mask
return adj_meta_grad

I try substituting line 128 and 130 with explicit subtraction and it works fine to me to avoid OOM, that is using
adj_meta_grad = adj_meta_grad - adj_meta_grad.min()
and
adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0))
In fact, I found it is enough if only line 128 is replaced.
I think something goes wrong when "-=" and ".min()" are used together.
It would be really helpful for me if anyone could offer an explanation to it.

@pqypq
Copy link

pqypq commented Mar 12, 2023

Hi, This problem can be solved by revising the Line 126 as: adj_meta_grad = adj_grad.detach() * (-2 * modified_adj.detach() + 1)

Hi! As mentioned in issues #90 and #127, OOM occurs when running Metattack in a higher version of Pytorch. I check the source code in metattack.py and find that function ''get_adj_score()'' seems to be the reason.

def get_adj_score(self, adj_grad, modified_adj, ori_adj, ll_constraint, ll_cutoff):
adj_meta_grad = adj_grad * (-2 * modified_adj + 1)
# Make sure that the minimum entry is 0.
adj_meta_grad -= adj_meta_grad.min()
# Filter self-loops
adj_meta_grad -= torch.diag(torch.diag(adj_meta_grad, 0))
# # Set entries to 0 that could lead to singleton nodes.
singleton_mask = self.filter_potential_singletons(modified_adj)
adj_meta_grad = adj_meta_grad * singleton_mask
if ll_constraint:
allowed_mask, self.ll_ratio = self.log_likelihood_constraint(modified_adj, ori_adj, ll_cutoff)
allowed_mask = allowed_mask.to(self.device)
adj_meta_grad = adj_meta_grad * allowed_mask
return adj_meta_grad

I try substituting line 128 and 130 with explicit subtraction and it works fine to me to avoid OOM, that is using
adj_meta_grad = adj_meta_grad - adj_meta_grad.min()
and
adj_meta_grad = adj_meta_grad - torch.diag(torch.diag(adj_meta_grad, 0))
In fact, I found it is enough if only line 128 is replaced.
I think something goes wrong when "-=" and ".min()" are used together.
It would be really helpful for me if anyone could offer an explanation to it.

Hi Enyan,

Thank you for your suggestion! I tried to modify the code according to your way, but it still doesn't work on my device. May I ask how large is your GPU memory?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants