Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with torch.util.tensorboard add_graph() #24157

Closed
GinSoda opened this issue Aug 11, 2019 · 61 comments
Closed

problem with torch.util.tensorboard add_graph() #24157

GinSoda opened this issue Aug 11, 2019 · 61 comments
Labels
module: tensorboard triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@GinSoda
Copy link

GinSoda commented Aug 11, 2019

code

import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Sequential(     #input_size=(1*28*28)
            nn.Conv2d(1, 6, 5, 1, 2),
            nn.ReLU(),      #(6*28*28)
            nn.MaxPool2d(kernel_size=2, stride=2),  #output_size=(6*14*14)
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(6, 16, 5),
            nn.ReLU(),      #(16*10*10)
            nn.MaxPool2d(2, 2)  #output_size=(16*5*5)
        )
        self.fc1 = nn.Sequential(
            nn.Linear(16 * 5 * 5, 120),
            nn.ReLU()
        )
        self.fc2 = nn.Sequential(
            nn.Linear(120, 84),
            nn.ReLU()
        )
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = x.view(x.size()[0], -1)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        return x

dummy_input = torch.rand(13, 1, 28, 28)
model = LeNet()
with SummaryWriter(comment='Net', log_dir='/output') as w:
    w.add_graph(model, (dummy_input, ))

🐛 Bug

log_dir is right
But tensorboard shows nothing !!
Does anyone encounter the same problem?

To Reproduce

Expected behavior

Environment

  • PyTorch Version (e.g., 1.0): 1.2.0
  • OS (e.g., Linux): ubuntu16.04
  • How you installed PyTorch (conda, pip, source): pip
  • Build command you used (if compiling from source): pip install torch -U
  • Python version: 3.6
  • CUDA/cuDNN version: CUDA 10.0.130
  • GPU models and configuration: GTX1080Ti
  • Any other relevant information:

Additional context

@LittlePea13
Copy link

LittlePea13 commented Aug 11, 2019

You need to close the writer or flush it.

    w.flush()
    w.close()

I faced the same problem and had it posted in StackOverflow
That will generate the log, but in my case it is still unable to load it in tensorboard, giving a

Unhandled Promise Rejection: TypeError: null is not an object (evaluating 'Fa.node')

Error in the console of the browser when loading the graph. I have tried graphs generated in tensorflow and they worked, it is only with pytorch ones, even the one provided in the tutorial for tensorboard in pytorch ( the one using torchvision). The log file does contain the graph, as I see it in its contents, and the script doesn't complain when saving it, it is just at visualising time at tensorboard. Let me know if you are able to visualise your graph.

I can confirm that the issue with the Unhandled Promise Rejection does not happen in 1.1, I downgraded it to 1.1 and it worked, the graph is now showing on Tensorboard. Weirdly, the graph generated by 1.1 has only 124 elements, while the one by 1.2, there are 507. This is shown when verbose is True, and I am attaching the output in txt files generated by both.
verbose_graph_1.1.txt
verbose_graph_1.2.txt

@bintonto
Copy link

Still blank after refreshing

@bintonto
Copy link

import torch
import torchvision.models as models
from torch.utils.tensorboard import SummaryWriter
resnet18 = models.resnet18(pretrained=True)
x = torch.randn(1, 3, 224, 224)
writer = SummaryWriter()
writer.add_graph(resnet18, x)
writer.close()

Collecting environment information...
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: version 3.13.4

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.15.4
[conda] blas 1.0 mkl
[conda] mkl 2019.4 233
[conda] mkl_fft 1.0.12 py37h5e564d8_0
[conda] mkl_random 1.0.2 py37h27c97d8_0
[conda] pytorch 1.2.0 py3.7_0 pytorch
[conda] pytorch-nightly 1.2.0.dev20190629 py3.7_0 pytorch
[conda] pytorch-transformers 1.0.0 pypi_0 pypi
[conda] torchaudio 0.3.0 py37 pytorch
[conda] torchsummary 1.5.1 pypi_0 pypi
[conda] torchvision 0.4.0 py37_cpu pytorch

@LittlePea13
Copy link

LittlePea13 commented Aug 12, 2019

If you check your log file, you will see it contains the graph, it is a different error than the first one mentioned by Ginsoda.

Do you get a graph page in Tensorboard but the graph doesn't load? If you check your browser console does it say

Unhandled Promise Rejection: TypeError: null is not an object (evaluating 'Fa.node')

Because in that case it is the same problem I am facing. I had to downgrade to v1.1, but the graph there is way more simpler and doesn't contain all the model.

@pietern pietern added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 13, 2019
@maximiliense
Copy link

Hi all,
@LittlePea13 in my case I do have the exact same problem as you (the tab show up, but the model does not load and I have the same browser error) ! Haven't downgraded to v1.1 yet though.

@rfejgin
Copy link
Contributor

rfejgin commented Aug 26, 2019

I too am getting a graph page that is empty. I did flush and close the SummaryWriter.
Attaching screenshot including Chrome's console which shows an error that may be related.

Note:

  • I do see the textual graph being dumped to the command line console and it seems correct there.

Configuration:

  • PyTorch 1.2.0
  • TensoBoard 1.14.0
  • Python 3.5.2

image

@rfejgin
Copy link
Contributor

rfejgin commented Aug 26, 2019

Note that I too get the same error when copying the example given in the PyTorch documentation:
https://pytorch.org/docs/stable/tensorboard.html

Only difference is that I am not using TensorBoard nightly, but the released TensorBoard 1.14.0.

@StuvX
Copy link

StuvX commented Aug 27, 2019

I'm having the same issue - any luck on tracing the issue?

@alqbib
Copy link

alqbib commented Aug 27, 2019

me too.
my enviroment information:
windows 10
python3.6
pytorch 1.2 Is CUDA available: No
tensorboard 1.14 or tb-nightly 1.14.0a20190614 or tb-nightly 1.15.0a20190826
tensorflow 1.14
tensorboardX 1.8
numpy 1.17.0
blank

@ErenBalatkan
Copy link

Getting the same problem as @alqbib and @rfejgin , im running the tutorial code located at https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html

@rfejgin
Copy link
Contributor

rfejgin commented Aug 28, 2019

@apaszke @orionr, @lanpa - any idea? Thanks!

@orionr
Copy link
Contributor

orionr commented Aug 28, 2019

My hint is that this is due to the TensorBoard compat (non-TensorFlow case) issue we saw where the log directory doesn't update correctly. Fixed in tensorflow/tensorboard#2342. Unfortunately this didn't make it out for TensorBoard 1.14, so you have three options - (1) use TensorBoard nightly with the fix, (2) install TensorFlow to leverage that code path in TensorBoard or (3) restart TensorBoard periodically for it to pickup the changes.

Please let us know if one of those options takes care of it.

@rfejgin
Copy link
Contributor

rfejgin commented Aug 28, 2019

Same issue with TB nightly. I don't think it's (3) because this happens even when I restart TB after the graph dump is complete. Will try (2).

@rfejgin
Copy link
Contributor

rfejgin commented Aug 28, 2019

Same issue with TensorFlow 1.14.0

@rfejgin
Copy link
Contributor

rfejgin commented Aug 28, 2019

Possibly related - see screenshot of error in the Chrome console above
#24157 (comment)

@orionr
Copy link
Contributor

orionr commented Aug 28, 2019

Interesting - you're right that Chrome console output is unusual. I wonder if our graph proto is somehow wrong in this case. @lanpa can you confirm the tutorial code works for you? Thanks.

@rfejgin
Copy link
Contributor

rfejgin commented Aug 30, 2019

@lanpa: I wasn't sure what you meant by the thumbs-up - does the tutorial code work for you?
I've seen this problem (graph not displayed) both with the tutorial code and my own models. Given that others have observed the same, something seems broken in the graph functionality...

@orionr
Copy link
Contributor

orionr commented Aug 30, 2019

cc @sanekmelnikov @natalialunova

@richard-vock
Copy link

Can confirm the same issue with tutorial code as well as custom model.
Verbose output looks fine, graph does not load with console error in chromium:

(index):24242 Uncaught (in promise) TypeError: Cannot read property 'node' of null
    at (index):24242
    at arrayEach ((index):13920)
    at Function.forEach ((index):14082)
    at B.buildSubhierarchy ((index):24242)
    at new B ((index):24229)
    at HTMLElement.<anonymous> ((index):25062)
    at Object.d.time ((index):24285)
    at HTMLElement._buildRenderHierarchy ((index):25061)
    at HTMLElement._buildNewRenderHierarchy ((index):25061)
    at Object.runMethodEffect [as fn] ((index):3714)

tb-nightly (1.15.0a20190902)
pytorch (newest torch package via pip)

@JianhuanZhuo
Copy link

Stuck by the same issue, any news?

@orionr
Copy link
Contributor

orionr commented Sep 6, 2019

It seems like some graphs cause this issue. A potential fix is at #25599 but we're still confirming. If you're willing to apply those changes locally and confirm it fixes your issue that would be great.

@maximiliense
Copy link

Hi @orionr, I can confirm that tensorboard does show the graph now! thanks!

@ulisesbussi
Copy link

Hi @orionr, I can confirm that tensorboard does show the graph now! thanks!

Me too! thanks!

@orionr
Copy link
Contributor

orionr commented Sep 6, 2019

In that case, landing the changes so they'll be in pytorch-nightly. We'll then add more robust testing around these cases. Thank you!

@rfejgin
Copy link
Contributor

rfejgin commented Sep 6, 2019

Works here too, thanks for the fix.

@orionr
Copy link
Contributor

orionr commented Sep 6, 2019

Fix landed. Please confirm fixed in pytorch-nightly after the build tonight, but closing.

@dnovischi
Copy link

Sorry, torch:1.3.0 nightly did not fix the issue on python 3.7. I will down grade to python 3.5 latter this week, to see if its still a problem.

@orionr
Copy link
Contributor

orionr commented Sep 26, 2019

@dnovischi, thanks for letting us know 1.3 doesn't work. Can you post a piece of sample code that shows the issue? cc @lanpa @sanekmelnikov

@dnovischi
Copy link

dnovischi commented Sep 26, 2019

@orionr Here you go and thanks for the quick response.

example-torch-nightly.zip

Update:
Installing the future package solved the issue for the following setup:
torch 1.3.0.dev2019091
tensorboard 1.14.0
python 3.6
ubuntu 16.04

However, I now get a warning when launching the tensorboard server:
"FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; ..."
Of course, this is a tensorflow issue.

Also note that in the sample code, above, i forgot to close the summary-writer, tb.close()

@akashb95
Copy link

akashb95 commented Sep 28, 2019

Was having the same problem.

I think it's necessary to have tensorboard-2.0.0. I wasn't able to get it to work with tensorboard-1.14 and pytorch nightly build.

Edit: Does now work with Python 3.6, tensorboard-2.0.0, pytorch-1.3.0dev20190925, Mac OS 10.14.6.

@wy171205
Copy link

wy171205 commented Oct 3, 2019

@orionr Thank you for your guidance,I've just solved this problem.
version:
torch 1.3.0.dev20191002
tensorboard 1.14.0
Python 3.7

image

@willprice
Copy link
Contributor

Also working with

  • pytorch 1.3.0.dev20190917
  • tensorboard from tf 2.0.0

@clefourrier
Copy link
Contributor

I updated to

  • torch nightly 1.3.0.dev20191003
  • still using tensorflow 1.14.0
  • python 3.7.2

and it's still not working for me. (With the same web console log).

image

During the graph creation, I get the following trace

.../MEDeA/medea/models/transformer/cells.py:17: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  mask[i, :tensor.size(0)] = 1
.../MEDeA/medea/models/transformer/cells.py:131: TracerWarning: Converting a tensor to a Python index might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  x = x + self.pe[:x.size(0), :, :x.size(-1)]
.../MEDeA/medea/models/transformer/cells.py:64: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert e == self.input_dim, f'Input dim ({e}) should match layer input dim ({self.input_dim})'
.../MEDeA/medea/models/transformer/cells.py:83: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  scores = torch.matmul(query, key.transpose(-1, -2)) / math.sqrt(key_dim)  # matrix multi and scale
.../MEDeA/medea/models/transformer/decoder.py:79: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  decoder_outputs = [torch.tensor(first_item).float().view(self.batch_size, -1)]  # first item is not predicted
.../MEDeA/medea/models/transformer/decoder.py:108: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  prev_predictions = torch.tensor([target_lang_token] * self.batch_size).long().view(self.batch_size, -1)
.../MEDeA/medea/models/transformer/decoder.py:109: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  eow = torch.tensor([eow_token] * self.batch_size).long().view(self.batch_size)
.../MEDeA/medea/models/transformer/decoder.py:111: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  while not torch.all(torch.eq(prev_predictions[:, -1], eow)) and i < memory.shape[1]:
.../MEDeA/medea/models/transformer/cells.py:85: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  indices = torch.triu_indices(key_dim, key_dim, offset=1)
.../MEDeA/medea/models/transformer/cells.py:17: TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator copy_ (possibly due to an assignment). This might cause the trace to be incorrect, because all other views that also reference this data will not reflect this change in the trace! On the other hand, if all other views use the same memory chunk, but are disjoint (e.g. are outputs of torch.split), this might still be safe.
  mask[i, :tensor.size(0)] = 1
.../MEDeA/medea/models/transformer/cells.py:86: TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator index_put_. This might cause the trace to be incorrect, because all other views that also reference this data will not reflect this change in the trace! On the other hand, if all other views use the same memory chunk, but are disjoint (e.g. are outputs of torch.split), this might still be safe.
  scores[:, :, indices[0], indices[1]] = -1e-32
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:37: UserWarning: Unknown op ConstantFill in domain `ai.onnx`.
  handler.ONNX_OP, handler.DOMAIN or "ai.onnx"))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:37: UserWarning: Unknown op ImageScaler in domain `ai.onnx`.
  handler.ONNX_OP, handler.DOMAIN or "ai.onnx"))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:34: UserWarning: Fail to get since_version of IsInf in domain `` with max_inclusive_version=9. Set to 1.
  handler.ONNX_OP, handler.DOMAIN, version))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:34: UserWarning: Fail to get since_version of Mod in domain `` with max_inclusive_version=9. Set to 1.
  handler.ONNX_OP, handler.DOMAIN, version))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:37: UserWarning: Unknown op Range in domain `ai.onnx`.
  handler.ONNX_OP, handler.DOMAIN or "ai.onnx"))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:34: UserWarning: Fail to get since_version of Resize in domain `` with max_inclusive_version=9. Set to 1.
  handler.ONNX_OP, handler.DOMAIN, version))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:34: UserWarning: Fail to get since_version of ReverseSequence in domain `` with max_inclusive_version=9. Set to 1.
  handler.ONNX_OP, handler.DOMAIN, version))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:37: UserWarning: Unknown op Round in domain `ai.onnx`.
  handler.ONNX_OP, handler.DOMAIN or "ai.onnx"))
.../builds/onnx-tensorflow/onnx_tf/common/handler_helper.py:34: UserWarning: Fail to get since_version of ThresholdedRelu in domain `` with max_inclusive_version=9. Set to 1.
  handler.ONNX_OP, handler.DOMAIN, version))
W1003 20:30:03.961869 4491834816 deprecation.py:323] From .../builds/onnx-tensorflow/onnx_tf/handlers/backend/reshape.py:26: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1003 20:30:03.964736 4491834816 deprecation.py:323] From .../builds/onnx-tensorflow/onnx_tf/handlers/backend/reshape.py:31: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W1003 20:30:04.010443 4491834816 deprecation.py:323] From .../builds/onnx-tensorflow/onnx_tf/handlers/backend_handler.py:182: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.

@orionr
Copy link
Contributor

orionr commented Oct 3, 2019

@clefourrier can you try installing TensorBoard (not necessarily TensorFlow) v2.0 and see if that fixes things for you?

@ysono
Copy link

ysono commented Oct 4, 2019

As mentioned by others, I think we still need py 3.6. This worked for me:

  • python 3.6.9
  • tensorboard 2.0.0
  • torch 1.3.0.dev20191003

py 3.5 works. py 3.7 doesn't work.
Didn't try tensorboard 1.14.0

@orionr
Copy link
Contributor

orionr commented Oct 4, 2019

@sanekmelnikov and @lanpa can we try py 3.7? Thanks.

@orionr orionr reopened this Oct 4, 2019
@clefourrier
Copy link
Contributor

clefourrier commented Oct 4, 2019

@clefourrier can you try installing TensorBoard (not necessarily TensorFlow) v2.0 and see if that fixes things for you?

@orionr I should have mentioned that I'm running tensorboard nightly, sorry (tb-nightly - 2.0.0a20190915 )

@orionr
Copy link
Contributor

orionr commented Oct 4, 2019

Just tried with py3.7 and tb 2.0 locally with Mac and the example on https://pytorch.org/docs/stable/tensorboard.html worked for me. @clefourrier and @ysono can you try and isolate your respective errors? Maybe try the simple ResNet example above to see if that works for you. At this point it's unlikely we can get any fix in for the PyTorch 1.3 release coming soon, but happy to fix anything in the nightly once we've isolated things.

@willprice
Copy link
Contributor

willprice commented Oct 4, 2019

I used 3.7 too and didn't have issues. Perhaps there is a specific op that is causing the issue?

@SuperShinyEyes
Copy link

SuperShinyEyes commented Oct 4, 2019

For me, Tensorboard is not the problem but PyTorch IS.

Tested with Pytorch 1.3.0.dev20190917 and it renders the graph in horizontal mode.
Screenshot_2019-10-04_23-32-16

Pytorch 1.1 renders the same architecture as,
Screenshot_2019-10-04_23-36-38

The older version is what we expect(?) and is easier to read. With 1.13, it was impossible to read ResNet graph.

I get the same rendering for both summary files with two different Tensorboard versions

  • 1.14.0
  • 2.1.0a2019100

@orionr
Copy link
Contributor

orionr commented Nov 1, 2019

Thanks for the details. @lanpa, @J0Nreynolds and @sanekmelnikov are looking to improve this visualization with #26639 in 1.4

@shayan113
Copy link

@orionr Thank you for your guidance,I've just solved this problem.
on Windows
version:
torch 1.3.1
tensorboard 2.0.1
Python 3.7.4

@jonas154
Copy link

jonas154 commented Dec 2, 2019

@shayan113 Which problem have you solved?
Relating the visualization problem with Resnet, I still get quite hard readable plots:
image

Ubuntu with
Torch: 1.3.1
Tensorboard: 2.0.2
Python: 3.7.4

@shschong
Copy link

Got it working. I had to install tensorboardX and import SummaryWriter from there. Also, I installed everything via conda.
allVersions
graph

@orionr
Copy link
Contributor

orionr commented Dec 18, 2019

Too bad that you needed to use tensorboardX instead of torch.utils.tensorboard, but happy you were able to unblock. @jonas154 did you try with 1.4? Thanks.

@jonas154
Copy link

@shschong Thanks for sharing your solution!

@orionr So far 1.4 isn't released or? I wanted to wait till the release of the latest version. So far I'm using the Hiddenlayer tool https://github.com/waleedka/hiddenlayer

@AceEviliano
Copy link

Thank you all, I think everything is settled with PyTorch including visualization being top to down instead of left to right. This is my specs

Ubuntu 18.04.3 LTS
torch 1.5.0.dev20200113
tensorboard 2.0.1
Python 3.7

@BrambleXu
Copy link

BrambleXu commented Jan 18, 2020

Update to pytorch 1.4 and tensorboard 2.1.0 with python 3.6, works well.

@Hsulet
Copy link

Hsulet commented Jan 23, 2020

I am using
torch 1.4.0
tensor board 2.0.2
Python 3.7.4
TensorBoard still display two rectangle for graph.

@sdsy888
Copy link

sdsy888 commented Jan 31, 2020

Hi @orionr @ptrblck
I just encountered another issue with add_graph.

My packages:

pytorch==1.4.0
ypthon==3.7.6

When I use

...
train_data_sample, _, _ = iter(dataloader_train).next()
writer.add_graph(model,train_data_sample) 
...

The error occurs:

*** RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient

And the traceback clearly shows the error happens when called add_graph. Also, I tried to use change input_to_model to FloatTensor or LongTensor, they all won't work.


Just confirmed, it has something to do with DataParallel. If I call add_graph before move model to my all 4 GPUs, there's no such bug. So, is there any way to avoid such issue without not using DataParallel?

@sdsy888
Copy link

sdsy888 commented Jan 31, 2020

Hi @orionr @ptrblck
I just encountered another issue with add_graph.

My packages:

pytorch==1.4.0
ypthon==3.7.6

When I use

...
train_data_sample, _, _ = iter(dataloader_train).next()
writer.add_graph(model,train_data_sample) 
...

The error occurs:

*** RuntimeError: Cannot insert a Tensor that requires grad as a constant. Consider making it a parameter or input, or detaching the gradient

And the traceback clearly shows the error happens when called add_graph. Also, I tried to use change input_to_model to FloatTensor or LongTensor, they all won't work.

Just confirmed, it has something to do with DataParallel. If I call add_graph before move model to my all 4 GPUs, there's no such bug. So, is there any way to avoid such issue without not using DataParallel?

[Problem solved]

I found it's the DataParallel when using multi GPUs that cause the problem. We need to fetch the model before wrapping it in the DataParallel.

So here's the method for those who encounter the same issue:

 # setup the summary writer
train_data_sample, label_sample = iter(dataloader_train).next()
writer = SummaryWriter(args.summary_path, flush_secs=120)

with writer:
    writer.add_graph(model.module,train_data_sample.to(device))  # model graph, with input

@jonas154
Copy link

jonas154 commented Feb 3, 2020

Update to pytorch 1.4 and tensorboard 2.1.0 with python 3.6, works well.

I can confirm - after an update to pytorch 1.4 everything works for me.

@orionr
Copy link
Contributor

orionr commented Feb 26, 2020

Looks like we are at a good spot with PyTorch 1.4, so closing. Please open a new issue if you continue to have problems and thanks.

@orionr orionr closed this as completed Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: tensorboard triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests