DataLoader workers deadlocked #1595

esube · 2017-05-19T16:38:59Z

I have this same issue (#804) with two workers and an AWS P2 instance with shared memory of 241G (less than 1% usage). Has anyone run into this issue without docker?

aa88kk · 2017-05-20T05:51:42Z

I also encountered the same problem. this issue maybe related to #1579, #1355.

zym1010 · 2017-05-21T01:09:50Z

I'm the author of #1355. Now I tried setting pin_memory to False, and seems that problem solved.

I came up with this idea by reading pytorch/examples#56

fmassa · 2017-05-21T15:08:05Z

I think I have a reproduction:

import torch
import torch.utils.data

class DS(object):
    def __getitem__(self, idx):
        return torch.rand(3, 640, 640)
    def __len__(self):
        return 2000

ds = DS()
it = torch.utils.data.DataLoader(ds, batch_size=500, num_workers=1)

for i, data in enumerate(it):
    print(i)

In my machine, it got stuck at recv_bytes.
Note that the total size of a batch is greater than 2GB in this case, and there might be some limitation of queue to be < 2^31, or that pickle can't handle objects larger than 2^31.

@apaszke do you see a solution for this problem?

apaszke · 2017-05-22T08:57:36Z

Not really. This pickle limitation shouldn't cause problems here - the data is copied into shared memory and only a very small handle is sent to another process.

apaszke · 2017-05-22T08:58:53Z

But there might be some more overflows happening inside Python's multiprocessing library 😕 maybe it includes a segment size along with the fd?

fmassa · 2017-05-22T09:29:02Z

Ok, I just tried running the snippet I sent you into our clusters and it didn't crash as I was expecting. (I came up with that example in my local machine).
This is the stack trace that I got in the server

Traceback (most recent call last):
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 38, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

I'll try to come up with a better MWE.

apaszke · 2017-05-22T09:35:15Z

I also can't reproduce the error on my machine using your snippet. But the stack trace really seems to show that we're trying to send some large blob. That's surprising. Can you try messing around with multiprocessing and see what's going on? (You can modify /home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/connection.py by adding prints etc.)

fmassa · 2017-05-22T11:34:49Z

Ok, I now have a repro with the issue I was facing in the cluster.
The problem I was seeing was that I was indeed passing large blobs via numpy arrays (by circumventing the collate_fn to return a np.array). So a MWE that was causing my error message was the following:

import torch.utils.data
import numpy as np

class DS(object):
    def __getitem__(self, idx):
        return np.zeros((3, 640, 480))
    def __len__(self):
        return 8000

ds = DS()
it = torch.utils.data.DataLoader(ds, batch_size=500, num_workers=1, 
                                 collate_fn=lambda x: x)

for i, data in enumerate(it):
    print(i)

This is expected, as numpy arrays are pickled entirely, while torch tensors use shared memory .

And the issue I was having in my local machine with the script I sent earlier was probably insufficient shared memory (I had 4GB, one batch was > 2GB).

Maybe we should write a summary with gotchas of the dataloader? I'll start

always return a tensor from your collate_fn, and not an np.array.
If you are facing deadlocks, try increasing the amount of shared memory

Any others? :)

esube · 2017-05-22T13:11:34Z

The issue on my side most likely is not due to shared memory as I described above and I am using tensors just like the other dataset examples. I still have the deadlock and it happens at random times.

zym1010 · 2017-05-22T13:15:48Z

@esube have you tried turning off pin_memory?

esube · 2017-05-22T17:18:02Z

@zym1010 actually, by default the pin_memory is turned off as constructor arg. Also, from the discussion you mentioned: pytorch/examples#56 it seems turning it on is useful when you use more than one GPU.

Mind you my dataset is SVHN 32x32 with batch size of 128. I don't think, this is a large batch for the setup: AWS P2 instance with 8 K80s and >400GB mem (241G shared) (physical memory not V) and 32 cores CPU.

zym1010 · 2017-05-22T18:47:58Z

@esube alright. From my experience, pinning or not doesn't make much difference, even for multiple GPUs. But basically, not using pinning completely solves my problem.

esube · 2017-05-22T18:54:41Z

@zym1010 Thanks for your suggestion. Unfortunately, with or w/out pinning, I have this same hangup issue. The worker hangs up querying the index_queue and it is essentially a deadlock after that as the worker's loop line: r = index_queue.get() is a blocking call.

zym1010 · 2017-05-22T20:35:49Z

@esube Yeah me too for getting stuck at index_queue.get(). There should be some race condition going on.

apaszke · 2017-05-24T09:09:03Z

I'm going to close it, because it's a duplicate of #1355. Please comment in the original issue.

…fb74b7 Summary: Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e Included changes: - **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (pytorch#1608) <Dmitri Smirnov> - **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (pytorch#1606) <bddppq> - **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (pytorch#1599) <Roman Dzhabarov> - **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (pytorch#1554) <Dmitri Smirnov> - **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (pytorch#1597) <bddppq> - **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (pytorch#1595) <bddppq> - **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (pytorch#1591) <Scott McKay> - **[937e64c](onnx/onnx@937e64c)**: add uint8 (pytorch#1590) <Lu Fang> - **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (pytorch#1588) <Young Kim> - **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (pytorch#1585) <Changming Sun> - **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (pytorch#1499) <Sergii Dymchenko> - **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (pytorch#1567) <Spandan Tiwari> - **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (pytorch#1583) <Roman Dzhabarov> - **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (pytorch#1578) <Spandan Tiwari> - **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (pytorch#1571) <Scott McKay> - **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (pytorch#1574) <Changming Sun> - **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (pytorch#1566) <Wenhao Hu> Differential Revision: D13049077 fbshipit-source-id: 11133f10bc6b451094d1081e4ce736b02c8b9e2a

…002d19 Summary: Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e Included changes: - **[f461f7a](onnx/onnx@f461f7a)**: Show the op's type and name when the shape inference is failed. (pytorch#1623) <Jerry> - **[ab8aaf9](onnx/onnx@ab8aaf9)**: Add scan test case (pytorch#1586) <G. Ramalingam> - **[c95357e](onnx/onnx@c95357e)**: link the tutorial (pytorch#1650) <Lu Fang> - **[d7e2420](onnx/onnx@d7e2420)**: Upgrade label encoder to support more input types (pytorch#1596) <Wei-Sheng Chin> - **[6425108](onnx/onnx@6425108)**: Add Doc about Adding New Operator into ONNX (pytorch#1647) <Lu Fang> - **[295889c](onnx/onnx@295889c)**: use an empty initializer to create map (pytorch#1643) <Lu Fang> - **[e38f3ec](onnx/onnx@e38f3ec)**: Remove redundant const (pytorch#1639) <daquexian> - **[ea694bf](onnx/onnx@ea694bf)**: implement fuse reduce->unsqueeze + fix assumption in nop_dropout pass (pytorch#1565) <Armen> - **[6db386e](onnx/onnx@6db386e)**: make output shape clear enough for Softmax family (pytorch#1634) <Lu Fang> - **[2b67c6e](onnx/onnx@2b67c6e)**: fix batchnorm doc (pytorch#1633) <Lu Fang> - **[c901784](onnx/onnx@c901784)**: remove inappropriate consts (pytorch#1632) <Lu Fang> - **[de82119](onnx/onnx@de82119)**: Shape inference fix for broadcast, concat and scan (pytorch#1594) <KeDengMS> - **[d7ffe3b](onnx/onnx@d7ffe3b)**: Update Optimizer Docs (pytorch#1607) <Armen> - **[d09d139](onnx/onnx@d09d139)**: mark PROTOBUF_INCLUDE_DIRS as BUILD_INTERFACE (pytorch#1466) <Yuta Okamoto> - **[eb4b7c2](onnx/onnx@eb4b7c2)**: allow variadic parameters of different types (pytorch#1615) <G. Ramalingam> - **[4166246](onnx/onnx@4166246)**: Fix onnxifi test (pytorch#1617) <Yinghai Lu> - **[6706a4d](onnx/onnx@6706a4d)**: Fix a bug in vector address access (pytorch#1598) <Raymond Yang> - **[ae39866](onnx/onnx@ae39866)**: Separate types of inputs 1 and 2 in OneHot op. (pytorch#1610) <Spandan Tiwari> - **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (pytorch#1608) <Dmitri Smirnov> - **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (pytorch#1606) <bddppq> - **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (pytorch#1599) <Roman Dzhabarov> - **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (pytorch#1554) <Dmitri Smirnov> - **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (pytorch#1597) <bddppq> - **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (pytorch#1595) <bddppq> - **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (pytorch#1591) <Scott McKay> - **[937e64c](onnx/onnx@937e64c)**: add uint8 (pytorch#1590) <Lu Fang> - **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (pytorch#1588) <Young Kim> - **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (pytorch#1585) <Changming Sun> - **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (pytorch#1499) <Sergii Dymchenko> - **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (pytorch#1567) <Spandan Tiwari> - **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (pytorch#1583) <Roman Dzhabarov> - **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (pytorch#1578) <Spandan Tiwari> - **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (pytorch#1571) <Scott McKay> - **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (pytorch#1574) <Changming Sun> - **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (pytorch#1566) <Wenhao Hu> Differential Revision: D13263831 fbshipit-source-id: 0c158dd12c45d704b6f37f63f3d74ed34ef2f534

…002d19 (#14568) Summary: Pull Request resolved: #14568 Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e Included changes: - **[f461f7a](onnx/onnx@f461f7a)**: Show the op's type and name when the shape inference is failed. (#1623) <Jerry> - **[ab8aaf9](onnx/onnx@ab8aaf9)**: Add scan test case (#1586) <G. Ramalingam> - **[c95357e](onnx/onnx@c95357e)**: link the tutorial (#1650) <Lu Fang> - **[d7e2420](onnx/onnx@d7e2420)**: Upgrade label encoder to support more input types (#1596) <Wei-Sheng Chin> - **[6425108](onnx/onnx@6425108)**: Add Doc about Adding New Operator into ONNX (#1647) <Lu Fang> - **[295889c](onnx/onnx@295889c)**: use an empty initializer to create map (#1643) <Lu Fang> - **[e38f3ec](onnx/onnx@e38f3ec)**: Remove redundant const (#1639) <daquexian> - **[ea694bf](onnx/onnx@ea694bf)**: implement fuse reduce->unsqueeze + fix assumption in nop_dropout pass (#1565) <Armen> - **[6db386e](onnx/onnx@6db386e)**: make output shape clear enough for Softmax family (#1634) <Lu Fang> - **[2b67c6e](onnx/onnx@2b67c6e)**: fix batchnorm doc (#1633) <Lu Fang> - **[c901784](onnx/onnx@c901784)**: remove inappropriate consts (#1632) <Lu Fang> - **[de82119](onnx/onnx@de82119)**: Shape inference fix for broadcast, concat and scan (#1594) <KeDengMS> - **[d7ffe3b](onnx/onnx@d7ffe3b)**: Update Optimizer Docs (#1607) <Armen> - **[d09d139](onnx/onnx@d09d139)**: mark PROTOBUF_INCLUDE_DIRS as BUILD_INTERFACE (#1466) <Yuta Okamoto> - **[eb4b7c2](onnx/onnx@eb4b7c2)**: allow variadic parameters of different types (#1615) <G. Ramalingam> - **[4166246](onnx/onnx@4166246)**: Fix onnxifi test (#1617) <Yinghai Lu> - **[6706a4d](onnx/onnx@6706a4d)**: Fix a bug in vector address access (#1598) <Raymond Yang> - **[ae39866](onnx/onnx@ae39866)**: Separate types of inputs 1 and 2 in OneHot op. (#1610) <Spandan Tiwari> - **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (#1608) <Dmitri Smirnov> - **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (#1606) <bddppq> - **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (#1599) <Roman Dzhabarov> - **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (#1554) <Dmitri Smirnov> - **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (#1597) <bddppq> - **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (#1595) <bddppq> - **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (#1591) <Scott McKay> - **[937e64c](onnx/onnx@937e64c)**: add uint8 (#1590) <Lu Fang> - **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (#1588) <Young Kim> - **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (#1585) <Changming Sun> - **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (#1499) <Sergii Dymchenko> - **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (#1567) <Spandan Tiwari> - **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (#1583) <Roman Dzhabarov> - **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (#1578) <Spandan Tiwari> - **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (#1571) <Scott McKay> - **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (#1574) <Changming Sun> - **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (#1566) <Wenhao Hu> Reviewed By: zrphercule Differential Revision: D13263831 fbshipit-source-id: a2ff22c6454e2430429e5a7d18d21661a7ffb0cb

apaszke closed this as completed May 24, 2017

ssnl mentioned this issue Nov 3, 2017

Signal handling in DataLoader workers; Timeout option #3474

Merged

houseroad mentioned this issue Nov 14, 2018

Automatic update of fbcode/onnx to 45ba661d278bc13d504829bfa6bfb48f15fb74b7 #13955

Closed

houseroad mentioned this issue Nov 29, 2018

Automatic update of fbcode/onnx to f461f7aad9987635b4aff108620ed7918f002d19 #14568

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader workers deadlocked #1595

DataLoader workers deadlocked #1595

esube commented May 19, 2017

aa88kk commented May 20, 2017

zym1010 commented May 21, 2017 •

edited

fmassa commented May 21, 2017

apaszke commented May 22, 2017

apaszke commented May 22, 2017

fmassa commented May 22, 2017

apaszke commented May 22, 2017

fmassa commented May 22, 2017

esube commented May 22, 2017

zym1010 commented May 22, 2017

esube commented May 22, 2017 •

edited

zym1010 commented May 22, 2017

esube commented May 22, 2017

zym1010 commented May 22, 2017

apaszke commented May 24, 2017

DataLoader workers deadlocked #1595

DataLoader workers deadlocked #1595

Comments

esube commented May 19, 2017

aa88kk commented May 20, 2017

zym1010 commented May 21, 2017 • edited

fmassa commented May 21, 2017

apaszke commented May 22, 2017

apaszke commented May 22, 2017

fmassa commented May 22, 2017

apaszke commented May 22, 2017

fmassa commented May 22, 2017

esube commented May 22, 2017

zym1010 commented May 22, 2017

esube commented May 22, 2017 • edited

zym1010 commented May 22, 2017

esube commented May 22, 2017

zym1010 commented May 22, 2017

apaszke commented May 24, 2017

zym1010 commented May 21, 2017 •

edited

esube commented May 22, 2017 •

edited