Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLoader workers deadlocked #1595

Closed
esube opened this issue May 19, 2017 · 15 comments
Closed

DataLoader workers deadlocked #1595

esube opened this issue May 19, 2017 · 15 comments

Comments

@esube
Copy link

esube commented May 19, 2017

I have this same issue (#804) with two workers and an AWS P2 instance with shared memory of 241G (less than 1% usage). Has anyone run into this issue without docker?

@aa88kk
Copy link

aa88kk commented May 20, 2017

I also encountered the same problem. this issue maybe related to #1579, #1355.

@zym1010
Copy link
Contributor

zym1010 commented May 21, 2017

I'm the author of #1355. Now I tried setting pin_memory to False, and seems that problem solved.

I came up with this idea by reading pytorch/examples#56

@fmassa
Copy link
Member

fmassa commented May 21, 2017

I think I have a reproduction:

import torch
import torch.utils.data

class DS(object):
    def __getitem__(self, idx):
        return torch.rand(3, 640, 640)
    def __len__(self):
        return 2000

ds = DS()
it = torch.utils.data.DataLoader(ds, batch_size=500, num_workers=1)

for i, data in enumerate(it):
    print(i)

In my machine, it got stuck at recv_bytes.
Note that the total size of a batch is greater than 2GB in this case, and there might be some limitation of queue to be < 2^31, or that pickle can't handle objects larger than 2^31.

@apaszke do you see a solution for this problem?

@apaszke
Copy link
Contributor

apaszke commented May 22, 2017

Not really. This pickle limitation shouldn't cause problems here - the data is copied into shared memory and only a very small handle is sent to another process.

@apaszke
Copy link
Contributor

apaszke commented May 22, 2017

But there might be some more overflows happening inside Python's multiprocessing library 😕 maybe it includes a segment size along with the fd?

@fmassa
Copy link
Member

fmassa commented May 22, 2017

Ok, I just tried running the snippet I sent you into our clusters and it didn't crash as I was expecting. (I came up with that example in my local machine).
This is the stack trace that I got in the server

Traceback (most recent call last):
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 38, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/queues.py", line 355, in put
    self._writer.send_bytes(obj)
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

I'll try to come up with a better MWE.

@apaszke
Copy link
Contributor

apaszke commented May 22, 2017

I also can't reproduce the error on my machine using your snippet. But the stack trace really seems to show that we're trying to send some large blob. That's surprising. Can you try messing around with multiprocessing and see what's going on? (You can modify /home/fmassa/sandbox/conda/envs/devenv/lib/python3.6/multiprocessing/connection.py by adding prints etc.)

@fmassa
Copy link
Member

fmassa commented May 22, 2017

Ok, I now have a repro with the issue I was facing in the cluster.
The problem I was seeing was that I was indeed passing large blobs via numpy arrays (by circumventing the collate_fn to return a np.array). So a MWE that was causing my error message was the following:

import torch.utils.data
import numpy as np

class DS(object):
    def __getitem__(self, idx):
        return np.zeros((3, 640, 480))
    def __len__(self):
        return 8000

ds = DS()
it = torch.utils.data.DataLoader(ds, batch_size=500, num_workers=1, 
                                 collate_fn=lambda x: x)

for i, data in enumerate(it):
    print(i)

This is expected, as numpy arrays are pickled entirely, while torch tensors use shared memory .

And the issue I was having in my local machine with the script I sent earlier was probably insufficient shared memory (I had 4GB, one batch was > 2GB).

Maybe we should write a summary with gotchas of the dataloader? I'll start

  • always return a tensor from your collate_fn, and not an np.array.
  • If you are facing deadlocks, try increasing the amount of shared memory

Any others? :)

@esube
Copy link
Author

esube commented May 22, 2017

The issue on my side most likely is not due to shared memory as I described above and I am using tensors just like the other dataset examples. I still have the deadlock and it happens at random times.

@zym1010
Copy link
Contributor

zym1010 commented May 22, 2017

@esube have you tried turning off pin_memory?

@esube
Copy link
Author

esube commented May 22, 2017

@zym1010 actually, by default the pin_memory is turned off as constructor arg. Also, from the discussion you mentioned: pytorch/examples#56 it seems turning it on is useful when you use more than one GPU.

Mind you my dataset is SVHN 32x32 with batch size of 128. I don't think, this is a large batch for the setup: AWS P2 instance with 8 K80s and >400GB mem (241G shared) (physical memory not V) and 32 cores CPU.

@zym1010
Copy link
Contributor

zym1010 commented May 22, 2017

@esube alright. From my experience, pinning or not doesn't make much difference, even for multiple GPUs. But basically, not using pinning completely solves my problem.

@esube
Copy link
Author

esube commented May 22, 2017

@zym1010 Thanks for your suggestion. Unfortunately, with or w/out pinning, I have this same hangup issue. The worker hangs up querying the index_queue and it is essentially a deadlock after that as the worker's loop line: r = index_queue.get() is a blocking call.

@zym1010
Copy link
Contributor

zym1010 commented May 22, 2017

@esube Yeah me too for getting stuck at index_queue.get(). There should be some race condition going on.

@apaszke
Copy link
Contributor

apaszke commented May 24, 2017

I'm going to close it, because it's a duplicate of #1355. Please comment in the original issue.

@apaszke apaszke closed this as completed May 24, 2017
houseroad added a commit to houseroad/pytorch that referenced this issue Nov 20, 2018
…fb74b7

Summary:
Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e

Included changes:
- **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (pytorch#1608) <Dmitri Smirnov>
- **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (pytorch#1606) <bddppq>
- **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (pytorch#1599) <Roman Dzhabarov>
- **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (pytorch#1554) <Dmitri Smirnov>
- **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (pytorch#1597) <bddppq>
- **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (pytorch#1595) <bddppq>
- **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (pytorch#1591) <Scott McKay>
- **[937e64c](onnx/onnx@937e64c)**: add uint8 (pytorch#1590) <Lu Fang>
- **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (pytorch#1588) <Young Kim>
- **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (pytorch#1585) <Changming Sun>
- **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (pytorch#1499) <Sergii Dymchenko>
- **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (pytorch#1567) <Spandan Tiwari>
- **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (pytorch#1583) <Roman Dzhabarov>
- **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (pytorch#1578) <Spandan Tiwari>
- **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (pytorch#1571) <Scott McKay>
- **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (pytorch#1574) <Changming Sun>
- **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (pytorch#1566) <Wenhao Hu>

Differential Revision: D13049077

fbshipit-source-id: 11133f10bc6b451094d1081e4ce736b02c8b9e2a
houseroad added a commit to houseroad/pytorch that referenced this issue Nov 29, 2018
…002d19

Summary:
Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e

Included changes:
- **[f461f7a](onnx/onnx@f461f7a)**: Show the op's type and name when the shape inference is failed. (pytorch#1623) <Jerry>
- **[ab8aaf9](onnx/onnx@ab8aaf9)**: Add scan test case (pytorch#1586) <G. Ramalingam>
- **[c95357e](onnx/onnx@c95357e)**: link the tutorial (pytorch#1650) <Lu Fang>
- **[d7e2420](onnx/onnx@d7e2420)**: Upgrade label encoder to support more input types (pytorch#1596) <Wei-Sheng Chin>
- **[6425108](onnx/onnx@6425108)**: Add Doc about Adding New Operator into ONNX (pytorch#1647) <Lu Fang>
- **[295889c](onnx/onnx@295889c)**: use an empty initializer to create map (pytorch#1643) <Lu Fang>
- **[e38f3ec](onnx/onnx@e38f3ec)**: Remove redundant const (pytorch#1639) <daquexian>
- **[ea694bf](onnx/onnx@ea694bf)**: implement fuse reduce->unsqueeze + fix assumption in nop_dropout pass (pytorch#1565) <Armen>
- **[6db386e](onnx/onnx@6db386e)**: make output shape clear enough for Softmax family (pytorch#1634) <Lu Fang>
- **[2b67c6e](onnx/onnx@2b67c6e)**: fix batchnorm doc (pytorch#1633) <Lu Fang>
- **[c901784](onnx/onnx@c901784)**: remove inappropriate consts (pytorch#1632) <Lu Fang>
- **[de82119](onnx/onnx@de82119)**: Shape inference fix for broadcast, concat and scan (pytorch#1594) <KeDengMS>
- **[d7ffe3b](onnx/onnx@d7ffe3b)**: Update Optimizer Docs (pytorch#1607) <Armen>
- **[d09d139](onnx/onnx@d09d139)**: mark PROTOBUF_INCLUDE_DIRS as BUILD_INTERFACE (pytorch#1466) <Yuta Okamoto>
- **[eb4b7c2](onnx/onnx@eb4b7c2)**: allow variadic parameters of different types (pytorch#1615) <G. Ramalingam>
- **[4166246](onnx/onnx@4166246)**: Fix onnxifi test (pytorch#1617) <Yinghai Lu>
- **[6706a4d](onnx/onnx@6706a4d)**: Fix a bug in vector address access (pytorch#1598) <Raymond Yang>
- **[ae39866](onnx/onnx@ae39866)**: Separate types of inputs 1 and 2 in OneHot op. (pytorch#1610) <Spandan Tiwari>
- **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (pytorch#1608) <Dmitri Smirnov>
- **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (pytorch#1606) <bddppq>
- **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (pytorch#1599) <Roman Dzhabarov>
- **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (pytorch#1554) <Dmitri Smirnov>
- **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (pytorch#1597) <bddppq>
- **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (pytorch#1595) <bddppq>
- **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (pytorch#1591) <Scott McKay>
- **[937e64c](onnx/onnx@937e64c)**: add uint8 (pytorch#1590) <Lu Fang>
- **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (pytorch#1588) <Young Kim>
- **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (pytorch#1585) <Changming Sun>
- **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (pytorch#1499) <Sergii Dymchenko>
- **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (pytorch#1567) <Spandan Tiwari>
- **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (pytorch#1583) <Roman Dzhabarov>
- **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (pytorch#1578) <Spandan Tiwari>
- **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (pytorch#1571) <Scott McKay>
- **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (pytorch#1574) <Changming Sun>
- **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (pytorch#1566) <Wenhao Hu>

Differential Revision: D13263831

fbshipit-source-id: 0c158dd12c45d704b6f37f63f3d74ed34ef2f534
facebook-github-bot pushed a commit that referenced this issue Nov 30, 2018
…002d19 (#14568)

Summary:
Pull Request resolved: #14568

Previous import was 882c5283c54345d131e8fe5c859e4844dcf7ca8e

Included changes:
- **[f461f7a](onnx/onnx@f461f7a)**: Show the op's type and name when the shape inference is failed. (#1623) <Jerry>
- **[ab8aaf9](onnx/onnx@ab8aaf9)**: Add scan test case (#1586) <G. Ramalingam>
- **[c95357e](onnx/onnx@c95357e)**: link the tutorial (#1650) <Lu Fang>
- **[d7e2420](onnx/onnx@d7e2420)**: Upgrade label encoder to support more input types (#1596) <Wei-Sheng Chin>
- **[6425108](onnx/onnx@6425108)**: Add Doc about Adding New Operator into ONNX (#1647) <Lu Fang>
- **[295889c](onnx/onnx@295889c)**: use an empty initializer to create map (#1643) <Lu Fang>
- **[e38f3ec](onnx/onnx@e38f3ec)**: Remove redundant const (#1639) <daquexian>
- **[ea694bf](onnx/onnx@ea694bf)**: implement fuse reduce->unsqueeze + fix assumption in nop_dropout pass (#1565) <Armen>
- **[6db386e](onnx/onnx@6db386e)**: make output shape clear enough for Softmax family (#1634) <Lu Fang>
- **[2b67c6e](onnx/onnx@2b67c6e)**: fix batchnorm doc (#1633) <Lu Fang>
- **[c901784](onnx/onnx@c901784)**: remove inappropriate consts (#1632) <Lu Fang>
- **[de82119](onnx/onnx@de82119)**: Shape inference fix for broadcast, concat and scan (#1594) <KeDengMS>
- **[d7ffe3b](onnx/onnx@d7ffe3b)**: Update Optimizer Docs (#1607) <Armen>
- **[d09d139](onnx/onnx@d09d139)**: mark PROTOBUF_INCLUDE_DIRS as BUILD_INTERFACE (#1466) <Yuta Okamoto>
- **[eb4b7c2](onnx/onnx@eb4b7c2)**: allow variadic parameters of different types (#1615) <G. Ramalingam>
- **[4166246](onnx/onnx@4166246)**: Fix onnxifi test (#1617) <Yinghai Lu>
- **[6706a4d](onnx/onnx@6706a4d)**: Fix a bug in vector address access (#1598) <Raymond Yang>
- **[ae39866](onnx/onnx@ae39866)**: Separate types of inputs 1 and 2 in OneHot op. (#1610) <Spandan Tiwari>
- **[45ba661](onnx/onnx@45ba661)**: Handle new types in the switch. (#1608) <Dmitri Smirnov>
- **[14853b6](onnx/onnx@14853b6)**: Bump docker image version to 230 used in CircleCI (#1606) <bddppq>
- **[e0993b8](onnx/onnx@e0993b8)**: [onnxifi] Make sure that backend handles run async. (#1599) <Roman Dzhabarov>
- **[e6965cc](onnx/onnx@e6965cc)**: Introduce SparseTensor ML proto (#1554) <Dmitri Smirnov>
- **[75b782f](onnx/onnx@75b782f)**: In driver test check the return status of onnxGetBackendIDs (#1597) <bddppq>
- **[c05b364](onnx/onnx@c05b364)**: Make CI log less verbose (#1595) <bddppq>
- **[fa568e4](onnx/onnx@fa568e4)**: Loop type shape inferencing (#1591) <Scott McKay>
- **[937e64c](onnx/onnx@937e64c)**: add uint8 (#1590) <Lu Fang>
- **[f86e951](onnx/onnx@f86e951)**: Add domain as an optional parameter for make_node function (#1588) <Young Kim>
- **[ff45588](onnx/onnx@ff45588)**: Remove unreachable code in shape_inference.h (#1585) <Changming Sun>
- **[f7dcad0](onnx/onnx@f7dcad0)**: Add several hyperbolic function ops. (#1499) <Sergii Dymchenko>
- **[a60ac7d](onnx/onnx@a60ac7d)**: Add OneHot op to ONNX. (#1567) <Spandan Tiwari>
- **[f6c3a7e](onnx/onnx@f6c3a7e)**: [compiler flag] Issue a warning if class has virtual method but missing virtual dtor. (#1583) <Roman Dzhabarov>
- **[88d1784](onnx/onnx@88d1784)**: Fix MaxUnpool shape inference when output_shape is provided as input (#1578) <Spandan Tiwari>
- **[20041b7](onnx/onnx@20041b7)**: Add type shape inferencing for the If operator (#1571) <Scott McKay>
- **[d6c4c75](onnx/onnx@d6c4c75)**: Add a virtual destructor to GraphInferencer (#1574) <Changming Sun>
- **[a339598](onnx/onnx@a339598)**: fix ConvTranspose spec (#1566) <Wenhao Hu>

Reviewed By: zrphercule

Differential Revision: D13263831

fbshipit-source-id: a2ff22c6454e2430429e5a7d18d21661a7ffb0cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants