Impact of server garbage collector setting. #347

nhirschey · 2021-05-27T22:56:04Z

nhirschey
May 27, 2021

Summary

Should there be a recommendation to avoid the .NET server garbage collection setting if you want to use diffsharp with a GPU?

I typically use "System.GC.Server":true with fsi for parallel CPU performance, but with this setting unused tensors do not get collected very often. That's a problem for the GPU; I kept getting Cuda out of memory errors training a model. The problem goes away when I use the (default) workstation garbage collection setting.

Code to reproduce the issue is below. The summary:

With server gc: consume 24.0 GB of GPU memory and crash during the 1st training epoch
With workstation gc: completes 10 epochs and never uses more than 3.2GB of GPU memory.

Further related reading for anybody else who finds themselves in this situation: https://github.com/xamarin/TorchSharp/blob/master/docfx/articles/memory.md

Reproduction

This code below will run with "workstation" garbage collection and crash with "server" garbage collection.

To switch between garbage collection settings, you need to modify fsi.runtimeconfig.json. On windows I find this file in C:\Program Files\dotnet\sdk\5.0.300-preview.21258.4\FSharp.

Workstation is the default FSI setting, and it should look like:

{
  "runtimeOptions": {
    "tfm": "net5.0",
    "framework": {
      "name": "Microsoft.NETCore.App",
      "version": "5.0.5"
    }
  }
}

To enable server garbage collection the runtime setting should look like

{
  "runtimeOptions": {
    "tfm": "net5.0",
    "framework": {
      "name": "Microsoft.NETCore.App",
      "version": "5.0.5"
    },
    "configProperties": {
      "System.GC.Server": true
   }
  }
}

#r "nuget: DiffSharp-cuda-windows, 1.0.0-preview-783523654"

open DiffSharp
open DiffSharp.Compose
open DiffSharp.Data
open DiffSharp.Model
open DiffSharp.Util
open DiffSharp.Optim

open System

#time "on"

System.Runtime.GCSettings.IsServerGC

fsi.AddPrinter<Tensor>(fun t -> 
    if t.nelement <= 1_000 then
        t.ToString()
    else
        sprintf "Shape: %A" t.shape )

dsharp.config(dtype=Dtype.Float32,device=Device.GPU,backend=Backend.Torch)

let batchSize = 128
let nBatches = 500
let nSamples = 200_000
let nFeatures = 30_000
let data = dsharp.randn([nSamples;nFeatures],device=Device.CPU)
let target = dsharp.randn(nSamples,device=Device.CPU)

let trainSet = TensorDataset(data, target)
let trainLoader = DataLoader(trainSet,batchSize=batchSize,shuffle=true)

let modelLinear =
    dsharp.view([-1;nFeatures])
    --> Linear(nFeatures,32)
    --> dsharp.relu
    --> Linear(32,1)

let optimizerLinear = Adam(modelLinear)

let epochs = 10
let validInterval = 100
for epoch = 1 to epochs do
    let batches = trainLoader.epoch(nBatches)
    for i, x, target in batches do
        modelLinear.reverseDiff()
        let y = modelLinear.forward(x).view(-1)
        let l = (y-target).abs().mean()
        l.reverse()
        optimizerLinear.step()
        if i % validInterval = 0 then
            print $"Epoch: {epoch} minibatch: {i} loss: {l}"

(* **Result using server GC**
"Epoch: 1 minibatch: 0 loss: tensor(0.81038, rev)"
"Epoch: 1 minibatch: 100 loss: tensor(0.890988, rev)"
"Epoch: 1 minibatch: 200 loss: tensor(0.776052, rev)"
System.Runtime.InteropServices.ExternalException (0x80004005): CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 24.00 GiB total capacity; 20.51 GiB already allocated; 0 bytes free; 22.39 GiB reserved in total by PyTorch)
Exception raised from malloc at ..\..\c10\cuda\CUDACachingAllocator.cpp:288 (most recent call first):
00007FFA255F0EC200007FFA255F0E60 c10.dll!c10::Error::Error [<unknown file> @ <unknown line number>]
00007FFA35A778C600007FFA35A77850 c10_cuda.dll!c10::CUDAOutOfMemoryError::CUDAOutOfMemoryError [<unknown file> @ <unknown line number>]
00007FFA35A7E39F00007FFA35A7D1D0 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFA35A7E42A00007FFA35A7D1D0 c10_cuda.dll!c10::cuda::CUDACachingAllocator::init [<unknown file> @ <unknown line number>]
00007FFA35A72BF000007FFA35A72990 c10_cuda.dll!c10::cuda::set_device [<unknown file> @ <unknown line number>]
00007FF9664A36CE00007FF9664A3520 torch_cuda_cpp.dll!at::native::empty_cuda [<unknown file> @ <unknown line number>]
00007FF8D5B6C03900007FF8D5A9B570 torch_cuda_cu.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FF9C20E5C9000007FF9C20E59D0 torch_cpu.DLL!at::TensorIteratorBase::allocate_or_resize_outputs [<unknown file> @ <unknown line number>]
00007FF9C20E624300007FF9C20E61A0 torch_cpu.DLL!at::TensorIteratorBase::build [<unknown file> @ <unknown line number>]
00007FF9C20E649F00007FF9C20E63E0 torch_cpu.DLL!at::TensorIteratorBase::build_binary_op [<unknown file> @ <unknown line number>]
00007FF9C21FAFD900007FF9C21FAFA0 torch_cpu.DLL!at::meta::add_Tensor::meta [<unknown file> @ <unknown line number>]
00007FF8D5B7D46700007FF8D5A9B570 torch_cuda_cu.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FF8D5B5DC0B00007FF8D5A9B570 torch_cuda_cu.dll!at::native::set_storage_cuda_ [<unknown file> @ <unknown line number>]
00007FF9C25B890A00007FF9C2564640 torch_cpu.DLL!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FF9C26A823800007FF9C26A81A0 torch_cpu.DLL!at::add [<unknown file> @ <unknown line number>]
00007FF9C3952CB300007FF9C3866980 torch_cpu.DLL!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FF9C396396000007FF9C3866980 torch_cpu.DLL!torch::autograd::GraphRoot::apply [<unknown file> @ <unknown line number>]
00007FF9C25B890A00007FF9C2564640 torch_cpu.DLL!at::native::mkldnn_sigmoid_ [<unknown file> @ <unknown line number>]
00007FF9C2A308F800007FF9C2A30860 torch_cpu.DLL!at::Tensor::add [<unknown file> @ <unknown line number>]
00007FFA1230DC3B00007FFA1230DBB0 LibTorchSharp.DLL!THSTensor_add [<unknown file> @ <unknown line number>]
00007FF9A79B40D6 <unknown symbol address> !<unknown symbol> [<unknown file> @ <unknown line number>]

   at TorchSharp.Torch.CheckForErrors()
   at TorchSharp.Tensor.TorchTensor.add(TorchTensor target, TorchScalar alpha)
   at DiffSharp.Backends.Torch.TorchRawTensor.AddTT(RawTensor t2, FSharpOption`1 alpha)
   at DiffSharp.Tensor.op_Addition(Tensor a, Tensor b)
   at DiffSharp.Optim.Adam.updateRule(String name, Tensor t)
   at <StartupCode$DiffSharp-Core>.$Optim.step@30.Invoke(Tuple`2 tupledArg)
   at DiffSharp.Model.ParameterDict.iter(FSharpFunc`2 f)
   at DiffSharp.Optim.Optimizer.step()
   at <StartupCode$FSI_0004>.$FSI_0004.main@()
Stopped due to error
*)

dsyme · 2021-05-28T16:45:27Z

dsyme
May 28, 2021
Maintainer

In TorchSharp, Niklas has been looking at these issues a lot, and the same lessons will apply to DiffSharp

He adds an explicit GC.Collect in the training loop, e.g. here: https://github.com/xamarin/TorchSharp/blob/master/src/FSharp.Examples/AlexNet.fs#L124

And yes, using Batch GC will also surely help too

He also adds explicit use which doesn't work in DiffSharp, though you can make it work if you graph the underlying handle from the RawTensor.

0 replies

nhirschey · 2021-05-28T18:08:46Z

nhirschey
May 28, 2021
Author

Thank you for the kind GC.Collect() tip. Doing it on each batch loop iteration after optimizerLinear.step() solves my problem and makes server GC behavior the same as workstation/batch GC.

For future users: GC.Collect() does not cause a visible drop in GPU memory usage shown in Windows task manager. But it works and does prevent new allocations.

0 replies

dsyme · 2021-05-28T20:14:56Z

dsyme
May 28, 2021
Maintainer

Cool good to know, we should update the DiffSharp samples to include this as a matter of course as well

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impact of server garbage collector setting. #347

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Impact of server garbage collector setting. #347

nhirschey May 27, 2021

Summary

Reproduction

Replies: 3 comments

dsyme May 28, 2021 Maintainer

nhirschey May 28, 2021 Author

dsyme May 28, 2021 Maintainer

nhirschey
May 27, 2021

dsyme
May 28, 2021
Maintainer

nhirschey
May 28, 2021
Author

dsyme
May 28, 2021
Maintainer