關於train時發生的錯誤問題 #43

joe880923 · 2023-07-23T09:02:34Z

您好，當我執行train.py時，在dcls_arch.py的檔案發生錯誤訊息如下
File "/home/wu/DCLS/codes/config/DCLS/models/modules/dcls_arch.py", line 87, in forward
clear_features[:, i:i+1, :, :] = clear_feature_ch[:, :, ks:-ks, ks:-ks]

RuntimeError: expand(torch.cuda.FloatTensor{[64, 1, 64, 64, 2, 2]}, size=[64, 1, 64, 64]): the number of sizes provided (4) must be greater or equal to the number of dimensions in the tensor (6)

原始碼的部分是在這邊:

    class CLS(nn.Module):
    
        def __init__(self, nf, reduction=4):
    
            super().__init__()
    
            self.reduce_feature = nn.Conv2d(nf, nf//reduction, 1, 1, 0)
    
            self.grad_filter = nn.Sequential(
                nn.Conv2d(nf//reduction, nf//reduction, 3),
                nn.LeakyReLU(0.1, inplace=True),
                nn.Conv2d(nf//reduction, nf//reduction, 3),
                nn.LeakyReLU(0.1, inplace=True),
                nn.Conv2d(nf//reduction, nf//reduction, 3),
                nn.AdaptiveAvgPool2d((3, 3)),
                nn.Conv2d(nf//reduction, nf//reduction, 1),
            )
    
            self.expand_feature = nn.Conv2d(nf//reduction, nf, 1, 1, 0)
    
        def forward(self, x, kernel):
            cls_feats = self.reduce_feature(x)
            kernel_P = torch.exp(self.grad_filter(cls_feats))
            kernel_P = kernel_P - kernel_P.mean(dim=(2, 3), keepdim=True)
            clear_features = torch.zeros(cls_feats.size()).to(x.device)
            print(clear_features.shape)
            ks = kernel.shape[-1]
            dim = (ks, ks, ks, ks)
            feature_pad = F.pad(cls_feats, dim, "replicate")
            for i in range(feature_pad.shape[1]):
                feature_ch = feature_pad[:, i:i+1, :, :]
                print(feature_ch.shape)
                clear_feature_ch = get_uperleft_denominator(feature_ch, kernel, kernel_P[:, i:i+1, :, :])
                print(clear_feature_ch)
                clear_features[:, i:i+1, :, :] = clear_feature_ch[:, :, ks:-ks, ks:-ks]
    
            x = self.expand_feature(clear_features)
    
            return x

這程式碼中 clear_feature_ch = get_uperleft_denominator(feature_ch, kernel, kernel_P[:, i:i+1, :, :])會用到get_uperleft_denominator的這個function
這function如下:

    def get_uperleft_denominator(img, kernel, grad_kernel):
          ker_f = convert_psf2otf(kernel, img.size()) # discrete fourier transform of kernel
          ker_p = convert_psf2otf(grad_kernel, img.size()) # discrete fourier transform of kernel
      
          denominator = inv_fft_kernel_est(ker_f, ker_p)
      
          numerator = torch.fft.fftn(img, dim=(-3, -2, -1))
          numerator = torch.stack((numerator.real, numerator.imag), -1)
          #numerator = torch.fft.ifft2(torch.complex(img[..., 0], img[..., 1]), dim=(-3, -2, -1))
          
          
          deblur = deconv(denominator, numerator)
          return deblur

其中convert_psf2otf這Function如下:

    def convert_psf2otf(ker, size):
            psf = torch.zeros(size).cuda()
            # circularly shift
            centre = ker.shape[2]//2 + 1
            psf[:, :, :centre, :centre] = ker[:, :, (centre-1):, (centre-1):]
            psf[:, :, :centre, -(centre-1):] = ker[:, :, (centre-1):, :(centre-1)]
            psf[:, :, -(centre-1):, :centre] = ker[:, :, : (centre-1), (centre-1):]
            psf[:, :, -(centre-1):, -(centre-1):] = ker[:, :, :(centre-1), :(centre-1)]
            # compute the otf
             #otf = torch.rfft(psf, 3, onesided=False)
             #otf = torch.fft.ifft2(torch.complex(psf[..., 0], psf[..., 1]), dim=(-3, -2, -1))
            
            otf = torch.fft.fftn(psf, dim=(-3, -2, -1))
            otf = torch.stack((otf.real, otf.imag), -1)
            return otf

我對 clear_features、clear_feature_ch以及 clear_features 分別進行.shape發現它們分別的維度如以下:
torch.Size([64, 16, 64, 64])
torch.Size([64, 1, 106, 106])
torch.Size([64, 1, 106, 106, 2, 2])
請問這是為甚麼會有這個情況以及這個部分要怎麼去修改呢?謝謝!

The text was updated successfully, but these errors were encountered:

joe880923 · 2023-07-23T18:45:39Z

您好，以上的問題已解決，並且沒有出錯，但卻出現以下狀況

執行完python3 train.py -opt=options/setting1/train/train_setting1_x4.yml後進行訓練的過程中發生了這個狀況，沒有出現任何的error，但也沒有出現訓練過程，就卡在這裡了。

請問您知道怎麼解決這個狀況嗎? 謝謝您。

Algolzw · 2023-07-23T20:22:16Z

你好，这个warning应该没问题。目前主界面是不会输出训练过程的，所有的日志都保存在log目录下面。所以最好后台运行训练代码，然后通过tail -f log/DCLSx4_setting1/train_xxxx.log -n 100 查看训练过程。

joe880923 · 2023-07-24T13:31:41Z

您好，我使用 tail -f log/DCLSx4_setting1/train_xxxx.log -n 100 的方式查看訓練過程，但log紀錄卡在最一開始就不動了，但也沒有跳出error的情況發生，我的gpu是使用一張3090，會不會是因為只使用一張導致負荷不了? 感謝您的回答!

Algolzw · 2023-07-24T14:16:19Z

也有可能，你可以减小训练batch size以及patch size试试呢

joe880923 · 2023-07-24T15:16:25Z

您說的patch size是指yml檔中的GT_size以及LR_size嗎?
預設GT_size=256、LR_size=64，我可以嘗試改成GT_size=256、LR_size=16嗎?
還是一定要GT_size=LR_size*4呢(因為是進行x4的train)?

Algolzw · 2023-07-24T15:29:31Z

GT size必须要是LR size * 4哈，保证4倍超分

joe880923 · 2023-07-24T16:04:08Z

我將 batch size = 2、GT_size=64以及LR_size=16
train的過程中發生了驗證錯誤的情況，如下:

Algolzw · 2023-07-24T22:01:30Z

应该是LR size 16太小了，建议至少40以上。

joe880923 · 2023-07-25T05:28:03Z

將patch size調整後，經過一段時間的train，inter一直卡在了22000，train.log目前的結果如下

補上val.log的情況，去看val.log時，感覺目前是正常在訓練的情況

從train.log的過程中發現psnr>20的時候才不會出現validation crash，然後當psnr>26時才會save model，因此去查看train.py的code，，如下:

想請問一下這麼為甚麼要這麼做? 謝謝您!

Algolzw · 2023-07-27T12:28:32Z

前面看着是正常的吗？这里应该是出现训练崩塌了，可以减小学习率缓解这个问题，具体可以参考这里

joe880923 · 2023-07-28T12:28:14Z

謝謝您的答覆!

另外想要請問您一些test的問題，因為我剛接觸blind sr這領域，有些部分想要確定一下，麻煩您了!

1.在test_setting.yml中的設置中，假如我想使用Set14的dataset來進行測試，其中的dataroot_GT以及dataroot_LQ設置如下:
dataroot_GT: /data/dataset/research/setting1/Set14/x4HRblur.lmdb
dataroot_LQ: /data/dataset/research/setting1/Set14/x4LRblur.lmdb
依我的了解input應該是LR而GT應該是HR才對，想請問您為什麼這裡設置的是HRblur以及LRblur而不是HR和LR呢?

2.使用一些dataset進行test時，假如我使用Set14這個dataset，裡面已經有附上各種倍率的HR以及LR了，還必須要先對各倍率的HR進行處理嗎(透過generate_mod_blur_LR_bic.py來產生Bic、HR、kernel、LR、LRblur)?

3.在train或test的設置檔中，只有使用到經過generate_mod_blur_LR_bic.py產生出來的LRblur以及HRblur(不確定這是不是就是HR)，那bicubic、Kernel、LR作用是什麼呢?

4.在yml設置檔中，您在裡面的HRblur是指HR嗎，依我的了解HR是高分辨率的圖，應該是用HR來當GT才對，為甚麼會有HRblur呢?

謝謝您!

Algolzw · 2023-08-01T14:24:28Z

你好，

setting这里的HRblur其实就是HR图像，只是我命名可能不太清楚= =
是的都需要使用generate_mod_blur_LR_bic.py这个脚本来生成LR图像，因为blindSR比普通的SR多了一个blur的过程。一般的set14等数据集只是简单下采样得到的，没有模糊过程。
一般LR就是对HR先使用kernel进行blur后再bicubic下采样得到。
是的，HRblur就是HR图像。

祝好

ese-ouyang mentioned this issue Apr 26, 2024

您好，以上的問題已解決，並且沒有出錯，但卻出現以下狀況 #57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

關於train時發生的錯誤問題 #43

關於train時發生的錯誤問題 #43

joe880923 commented Jul 23, 2023 •

edited

joe880923 commented Jul 23, 2023 •

edited

Algolzw commented Jul 23, 2023

joe880923 commented Jul 24, 2023 •

edited

Algolzw commented Jul 24, 2023

joe880923 commented Jul 24, 2023

Algolzw commented Jul 24, 2023

joe880923 commented Jul 24, 2023

Algolzw commented Jul 24, 2023

joe880923 commented Jul 25, 2023 •

edited

Algolzw commented Jul 27, 2023

joe880923 commented Jul 28, 2023 •

edited

Algolzw commented Aug 1, 2023

關於train時發生的錯誤問題 #43

關於train時發生的錯誤問題 #43

Comments

joe880923 commented Jul 23, 2023 • edited

joe880923 commented Jul 23, 2023 • edited

Algolzw commented Jul 23, 2023

joe880923 commented Jul 24, 2023 • edited

Algolzw commented Jul 24, 2023

joe880923 commented Jul 24, 2023

Algolzw commented Jul 24, 2023

joe880923 commented Jul 24, 2023

Algolzw commented Jul 24, 2023

joe880923 commented Jul 25, 2023 • edited

Algolzw commented Jul 27, 2023

joe880923 commented Jul 28, 2023 • edited

Algolzw commented Aug 1, 2023

joe880923 commented Jul 23, 2023 •

edited

joe880923 commented Jul 23, 2023 •

edited

joe880923 commented Jul 24, 2023 •

edited

joe880923 commented Jul 25, 2023 •

edited

joe880923 commented Jul 28, 2023 •

edited