Pytorch 디버그 모음

Language/Python

Pytorch 디버그 모음

박서현 2022. 2. 24. 11:19

파이토치를 활용해 딥러닝 모델을 처음 학습해보았다. 처음이다보니 간단한 버그부터 현재는 정확히 이해하고 해결하지 못할 버그까지 다양한 문제와 맞닥뜨렸다. 나중을 위해서라도 최대한 모든 문제를 정리해보려고 한다. 기본적으로는 버그(에러문) >> 문제해결방안 형식이다.

ValueError: Unknown resampling filter (256). Use Image.NEAREST (0), Image.LANCZOS (1), Image.BILINEAR (2), Image.BICUBIC (3), Image.BOX (4) or Image.HAMMING (5)
>>> transfroms.Resize((h, w))에 원하는 사이즈를 첫번째 인자에 튜플 형태로 넣어야 한다. 두번째 인자는 보간법과 관련이 있는데, 튜플을 정확하게 넣지 않고, h와 w가 두 개의 인자로 인식되면 위와 같은 에러가 발생한다.

segmentation 학습을 위해 데이터 개수를 꼭 늘려야 하나? 왜 Pytorch Augmentation은 샘플 수를 늘리지는 않는가?
>>> But the greater question is what are you trying to accomplish with transformations? If you just want more random samples then train your networks for a longer number of epochs. Every epoch your dataset will be transformed differently (since you have random transformations there) so you’ll get a fresh set of 150 images for your network to chew on.

multiclass segmentation에서 레이블링
>>> https://discuss.pytorch.org/t/multiclass-segmentation/54065/2
>>> https://discuss.pytorch.org/t/error-in-python-s-multiprocessing-library/31355/6

cv2는 내부적으로 ndarray를 이용한다.
albumentation를 이용할 때, image, mask, keypoint를 변형할 수 있다. 이때 cv2.fillPoly -> Transform 순서로 mask 이미지를 얻는 경우 폴리곤의 변이 계단모양으로 보인다. 따라서 Transform -> image, polygonpoints -> cv2.fillPoly 순서로 mask 이미지를 얻는 경우 이러현 현상을 피할 수 있다.

cv2.error: OpenCV(4.5.5) /io/opencv/modules/imgproc/src/drawing.cpp:2396: error: (-215:Assertion failed) p.checkVector(2, CV_32S) >= 0 in function 'fillPoly'
>>>Another reason may be that the array is not contiguous. Making it contiguous would also solve the issue
image = np.ascontiguousarray(image, dtype=np.uint8)
>>>https://stackoverflow.com/questions/23830618/python-opencv-typeerror-layout-of-the-output-array-incompatible-with-cvmat
>>> (내 경우) cv2.fillPoly(mask, [poly], lb, cv2.LINE_AA)에서 poly가 []인 경우 -> transform을 거치면서 좌표가 사라지는 경우가 있었음(아래 참고자료 확인해보면 verticalflip까지 적용할때 keypoint완전히 사라짐, colorjitter까지 적용하면 다시 살아남...)

2022-02-23 08:55:43,766 - Epoch [1/100] | Batch [200/467] train loss: 0.04039009287953377 (epoch avg 0.09668449312448502)
        ../../data/피부염/C1_감염성피부염/CYT_D_C1_002249.jpg
        CYT_D_C1_002249.jpg in get_data: [[2035, 1154], [2013, 1170], [2025, 1331], [1985, 1406], [1995, 1580], [2033, 1600], [2035, 1154]]
        ../../data/피부염/C1_감염성피부염/CYT_D_C1_002249.jpg
        0.5041380659431209
        Resize/t(512, 512, 3)/t[(508.75, 359.3965936739659), (503.25, 364.3795620437956), (506.25, 414.5206812652068), (496.25, 437.87834549878346), (498.75, 492.0681265206813), (508.25, 498.2968369829684), (508.75, 359.3965936739659)]
        Resize+RandomRotate90/t(512, 512, 3)/t[(2.25, 151.60340632603408), (7.75, 146.6204379562044), (4.75, 96.4793187347932), (14.75, 73.12165450121654), (12.25, 18.93187347931871), (2.75, 12.703163017031613), (2.25, 151.60340632603408)]
        Resize+RandomRotate90+Rotate/t(512, 512, 3)/t[(367.3089240516385, 5.6423821864378185), (372.1164748182134, 11.296356841627528), (422.32713493438484, 9.874580672856599), (445.35878714220314, 20.604141475918684), (499.6003837704073, 19.809432457204764), (506.12475131079833, 10.509999021726177), (367.3089240516385, 5.6423821864378185)]
        Resize+RandomRotate90+Rotate+HorizontalFlip/t(512, 512, 3)/t[(509.7539347754548, 154.0850044211807), (504.3029281593547, 149.0484877036511), (507.79313412458214, 98.9391043598495), (498.02203596247836, 75.48476320186683), (501.05185926907484, 21.322021968321017), (510.6123179570756, 15.186513538041204), (509.7539347754548, 154.0850044211807)]
        Resize+RandomRotate90+Rotate+HorizontalFlip+VerticalFlip/t(512, 512, 3)/t[]
        Resize+RandomRotate90+Rotate+HorizontalFlip+VerticalFlip+ColorJitter/t(512, 512, 3)/t[(147.6639231456087, 507.097784580284), (142.76739275179443, 501.5206918002438), (92.58556250514624, 503.73784084177396), (69.38679934507354, 493.3745466603199), (15.164603160853858, 495.02857452019214), (8.788397364483501, 504.4302144657196), (147.6639231456087, 507.097784580284)]
        Resize+RandomRotate90+Rotate+HorizontalFlip+VerticalFlip+ColorJitter+Normalize/t(512, 512, 3)/t[(123.28706114246694, 495.18688632402774), (118.96667802060045, 489.15246311588146), (68.81185355324635, 486.3910469167784), (46.75293956649551, 473.78212974782076), (123.28706114246694, 495.18688632402774)]
        Resize+Normalize/t(512, 512, 3)/t[(508.75, 359.3965936739659), (503.25, 364.3795620437956), (506.25, 414.5206812652068), (496.25, 437.87834549878346), (498.75, 492.0681265206813), (508.25, 498.2968369829684), (508.75, 359.3965936739659)]
        polygons_2d
                 [[2035 1154][2013 1170][2025 1331][1985 1406][1995 1580][2033 1600][2035 1154]]
        transformed_polygons_2d
                 [[509 359][503 364][506 415][496 438][499 492][508 498][509 359]]
        transformed_polygons_3d
                 [array([[509, 359],[503, 364],[506, 415],[496, 438],[499, 492],[508, 498],[509, 359]])]

polygon 좌표 -> transform -> fillpoly 가 아닌 polygon 좌표 -> fillpoly -> transform을 하니 에러는 뜨지 않지만, 위와 동일한 stage에서 mask 이미지에 배경만 있음

해당 이미지에 random seed를 바꿔가며 transform을 적용해보니 가장자리에 있던 조그마난 라벨링 부위가 transform에 의해 이미지 밖으로 사라져버린 경우였음...

random seed = 2, Flip + Rotation이 적용돼 초록색 질환 부위가 사라진 것을 볼 수 있다.&nbsp;

OSError: image file is truncated (75 bytes not processed)

>> 이미지가 아래와 같이 깨져있는 경우 발생. 회색으로 가려진 부분에 ground truth 영역이 있기 때문에, 이러한 이미지는 세그멘테이션 모델 학습에 사용하지 않았음

>> 아래와 같이 설정해, 깨진 이미지를 load할 수 있음

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

from PIL import Image
import numpy as np
img = Image.open('dog.png')
img = np.asarray(img)

runtimeerror: expected object of scalar type long but got scalar type double for argument #2 'target' in call to _thnn_nll_loss2d_forward
>>>파이토치 cross entropy에서 참값은 long 타입이어야 함
>>>https://discuss.pytorch.org/t/runtimeerror-expected-object-of-scalar-type-long-but-got-scalar-type-float-when-using-crossentropyloss/30542/2

multi loss
>>> https://stackoverflow.com/questions/53994625/how-can-i-process-multi-loss-in-pytorch/53997034
backward(retain_grad=True)
>>> https://jh-bk.tistory.com/13

리눅스 matplotlib 한글 적용
>>> https://teddylee777.github.io/visualization/matplotlib-ubuntu-%ED%95%9C%EA%B8%80%ED%8F%B0%ED%8A%B8%EC%84%A4%EC%B9%9

NLLLoss를 이용할 때 negative values가 return되는 경우
>>> NLLLoss의 input은 logsoftmax를 거친 log probability여야 한다.

* multi gpu precessing tips
>>> https://pytorch.org/tutorials/beginner/dist_overview.html
>>> https://medium.com/daangn/pytorch-multi-gpu-%ED%95%99%EC%8A%B5-%EC%A0%9C%EB%8C%80%EB%A1%9C-%ED%95%98%EA%B8%B0-27270617936b
>>> https://blog.si-analytics.ai/12
>>> torch.nn.parallel.DistributedDataParallel : https://algopoolja.tistory.com/56

* dist.init_process_group 코드에서 넘어가지 않는 에러
>>> gpu 개수가 2개인데 world_size 인자에 2 초과 값을 넣는 경우, It will hang unless you pass in nprocs=world_size to mp.spawn(). In other words, it's waiting for the "whole world" to show up, process-wise.
https://stackoverflow.com/questions/66498045/how-to-solve-dist-init-process-group-from-hanging-or-deadlocks
>>> init_method 인자에 tcp 주소를 입력할 때, ip(master_address)는 docker container 환경에서 container의 ip를, port(master_port)는 비어있는 포트 아무거나 입력했을 때 해당 함수가 제대로 실행되지 않음. 이때 ip를 "localhost"로 했을 때 잘 구동됨. (추가: 팀장님 코멘트) 분산학습 시 url을 이용하는 것은 multi machines 환경에서 machine 끼리의 통신을 위함이기 때문에, single gpu 환경에서는 'localhost'를 이용하면 된다.

* GPU 분산 학습시 아래 에러 발생할 때(DDP 이용)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
>>> gpu의 메모리가 부족할 수 있음. gpu memory usage확인하기. cuda out of memory와 같은 상황
>>> 이외에 dist.init_process_group 이전에 torch.cuda.set_device(global_rank) 코드를 넣어주면 해결될 수 있다고 함.

* Attribute error: Can't pickle local object
>>> pickle을 이용할 때 local변수의 경우 위와 같은 에러가 발생한다. 이럴 때 global로 선언(declare)해주어야 한다.
https://www.pythonpool.com/cant-pickle-local-object/
ex) target = 0
def func():
    global target
    ...

** multi gpu에서 학습하고 single gpu에서 로드하기
* multi gpu processing을 이용해 학습한 모델의 state_dict()를 torch.save할 때 state_dict.keys()에 module.이 붙어있음(moduels.Conv1.conv.0.weight).
따라서 model.load_state_dict(torch.load('model_state_dict.pth'))에서 missing key errors 발생함.
>>> 이 때 아래와 같이 state_dict.keys()에서 'module.'을 뺀 new_state_dict를 이용해야 한다.
# original saved file with DataParallel
state_dict = torch.load('myfile.pth.tar')
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
    name = k[7:] # remove `module.`
    new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)
# https://discuss.pytorch.org/t/solved-keyerror-unexpected-key-module-encoder-embedding-weight-in-state-dict/1686/3
>>> 혹은 multi gpu에서 학습한 모델을 저장할 때 torch.save(model.module.state_dict(), save_path)를 이용한다.
https://jangjy.tistory.com/323

torch.nn.DataParallel이용 nllloss로 loss.backward할때 진행되지 않는 문제
>>> loss function도 parallel 처리해주어야 함.
https://aigong.tistory.com/186
https://hwijeen.github.io/2018-11-05/Data-parallelism,-multi-GPU/
>>> Horovod 프레임워크 사용
https://blog.si-analytics.ai/22
https://techandlife.tistory.com/28 (때, learning rate를 1개에 머신에서 batch size를 32로 하는 경우와 유사하게 만들어주기 위해서 learning rate를 분산 머신의 갯수만큼 곱해줌으로서 적절히 scaling을 해야 하는게 일반적이다. 왜냐하면, 분산 학습에서는 average gradient를 이용하여 network를 update시키기 때문이다.)

torch.cuda.empty_cache코드에서 Out of memory가 뜰 때
>>> https://discuss.pytorch.org/t/out-of-memory-when-i-use-torch-cuda-empty-cache/57898/2
>>>The reason is , torch.cuda.empty_cache() write data to gpu0 (by default) ,about 500M
when I meet this problem, my gpu0 was fully occupied. so if I try
    with torch.cuda.device('cuda:1'):
        torch.cuda.empty_cache()


batch size를 6으로 하고 gpu 2개에서 ddp로 학습할 때, >> 이건 아직 잘 모르겠음. batch size 8로 하니깐 잘 돌아감..
deeplab의 aspp에서 image pooling에 해당하는 global_avg_pool(x)에서 size에러가 남.
이 때 x.size()는 [1, 2048, 32, 32]이었고, 마지막 배치의 데이터 수가 1인 것으로 보임.

torch multiprocessing과 logging
>>> https://stackoverflow.com/questions/64752343/pytorch-why-logging-fails-in-ddp

Why my model returns nan? focal cross entropy를 사용해 training을 진행할 때, lr이 0.02였음. 한 스텝 지나니 모델 아웃풋이 죄다 nan이었음. 아래 해결법 중 lr을 2e-5로 바꾸니 해결됨. (>>수정 : 추후 CE, FCE, ICE, DL, GDL, IDL 등 custom한 loss에서 nan 값이 나온 이유는 torch.log에 0이 들어가서임.)
>>>There are many potential reasons. Most likely exploding gradients. The two things to try first:
Normalize the inputs
Lower the learning rate
>>> Hello, there is another possibility: If the output contain some large values (abs(value) > 1e20), then nn.LayerNorm(output) might return a all nan vector.
>>> https://discuss.pytorch.org/t/why-my-model-returns-nan/24329/5

ValueError: Expected more than 1 value per channel when training
>>>BatchNorm에서 배치 사이즈가 1일 때 위와 같은 에러 발생
>>> 패트릭 형님 답변 : Most likely you have a nn.BatchNorm layer somewhere in your model, which expects more then 1 value to calculate the running mean and std of the current batch.
In case you want to validate your data, call model.eval() before feeding the data, as this will change the behavior of the BatchNorm layer to use the running estimates instead of calculating them for the current batch.
If you want to train your model and can’t use a bigger batch size, you could switch e.g. to InstanceNorm.
>>> https://discuss.pytorch.org/t/error-expected-more-than-1-value-per-channel-when-training/26274
>>> 다른 답변 : There is one solution . If you encounter a batch_size = 1 , then simply add a dummy tensor of Zeros . And before computing loss simply eliminate that tensor so its not affecting gradients. But this would require a computational overhead. 이 경우 배치 정규화에 값에 영향을 미칠 듯 함...
>>> https://github.com/Tramac/Fast-SCNN-pytorch/issues/5

Premature end of JPEG file
>>> This is a low level warning, and is not caught by python asserts or cv2 loading errors, so these files will all be used for training.
>>> https://github.com/ultralytics/yolov5/issues/916#issuecomment-791753121

>>> (이 방법으로 해결) $ mogrify -set comment 'Image rewritten with ImageMagick' *.jpg
>>> This command changes a property of the file leaving the image data untouched. However, the image is loaded and resaved, eliminating the extra information that causes the corruption error.
>>> https://stackoverflow.com/questions/9131992/how-can-i-catch-corrupt-jpegs-when-loading-an-image-with-imread-in-opencv