-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Various errors when training scales=320 #415
Comments
I have the same problem with COCO dataset and sometimes it happens in test time too, not only while training. |
And what's more when I reduced the |
I am quite confused about ROI_CANONICAL_SCALE: 90,how can i get it ? |
@moyans You can use |
@daquexian Sorry, I didn't explain it clearly. I know where it is. The original value is 224. I'm just wondering how to calculate this number. |
@moyans I calculated it by 224 * 320 / 800. |
Did you find a solution for your problem? I got something similar when I reduce the image-scale for my own dataset to 360x480.. I think it is related to the cython_nms, since if I configure TRAIN/TEST.RPN_NMS_THRESH: 0.0 The model trains and evaluates (but with worse results of course..). I also tried to debug this, but I still couldn't figure out the problem... using the CPU NMS (without cython) from the old py-faster-rcnn repo I found that sometimes you divide by zero inside the NMS (when the two boxes both have area 0). This should be fixed by setting TRAIN/TEST.RPN_MIN_SIZE > 0. But it seems that this is not the only problem... Could you please try to switch off the RPN_NMS (set RPN_NMS_THRESH to 0.0) and see if it works then? Maybe it could also be a problem that the number of anchors/proposals is too small when we use NMS on the regressed boxes that were generated on very small feature maps (due to the reduced input image-size) |
@pfollmann Thanks for your information! I may try it tomorrow. Does the bug still exist even though TRAIN/TEST.RPN_MIN_SIZE > 0? |
Yes, unfortunately with TRAIN/TEST.RPN_MIN_SIZE I still got errors at random iterations in your above-mentioned style.. |
I think that I found the problem: It is in detectron/utils/cython_nms.pyx:
The numpy.argsort function seems to be buggy at this point (no clue why). I replaced it with the cython argsort implementation from https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx. To make it work the following changes are necessary:
For me the training works fine now for 20k iterations and also the inference had no more seg-faults (including a little speed-up ;-) ) The open question still is why cython_nms.pyx worked fine for other configurations of the TRAIN/TEST.SCALE? In my experience not the image-scale was the problem but the size of objects that are getting very small in the case of rescaled images to small sizes. Hope that helps! PS: My current Detectron-version is quite far from the master, therefore I'm not sure if I find time to launch a PR soon.. |
@pfollmann Wow! So coool! I will try it soon. Thanks!
…On Thu, May 17, 2018, 10:57 PM pfollmann ***@***.***> wrote:
I think that I found the problem: It is in detectron/utils/cython_nms.pyx:
cdef np.ndarray[np.int_t, ndim=1] order = scores.argsort()[::-1]
The numpy.argsort function seems to be buggy at this point (no clue why).
I replaced it with the cython argsort implementation from
https://github.com/jcrudy/cython-argsort/blob/master/cyargsort/argsort.pyx.
To make it work the following changes are necessary:
- Place argsort.pyx in detectron/utils
- Change line 13 in argsort.pyx to
ctypedef cnp.float32_t FLOAT_t
- register the file in setup.py (similar to cython_nms.pyx and
cython_bbox.pyx)
- include it in detectron/utils/cython_nms.pyx, i.e. change the file
as follows:
import utils.argsort as argsort ... cdef np.ndarray[np.int_t, ndim=1]
order = np.empty((ndets), dtype=np.intp) argsort.argsort(-scores, order)
- and run 'make' in detectron to compile the cython-modules again.
For me the training works fine now for 20k iterations and also the
inference had no more seg-faults (including a little speed-up ;-) )
The open question still is why cython_nms.pyx worked fine for other
configurations of the TRAIN/TEST.SCALE? In my experience not the
image-scale was the problem but the size of objects that are getting very
small in the case of rescaled images to small sizes.
Hope that helps!
PS: My current Detectron-version is quite far from the master, therefore
I'm not sure if I find time to launch a PR soon..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#415 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALEcn-6IxvfvqGsyY2sjINVV_0D92lXdks5tzY_cgaJpZM4Tzn30>
.
|
@pfollmann It works! Thanks! I'd like to remain this issue open because the patch has not been merged into master. Looking forward to your PR :) You might want to fetch the master and modify the corresponding files. The steps on master are nothing different from those you pointed out above. |
@pfollmann Thanks, save my day! |
import detectron.utils.cython_nms as cython_nms |
@shenghsiaowong I think you should use |
,,,i have change it,but it does not work,i know thisis a small issue ,but ihave no idea |
what is meaning of this?thank you |
@shenghsiaowong sorry I haven't met this error. @pfollmann do you have time to send a PR for your excellent solution so that every user can benefit from it seamlessly? :) |
@pfollmann @daquexian Hi, thanks for the solution, but when I try your advice, other error happened: Error in 'python': free() invalid next size (fast) |
@shenghsiaowong |
@karenyun I am meeting same problem here, did you figure it out? |
@karenyun @StepOITD I met the same problem and solve it as follows:
to
Hope it helps. |
Expected results
Training runs correctly in any proper sizes.
Actual results
Training runs correctly for some iterations, then ends at a random time. I have disabled the shuffle of dataset by modify
_shuffle_roidb_inds
inlib/roi_data/loader.py
and tried on VOC twice, the program crashed at different iterations respectively.What's more, the error messages are different in different runs. Sometimes it is
and sometimes it is
Detailed steps to reproduce
In an existing config, modify
TRAIN.SCALES
to(320,)
,TRAIN.MAX_SIZE
to500
. Since I was using an FPN config, I modifiedFPN.RPN_ANCHOR_START_SIZE
to 16 andROI_CANONICAL_SCALE
to 90.I have tested on COCO and VOC, both fails.
System information
PYTHONPATH
environment variable: nullpython --version
output: Python 2.7.12The text was updated successfully, but these errors were encountered: