Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSDK crashes when trying to load a neural network #145

Open
kkishore9891 opened this issue Feb 27, 2024 · 17 comments
Open

PSDK crashes when trying to load a neural network #145

kkishore9891 opened this issue Feb 27, 2024 · 17 comments

Comments

@kkishore9891
Copy link

Hello!
In my code I have PSDK subscriptions callbacks which store images, GPS, velocity, altitude, etc from the matrice m30t drone.
I am using a jetson xavier based payload computer to run the PSDK. When I open another process in parallel which loads a yolo object detection network and starts performing inferences with it, there is a chance that the neural network crashes the Payload SDK and the callbacks stop receiving the latest image, widget and other values. These are the error messages that I receive before the PSDK crashes:

[87.903][linker]-[Error]-[DjiCommand_SendAsyncHandle:906) Command async send error 0
[87.903][infor]-[Warn]-[DjiAircraftInfo_CheckPingStatusAsyncCallback:265) connect status async timeout

I am not sure why this occurs but it seems that this error occurs when there is a high computational load. Alos, the PSDK crash is random. Sometimes the neural network is succesfully loaded and starts performing the inferences and other times, it crashes. I tried to look into the source code of the payload SDK but I could not find much. Any help would be appreciated.

@kkishore9891
Copy link
Author

Here are more error messages that I obtained in another crash which caused the jetson to shut down when the model is loaded:
psdk_error.txt

@dji-dev
Copy link
Contributor

dji-dev commented Feb 28, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Thank you for your patience. I have read your description and suspect that the abnormality is caused by excessive load on your serial link. You can do this and adjust the baud rate to the highest. If you have subscription behavior, please try to reduce the The number of TOPIC subscribed, or reduce the TOPIC frequency of subscription.
Your error is a communication timeout, and there are uart-related errors in the provided file. These are errors caused by serial communication blocking.

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

Hello! Thank you for your reply. I did increase the baud rate to 1000000. I think there are some positive effects. But I would like to know if I can reduce the frequency of the image subscription too. The liveview sample works slightly different than the dji subscription topics. If I can subscribe that images at a slower rate, that could help reduce the load as well. We are planning to subscribe both RGB and thermal images for our multimodal neural network which would fuse together both the images for inference. I would like to know if that is possible as well.

@dji-dev
Copy link
Contributor

dji-dev commented Mar 1, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Sorry, you are referring to reducing the frequency of image subscription. Can you provide the specific interface function? So that there is no deviation in our communication, thank you very much ~

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

image
In this particular function where we associate a callback for a particular video stream, the callback receives the images at a particular frequency. I would like to know if we can reduce the rate at which it receives the images.

@kkishore9891
Copy link
Author

Hello! There are more updates regarding this issue. After running the code several times, it has been found that increasing baud rate still doesn't solve the issue. But there are some interesting insights:

  1. This issue occurs even when running DJI PSDK's sample programs like liveview test demo or even low bandwidth requirements programs like the widget test demo.

  2. Everytime after the Nvidia computer is booted for the first time, running the PSDK code followed by loading the neural network into the GPU works without any issue. But having the PSDK running and restarting just the neural network code without making any changes to the code seems to cause the same uart failure errors mentioned above.

Given the fact that the psdk's .c files have been compiled into binary libraries, it is a bit hard to debug the underlying issues.

@dji-dev
Copy link
Contributor

dji-dev commented Mar 4, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Regarding your previous question, the frequency of receiving pictures, this cannot be changed.
We suspect that the problem you mentioned is caused by a third-party library. Is the neural network program you use open source? If yes, where can we find it?

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

kkishore9891 commented Mar 4, 2024

Hello,
I am sorry but we can't show you our code. However, I can tell you that we are using Yolov7 tiny model which is loaded using the pytorch framework using cuda. There are some ros2 subscribers and publishers present inside the code.

@dji-dev
Copy link
Contributor

dji-dev commented Mar 6, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Thank you for your patience. I was on vacation yesterday, so the response time was slowed down.
We are currently suspecting compatibility issues with third-party libraries. Could you please send us the library files you use? Your source code is not required. Without library files, it will be difficult for us to locate further.

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

Hello, here are some updates.

  1. For the neural network we are not using C++ programming. We are using the Python's Pytorch library. So we only have the .py files meaning we can only send the source code. But here is the part of the code which causes the issue. This is a pytorch code which loads the neural network into the GPU:
def load_model(self):
        logging.debug("Loading yolov7 model")
        
        self.weights, self.view_img, self.save_txt, self.imgsz, self.trace = self.config.weights_path, self.config.view_img, self.config.save_txt, self.config.img_size, not self.config.no_trace
        self.save_img = True
        # Directories
        # save_dir = Path(increment_path(Path(self.config.project) / self.config.name, exist_ok=self.config.exist_ok))  # increment run
        # (save_dir / 'labels' if save_txt else save_dir).mkdir(parents=True, exist_ok=True)  # make dir

        
        
        # Initialize
        set_logging()
        self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 
        logging.warning(f"The device selected is: {self.device}") #select_device(self.config.device)
        self.half = self.device.type != 'cpu'  # half precision only supported on CUDA

        # Load model
        logging.info(f"Attempting loading the model")
        ##NB Just temprary for debugging
        import sys      
        sys.path.insert(0, "/home/nvidia/yolov7/yolov7/")

        self.model = attempt_load(self.weights, map_location=self.device)  # load FP32 model
        logging.info("Model loaded")
        sys.path.remove("/home/nvidia/yolov7/yolov7/")

        
        self.stride = int(self.model.stride.max())  # model stride
        self.imgsz = check_img_size(self.imgsz, s=self.stride)  # check img_size

        logging.info(f"Model stride: {self.stride}")
        if self.trace:
            self.model = TracedModel(self.model, self.device, self.config.img_size)
        logging.info(f"Model traced: {self.trace}")

        
        logging.info(f"Setting half precision: {self.half}")
        if self.half:
            self.model.half()  # to FP16
        logging.info(f"Model half precision: {self.half}")
        
        # Second-stage classifier
        self.classify = False
        logging.info(f"Setting classify: {self.classify}")
        if self.classify:
            logging.info(f"Loading classifier")
            modelc = load_classifier(name='resnet101', n=2)  # initialize
            modelc.load_state_dict(torch.load('weights/resnet101.pt', map_location=self.device)['model']).to(self.device).eval()
        
        # Set Dataloader
        logging.debug("Setting dataloader")
        vid_path, vid_writer = None, None
        logging.debug(f"Setting view_img: {self.view_img}")
        if self.view_img:
            self.view_img = check_imshow()
        cudnn.benchmark = True  # set True to speed up constant image size inference
        logging.debug("Setting dataset")

        # Get names and colors
        self.class_names = self.model.module.names if hasattr(self.model, 'module') else self.model.names
        self.colors = [[random.randint(0, 255) for _ in range(3)] for _ in self.class_names]
        
        # Run inference
        logging.debug("Running inference")
        if self.device.type != 'cpu':
            self.model(torch.zeros(1, 3, self.imgsz, self.imgsz).to(self.device).type_as(next(self.model.parameters())))  # run once
        
        old_img_w = old_img_h = self.imgsz
        old_img_b = 1
        
        # Warmup
        # if self.device.type != 'cpu' and (old_img_b != img.shape[0] or old_img_h != img.shape[2] or old_img_w != img.shape[3]):
        #     old_img_b = img.shape[0]
        #     old_img_h = img.shape[2]
        #     old_img_w = img.shape[3]
        #     for i in range(3):
        #         self.model(img, augment=self.config.augment)[0]
        
        
        # Tracker
        logging.debug("Loading tracker if selected")
        if self.tracker_type != None:
            logging.debug("Loading Tracker")
            self.tracker_module = load_tracker(
                config_path = self.config_path,
                tracker_type= self.tracker_type,
                tracker_config_path= self.tracker_config_path
            )
  1. While monitoring the serial monitor, we notice that the serial communication does get blocked for a while when the neural net is being loaded into the memory. So loading the NN after the PSDK is launched causes the PSDK to crash because we believe some kind of timeout in the PSDK is being exceeded. But this still doesn't explain why the code works when the Pytorch model code is run the first time after boot but causes the PSDK to crash when it is restarted.

  2. But the communication gets unblocked eventually after loading the NN. So running the PSDK layer after the neural network is loaded seems to alleviate this issue. But even then the PSDK might crash the first time. However, we observe that running the PSDK again seems to work. This behaviour is a bit hard to explain.

@kkishore9891
Copy link
Author

I would like to add one thing:

1.147][linker]-[Warn]-[DjiCommand_SendAsyncHandle:894) Command async send retry: index = 0, retryTimes = 3, cmdSet = 0, cmdId = 135

When we start the NN program first and then run the PSDK code, we notice that when the following error message appears in the PSDK code, the PSDK will crash:

[1.147][linker]-[Warn]-[DjiCommand_SendAsyncHandle:894) Command async send retry: index = 0, retryTimes = 3, cmdSet = 0, cmdId = 135

I would like to know if there is any function that we can call using the PSDK that would indicate if Command_SendAsynchandle is sending a retry message or not or something similar so that we could automatically terminate the program with a return code -1 and retry the code so that the issue is solved.

@dji-dev
Copy link
Contributor

dji-dev commented Mar 7, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Could you please attach the complete PSDK log? This will allow us to better confirm your problem, and we will confirm for you whether there is any way to know if the serial port communication is abnormal.

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

Hello! I am attaching the following three files and I hope that would help you debug the issue.

  1. Terminal output of Sample program that streams camera output: PSDK terminal output.txt

  2. Terminal output of Neural network code for which I uploaded the code snippet in the previous reply: Yolov7 terminal output.txt

  3. DJI sample code log file: DJI_0035_20240307_12-39-58.log

@dji-dev
Copy link
Contributor

dji-dev commented Mar 8, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Thank you for the information you provided. We will take a look at your log first. If there is any progress, we will synchronize it with you again.

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

Hello!
Are there any updates regarding this? This would be of great help to improve the reliability of our product.

@dji-dev
Copy link
Contributor

dji-dev commented Mar 18, 2024

Agent comment from Leon in Zendesk ticket #101560:

Dear Developer,

Hello, thank you for contacting DJI.

Regarding your question, our internal team is still analyzing it and has not yet made a final conclusion.

You can try and see if it helps you:

  1. Change the serial port connection link. If you are using the onboard serial port, please change it to USB-TTL and try it.
  2. If you are using USB-TTL, you can try to use FT232 or CH340 chips.
  3. If you are using DuPont wire, please try to replace it with a DuPont wire of the same length.

Thank you for your support of DJI products.

Best Regards,
DJI Dajiang innovation SDK technical support

°°°

@kkishore9891
Copy link
Author

Hello! We are using a single board computer to execute the PSDK scripts. This one to be precise. This is an Nvidia Jetson Xavier board and I believe it uses the SDK round ribbon cable to communicate with the Matrice M30T's PSDK port. So I am not sure if the points you mentioned above would be relevant for our situation.

One thing I believe that can be useful for us is if there is a timeout in the PSDK's serial communication, some way to re-establish communication would be great. Everytime the PSDK crashes, I have to restart the PSDK and this would be hard to do mid air. And even when I restart it does take a while for the PSDK to reliably execute without showing the "Command async send retry: index = 0, retryTimes = 3, cmdSet = 0" warning message. This is affecting our application a lot. We are also optimising on our end to ensure the serial communication is never blocked. But the code base is increasing in size as the project grows and it would be beneficial to have a redundancy on PSDK's side of things.

Some kind of a function that I can call in the PSDK to restart the connection with the drone or some self correcting mechanism would be very useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants