Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yarpmanager gets (almost) stuck in case the connection to yarpserver is delayed #3104

Open
S-Dafarra opened this issue Apr 16, 2024 · 7 comments

Comments

@S-Dafarra
Copy link
Contributor

S-Dafarra commented Apr 16, 2024

Describe the bug

We have been largely exploiting yarpmanager to run demos involving several applications running on different PCs using different OS. So far, everything is great. Some difficulty arises when yarpserver is placed in a very delayed network. In this case, clicking the wrong button might render the system unusable. To give an example, if we press the refresh all button, it is not possible to do anything else in the application until all the applications, connections, and resources have been checked. The problem is that in case of a delayed network, this operation might require minutes. The problem gets worst if you try to close the application that got stuck as the entire yarpmanager gets completely stuck

To Reproduce
One possible way to reproduce it is to set yarp conf to an IP not reachable, and then open yarpmanager without running yarpserver. Then, open the application with the largest number of applications and press on the refresh all button. Then, try to close the blocked application

Expected behavior
When attempting to run another action while another one is running, yarpmanager should simply stop the previous one. At the same time, if the user wants to close the open application, it should close without blocking the entire yarpmanager.

Screenshots
In the following I just tried to open an application and then I try to close it immediately after. In this specific case, the issue is that yarpserver is not running, but the same happens when yarpserver is in a delayed network

YARP.module.manager.2024-04-16.18-26-33.mp4

Configuration (please complete the following information):

  • OS: Ubuntu/Windows
  • yarp version: 3.9.0
  • compiler:

Additional context
Add any other context about the problem here.

cc @randaz81 @traversaro @Nicogene @GiulioRomualdi

@S-Dafarra
Copy link
Contributor Author

I dug a bit into the code. When pressing the "refresh all" button the following code is executed

void MainWindow::onRefresh()
{
QWidget *w = ui->mainTabs->currentWidget();
if(!w){
return;
}
yarp::manager::NodeType type = ((GenericViewWidget*)w)->getType();
if(type == yarp::manager::APPLICATION){
auto* ww = (ApplicationViewWidget*)w;
ww->refresh();
}
if(type == yarp::manager::RESOURCE){
auto* ww = (ResourceViewWidget*)w;
ww->refresh();
}
}

which in turn calls

void ApplicationViewWidget::refresh()
{
selectAllConnections(true);
selectAllModule(true);
selectAllResources(true);
onRefresh();
}

and then

bool ApplicationViewWidget::onRefresh()
{
if (safeManager.busy()) {
return false;
}
std::vector<int> modulesIDs;
std::vector<int> connectionsIDs;
std::vector<int> resourcesIDs;
for(int i=0;i<ui->moduleList->topLevelItemCount();i++) {
QTreeWidgetItem *it = ui->moduleList->topLevelItem(i);
if (it->data(0,Qt::UserRole) == APPLICATION) {
refreshNestedApplication(it,&modulesIDs);
} else {
if (it->isSelected()) {
modulesIDs.push_back(it->text(1).toInt());
safeManager.updateExecutable(it->text(1).toInt(),
it->text(4).toLatin1().data(),
it->text(3).toLatin1().data(),
it->text(5).toLatin1().data(),
it->text(6).toLatin1().data(),
it->text(7).toLatin1().data());
it->setText(2,"waiting");
it->setIcon(0,QIcon(":/refresh22.svg"));
it->setForeground(2,QColor("#000000"));
}
}
}
for(int i=0;i<ui->connectionList->topLevelItemCount();i++) {
updateConnection(i,connectionsIDs);
}
for(int i=0;i<ui->resourcesList->topLevelItemCount();i++) {
QTreeWidgetItem *it = ui->resourcesList->topLevelItem(i);
if (it->isSelected()) {
resourcesIDs.push_back(it->text(1).toInt());
it->setText(3,"waiting");
it->setIcon(0,QIcon(":/refresh22.svg"));
it->setForeground(3,QColor("#000000"));
}
}
safeManager.safeRefresh(modulesIDs,
connectionsIDs,
resourcesIDs);
yarp::os::SystemClock::delaySystem(0.1);
selectAllConnections(false);
selectAllModule(false);
selectAllResources(false);
return true;

The juicy bit is

safeManager.safeRefresh(modulesIDs,
connectionsIDs,
resourcesIDs);

that spawns a new thread

void SafeManager::safeRefresh(std::vector<int>& MIDs,
std::vector<int>& CIDs,
std::vector<int>& RIDs)
{
if (busy()) {
return;
}
WAIT_SEMAPHOR();
modIds = MIDs;
conIds = CIDs;
resIds = RIDs;
action = MREFRESH;
POST_SEMAPHOR();
if (!yarp::os::Thread::isRunning()) {
yarp::os::Thread::start();
}
}

After this, safeManger appears busy until the action is done. This means that in all subsequent actions, the lines

if (safeManager.busy()) {
return false;
}

will always return false. This appears the reason why it is not possible to do anything else on the application until the previous action is done.

Instead, when closing the tab, the following code is executed:

bool MainWindow::onTabClose(int index)
{
auto* w = (GenericViewWidget*)ui->mainTabs->widget(index);
if(!w){
return false;
}
if(w->getType() == yarp::manager::APPLICATION){
ApplicationViewWidget *aw = ((ApplicationViewWidget*)w);
if(aw && aw->isRunning()){
QMessageBox msgBox;
msgBox.setIcon(QMessageBox::Icon::Warning);
msgBox.setWindowTitle(QString("Closing %1").arg(ui->mainTabs->tabText(index)));
msgBox.setText(tr("You have some running module. After closing the application window you might not be able to recover them. Are you sure?"));
QPushButton* noButton = msgBox.addButton(tr("No"), QMessageBox::NoRole);
QPushButton* pstopAndClose = msgBox.addButton(tr("Yes and Stop"), QMessageBox::YesRole);
msgBox.addButton(tr("Yes"), QMessageBox::YesRole);
msgBox.setDefaultButton(noButton);
msgBox.exec();
if (msgBox.clickedButton() == noButton)
{
return false;
}
else if(msgBox.clickedButton() == pstopAndClose)
{
onStop();
}
}
QString fileOfCurrentApp = aw->getFileName();
if(aw->isModified() && aw->isEditingMode()){
QMessageBox::StandardButton btn = QMessageBox::question(this,"Save",QString("%1 has been modified\nDo you want to save it before closing?").arg(aw->getAppName().toLatin1().data()),
QMessageBox::Yes | QMessageBox::No | QMessageBox::Cancel);
if(btn == QMessageBox::Yes){
bool ret = aw->save();
if(ret){
onReopenApplication(aw->getAppName(),fileOfCurrentApp);
ui->mainTabs->removeTab(index);
delete w;
return true;
}else{
QMessageBox::critical(this,"Error",QString("Error Saving the file"));
return false;
}
}
if(btn == QMessageBox::Cancel){
return false;
}
if (btn == QMessageBox::No){
QFile file(fileOfCurrentApp);
if(!file.exists()){
ui->entitiesTree->setCurrentItem(ui->entitiesTree->getWidgetItemByFilename(fileOfCurrentApp));
ui->entitiesTree->onRemove();
aw->closeManager();
ui->mainTabs->removeTab(index);
delete w;
return true;
}
}
}
aw->closeManager();
}
ui->mainTabs->removeTab(index);
delete w;
return true;
}

which calls

void ApplicationViewWidget::closeManager() {
safeManager.close();
}

and then it calls

void SafeManager::close() {
yarp::os::Thread::stop();
WAIT_SEMAPHOR();
eventReceiver = nullptr;
POST_SEMAPHOR();
}

This gets stuck in

yarp::os::Thread::stop();

since it is blocking until the separate thread finishes the run method.
Since this is happening in the GUI thread, the whole application gets stuck and the OS tags it as "Not Responding"

A possible idea

In the end, the main issue is that the run method is blocking until everything has been done

void SafeManager::run()
{
WAIT_SEMAPHOR();
ThreadAction local_action = action;
std::vector<int> local_modIds = modIds;
std::vector<int> local_conIds = conIds;
std::vector<int> local_resIds = resIds;
POST_SEMAPHOR();
switch(local_action){
case MRUN:{
std::vector<double> waitVec;
for (int local_modId : local_modIds)
{
Executable * exec = Manager::getExecutableById(local_modId);
if (exec)
{
waitVec.push_back(exec->getPostExecWait());
}
}
double minWait=*std::min_element(waitVec.begin(), waitVec.end());
for (int local_modId : local_modIds)
{
Executable * exec = Manager::getExecutableById(local_modId);
if (exec)
{
exec->setPostExecWait(exec->getPostExecWait() - minWait);
}
Manager::run(local_modId, true);
}
/*
for(unsigned int i=0; i<local_modIds.size(); i++)
Manager::waitingModuleRun(local_modIds[i]);
for(unsigned int i=0; i<local_conIds.size(); i++)
{
if(Manager::connected(local_conIds[i]))
{
if(eventReceiver) eventReceiver->onConConnect(local_conIds[i]);
}
else
{
if(eventReceiver) eventReceiver->onConDisconnect(local_conIds[i]);
}
refreshPortStatus(local_conIds[i]);
}
for(unsigned int i=0; i<local_resIds.size(); i++)
{
if(Manager::exist(local_resIds[i]))
{
if(eventReceiver) eventReceiver->onResAvailable(local_resIds[i]);
}
else
{
if(eventReceiver) eventReceiver->onResUnAvailable(local_resIds[i]);
}
}*/
break;
}
case MSTOP:{
std::vector<double> waitVec;
for (int local_modId : local_modIds)
{
Executable * exec = Manager::getExecutableById(local_modId);
if (exec)
{
waitVec.push_back(exec->getPostStopWait());
}
}
double minWait=*std::min_element(waitVec.begin(), waitVec.end());
for (int local_modId : local_modIds)
{
Executable * exec = Manager::getExecutableById(local_modId);
if (exec)
{
exec->setPostStopWait(exec->getPostStopWait() - minWait);
}
Manager::stop(local_modId, true);
}
/*for(unsigned int i=0; i<local_modIds.size(); i++)
Manager::waitingModuleStop(local_modIds[i]);
for(unsigned int i=0; i<local_conIds.size(); i++)
{
if(Manager::connected(local_conIds[i]))
{
if(eventReceiver) eventReceiver->onConConnect(local_conIds[i]);
}
else
{
if(eventReceiver) eventReceiver->onConDisconnect(local_conIds[i]);
}
refreshPortStatus(local_conIds[i]);
}
for(unsigned int i=0; i<local_resIds.size(); i++)
{
if(Manager::exist(local_resIds[i]))
{
if(eventReceiver) eventReceiver->onResAvailable(local_resIds[i]);
}
else
{
if(eventReceiver) eventReceiver->onResUnAvailable(local_resIds[i]);
}
}*/
break;
}
case MKILL:{
for (int local_modId : local_modIds) {
Manager::kill(local_modId, true);
}
/*for(unsigned int i=0; i<local_modIds.size(); i++)
Manager::waitingModuleKill(local_modIds[i]);
for(unsigned int i=0; i<local_conIds.size(); i++)
{
if(Manager::connected(local_conIds[i]))
{
if(eventReceiver) eventReceiver->onConConnect(local_conIds[i]);
}
else
{
if(eventReceiver) eventReceiver->onConDisconnect(local_conIds[i]);
}
refreshPortStatus(local_conIds[i]);
}
for(unsigned int i=0; i<local_resIds.size(); i++)
{
if(Manager::exist(local_resIds[i]))
{
if(eventReceiver) eventReceiver->onResAvailable(local_resIds[i]);
}
else
{
if(eventReceiver) eventReceiver->onResUnAvailable(local_resIds[i]);
}
}*/
break;
}
case MCONNECT:{
for(int local_conId : local_conIds)
{
refreshPortStatus(local_conId);
if(Manager::connect(local_conId))
{
if (eventReceiver) {
eventReceiver->onConConnect(local_conId);
}
}
else
{
if (eventReceiver) {
eventReceiver->onConDisconnect(local_conId);
}
}
}
break;
}
case MDISCONNECT:{
for(int local_conId : local_conIds)
{
refreshPortStatus(local_conId);
if(Manager::disconnect(local_conId))
{
if (eventReceiver) {
eventReceiver->onConDisconnect(local_conId);
}
}
else
{
if (eventReceiver) {
eventReceiver->onConConnect(local_conId);
}
}
}
break;
}
case MREFRESH:{
busyAction = true;
for(int local_modId : local_modIds)
{
if(Manager::running(local_modId))
{
if (eventReceiver) {
eventReceiver->onModStart(local_modId);
}
}
else //if(Manager::suspended(local_modIds[i]))
{
if (eventReceiver) {
eventReceiver->onModStop(local_modId);
}
}
}
for(int local_conId : local_conIds)
{
refreshPortStatus(local_conId);
if(Manager::connected(local_conId))
{
if (eventReceiver) {
eventReceiver->onConConnect(local_conId);
}
}
else
{
if (eventReceiver) {
eventReceiver->onConDisconnect(local_conId);
}
}
}
for(int local_resId : local_resIds)
{
if(Manager::exist(local_resId))
{
if (eventReceiver) {
eventReceiver->onResAvailable(local_resId);
}
}
else
{
if (eventReceiver) {
eventReceiver->onResUnAvailable(local_resId);
}
}
}
busyAction = false;
break;
}
case MREFRESH_CNN:{
for(int local_conId : local_conIds)
{
refreshPortStatus(local_conId);
if(Manager::connected(local_conId))
{
if (eventReceiver) {
eventReceiver->onConConnect(local_conId);
}
}
else
{
if (eventReceiver) {
eventReceiver->onConDisconnect(local_conId);
}
}
}
break;
}
case MATTACHSTDOUT:{
for (int local_modId : local_modIds) {
Manager::attachStdout(local_modId);
}
break;
}
case MDETACHSTDOUT:{
for (int local_modId : local_modIds) {
Manager::detachStdout(local_modId);
}
break;
}
case MLOADBALANCE:{
busyAction = true;
Manager::loadBalance();
if (eventReceiver) {
eventReceiver->onLoadBalance();
}
busyAction = false;
break;
}
default:
break;
};
if (eventReceiver) {
eventReceiver->onError();
}
}

One possibility would be to use a flag (an atomic boolean for example) to interrupt prematurely the execution of the different for loops to make sure that no action causes the entire yarpmanager to get stuck.

@randaz81
Copy link
Member

This is a well known behaviour but it is exremily difficult to improve. Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time. So, if the network ping time is 100ms, each connection to verify might require 4x100ms. If you have 10 connections, these are 4 seconds of delay. No escape. The obvious solution is to keep your ping time to yarp server in the 10ms range or run the yarpserver elsewhere. Ps: fixing the network delay is your preferred option because a responsive yarp server with delayed communication is very bad too (and not equally detactable).

@S-Dafarra
Copy link
Contributor Author

Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.

Is it possible to prematurely interrupt a ping without waiting for the timeout time?

@randaz81
Copy link
Member

Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).

@S-Dafarra
Copy link
Contributor Author

Regarding the possibility of interrupting an ongoing operation, instead, I think that this is certainly doable (between two subsequent checks).

Yeah exactly, that's what I had in mind. In principle, also in case of connection checks, the maximum waiting time should be 2 seconds, so it might still be a reasonable time to wait.

@randaz81
Copy link
Member

Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.

Is it possible to prematurely interrupt a ping without waiting for the timeout time?

No, this is not possible.

@S-Dafarra
Copy link
Contributor Author

Checking if a connection is active or not, involves the establishment of multiple (4?) back and forth communications with the server ( the same of 'yarp ping' command') each of them requiring a ping time.

Is it possible to prematurely interrupt a ping without waiting for the timeout time?

No, this is not possible.

I had a further look on this. The timeout value seems to be used here

bool NetworkBase::exists(const std::string& port, const ContactStyle& style, bool checkVer)
{
bool silent = style.quiet;
Contact address = NetworkBase::queryName(port);
if (!address.isValid()) {
if (!silent) {
yCInfo(NETWORK, "Address of port %s is not valid", port.c_str());
}
return false;
}
Contact address2(address);
if (style.timeout >= 0) {
address2.setTimeout((float)style.timeout);

which sets an internal value that is get in
openTimeout.set(address.getTimeout());
timeout = &openTimeout;
}
int result = connector.connect(stream, addr, timeout, ACE_Addr::sap_any, 1);

So in the end, the timeout value is given directly to ACE. By reading the documentation of connect, I noticed that it returns a different error when it cannot perform a connection given the timeout (namely EWOULDBLOCK). At the same time, it is possible to call the complete method, that is exactly meant to continue a connection attempt that timed out.

This means that in principle it may be possible to use a fraction of the TIMEOUT value when calling connect and then call complete several times until the user-specified timeout value is reached. In principle, this could give us control to stop this process at an earlier stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants