Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best method for handling reconnections to large numbers of PVs #79

Open
kathryn-baker opened this issue Nov 22, 2022 · 7 comments
Open

Comments

@kathryn-baker
Copy link

Hello,

In our group we are using pvapy for a number of applications but primarily to host a number of PvaServers. We then have a separate application that connects to the PVs on these servers using the Channel client to monitor them for changes.

So far, the functionality of both applications has worked well, but we are now facing a number of issues with reconnection protocols. I have tried two different methods to account for PV reconnections so far and both have created their own problems. I'm not sure if these are a result of how I'm doing the reconnection using the library or something in our system...

The methods I've tested so far are:

  1. I create the Channel object for the client, if the PV is down and it times out, we subscribe the onChanges function to monitor for updates anyway and just wait for the channel to connect and start receiving messages at some point in the future

This method works well if the PvaServer is stopped cleanly, but encounters problems if the PvaServer crashes unexpectedly. In this case we stop receiving updates for the PVs through the assigned onChanges function, even though we can see the PVs updating with a pvmonitor on the command line.

  1. I create the Channel object for the client and assign a connection callback using the setConnectionCallback() method. In this method we subscribe the onChanges function and start monitoring the PV. When it disconnects, we stop monitoring the channel.

This second option feels to me to be the cleaner of the two. However, we noticed alternating crashes between the PvaServer and the client application. We haven't yet worked out which of the two programs is responsible for the crashes. We also encountered some segmentation faults using this method.

In your opinion, what is the best way to handle these situations where potentially lots of PVs become disconnected at once? Do you have any advice for how to handle reconnections on the client side, and is there anything you might be able to suggest for the server side to improve the closure of PV connections?

Thanks for your help!

@sveseli
Copy link
Collaborator

sveseli commented Nov 22, 2022

Hi,

Any segmentation fault either on a client or a server side is most certainly a bug. What version/package/OS of pvapy are you using? How many PVs are we actually talking about?

I am also not quite sure I understood the use case. The PVs that you host on PvaServers, have they been created on those servers, or are they actually originating on a separate set of IOCs? If you had a small snippet of the code that illustrates the application, that would really help.

@kathryn-baker
Copy link
Author

Thanks for the quick response!

I realise it might be easier to explain the use case in a diagram so I've included that here:

image

We currently have three instances of an application that creates it's own PvaServer, reads the Variables that are available on a specific PLC and adds them as records to the server. Then as the variables in the PLC update and changes are communicated through the CIP protocol, the server application reads the value and updates the PV records on the server asynchronously. So the PVs are created by the PvaServer, but the values originate from a PLC. I'm afraid I don't have a simplified version of this code to present include at the moment but if you need one I can get someone else to summarise it. These applications are using pvapy=5.0.0 in a python:3.10.8-slim (debian) container in a Docker swarm.

The second application is one that connects to and monitors all of the PVs across all three servers for changes, and propagates the change to a different service in our docker swarm. In total there are about 500 PVs across the three servers. We do this using a python object that inherits from the Channel class in pvapy (code below). As we use the onChanges function and monitoring here, we do not use async. This application also uses pvapy=5.0.0 but uses a python:3.9-slim (debian) docker container. The seg fault was only seen in this application as far as I'm aware.

class PV(Channel):
    def __init__(self, name, other_app) -> None:
        super().__init__(name)
        self.subscribe('onChanges', self.onChanges)
        self.other_app = other_app

    def onChanges(self, value: PvObject):
        value_dict = value.toDict()
        self.other_app.notify(value_dict)

# elsewhere in code
pv = PV('test:name')
pv.startMonitor('field(value,timeStamp, etc')

Thanks again for your help.

@sveseli
Copy link
Collaborator

sveseli commented Nov 22, 2022

Thanks for the image.

This method works well if the PvaServer is stopped cleanly, but encounters problems if the PvaServer crashes unexpectedly. In this case we stop receiving updates for the PVs through the assigned onChanges function, even though we can see the PVs updating with a pvmonitor on the command line.

This should just work, and your channel monitoring should simply reconnect once PvaServer comes back online after restart, assuming old TCP connections are gone. This seems to be fairly simple application as far as pvapy. How does PvaServer "crash unexpectedly"? Do you use "PvaServer.update()" method for updating records? Were you able to get core dump by any chance and do you know if the crash is caused by something in pvapy code, or elsewhere (e.g., in the code that gets values from PLCs)?

I create the Channel object for the client and assign a connection callback using the setConnectionCallback() method. In this method we subscribe the onChanges function and start monitoring the PV. When it disconnects, we stop monitoring the channel.

This should work as well, but I am not sure if you have a separate connection callback method. In other words, I would create "connectionCallback()" method which either invokes "monitor(onChange)" (when connection comes online) or "stopMonitor()" when it goes away.

One approach that also might help with crashes would be to simply insert MirrorServer instance between three PvaServers and your Client class. MirrorServer would isolate Client from server crashes and would delete original channels and re-establish them as PvaServer goes away and comes back online. I must say I have not yet tested MirrorServer with more than 10-20 channels, so I cannot say for sure if 500 channels would somehow be a problem. It works really well with a large PV update rates though.

@sveseli
Copy link
Collaborator

sveseli commented Nov 22, 2022

Another thing I forgot to mention, recently I did some work on PvaServer class, and as of release 5.1.0 it does not start separate callback thread if it is not needed, which simplifies what is happening internally, so it may be worth trying one of the later pvapy releases. Note that the latest release uses epics base 7.0.7.

@Monarda
Copy link

Monarda commented Nov 23, 2022

Hello, I'm @kathryn-baker's colleague working on the PVAserver code which communicates with the PLCs via the CIP protocol, and to which her client code connects. I've only recently taken over from someone else so there's plenty about the code I'm still learning.

The CIP PVAServer code does use the PVAServer.update() method to update the records, called via an asyncio task and queues.

When one of these containers is restarted manually we usually see @kathryn-baker's client code reconnect all PVs successfully. But not every time. Our current experience is that we need to check whether to restart her client.

The unexpected crash we were experiencing was when the connection to the PLC caused an AssertionError in the CIP library code (cpppo) we are using. This caused the container to crash, and Docker would then automatically restart it, and a new instance would be populated with the same PVs. This would cause a short outage of a few seconds in the availability of the PVs. Obviously none of this is an issue with the pvapy code! However, when the container restarted after this crash none of the PVs in @kathryn-baker's client code reconnect successfully.

@sveseli
Copy link
Collaborator

sveseli commented Nov 23, 2022

Thanks for the info, it sounds like the only pvapy-related issue is really monitor not re-connecting after the crash, not the crash itself.

I presume you can recreate the problem fairly easily by killing one of your containers manually. Could you pick a PV, start mirror server, monitor mirrored channel, and see if the mirrored channel reconnects after you kill the container? You can do this using couple of command lines:

# Terminal 1
$ pvapy-mirror-server --channel-map "(pv:mirror,pv:original,pva)"

# Terminal 2
$ pvget -m pv:mirror

I'd like to know if this works.

@sveseli
Copy link
Collaborator

sveseli commented Nov 23, 2022

Forgot to mention, the following just works when using pvapy-ad-sim-server as a source that gets killed and restarted.

>>> from pvapy import *
>>> c = Channel('ad:image')
>>> def echo(pv):
...     print(pv['uniqueId'])
>>> c.monitor(echo, '')
>>> 42
43
44
45
46
47
48
49
0
1
2
3
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants