Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Ability to restart subsystems related to logging in #6166

Open
thorst opened this issue Apr 16, 2024 · 13 comments
Open

[BUG] Ability to restart subsystems related to logging in #6166

thorst opened this issue Apr 16, 2024 · 13 comments
Labels
bug Something isn't working Internal-Issue-Created An issue has been created in NextGen's internal issue tracker RS-12749 triaged

Comments

@thorst
Copy link

thorst commented Apr 16, 2024

Is your feature request related to a problem? Please describe.
Our problem is that nine out of ten service restarts are done because our developers can no longer log into the mirth ide. The channels are up and running but we cannot log in to access them. We then restart the services and then we can log in to access the gui. I did confirm with support that there is no way to currently restart that portion of the system, and that I would need to put in a feature request.

Describe your use case
We frequently need to access the gui to look at transactions, or make coding changes, so we need to be able to access the system at all times.

Describe the solution you'd like
I would like to be able to restart the login service if it gets gummed up. Something like a mcaccess restart command.

Describe alternatives you've considered
The current alternative is to restart mcservice which is a heavy-handed approach to enable users to log back in

Additional context
We are on 4.2.0 and hope to upgrade to the latest and greatest asap.

@thorst thorst added the enhancement New feature or request label Apr 16, 2024
@jonbartels
Copy link
Contributor

The best solution would be to find out what the login problem is. Restarting single modules is a band-aid for whatever the underlying problem is.

I did confirm with support that there is no way to currently restart that portion of the system, and that I would need to put in a feature request.

This sounds like a polite "we don't know" answer from support.

Can you share any more details about logs, debugging, or other data you shared with support?
Do you have any customizations to Mirth such as JARs in server-launcher-lib or custom-lib?
Are you using any authentication extensions like LDAP or MFA?
What DB engine are you running on?

To the best of my knowledge, the only parts of Mirth Connect that can be managed independently are the channels themselves. Core Mirth, even plugins, are not separately deployable. What support is suggesting is a massive architectural change to the application.

@thorst
Copy link
Author

thorst commented Apr 16, 2024

The issue is that it's not reproducible, but it happens like 2x a week. It happens primarily on our prod server, heaviest use, and then occasionally on our poc (proof of concept) server, as this is where people are more likely to make a code mistake.

We do not have any jars. We use ldap and postgres.

@jonbartels
Copy link
Contributor

If the LDAP plugin is being used, you may have to depend on NextGen support. The LDAP plugin from Nextgen is closed-source.

I would suggest capturing some data when this problem happens:

  1. The contents of mirth.log
  2. Run the MC admin client from the admin launcher with "Show Java Console" set to "Yes". There may be some logspam there, but there could be a useful error
  3. Take a thread dump from the Java server process to see if there are login or LDAP related calls
  4. Logs from the LDAP/AD server would also be informative, learning what authn calls Mirth is making to the LDAP server could show the issue

@thorst
Copy link
Author

thorst commented Apr 17, 2024

Today I was finally able to generate an error that we get a lot and also crash the login. If you hit the client API a decent amount, it'll cause this, so it's possible that several of my issues are all one giant issue.

Errors like the below, where it's not related to a channel, are hard to track down what was the cause of the issue.

java.lang.NullPointerException: Cannot invoke "String.length()" because "s" is null

In this scenario today, I have a webpage that makes 2 calls basically simultaneously. It hits /codeTemplateLibraries?includeCodeTemplates=true and /channels/idsAndNames. So the above error will happen on the idsAndNames api call after it has been pushed a little. To reproduce, I can reload the page about 10 times (the first 9 times are successful). It'll then throw this error. I can reload another 10 times or so and then it'll crash the api altogether. We then get a lot of pop-ups in the mirth IDE, where it'll kick you out, and we cannot log in. The only way to recover from this is to restart mcservice.

PS. thanks @jonbartels for mentioning ldap. We do have other errors related to ldap that I need to rectify, but this github issue is related to logging in, and I believe it's because I'm crashing the client api.

So the question becomes, how can I make the client api more resilient?

@jonbartels
Copy link
Contributor

Hmmm. Now this problem is getting interesting.

Are there more lines before and after that error message? The whole stack trace would narrow down where the problem is.

@thorst
Copy link
Author

thorst commented Apr 17, 2024

Not really,

ERROR 2024-04-09 16:19:28.603 [ConfigurationServlet Thread (Get channel tags) < qtp1955460554-1495] com.mirth.connect.model.converters.ObjectXMLSerializer: com.mirth.connect.donkey.util.DonkeyElement$DonkeyElementException: java.lang.NullPointerException: Cannot invoke "String.length()" because "s" is null

and

ERROR 2023-10-24 13:01:13.087 [ChannelGroupServlet Thread (Get channel groups) < qtp231198585-109214] com.mirth.connect.model.converters.ObjectXMLSerializer: com.mirth.connect.donkey.util.DonkeyElement$DonkeyElementException: java.lang.NullPointerException

@lmillergithub
Copy link
Collaborator

@thorst We looked at this issue and think this should be a bug instead of an idea. We will change the issue. We have created an internal ticket. Thanks.

@lmillergithub lmillergithub added bug Something isn't working triaged Internal-Issue-Created An issue has been created in NextGen's internal issue tracker RS-12749 and removed enhancement New feature or request labels Apr 17, 2024
@lmillergithub lmillergithub changed the title [IDEA] Ability to restart subsystems related to logging in [BUG] Ability to restart subsystems related to logging in Apr 17, 2024
@jonbartels
Copy link
Contributor

jonbartels commented Apr 17, 2024

DonkeyElementException is only instantiated and thrown from three places in the open-source Mirth code. Since non-open extensions are involved it could be coming from somewhere else.

The declaration is here, then find usages on its constructor shows the only three places where the exception can originate.

public static class DonkeyElementException extends Exception {

Then the interesting one is here:

new StringReader(xml) is a java.io class:

public StringReader(String s) {
        this.str = s;
        this.length = s.length();
    }

So I think that xml is somehow null, Mirth is trying to serialize an empty XML element to XML.

@thorst
Copy link
Author

thorst commented Apr 17, 2024

So I have ticket #6058 related to donkey, I was told I should put in a request to add logging, but it sounds like a simple if to check if its null would fix this. HOWEVER, Like I said, it works 9 times out of 10, and then randomly will error. So im not sure why it would be throwing an error when the data should be there.

@tonygermano
Copy link
Collaborator

Try looking in the actual mirth.log file instead of the log in the admin client dashboard. Sometimes it has more information about the error.

@jonbartels
Copy link
Contributor

So the above error will happen on the idsAndNames api call after it has been pushed a little. To reproduce, I can reload the page about 10 times (the first 9 times are successful). It'll then throw this error. I can reload another 10 times or so and then it'll crash the api altogether.

Can you run netstat when the crashing condition exists? I wonder if the calling app is holding connections open and exhausting a connection pool in Mirth.

@thorst
Copy link
Author

thorst commented Apr 17, 2024

Try looking in the actual mirth.log file instead of the log in the admin client dashboard. Sometimes it has more information about the error.

That is from the server log, not the dashboard. Thats why Ive been tracking this issue for like a year, its hard to reproduce when there is a lot going on (as we always have) because theres so little info. Today was the first time I was able to reproduce it consistnetly. Before I had never actually witnessed the api returning the error, i only saw the anemic error in the log.

So the above error will happen on the idsAndNames api call after it has been pushed a little. To reproduce, I can reload the page about 10 times (the first 9 times are successful). It'll then throw this error. I can reload another 10 times or so and then it'll crash the api altogether.

Can you run netstat when the crashing condition exists? I wonder if the calling app is holding connections open and exhausting a connection pool in Mirth.

I will do that tommorow. Team already pissed at me for taking test down several times today. Im also working on standing up another box to help with dev and troubleshoot where i can break things.

@thorst
Copy link
Author

thorst commented Apr 25, 2024

So, I'm still researching this but will give this update on where I'm at. I was using .NET's httpclient which does have the possibility to exhaust connections, depending on how its configured. I set up a test environment under the old code, and then wrote new code to allow it to pool connections and reuse them (dependency injection and IHttpClientFactory).

Interestingly, both had 6 connections max at any given time, no matter how many requests I made. I used netstat and grep to filter the results based on things hitting port 8443. I expected under the old code I would see 1 request per connection, and I was expecting to see 20 or 30 requests in my testing create the respective number of connections. That wasn't the case though.

What I am experiencing is, for example I make 20 calls rapidly (I tied it to a button click and clicked a button with my mouse 20 times), the first 10 calls happen very quickly, and a response is given. Then mirth is under more load so the next 5 will finish but take much longer than one would expect if they were simply queued up and then responded in order. So, the systems net time, because things were queued, is much longer than if I just fired one request per second for 20 seconds. That would finish quicker than firing 20 in one second. To finish my scenario, the last five will completely time out at the .NET httpclient level...which obviously I could have just adjusted the timeout on the httpclient, but any time would likely still be hit.

Then I can run netstat and see the 6 connections peel off slowly as you would expect. So again, I expected to see all 20 calls with 20 connections in order to reach the port exhaustion, and I'm not, I'm only seeing 6, but that may just be a misunderstanding on my part on what port exhaustion looks like.

In any case, whatever my issue is, and as I continue to work to reuse connections and reduce calls, this ticket is still relevant to be able to stop and start the login portion of the system. There will always be some error happening with this or that and having the ability to be able to restart the log in features, without a complete service restart, is ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Internal-Issue-Created An issue has been created in NextGen's internal issue tracker RS-12749 triaged
Projects
None yet
Development

No branches or pull requests

4 participants