New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threads hang on Kubernetes instance #1792
Comments
How many times of calling can cause the hanging? I am interested in trying
this on a regular Metacat instance.
…On Thu, Feb 1, 2024 at 10:44 AM Matthew B ***@***.***> wrote:
This is for the metacat deployment in k8s. It may also happen in legacy
deployments, but hitherto gone undetected? The example below is for
replication; however, it is assumed that other operations that create
threads would also be affected (e.g. changing system metadata).
STEPS TO REPRODUCE
Basically, you need to start a bunch of threads...
1. Set up a metacat instance with a test node cert, to be a MN.
2. Enable settings for replication with the sandbox CN; for example
(values.yaml override):
## DataONE Member Node (MN) Parameters
dataone.certificate.fromHttpHeader.enabled: true
dataone.autoRegisterMemberNode: 2024-01-25
D1Client.CN_URL: https://cn-sandbox.test.dataone.org/cn
dataone.nodeId: "urn:node:TestBROOKELT"
dataone.subject: "CN=urn:node:TestBROOKELT,DC=dataone,DC=org"
dataone.nodeName: Test BROOKE LT Metacat Node
dataone.nodeDescription: Dev cluster Test BROOKE LT Metacat Node
dataone.contactSubject: http://orcid.org/0000-0002-1472-913X
dataone.nodeSynchronize: true
dataone.nodeReplicate: true
dataone.replicationpolicy.default.numreplicas: "1"
1. Retrieve a valid sysmeta file from the CN like this one, and save
it to a file named systemmeta.xml, in the same directory where you
will execute the following step:
<ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
<serialVersion>31</serialVersion>
<identifier>testReplicate.1706572592156</identifier>
<formatId>eml://ecoinformatics.org/eml-2.0.1</formatId>
<size>12960</size>
<checksum algorithm="MD5">e9eba01da2e921f03c0239a3632e70ac</checksum>
<submitter>CN=urn:node:TestBROOKELT,DC=dataone,DC=org</submitter>
<rightsHolder>CN=urn:node:TestBROOKELT,DC=dataone,DC=org</rightsHolder>
<accessPolicy>
<allow>
<subject>public</subject>
<permission>read</permission>
</allow>
</accessPolicy>
<replicationPolicy replicationAllowed="true" numberReplicas="1">
<preferredMemberNode>urn:node:TestBROOKELT</preferredMemberNode>
</replicationPolicy>
<archived>false</archived>
<dateUploaded>2024-01-29T23:56:35.102+00:00</dateUploaded>
<dateSysMetadataModified>2024-01-29T23:56:35.102+00:00</dateSysMetadataModified>
<originMemberNode>urn:node:mnSandboxUCSB1</originMemberNode>
<authoritativeMemberNode>urn:node:mnSandboxUCSB1</authoritativeMemberNode>
<replica>
<replicaMemberNode>urn:node:mnSandboxUCSB1</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2024-01-29T23:57:11.228+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:cnSandbox</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2024-01-29T23:57:11.309+00:00</replicaVerified>
</replica>
</ns3:systemMetadata>
1. run this curl command several times (eg in a loop)
curl -E /etc/dataone/client/private/urn_node_cnSandboxUCSB1.pem -F ***@***.***" -F "sourceNode=urn:node:mnSandboxUCSB1" -X POST https://metacat-dev.test.dataone.org/metacat/d1/mn/v2/replicate
EXPECTED
MNResourceHandler logs a message like this:
metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112]
It then starts this thread
<https://github.com/NCEAS/metacat/blob/57d400af212a6a9d7e62616cfee405bd2f786441/src/edu/ucsb/nceas/metacat/restservice/v2/MNResourceHandler.java#L1123>,
which should make a call to MNodeService.replicate()
<https://github.com/NCEAS/metacat/blob/c74c522e9809b414a399c5876b03fcdad816a8d5/src/edu/ucsb/nceas/metacat/dataone/MNodeService.java#L922>,
resulting in log output like this:
metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:
Session.Subject = CN=urn:node:cnSandboxUCSB1,DC=dataone,DC=org
identifier = testReplicate.1706727183679
Source NodeReference =urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.dataone.MNodeService:replicate:923]
OBSERVED
MNResourceHandler logs a message like this:
metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112]
However, it appears that this thread
<https://github.com/NCEAS/metacat/blob/57d400af212a6a9d7e62616cfee405bd2f786441/src/edu/ucsb/nceas/metacat/restservice/v2/MNResourceHandler.java#L1123>,
which should make a call to MNodeService.replicate()
<https://github.com/NCEAS/metacat/blob/c74c522e9809b414a399c5876b03fcdad816a8d5/src/edu/ucsb/nceas/metacat/dataone/MNodeService.java#L922>
is never started, because there is no log output like this:
metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:[...etc]
WORKAROUND
Restarting the pod seems to restore normality
—
Reply to this email directly, view it on GitHub
<#1792>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5QQDCXHZ5DF5T3DE4WJS3YRPO7RAVCNFSM6AAAAABCVNCXYSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYTGMRQHAYDCOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I don't have a good answer. 1000 calls didn't trigger it for me. Maybe just keep going until it breaks? :-) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is for the metacat deployment in k8s. (It may also happen in legacy deployments, but has hitherto gone undetected?) The example below is for replication; however, it is assumed that other operations that create threads would also be affected (e.g. changing system metadata).
STEPS TO REPRODUCE
Basically, you need to start a bunch of threads...
cn-sandbox-ucsb-1.test.dataone.org
systemmeta.xml
, in the same directory where you will execute the following step:EXPECTED
MNResourceHandler
logs a message like this:It then starts this thread, which should make a call to MNodeService.replicate(), resulting in log output like this:
metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters: Session.Subject = CN=urn:node:cnSandboxUCSB1,DC=dataone,DC=org identifier = testReplicate.1706727183679 Source NodeReference =urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.dataone.MNodeService:replicate:923]
OBSERVED
MNResourceHandler
logs a message like this:However, it appears that this thread, which should make a call to MNodeService.replicate() is never started, because there is no log output like this:
metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:[...etc]
WORKAROUND
Restarting the pod seems to restore normality
The text was updated successfully, but these errors were encountered: