Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads hang on Kubernetes instance #1792

Open
artntek opened this issue Feb 1, 2024 · 2 comments
Open

Threads hang on Kubernetes instance #1792

artntek opened this issue Feb 1, 2024 · 2 comments
Labels
bug Something isn't working k8s Kubernetes/Helm Related
Milestone

Comments

@artntek
Copy link
Contributor

artntek commented Feb 1, 2024

This is for the metacat deployment in k8s. (It may also happen in legacy deployments, but has hitherto gone undetected?) The example below is for replication; however, it is assumed that other operations that create threads would also be affected (e.g. changing system metadata).

NOTE: I could not reproduce this using the steps below, even when looping 1000 times.
The actual bug was observed while testing replication in the dev cluster. We noticed the symptoms in the OBSERVED section below, and surmised the threads were not starting. It's unclear how the container first got into this state, though...

STEPS TO REPRODUCE

Basically, you need to start a bunch of threads...

  1. prerequisite: you need sudo access to the sandbox cn cn-sandbox-ucsb-1.test.dataone.org
  2. Set up a metacat instance with a test node cert, to be a MN.
  3. Enable settings for replication with the sandbox CN; for example (values.yaml override):
  ## DataONE Member Node (MN) Parameters
  dataone.certificate.fromHttpHeader.enabled: true
  dataone.autoRegisterMemberNode: 2024-01-25
  D1Client.CN_URL: https://cn-sandbox.test.dataone.org/cn
  dataone.nodeId: "urn:node:TestBROOKELT"
  dataone.subject: "CN=urn:node:TestBROOKELT,DC=dataone,DC=org"
  dataone.nodeName: Test BROOKE LT Metacat Node
  dataone.nodeDescription: Dev cluster Test BROOKE LT Metacat Node
  dataone.contactSubject: http://orcid.org/0000-0002-1472-913X
  dataone.nodeSynchronize: true
  dataone.nodeReplicate: true
  dataone.replicationpolicy.default.numreplicas: "1"
  1. Retrieve a valid sysmeta file from the CN like this one, and save it to a file named systemmeta.xml, in the same directory where you will execute the following step:
<ns3:systemMetadata xmlns:ns2="http://ns.dataone.org/service/types/v1" xmlns:ns3="http://ns.dataone.org/service/types/v2.0">
<serialVersion>31</serialVersion>
<identifier>testReplicate.1706572592156</identifier>
<formatId>eml://ecoinformatics.org/eml-2.0.1</formatId>
<size>12960</size>
<checksum algorithm="MD5">e9eba01da2e921f03c0239a3632e70ac</checksum>
<submitter>CN=urn:node:TestBROOKELT,DC=dataone,DC=org</submitter>
<rightsHolder>CN=urn:node:TestBROOKELT,DC=dataone,DC=org</rightsHolder>
<accessPolicy>
<allow>
<subject>public</subject>
<permission>read</permission>
</allow>
</accessPolicy>
<replicationPolicy replicationAllowed="true" numberReplicas="1">
<preferredMemberNode>urn:node:TestBROOKELT</preferredMemberNode>
</replicationPolicy>
<archived>false</archived>
<dateUploaded>2024-01-29T23:56:35.102+00:00</dateUploaded>
<dateSysMetadataModified>2024-01-29T23:56:35.102+00:00</dateSysMetadataModified>
<originMemberNode>urn:node:mnSandboxUCSB1</originMemberNode>
<authoritativeMemberNode>urn:node:mnSandboxUCSB1</authoritativeMemberNode>
<replica>
<replicaMemberNode>urn:node:mnSandboxUCSB1</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2024-01-29T23:57:11.228+00:00</replicaVerified>
</replica>
<replica>
<replicaMemberNode>urn:node:cnSandbox</replicaMemberNode>
<replicationStatus>completed</replicationStatus>
<replicaVerified>2024-01-29T23:57:11.309+00:00</replicaVerified>
</replica>
</ns3:systemMetadata>
  1. run this curl command several times (eg in a loop)
  HOST_CONTEXT="https://metacat-dev.test.dataone.org/metacat"
  # replace with your MN host and context

  # change upper bound of seq as needed
  for i in `seq 1 100`; do 
    echo $i;
    sudo curl -E /etc/dataone/client/private/urn_node_cnSandboxUCSB1.pem \
              -F "sysmeta=@systemmeta.xml" -F "sourceNode=urn:node:mnSandboxUCSB1" \
             -X POST $HOST_CONTEXT/d1/mn/v2/replicate
  done

EXPECTED

MNResourceHandler logs a message like this:

metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112]

It then starts this thread, which should make a call to MNodeService.replicate(), resulting in log output like this:

metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:
	Session.Subject      = CN=urn:node:cnSandboxUCSB1,DC=dataone,DC=org
	identifier           = testReplicate.1706727183679
	Source NodeReference =urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.dataone.MNodeService:replicate:923]

OBSERVED

MNResourceHandler logs a message like this:

metacat 20240201-18:05:45: [DEBUG]: sourceNode: urn:node:mnSandboxUCSB1 [edu.ucsb.nceas.metacat.restservice.v2.MNResourceHandler:replicate:1112]

However, it appears that this thread, which should make a call to MNodeService.replicate() is never started, because there is no log output like this:

metacat 20240201-18:05:45: [INFO]: MNodeService.replicate() called with parameters:[...etc]

WORKAROUND

Restarting the pod seems to restore normality

@taojing2002
Copy link
Contributor

taojing2002 commented Feb 1, 2024 via email

@artntek artntek added bug Something isn't working k8s Kubernetes/Helm Related labels Feb 1, 2024
@artntek
Copy link
Contributor Author

artntek commented Feb 13, 2024

I don't have a good answer. 1000 calls didn't trigger it for me. Maybe just keep going until it breaks? :-)

@artntek artntek added this to the 3.1.0 milestone Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working k8s Kubernetes/Helm Related
Projects
None yet
Development

No branches or pull requests

2 participants