Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

Closed
github-actions bot opened this issue Apr 24, 2024 · 3 comments
Closed
Assignees
Labels
area/upstream Upstream related like tgt upstream library area/volume-rwx Volume RWX related investigation-needed Need to identify the case before estimating and starting the development kind/backport Backport request kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@github-actions
Copy link

backport #8394

@github-actions github-actions bot added area/upstream Upstream related like tgt upstream library area/volume-rwx Volume RWX related investigation-needed Need to identify the case before estimating and starting the development kind/backport Backport request kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage labels Apr 24, 2024
@github-actions github-actions bot added this to the v1.6.2 milestone Apr 24, 2024
@derekbit
Copy link
Member

The issue can be reproduced by

  • Create a RWX volume.
  • Create a workload A using the volume. The running workload A ensures the share-manager pod keeps running.
  • Repeatedly attach and detach workload B using the volume. The memory usage (cat /proc/<PID of nfs-ganesha>/status | grep VmRSS) of the nfs-ganesha increases over time.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented May 6, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
  1. Create a 3 node cluster
  2. Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
  3. Create second workload with the RWX volume.
  4. Scale down the second workload and scale up repeatedly 100 times
  5. Find the PID of the nfs-ganesha in the share-manager pod by ps aux
  6. Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS
  7. VmRSS in LH v1.6.1 is significantly larger than the value after applying the fix.
  • Does the PR include the explanation for the fix or the feature?

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

longhorn/nfs-ganesha#13
longhorn/longhorn-share-manager#204

  • Which areas/issues this PR might have potential impacts on?
    Area: RWX volume, memory leak, upstream
    Issues

@roger-ryao roger-ryao self-assigned this May 7, 2024
@roger-ryao
Copy link

Verified on v1.6.x-head 20240507

The test steps
#8394 (comment)

  1. Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
  2. Scale up the replicas to 3.
  3. Check if 3 workloads are in the "Running" state.
  4. Scale down the replicas to 1.
  5. Check if one workload are in the "Running" state.
    We can test steps 2-5 using the following shell script.
deployment_rwx_test.sh
#!/bin/bash

# Define the deployment name
DEPLOYMENT_NAME="rwx-test"
KUBECONFIF="/home/ryao/Desktop/note/longhorn-tool/ryao-161.yaml"

for ((i=1; i<=100; i++)); do
    # Scale deployment to 10 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=3

    # Wait for the deployment to have 3 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "3" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running Running Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done

    # Scale deployment down to 1 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=1

    # Wait for the deployment to have 1 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "1" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done
done
  1. Find the PID of the nfs-ganesha in the share-manager pod by ps aux
  2. Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS

Result Passed

  1. We were also able to reproduce this issue on v1.6.1.
  2. After executing the script, the output for v1.6.1 is as follows:
Every 2.0s: cat /proc/29/status | grep VmRSS                     share-manager-pvc-119d403e-ae17-4f4f-aa7f-06e7bf40fca2: Tue May  7 09:54:38 2024

VmRSS:     47192 kB

For the v1.6.x-head

Every 2.0s: cat /proc/29/status | grep VmRSS                    share-manager-pvc-f22c2fdf-330e-4c22-aea2-45a10c570cbf: Tue May  7 10:09:11 2024

VmRSS:     41604 kB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upstream Upstream related like tgt upstream library area/volume-rwx Volume RWX related investigation-needed Need to identify the case before estimating and starting the development kind/backport Backport request kind/bug priority/0 Must be fixed in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage
Projects
None yet
Development

No branches or pull requests

3 participants