[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

github-actions · 2024-04-24T04:59:10Z

backport #8394

derekbit · 2024-04-24T05:32:01Z

The issue can be reproduced by

Create a RWX volume.
Create a workload A using the volume. The running workload A ensures the share-manager pod keeps running.
Repeatedly attach and detach workload B using the volume. The memory usage (cat /proc/<PID of nfs-ganesha>/status | grep VmRSS) of the nfs-ganesha increases over time.

longhorn-io-github-bot · 2024-05-06T01:43:36Z

Pre Ready-For-Testing Checklist

Where is the reproduce steps/test steps documented?
The reproduce steps/test steps are at:

Create a 3 node cluster
Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
Create second workload with the RWX volume.
Scale down the second workload and scale up repeatedly 100 times
Find the PID of the nfs-ganesha in the share-manager pod by ps aux
Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS
VmRSS in LH v1.6.1 is significantly larger than the value after applying the fix.

Does the PR include the explanation for the fix or the feature?
Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
The PR is at

longhorn/nfs-ganesha#13
longhorn/longhorn-share-manager#204

Which areas/issues this PR might have potential impacts on?
Area: RWX volume, memory leak, upstream
Issues

roger-ryao · 2024-05-07T10:11:51Z

Verified on v1.6.x-head 20240507

longhorn v1.6.x-head 46706be
nfs-ganesha longhorn-ganesha-v5 longhorn/nfs-ganesha@996a59c
longhorn-share-manager v1.6.x-head longhorn/longhorn-share-manager@510b21a

The test steps
#8394 (comment)

Create first workload with a RWX volume by https://github.com/longhorn/longhorn/blob/master/examples/rwx/rwx-nginx-deployment.yaml
Scale up the replicas to 3.
Check if 3 workloads are in the "Running" state.
Scale down the replicas to 1.
Check if one workload are in the "Running" state.
We can test steps 2-5 using the following shell script.

deployment_rwx_test.sh

#!/bin/bash

# Define the deployment name
DEPLOYMENT_NAME="rwx-test"
KUBECONFIF="/home/ryao/Desktop/note/longhorn-tool/ryao-161.yaml"

for ((i=1; i<=100; i++)); do
    # Scale deployment to 10 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=3

    # Wait for the deployment to have 3 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "3" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running Running Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done

    # Scale deployment down to 1 replicas
    kubectl --kubeconfig=$KUBECONFIF scale deployment $DEPLOYMENT_NAME --replicas=1

    # Wait for the deployment to have 1 ready replicas
    until [[ "$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')" == "1" ]]; do
        ready_replicas=$(kubectl --kubeconfig=$KUBECONFIF get deployment $DEPLOYMENT_NAME -o=jsonpath='{.status.readyReplicas}')
        echo "Iteration #$i: $DEPLOYMENT_NAME has $ready_replicas ready replicas"
        sleep 1
    done

    # Check if all pods are in the "Running" state
    while [[ $(kubectl --kubeconfig=$KUBECONFIF get pods -l=app=$DEPLOYMENT_NAME -o=jsonpath='{.items[*].status.phase}') != "Running" ]]; do
        echo "Not all pods are in the 'Running' state. Waiting..."
        sleep 5
    done
done

Find the PID of the nfs-ganesha in the share-manager pod by ps aux
Observe the VmRSS of nfs-ganesha in the share-manager pod by cat /proc/<nfs-ganesha PID>/status | grep VmRSS

Result Passed

We were also able to reproduce this issue on v1.6.1.
After executing the script, the output for v1.6.1 is as follows:

Every 2.0s: cat /proc/29/status | grep VmRSS                     share-manager-pvc-119d403e-ae17-4f4f-aa7f-06e7bf40fca2: Tue May  7 09:54:38 2024

VmRSS:     47192 kB

For the v1.6.x-head

Every 2.0s: cat /proc/29/status | grep VmRSS                    share-manager-pvc-f22c2fdf-330e-4c22-aea2-45a10c570cbf: Tue May  7 10:09:11 2024

VmRSS:     41604 kB

github-actions bot added this to the v1.6.2 milestone Apr 24, 2024

github-actions bot assigned derekbit Apr 24, 2024

roger-ryao self-assigned this May 7, 2024

roger-ryao closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

github-actions bot commented Apr 24, 2024

derekbit commented Apr 24, 2024

longhorn-io-github-bot commented May 6, 2024 •

edited by derekbit

roger-ryao commented May 7, 2024

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

[BACKPORT][v1.6.2][BUG] share-manager-pvc appears to be leaking memory #8426

Comments

github-actions bot commented Apr 24, 2024

derekbit commented Apr 24, 2024

longhorn-io-github-bot commented May 6, 2024 • edited by derekbit

Pre Ready-For-Testing Checklist

roger-ryao commented May 7, 2024

longhorn-io-github-bot commented May 6, 2024 •

edited by derekbit