memory usage of data coordinator keeps increasing #31226
-
Hello, i have a question about memory usage of data coordinator We are running a data pipeline for milvus (v2.3.10) that repeats below process.
p.s segment count of single collection is about 1700 ~ 1800 data pipeline starts at everyday 4 AM, and after repeating this for week, it seems like the memory usage of data coordinator is gradually increasing There are some questions i want to ask.
below is log and pprof for data coordinator milvus log link: https://sendanywhe.re/L6K266NZ |
Beta Was this translation helpful? Give feedback.
Replies: 11 comments 16 replies
-
The material you provided is rich, firstly let us read the log and profile to investigate. |
Beta Was this translation helpful? Give feedback.
-
Looks like the cluster has: |
Beta Was this translation helpful? Give feedback.
-
I have another question. to check below question, i manually triggered OOM of active datacoord using stress-ng command.
But it seems like two datacoord pods are keep switching active-standby forever, and not working normally. below is log file |
Beta Was this translation helpful? Give feedback.
-
is there a reason we have so many segments? Did we do manual flush or num_entities? |
Beta Was this translation helpful? Give feedback.
-
After reading the log, I noticed some strange points:
Is there another cluster running with the same etcd server? |
Beta Was this translation helpful? Give feedback.
-
From the pprof
UpdateDropChannelSegmentInfo -> Should we do clone? also, I believe the question could be:
|
Beta Was this translation helpful? Give feedback.
-
@yiwangdr |
Beta Was this translation helpful? Give feedback.
-
@xiaofan-luan That looks like the cause. We need logs to confirm it. @Jung-JongHyuk is it possible for you to get the datacoord log that can cover the "drop-then-spike" period (e.g. 03/11 14:00-18:00). Meanwhile we will try to reproduce it. |
Beta Was this translation helpful? Give feedback.
-
Hello, after some trial to reproduce, we found the root cause. when milvus do compaction / drop collection, segment files were deleted in object storage but it's directory remained.
when we reinstalled milvus with minio as object storage, both scanning time and datacoord memory usage kept stable
We will keep watching to check whether is there another leak in other components. Thanks for helping us! |
Beta Was this translation helpful? Give feedback.
-
we are actually working on a improve the scan efficiency. If I understand correctly you are actually using a file system rather than object storage? We believe it would be great if we can implement a fs chunk manager |
Beta Was this translation helpful? Give feedback.
Hello, after some trial to reproduce, we found the root cause.
The main reason was that the object storage we used was not 100% compatible with s3 / minio.
when milvus do compaction / drop collection, segment files were deleted in object storage but it's directory remained.
those empty directories included to scan target in garbage collector of datacoord, so time spent for scanning storage kept increasing like below log