memory usage of data coordinator keeps increasing #31226

Jung-JongHyuk · 2024-03-13T08:08:25Z

Jung-JongHyuk
Mar 13, 2024

Hello, i have a question about memory usage of data coordinator

We are running a data pipeline for milvus (v2.3.10) that repeats below process.

create new collection (25 scalar field, partition key with 128 partition, IVF_SQ8 index with 256 vector dim)
bulk insert to collection (data size: 500 million)
load collection with 4 replica
switch collection (by alter alias) and drop previous collection

p.s segment count of single collection is about 1700 ~ 1800

data pipeline starts at everyday 4 AM, and after repeating this for week, it seems like the memory usage of data coordinator is gradually increasing

There are some questions i want to ask.

Is this a memory leak? or data coordinator just needs lots of memory when there are many segments? we are using 8c15g for datacoord, and currently, 5.6GB of memory are used. Is this normal for total 1 billion entity with 3600 segments?
Can you explain memory management cycle of data coordinator? It seems like memory periodically decrease and re-increase rapidly, but it's irrelevant to our data pipeline cycle
if an OOM occur in data coordinator, will bulk inserting to new collection and searching to already loaded collection not affected? (we are searching with eventually consistency level)

below is log and pprof for data coordinator

milvus log link: https://sendanywhe.re/L6K266NZ
milvus pprof.zip

Answered by Jung-JongHyuk

Mar 29, 2024

Hello, after some trial to reproduce, we found the root cause.
The main reason was that the object storage we used was not 100% compatible with s3 / minio.

when milvus do compaction / drop collection, segment files were deleted in object storage but it's directory remained.
those empty directories included to scan target in garbage collector of datacoord, so time spent for scanning storage kept increasing like below log

"[2024/03/27 11:46:18.815 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=17h54m4.892203887s] [keys=366220]"
"[2024/03/26 09:55:19.676 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan fi…

View full answer

yhmo · 2024-03-13T09:43:02Z

yhmo
Mar 13, 2024
Collaborator

The datacoord shouldn't use so much memory, even if there are thousands of segments. Seems is a memory leak.
Data coordinator maintains the snapshots for data changes, assigns index tasks to index nodes, assigns compaction tasks to data nodes, triggers garbage collection, and so on. Theoretically, its memory usage should be low.
I didn't verify this, but I think the search process is affected if datacoord is down.

The material you provided is rich, firstly let us read the log and profile to investigate.

0 replies

yhmo · 2024-03-13T10:31:49Z

yhmo
Mar 13, 2024
Collaborator

Looks like the cluster has:
4 data nodes
40 index nodes
400 query nodes?

3 replies

Jung-JongHyuk Mar 13, 2024
Author

Yes our k8s cluster has limit that max limit of core and memory of pod is 8c15g. So we had to increase number of query node to get sufficient memory

yhmo Mar 13, 2024
Collaborator

Did you have a monitoring service to observe the cluster?
https://milvus.io/docs/monitor.md

Jung-JongHyuk Mar 13, 2024
Author

yes. below is capture of datacoord metric of past 7 days.

I think it's strange that collection num (milvus_datacoord_collection_num) keeps increasing even if i dropped collection. Can this be a possible reason?

Jung-JongHyuk · 2024-03-13T17:53:47Z

Jung-JongHyuk
Mar 13, 2024
Author

I have another question. to check below question, i manually triggered OOM of active datacoord using stress-ng command.

if an OOM occur in data coordinator, will bulk inserting to new collection and searching to already loaded collection not affected? (we are searching with eventually consistency level)

But it seems like two datacoord pods are keep switching active-standby forever, and not working normally.
We have enabled datacoord HA option, and it worked in previous milvus version (2.2.15) with same test method

below is log file
logs.tar.gz

2 replies

Jung-JongHyuk Mar 14, 2024
Author

we tried below step, but not resolved.

restart datacoord
restart milvus cluster
disable datacoord HA option and restart milvus cluster

Jung-JongHyuk Mar 14, 2024
Author

we version up to 2.3.11, but not resolved.
below is output of kubectl describe of datacoord
is it better to create a bug issue for this?

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  5m35s                  default-scheduler  Successfully assigned commerce-ai-service/commerce-ai-milvus-n2c-datacoord-7ff5d56498-nxmdp to pgr1w0709.ncc
  Normal   Pulled     5m35s                  kubelet            Container image "reg.navercorp.com/csp/milvus/milvus-config-tool:v0.1.1-rebuild" already present on machine
  Normal   Created    5m35s                  kubelet            Created container config
  Normal   Started    5m35s                  kubelet            Started container config
  Normal   Pulling    5m34s                  kubelet            Pulling image "reg.navercorp.com/csp/milvus/milvus:v2.3.11-root"
  Normal   Pulled     5m21s                  kubelet            Successfully pulled image "reg.navercorp.com/csp/milvus/milvus:v2.3.11-root"
  Normal   Created    5m21s                  kubelet            Created container datacoord
  Normal   Started    5m21s                  kubelet            Started container datacoord
  Normal   Pulled     5m21s                  kubelet            Container image "reg.navercorp.com/csp/infra/filebeat/filebeat-nelo:5.6.0.1" already present on machine
  Normal   Created    5m20s                  kubelet            Created container filebeat-nelo
  Normal   Started    5m20s                  kubelet            Started container filebeat-nelo
  Warning  Unhealthy  2m6s (x11 over 3m46s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  116s (x4 over 3m26s)   kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500

xiaofan-luan · 2024-03-13T21:00:42Z

xiaofan-luan
Mar 13, 2024
Maintainer

is there a reason we have so many segments?

Did we do manual flush or num_entities?

1 reply

Jung-JongHyuk Mar 14, 2024
Author

In our side, we didn't do manual flush or call num_entities.

below is our collection insert process

create new collection and create vector index (partition key enabled with 128 partition)
bulk insert to collection (we bulk insert total 500M entity by dividing data to 1M size json file. size of each json file is about 3GB)
when bulk insert is done, load collection with 4 replica
if load success, drop previous collection

I thought number or dropped segment increase is due to partition key is enabled and compaction occurs during bulk insert. Is our segment number metric is abnormal and cause of datacoord's large memory usage?

yhmo · 2024-03-14T03:20:02Z

yhmo
Mar 14, 2024
Collaborator

After reading the log, I noticed some strange points:

The proxy log should have INFO log like "received import request, collectionName: XXX, files: xxx" since there are bulkinsert request every day, but I didn't see any in proxy's log.
In the log of rootcoord, I see 8000+ of this warning "[2024/03/06 05:00:38.889 +09:00] [WARN] [rootcoord/import_manager.go:227] ["import task is rejected"] ["task ID"=448165923277867806] ["error code"=UnexpectedError] [cause="no available DataNode: node lacks"]", from 2024/03/05 to 2024/03/13. It just like this cluster has no data nodes to process the bulkinsert tasks.
In the log of datacoord, only 8 minutes are recorded, I see lot of this error "[2024/03/13 16:20:32.725 +09:00] [WARN] [datacoord/indexnode_manager.go:123] ["get IndexNode slots failed"] [nodeID=32] [error="......code = Canceled desc = context canceled". Just like this cluster cannot access these index nodes to build index.

Is there another cluster running with the same etcd server?

5 replies

Jung-JongHyuk Mar 14, 2024
Author

Let me explain our situation.
we have 3 k8s cluster, and only one cluster can support ssd local storage.
so, we installed 3 milvus cluster per k8s cluster, but etcd of each milvus are installed in one k8s cluster (using external etcd)
so 3 etcd service are installed in one k8s cluster, but milvus clusters does not share same etcd service.

Jung-JongHyuk Mar 14, 2024
Author

The proxy log should have INFO log like "received import request, collectionName: XXX, files: xxx" since there are bulkinsert request every day, but I didn't see any in proxy's log.

in grafana metric, It seems like import request are received to proxy

In the log of rootcoord, I see 8000+ of this warning "[2024/03/06 05:00:38.889 +09:00] [WARN] [rootcoord/import_manager.go:227] ["import task is rejected"] ["task ID"=448165923277867806] ["error code"=UnexpectedError] [cause="no available DataNode: node lacks"]", from 2024/03/05 to 2024/03/13. It just like this cluster has no data nodes to process the bulkinsert tasks.

In the log of datacoord, only 8 minutes are recorded, I see lot of this error "[2024/03/13 16:20:32.725 +09:00] [WARN] [datacoord/indexnode_manager.go:123] ["get IndexNode slots failed"] [nodeID=32] [error="......code = Canceled desc = context canceled". Just like this cluster cannot access these index nodes to build index.

I thought upper logs means that all data node / index node are busy to process bulk insert / index create process, and retry when node become available.
Is upper logs should not happen even if all data / index nodes are busy?

below is cpu / memory usage of data / index nodes of past week

yhmo Mar 14, 2024
Collaborator

Did you change the log level of milvus? Seems all INFO level logs of rootcoord are missed after 2024/03/05 14:42:02

yhmo Mar 14, 2024
Collaborator

All the logs of the 4 data nodes are from 2024/03/13 09:20, I didn't know what happened inside the 4 data nodes.

Jung-JongHyuk Mar 14, 2024
Author

I checked our helm file, and found that log level was set to warn

we will try to upgrade to 2.3.11 and modify log level to info

image:
  all:
    tag: v2.3.10-root

attu:
  image:
    tag: v2.3.8-root

log:
  level: "warn"
...

xiaofan-luan · 2024-03-14T23:52:35Z

xiaofan-luan
Mar 14, 2024
Maintainer

0 replies

xiaofan-luan · 2024-03-15T00:13:45Z

xiaofan-luan
Mar 15, 2024
Maintainer

From the pprof

SegmentInfo clone takes a lot of memory, and there maybe leakage somewhere on drop virtual channel and Compaction
Minio list takes many memory, happens on garbageCollector. seems that we should not use List operation but should use a iterator.

UpdateDropChannelSegmentInfo -> Should we do clone?

also, I believe the question could be:

scan object storage takes for ever, clear etcd can not access the resource and dropped segment keeps increasing.
we need split clear etcd and scan into two different process and rethink about the scan implementation (this seems to be too painful). Should we just remove scan and orphan file can be handled by:
prewrite a tmp file node on etcd and claim the node owner.
if the node owner is not crashed, it should be happened by node itself and or by the node
if the node owner is crashed, the reference node should be expired and the S3 file is handled by coordinator

0 replies

xiaofan-luan · 2024-03-15T00:14:10Z

xiaofan-luan
Mar 15, 2024
Maintainer

@yiwangdr
@congqixia
what do you think?

0 replies

yiwangdr · 2024-03-16T00:06:53Z

yiwangdr
Mar 16, 2024
Collaborator

@xiaofan-luan That looks like the cause. We need logs to confirm it.

@Jung-JongHyuk is it possible for you to get the datacoord log that can cover the "drop-then-spike" period (e.g. 03/11 14:00-18:00).

Meanwhile we will try to reproduce it.

3 replies

Jung-JongHyuk Mar 16, 2024
Author

Thanks for help. It's hard to share past datacoord log because we already reinstalled milvus, and log level was warn.
I'll try to reproduce it in another cluster (with info level log) and share log if same problem occurs.

I have one more question. I saw an issue about gc of dropped collection (#31306)
can this be a reason of our memory leak case?

xiaofan-luan Mar 17, 2024
Maintainer

could be.
We also works reproducing this issue by doing frequent bulkinsert.

If you get any luck to reproduce, please connect me @ xiaofan.luan@zilliz.com and I would like to have a short discussion on zoom if possible. thanks!
Will also let you know abour progress on our side

yiwangdr Mar 17, 2024
Collaborator

@Jung-JongHyuk if you'd like to verify if the collections are cleaned up, please try birdwatcher(https://github.com/milvus-io/birdwatcher).
Once you get into birdwatcher, simply run show collections command.

Jung-JongHyuk · 2024-03-29T10:48:22Z

Jung-JongHyuk
Mar 29, 2024
Author

Hello, after some trial to reproduce, we found the root cause.
The main reason was that the object storage we used was not 100% compatible with s3 / minio.

when milvus do compaction / drop collection, segment files were deleted in object storage but it's directory remained.
those empty directories included to scan target in garbage collector of datacoord, so time spent for scanning storage kept increasing like below log

"[2024/03/27 11:46:18.815 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=17h54m4.892203887s] [keys=366220]"
"[2024/03/26 09:55:19.676 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=13h8m35.44933654s] [keys=432800]"
"[2024/03/25 15:09:30.074 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=10h58m29.925731707s] [keys=312840]"
"[2024/03/25 01:14:36.326 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7h33m27.182116531s] [keys=222840]"
"[2024/03/24 15:45:26.010 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7h36m19.154562191s] [keys=217780]"
"[2024/03/24 06:15:58.551 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=6h49m36.19397161s] [keys=217780]"
"[2024/03/23 21:35:26.266 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7h15m53.18378414s] [keys=217780]"
"[2024/03/23 11:56:52.231 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7h25m50.691667439s] [keys=225680]"
"[2024/03/23 02:00:21.953 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=5h42m51.813345555s] [keys=193840]"
"[2024/03/22 17:54:06.074 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4h55m13.092105488s] [keys=146400]"
"[2024/03/22 11:15:28.534 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=3h28m16.827029221s] [keys=109300]"
"[2024/03/22 06:32:15.174 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=2h7m56.590670681s] [keys=84700]"
"[2024/03/22 03:40:26.932 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=1h38m27.22216328s] [keys=68080]"
"[2024/03/22 01:24:59.109 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=1h20m14.811256759s] [keys=56200]"
"[2024/03/21 23:35:25.795 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=1h3m48.436943855s] [keys=44500]"
"[2024/03/21 22:07:03.604 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=51m40.583870677s] [keys=35080]"
"[2024/03/21 20:55:24.062 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=40m17.746650268s] [keys=27640]"
"[2024/03/21 19:47:08.427 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=31m26.452487926s] [keys=22460]"
"[2024/03/21 18:31:13.581 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=15m52.64032292s] [keys=13540]"
"[2024/03/21 17:10:53.789 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=43.977847905s] [keys=1191]"

when we reinstalled milvus with minio as object storage, both scanning time and datacoord memory usage kept stable

"[2024/03/27 11:40:36.826 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m48.804026519s] [keys=579987]"
"[2024/03/27 10:40:41.916 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m36.265114881s] [keys=550665]"
"[2024/03/27 09:40:36.915 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m35.953574835s] [keys=522639]"
"[2024/03/27 08:39:56.559 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7m39.71556926s] [keys=492156]"
"[2024/03/27 07:40:39.210 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7m50.09010291s] [keys=462213]"
"[2024/03/27 06:40:13.849 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m5.901152793s] [keys=437157]"
"[2024/03/27 05:38:15.313 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7m4.569652535s] [keys=396684]"
"[2024/03/27 04:37:37.976 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m57.511029769s] [keys=344493]"
"[2024/03/27 03:33:04.444 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m33.326530013s] [keys=603747]"
"[2024/03/27 02:33:05.313 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m34.197108654s] [keys=603747]"
"[2024/03/27 01:33:05.246 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m34.115124107s] [keys=603747]"
"[2024/03/27 00:33:06.217 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m35.109516978s] [keys=603747]"
"[2024/03/26 23:33:13.797 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m42.708507554s] [keys=603747]"
"[2024/03/26 22:33:17.353 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m46.167012679s] [keys=603747]"
"[2024/03/26 21:33:15.679 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m44.545403276s] [keys=603747]"
"[2024/03/26 20:33:12.877 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m41.823645032s] [keys=603747]"
"[2024/03/26 19:33:16.230 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m45.095974402s] [keys=603747]"
"[2024/03/26 18:33:16.917 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m45.804461636s] [keys=603747]"
"[2024/03/26 17:33:17.274 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m46.151250494s] [keys=603747]"
"[2024/03/26 16:33:25.869 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m44.437308661s] [keys=603747]"
"[2024/03/26 15:36:44.872 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=5m14.84065954s] [keys=613845]"
"[2024/03/26 14:41:55.125 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m47.75811328s] [keys=653967]"
"[2024/03/26 13:41:46.656 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m31.763545042s] [keys=634365]"
"[2024/03/26 12:41:24.313 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m14.727443047s] [keys=603963]"
"[2024/03/26 11:40:39.615 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m45.546021668s] [keys=578097]"
"[2024/03/26 10:41:18.500 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m16.503677099s] [keys=556416]"
"[2024/03/26 09:41:48.088 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m27.283125679s] [keys=531900]"
"[2024/03/26 08:41:01.760 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m57.224259794s] [keys=502740]"
"[2024/03/26 07:40:16.392 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m20.105468913s] [keys=474120]"
"[2024/03/26 06:40:40.108 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=8m29.712926538s] [keys=447174]"
"[2024/03/26 05:39:14.090 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=7m25.69588574s] [keys=407619]"
"[2024/03/26 04:39:32.211 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=5m42.358313135s] [keys=365364]"
"[2024/03/26 03:45:37.927 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=17m6.870932491s] [keys=601722]"
"[2024/03/26 02:33:07.040 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m35.916980994s] [keys=601722]"
"[2024/03/26 01:33:09.429 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m38.332474704s] [keys=601722]"
"[2024/03/26 00:33:17.540 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m46.420777616s] [keys=601722]"
"[2024/03/25 23:33:14.752 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m43.555141897s] [keys=601722]"
"[2024/03/25 22:33:18.143 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m46.959714861s] [keys=601722]"
"[2024/03/25 21:33:18.395 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m47.264423681s] [keys=601722]"
"[2024/03/25 20:33:17.823 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m46.59407854s] [keys=601722]"
"[2024/03/25 19:33:16.265 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m45.143971324s] [keys=601722]"
"[2024/03/25 18:33:15.057 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m43.912134506s] [keys=601722]"
"[2024/03/25 17:33:11.754 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m40.668752309s] [keys=601722]"
"[2024/03/25 16:33:15.622 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m44.463660259s] [keys=601722]"
"[2024/03/25 15:35:00.719 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=4m41.636471239s] [keys=601722]"
"[2024/03/25 14:43:14.158 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=11m1.880636143s] [keys=695034]"
"[2024/03/25 13:41:47.749 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m37.086008516s] [keys=637902]"
"[2024/03/25 12:41:28.366 +09:00] [INFO] [datacoord/garbage_collector.go:235] [""gc scan finish list object""] [prefix=insert_log] [""time spent""=9m27.222801336s] [keys=613791]"

We will keep watching to check whether is there another leak in other components. Thanks for helping us!

2 replies

yiwangdr Apr 1, 2024
Collaborator

@Jung-JongHyuk Thanks for the update. Good to know that minio config works well. What was the old config? Could you provide the config file? We'd like to reproduce it and fix the issue. Thanks in advance.

Jung-JongHyuk Apr 8, 2024
Author

@xiaofan-luan @yiwangdr sorry for late reply.

When we install milvus, we used custom object storage platform (private and non-standard platform which only used in our company)
The problem was that our object storage platform was not 100% compatible with s3 / minio. It was implemented base on minio, but added some layer to support POSIX feature (such as directory), so that in can be used as file system if necessary

so when milvus delete files in object storage (in garbage collector of datacoord), empty directory remained in our object storage (this wasn't bug of milvus, but because our object storage didn't work same as s3 / minio when delete object)

those empty directory are included in scan target in ListWithPrefix method (searching files using bfs), so scan time became longer and longer, and queue size in ListWithPrefix method became bigger.

It was our fault to use non-standard object storage for milvus.

xiaofan-luan · 2024-03-29T20:54:46Z

xiaofan-luan
Mar 29, 2024
Maintainer

we are actually working on a improve the scan efficiency.

If I understand correctly you are actually using a file system rather than object storage?

We believe it would be great if we can implement a fs chunk manager

0 replies

memory usage of data coordinator keeps increasing #31226

Jung-JongHyuk Mar 13, 2024

Replies: 11 comments · 16 replies

yhmo Mar 13, 2024 Collaborator

yhmo Mar 13, 2024 Collaborator

Jung-JongHyuk Mar 13, 2024 Author

yhmo Mar 13, 2024 Collaborator

Jung-JongHyuk Mar 13, 2024 Author

Jung-JongHyuk Mar 13, 2024 Author

Jung-JongHyuk Mar 14, 2024 Author

Jung-JongHyuk Mar 14, 2024 Author

xiaofan-luan Mar 13, 2024 Maintainer

Jung-JongHyuk Mar 14, 2024 Author

yhmo Mar 14, 2024 Collaborator

Jung-JongHyuk Mar 14, 2024 Author

Jung-JongHyuk Mar 14, 2024 Author

yhmo Mar 14, 2024 Collaborator

yhmo Mar 14, 2024 Collaborator

Jung-JongHyuk Mar 14, 2024 Author

xiaofan-luan Mar 14, 2024 Maintainer

xiaofan-luan Mar 15, 2024 Maintainer

xiaofan-luan Mar 15, 2024 Maintainer

yiwangdr Mar 16, 2024 Collaborator

Jung-JongHyuk Mar 16, 2024 Author

xiaofan-luan Mar 17, 2024 Maintainer

yiwangdr Mar 17, 2024 Collaborator

Jung-JongHyuk Mar 29, 2024 Author

yiwangdr Apr 1, 2024 Collaborator

Jung-JongHyuk Apr 8, 2024 Author

xiaofan-luan Mar 29, 2024 Maintainer

Jung-JongHyuk
Mar 13, 2024

Replies: 11 comments 16 replies

yhmo
Mar 13, 2024
Collaborator

yhmo
Mar 13, 2024
Collaborator

Jung-JongHyuk Mar 13, 2024
Author

yhmo Mar 13, 2024
Collaborator

Jung-JongHyuk Mar 13, 2024
Author

Jung-JongHyuk
Mar 13, 2024
Author

Jung-JongHyuk Mar 14, 2024
Author

Jung-JongHyuk Mar 14, 2024
Author

xiaofan-luan
Mar 13, 2024
Maintainer

Jung-JongHyuk Mar 14, 2024
Author

yhmo
Mar 14, 2024
Collaborator

Jung-JongHyuk Mar 14, 2024
Author

Jung-JongHyuk Mar 14, 2024
Author

yhmo Mar 14, 2024
Collaborator

yhmo Mar 14, 2024
Collaborator

Jung-JongHyuk Mar 14, 2024
Author

xiaofan-luan
Mar 14, 2024
Maintainer

xiaofan-luan
Mar 15, 2024
Maintainer

xiaofan-luan
Mar 15, 2024
Maintainer

yiwangdr
Mar 16, 2024
Collaborator

Jung-JongHyuk Mar 16, 2024
Author

xiaofan-luan Mar 17, 2024
Maintainer

yiwangdr Mar 17, 2024
Collaborator

Jung-JongHyuk
Mar 29, 2024
Author

yiwangdr Apr 1, 2024
Collaborator

Jung-JongHyuk Apr 8, 2024
Author

xiaofan-luan
Mar 29, 2024
Maintainer