Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage grows but never shrinks when running find command #674

Open
daltschu22 opened this issue Dec 11, 2023 · 10 comments
Open

Memory usage grows but never shrinks when running find command #674

daltschu22 opened this issue Dec 11, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@daltschu22
Copy link

daltschu22 commented Dec 11, 2023

Mountpoint for Amazon S3 version

1.3.1

AWS Region

us-east-1

Describe the running environment

t3.xlarge EC2

Rocky Linux 8.8

x86

Mounting bucket cross account

Mountpoint options

/usr/bin/mount-s3 --read-only --allow-other --file-mode 0555 --dir-mode 0555

What happened?

Mountpoint memory usage starts small. Around 0.1 when watched like so

watch -n 3 -d 'ps aux | grep /usr/bin/mount-s3 | awk '\''{sum += $4} END {print sum}'\'

Running a find across the bucket sees the memory usage slowly tick upwards. Stopping the find ends with mountpoint consuming
5.6 continuously and not releasing it back.

This is also observed on another host where the usage is unclear, we see 18.3 percent memory usage, but no active processes using the mount. This number does not decrease over time.

Relevant log output

No response

@daltschu22 daltschu22 added the bug Something isn't working label Dec 11, 2023
@daltschu22
Copy link
Author

Additionally we are now seeing issues with the mount getting OOM killed

[4150713.051972] Out of memory: Killed process 3302846 (mount-s3) total-vm:14585776kB, anon-rss:11176204kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:23028kB oom_score_adj:0

We see the same behavior as this #630

@daltschu22
Copy link
Author

Additional debug info

Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.io_size[type=read]: n=4: min=3184 p10=3199 p50=16511 avg=26494.00 p90=70143 p99=70143 p99.9=70143 max=70143
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.op_latency_us[op=flush]: n=3: min=14 p10=14 p50=14 avg=15.00 p90=17 p99=17 p99.9=17 max=17
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.op_latency_us[op=getattr]: n=1: min=63232 p10=63487 p50=63487 avg=63360.00 p90=63487 p99=63487 p99.9=63487 max=63487
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.op_latency_us[op=lookup]: n=8: min=22912 p10=23039 p50=65023 avg=62632.00 p90=95231 p99=95231 p99.9=95231 max=95231
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.op_latency_us[op=open]: n=3: min=24448 p10=24575 p50=64255 avg=54037.33 p90=73727 p99=73727 p99.9=73727 max=73727
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.op_latency_us[op=read]: n=4: min=73728 p10=74239 p50=93183 avg=137536.00 p90=255999 p99=255999 p99.9=255999 max=255999
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.op_latency_us[op=release]: n=3: min=13 p10=13 p50=17 avg=16.33 p90=19 p99=19 p99.9=19 max=19
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: fuse.total_bytes[type=read]: 105584 (n=4)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: prefetch.out_of_order: 1
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: prefetch.part_queue_starved_us: n=4: min=73728 p10=74239 p50=93183 avg=137536.00 p90=255999 p99=255999 p99.9=255999 max=255999
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_auto_default_network_io: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_auto_ranged_copy_network_io: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_auto_ranged_get_network_io: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_auto_ranged_put_network_io: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_requests_being_prepared: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_requests_being_processed: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_requests_stream_queued_waiting: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_requests_streaming_response: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.num_total_network_io: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.client.request_queue_size: 0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.failures[op=head_object,status=404]: 2 (n=2)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.first_byte_latency_us[op=get_object]: n=4: min=71680 p10=72191 p50=91135 avg=135360.00 p90=253951 p99=253951 p99.9=253951 max=253951
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.first_byte_latency_us[op=head_object]: n=12: min=14592 p10=21375 p50=46335 avg=44189.33 p90=65535 p99=93695 p99.9=93695 max=93695
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.first_byte_latency_us[op=list_objects]: n=11: min=15680 p10=18175 p50=60671 avg=54205.09 p90=70143 p99=74751 p99.9=74751 max=74751
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.throughput_mibs[op=get_object,size=1-16MiB]: n=3: min=4 p10=4 p50=8 avg=8.00 p90=12 p99=12 p99.9=12 max=12
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.throughput_mibs[op=get_object,size=<1MiB]: n=1: min=0 p10=0 p50=0 avg=0.00 p90=0 p99=0 p99.9=0 max=0
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.total_latency_us[op=get_object]: n=4: min=71680 p10=72191 p50=91647 avg=135872.00 p90=254975 p99=254975 p99.9=254975 max=254975
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.total_latency_us[op=head_object]: n=12: min=14592 p10=21375 p50=46335 avg=44189.33 p90=65535 p99=93695 p99.9=93695 max=93695
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests.total_latency_us[op=list_objects]: n=11: min=15744 p10=18175 p50=60671 avg=54210.91 p90=70143 p99=74751 p99.9=74751 max=74751
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests[op=get_object]: 4 (n=4)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests[op=head_object]: 12 (n=12)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.meta_requests[op=list_objects]: 11 (n=11)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.failures[op=head_object,type=Default,status=404]: 2 (n=2)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.first_byte_latency_us[op=get_object,type=Default]: n=4: min=68096 p10=68607 p50=79871 avg=121664.00 p90=239615 p99=239615 p99.9=239615 max>
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.first_byte_latency_us[op=head_object,type=Default]: n=12: min=11392 p10=16767 p50=43519 avg=40701.33 p90=60671 p99=90623 p99.9=90623 max=9>
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.first_byte_latency_us[op=list_objects,type=Default]: n=11: min=12736 p10=14271 p50=56831 avg=49186.91 p90=64255 p99=70655 p99.9=70655 max=>
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.total_latency_us[op=get_object,type=Default]: n=4: min=71680 p10=72191 p50=91135 avg=135488.00 p90=253951 p99=253951 p99.9=253951 max=2539>
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.total_latency_us[op=head_object,type=Default]: n=12: min=14464 p10=21247 p50=46079 avg=44072.00 p90=65535 p99=93695 p99.9=93695 max=93695
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests.total_latency_us[op=list_objects,type=Default]: n=11: min=15680 p10=18175 p50=60671 avg=54181.82 p90=70143 p99=74751 p99.9=74751 max=74751
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests[op=get_object,type=Default]: 4 (n=4)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests[op=head_object,type=Default]: 12 (n=12)
Dec 11 22:17:12 hostname.corp.local mount-s3[3336253]: [INFO] mountpoint_s3::metrics: s3.requests[op=list_objects,type=Default]: 11 (n=11)

@daltschu22
Copy link
Author

Behavior I have noticed, when under heavy use it seems the memory usage does fluctuate up and down, but as soon as the job that is running has completed it remains with the memory usage it was at.

Its currently sitting at 19.9441 GB where it remains without reducing.

@passaro
Copy link
Contributor

passaro commented Dec 13, 2023

Hi @daltschu22, thank you for reporting the issue. Would you be able to share a few more details to help us reproduce it? In particular:

  • what does your find command look like? Is it just looking at file names (or other attributes) or also at file content?
  • could you give a rough estimate of the number of files involved?

@andreyzhelnin-st
Copy link

andreyzhelnin-st commented Dec 19, 2023

Same issue for me, mem usage can went up to 12-19Gb and never goes down.

@daltschu22
Copy link
Author

Thanks @passaro

Find is simply find /mountpoint/mount/path. Just listing all the contents.

Current object count for the bucket I was testing is 5030736.

@dannycjones
Copy link
Contributor

Hey @daltschu22, we'd like to investigate a bit more here. We weren't able to reproduce the issue when we last took a look here (sorry for not sharing this) but maybe there's some specifics that will help.

What does a typical key look like, can you share one? For example... a/b/c.txt.

  • Do they all have a similar number of slashes or does it vary?
  • What's a typical length for one of these keys? Does it vary?

@dannycjones dannycjones changed the title Memory usage grows but never shrinks Memory usage grows but never shrinks when running find command Feb 14, 2024
@dannycjones
Copy link
Contributor

The references to find is a bit confusing - I don't think we're identifying the actual trigger for the high usage. For example, I see we had some metrics related to file reads in the supplied logs.

For the find command, we're interacting with metadata only. Mountpoint builds up a view of the S3 bucket which it then uses for subsequent operations such as metadata lookups, opening files for reading, etc.. Mountpoint does some cleanup based on signals from the Kernel, but otherwise we don't currently consider this to be a risk to the resources on hosts. If you are seeing high usage due to metadata operations impacting your workload, we really want to learn more so we know time needs investing here.

Reading from files is where we expect higher memory usage. Mountpoint prefetches a subsection of the file and keeps that ready to be read by applications to allow for performant sequential reads. Each file handle will store that prefetched portion of the file. If you have a use case that is opening a large number of files for reading concurrently, we'd like to learn more to help us prioritize mechanisms or configurations that can best manage memory usage during reading.

Managing memory usage more effectively when reading from files is something we've been thinking about. When we have more information, we'll share more details on GitHub.

@daltschu22
Copy link
Author

Thanks Danny, to be clear I don't think this is a find specific thing. Our users were seeing this behavior during regular workloads which could be any number of things. I ran a find as an easy way to replicate this, but its not specific to it.

I understand what you are saying, it is curious that just metadata operations is triggering this.

We have moved away from relying on mountpoint for now until it becomes more mature, but let me see if I can get a test instance spun up to replicate this.
I will try and get some examples of the makeup of the buckets.

Thanks for digging into this!

@daltschu22
Copy link
Author

daltschu22 commented Feb 21, 2024

I can get more detailed info, but as an example one of the buckets used to test is 5+ Petabytes in size, with 100+ million objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants