Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvester memory leakage [Bug] #16925

Open
PaulKryptex opened this issue Nov 26, 2023 · 19 comments
Open

Harvester memory leakage [Bug] #16925

PaulKryptex opened this issue Nov 26, 2023 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@PaulKryptex
Copy link

What happened?

Harvester crashes after some days of work. It seems that a memory leakage occurs.

Resource-Exhaustion-Detector:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: start_harvester.exe (5716) consumed 20,333,797,376 bytes

It started after enabling compressed plot farming.

Weakest PC info:
WIN 10, 6 gb RAM, GTX 1660, 23GB free on C: after reboot

The problem persists on 3 different PCs

Current config:

CHIA_ALERTS_PUBKEY: ***
chia_ssl_ca:
  crt: config/ssl/ca/chia_ca.crt
  key: config/ssl/ca/chia_ca.key
daemon_max_message_size: 50000000
daemon_port: 55400
daemon_ssl:
  private_crt: config/ssl/daemon/private_daemon.crt
  private_key: config/ssl/daemon/private_daemon.key
farmer:
  full_node_peer:
    host: localhost
    port: 8444
  harvester_peer:
    host: localhost
    port: 8448
  logging: &id001
    log_filename: log/debug.log
    log_level: WARNING
    log_maxfilesrotation: 7
    log_stdout: false
    log_syslog: false
    log_syslog_host: localhost
    log_syslog_port: 514
  network_overrides: &id002
    config:
      mainnet:
        address_prefix: xch
        default_full_node_port: 8444
      testnet0:
        address_prefix: txch
      testnet1:
        address_prefix: txch
      testnet2:
        address_prefix: txch
      testnet3:
        address_prefix: txch
      testnet4:
        address_prefix: txch
      testnet7:
        address_prefix: txch
        default_full_node_port: 58444
    constants:
      mainnet:
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        NETWORK_TYPE: 0
      testnet0:
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        MIN_PLOT_SIZE: 18
        NETWORK_TYPE: 1
      testnet2:
        DIFFICULTY_CONSTANT_FACTOR: 10052721566054
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        MIN_PLOT_SIZE: 18
        NETWORK_TYPE: 1
      testnet3:
        DIFFICULTY_CONSTANT_FACTOR: 10052721566054
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        MEMPOOL_BLOCK_BUFFER: 10
        MIN_PLOT_SIZE: 18
        NETWORK_TYPE: 1
      testnet4:
        DIFFICULTY_CONSTANT_FACTOR: 10052721566054
        DIFFICULTY_STARTING: 30
        EPOCH_BLOCKS: 768
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        MEMPOOL_BLOCK_BUFFER: 10
        MIN_PLOT_SIZE: 18
        NETWORK_TYPE: 1
      testnet5:
        DIFFICULTY_CONSTANT_FACTOR: 10052721566054
        DIFFICULTY_STARTING: 30
        EPOCH_BLOCKS: 768
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        MEMPOOL_BLOCK_BUFFER: 10
        MIN_PLOT_SIZE: 18
        NETWORK_TYPE: 1
      testnet7:
        DIFFICULTY_CONSTANT_FACTOR: 10052721566054
        DIFFICULTY_STARTING: 30
        EPOCH_BLOCKS: 768
        GENESIS_CHALLENGE: ***
        GENESIS_PRE_FARM_FARMER_PUZZLE_HASH: ***
        GENESIS_PRE_FARM_POOL_PUZZLE_HASH: ***
        MEMPOOL_BLOCK_BUFFER: 50
        MIN_PLOT_SIZE: 18
        NETWORK_TYPE: 1
  pool_public_keys: !!set
    ***: null
  pool_share_threshold: 1000
  port: 8447
  rpc_port: 8559
  selected_network: mainnet
  ssl:
    private_crt: config/ssl/farmer/private_farmer.crt
    private_key: config/ssl/farmer/private_farmer.key
    public_crt: config/ssl/farmer/public_farmer.crt
    public_key: config/ssl/farmer/public_farmer.key
  start_rpc_server: true
  xch_target_address: ***
full_node:
  database_path: db/blockchain_v2_CHALLENGE.sqlite
  db_sync: auto
  dns_servers:
  - dns-introducer.chia.net
  enable_profiler: false
  enable_upnp: true
  exempt_peer_networks: []
  farmer_peer:
    host: localhost
    port: 8447
  introducer_peer:
    host: introducer.chia.net
    port: 8444
  log_sqlite_cmds: false
  logging: *id001
  max_inbound_farmer: 10
  max_inbound_timelord: 5
  max_inbound_wallet: 20
  network_overrides: *id002
  peer_connect_interval: 30
  peer_connect_timeout: 30
  peer_db_path: db/peer_table_node.sqlite
  port: 8444
  recent_peer_threshold: 6000
  rpc_port: 8555
  sanitize_weight_proof_only: false
  selected_network: mainnet
  send_uncompact_interval: 0
  short_sync_blocks_behind_threshold: 20
  simulator_database_path: sim_db/simulator_blockchain_v1_CHALLENGE.sqlite
  simulator_peer_db_path: sim_db/peer_table_node.sqlite
  ssl:
    private_crt: config/ssl/full_node/private_full_node.crt
    private_key: config/ssl/full_node/private_full_node.key
    public_crt: config/ssl/full_node/public_full_node.crt
    public_key: config/ssl/full_node/public_full_node.key
  start_rpc_server: true
  sync_blocks_behind_threshold: 300
  target_outbound_peer_count: 8
  target_peer_count: 80
  target_uncompact_proofs: 100
  timelord_peer:
    host: localhost
    port: 8446
  wallet_peer:
    host: localhost
    port: 8449
  weight_proof_timeout: 360
harvester:
  chia_ssl_ca:
    crt: config/ssl/ca/chia_ca.crt
    key: config/ssl/ca/chia_ca.key
  decompressor_thread_count: 1
  farmer_peer:
    host: localhost
    port: 8447
  logging: *id001
  network_overrides: *id002
  num_threads: 30
  parallel_decompressor_count: 1
  parallel_read: true
  plot_directories:
  - D:\
  - G:\
  - C:\
  - E:\
  plots_refresh_parameter:
    batch_size: 300
    batch_sleep_milliseconds: 1
    interval_seconds: 120
    retry_invalid_seconds: 1200
  port: 8448
  private_ssl_ca:
    crt: config/ssl/ca/private_ca.crt
    key: config/ssl/ca/private_ca.key
  rpc_port: 8560
  selected_network: mainnet
  ssl:
    private_crt: config/ssl/harvester/private_harvester.crt
    private_key: config/ssl/harvester/private_harvester.key
  start_rpc_server: true
inbound_rate_limit_percent: 100
introducer:
  host: localhost
  logging: *id001
  max_peers_to_send: 20
  network_overrides: *id002
  port: 8445
  recent_peer_threshold: 6000
  selected_network: mainnet
  ssl:
    public_crt: config/ssl/full_node/public_full_node.crt
    public_key: config/ssl/full_node/public_full_node.key
logging: *id001
min_mainnet_k_size: 32
network_overrides: *id002
outbound_rate_limit_percent: 30
ping_interval: 120
pool:
  logging: *id001
  network_overrides: *id002
  pool_list:
  - authentication_public_key: ***
    launcher_id: '***'
    owner_public_key: '***'
    p2_singleton_puzzle_hash: '***'
    payout_instructions: ***
    pool_url: https://pool.findchia.com
    target_puzzle_hash: '***'
  - authentication_public_key: ***
    launcher_id: '***'
    owner_public_key: '***'
    p2_singleton_puzzle_hash: '***'
    payout_instructions: ***
    pool_url: ''
    target_puzzle_hash: '***'
  selected_network: mainnet
  xch_target_address: ***
private_ssl_ca:
  crt: config/ssl/ca/private_ca.crt
  key: config/ssl/ca/private_ca.key
selected_network: mainnet
self_hostname: localhost
timelord:
  fast_algorithm: false
  full_node_peer:
    host: localhost
    port: 8444
  logging: *id001
  max_connection_time: 60
  network_overrides: *id002
  port: 8446
  sanitizer_mode: false
  selected_network: mainnet
  ssl:
    private_crt: config/ssl/timelord/private_timelord.crt
    private_key: config/ssl/timelord/private_timelord.key
    public_crt: config/ssl/timelord/public_timelord.crt
    public_key: config/ssl/timelord/public_timelord.key
  vdf_clients:
    ip:
    - localhost
    - localhost
    - 127.0.0.1
    ips_estimate:
    - 150000
  vdf_server:
    host: localhost
    port: 8000
timelord_launcher:
  host: localhost
  logging: *id001
  port: 8000
  process_count: 3
ui:
  daemon_host: localhost
  daemon_port: 55400
  daemon_ssl:
    private_crt: config/ssl/daemon/private_daemon.crt
    private_key: config/ssl/daemon/private_daemon.key
  logging: *id001
  network_overrides: *id002
  port: 8222
  rpc_port: 8555
  selected_network: mainnet
  ssh_filename: config/ssh_host_key
wallet:
  database_path: wallet/db/blockchain_wallet_v1_CHALLENGE_KEY.sqlite
  db_sync: auto
  enable_profiler: false
  full_node_peer:
    host: localhost
    port: 8444
  initial_num_public_keys: 100
  initial_num_public_keys_new_wallet: 5
  introducer_peer:
    host: introducer.chia.net
    port: 8444
  logging: *id001
  network_overrides: *id002
  num_sync_batches: 50
  peer_connect_interval: 60
  port: 8449
  recent_peer_threshold: 6000
  rpc_port: 9256
  selected_network: mainnet
  short_sync_blocks_behind_threshold: 20
  ssl:
    private_crt: config/ssl/wallet/private_wallet.crt
    private_key: config/ssl/wallet/private_wallet.key
    public_crt: config/ssl/wallet/public_wallet.crt
    public_key: config/ssl/wallet/public_wallet.key
  start_height_buffer: 100
  starting_height: 0
  target_peer_count: 5
  testing: false
  trusted_peers:
    trusted_node_1: config/ssl/full_node/public_full_node.crt
  wallet_peers_path: wallet/db/wallet_peers.sqlite

Version

2.1.1

What platform are you using?

Windows

What ui mode are you using?

GUI

Relevant log output

No response

@PaulKryptex PaulKryptex added the bug Something isn't working label Nov 26, 2023
@wjblanke
Copy link
Contributor

can you provide logs? this happens consistently? are all plots healthy (no lookup failures due to bad drives, etc)

@harold-b
Copy link
Contributor

In addition to the log, can you give us a bit more info about your farm as well?

How many plots, compression levels, if you have a mix of classic (non-compressed) and compressed, etc.

@PaulKryptex
Copy link
Author

  1. Two logs attached. log.zip - a PC with a full node log2.zip is on a harvester.
    log.zip
    log2.zip
  2. No faulty drives right now.
  3. It is a mix of different plots. I was replotting to compressed plots when the issue appeared.

For example on a PC with a full node:
K32 116 92%
Compression
C0 51 44%
C3 48 41%
C4 17 15%

K33 10 8%
Compression
C0 10 100%

Total is about 60 tib and they are distributed to three PCs (one full node+2 harvesters). I am experiencing the same issue on all three. One of the harvesters has 32GB RAM. Other PCs have 6GB.

@karlestira
Copy link

karlestira commented Dec 13, 2023

Same problem when using compressed plot farming on windows 10.
When start farming, the commit mem increase(about 0.1G/min). And the mem will not be released if stop farming.
Both GUI and CLI mode will cause this mem leak. I'm sure GPU plotting cause leak, and CPU plotting may also cause leak.

@JamesKrolak
Copy link

I'm experiencing this same issue, but only with my remote harvester. And I have a large footprint. It's a Dell server with 2x 8 core 2.1Ghz CPUs, 384GB of RAM, with 40,708 C05 plots. It doesn't seem to matter what I set the swapfile to. After less than a day, I get the virtual memory exhaustion somewhere in the System Logs and the node stops providing any plots to my main farmer. start_harvester.exe continues to run most of the time. It just does nothing. So, my powershell script that monitors it doesn't trigger an alert.

I had previously run all of these plots of of my main harvester that also had about 37,000 more plots. It didn't have virtual memory issues like this. It had rather high latency with the harvester, so that's why I moved 1/2 to a remote harvester.

@JamesKrolak
Copy link

I found the solution to this problem! At least, this fixed it on both my farmer and my remote harvester. It looks like I had encountered this problem on my original farmer 6 months ago, but had forgotten what the fix was. I had modified a powershell script that I'd found online to fix it. After running this, I've not had any low virtual memory errors.

The issue is that Windows Defender keeps trying to scan or block some of the files involved. (I'm not sure if it's the executables or the plots) I assume that it gets in the middle of the plot checks and the harvester has to abandon it and try to redo the checks, but doesn't release the memory used. Over time, this creates a problem on the system.

So, I modified this script to exclude the chia directories, the key executables, and all .plot files. For some reason, I can't find where I originally got the powershell script off the web. Attached is my version if anyone else wants to use it. Note that my file has a .txt extension, but you'll need to change that to .ps1 and run it within powershell.
Defender_Exclusions.txt

@karlestira
Copy link

karlestira commented Jan 9, 2024

Thank you for your solution, but I always close MSWD(fxxk MSWD) when windows installed. So I think this cannot solve my problem. Anyway, I will try it. And I will want to beat the shit out of MSWD if the problem is truely caused by MSWD.

My solution is using GUI instead of CLI. The GUI farmer still cause leak, but for some reasons, memory usage will be controled when it close to max virtual memory. And will not gain, even you close the GUI and restart it without restart OS. However the GUI will act strange after a long time running(about 2 weeks, this problem is long-standing, exist before the compress-plot chia version), and you will need to restart it.

But in CLI, each time you call a chia process(like "chia farm summary" when you want to check the farmer), the memory usage will gain, and finally lead to OOM.

My farmer server only run chia and utorrent(for PT download, very little memory used), so I can tolerate this kind of memory leak with an upper limit, now it works well. However, I still want to see the final solution of this problem.

@JamesKrolak
Copy link

Darn. I hoped that what I'd found would help someone else. Are you running any other antivirus or anti-malware on your system?

I experienced the OOM error even when I ran the full node GUI on this system, as well--though not as often as when running only the harvester. The Windows Defender exclusions fixed it both for the full GUI and the harvester.

@karlestira
Copy link

Maybe the MSWD is not obediently shut down, I will check it later.

@PaulKryptex
Copy link
Author

My walkaround is a daily harvester reboot using a task scheduler. Memory leakage is not very violent. But it is not a beautiful solution of cause.

@Egorgod123
Copy link

I have a similar problem, but I don't use compressed plots. The size of the swap file is growing at a high rate, and you also have to reboot your PC daily.
20db7e1e-6953-4266-a9bf-21a9f416dca0
b620760a-d46b-409d-a9fd-98ec8f015068

@wjblanke
Copy link
Contributor

Can you guys make sure Windows Defender and other antivirus software is out of the picture (use exclusions or worst case disable). Plots are big files so I can see how it might cause issues with scanners and memory usage. Let us known if u can diagnose the issue further. It seems like this is system specific since it happens only to certain machines.

@Egorgod123
Copy link

I've tried various solutions. I turned off the antivirus, added exceptions, deleted antiviruses, disabled various background applications, but still there is a leak when the harvester is running. If you do not turn on chia, then there is no leakage. I don't understand what the reason might be.

@wjblanke
Copy link
Contributor

Can you check the task list and see if the chia harvester is allocating the memory or if it is some other process? We appreciate the screen shots and I've been using Google translate on my phone to look at them but we don't see the process name listed on the memory displays. Thanks for your report! The fact that you are not using compressed plotting greatly reduced the search surface so would be interested in hearing your results. Also, which patched version of Windows 10 are you running? Which CPU? We still think this is machine specific.

@Egorgod123
Copy link

Egorgod123 commented Feb 24, 2024

I am attaching a screenshot, but it is not informative enough. It does not show how much memory is allocated. When chia is not running, the size of the allocated memory does not increase, but as soon as chia is started, the size of the allocated memory begins to increase gradually. Interesting fact: yesterday the chia harvester was running all day, but at the end of the day I noticed that the allocated memory did not increase, there was no leak all day, but I did nothing special. The leak started again today. Windows 11 pro 23 H2, Rizen 5700x processor
2024-02-24_12-49-40

@PaulKryptex
Copy link
Author

The resource monitor data might be more helpful.

Computer 1
start_harvester.exe
Commit set 19331152 KB (!!!)
Working set 135508 KB
Shareable 4652 KB
Private 129736 KB
Windows version 21H1 build 19043.1320
Capture

Computer 2
start_harvester.exe
Commit set 4094628 KB
Working set 26840 KB
Shareable 4796 KB
Private 22044 KB
Windows version 22H2 build 19045.4046

@Egorgod123
Copy link

Egorgod123 commented Feb 28, 2024

I continue to study this bug. I left the PC to farm chia overnight. The following indicators increased overnight: Committed memory, cache memory and paged pool.

2024-02-28_08-47-51

I tried to look at it in RAMMap and I think I realized that chia does not work correctly with plot files and the file of the blockchain itself. As I understand it, windows sort of caches them in RAM.

2024-02-28_08-44-44
2024-02-28_08-47-10

In RAMMap, you can clear this cache by clicking on the "Empty Standby List". The memory is cleared instantly and there is no need to restart the PC.

2024-02-28_08-48-33
2024-02-28_08-49-55

There is even a small program on github that can do this. You can configure it to run in the windows scheduler, for example, once an hour. Well, apparently we can only wait for the developers to fix it.
redacted

@PaulKryptex
Copy link
Author

Empty standby list only cleans up RAM. It does not help with a huge swap.

@Egorgod123
Copy link

Yes, it really only cleans up RAM, but the SWAP file grows just when RAM is overflowing. After I added the Empty standby list
to the windows scheduler with a launch every 1 hour,
The size of the SWAP began to decrease smoothly after several forced PC reboots over the course of several days and now it is only 2Gb.
https://github.com/stefanpejcic/EmptyStandbyList/blob/master/EmptyStandbyList.exe
https://www.youtube.com/watch?v=uiZ9f5Eo9Js

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants