Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Farmer faulted 2 times #500

Closed
etemiz opened this issue Oct 30, 2020 · 37 comments
Closed

[BUG] Farmer faulted 2 times #500

etemiz opened this issue Oct 30, 2020 · 37 comments
Labels
bug Something isn't working

Comments

@etemiz
Copy link

etemiz commented Oct 30, 2020

Describe the bug
Farmer has been unstable and closes abruptly.

To Reproduce
Steps to reproduce the behavior:

  1. npm run electron &
  2. cat ~/.chia/.../log/debug.log

Expected behavior
Run indefinitely

Screenshots

Screenshot from 2020-10-29 19-37-44
Screenshot from 2020-10-30 14-18-27

Desktop (please complete the following information):

  • OS: Ubuntu Desktop
  • OS Version/Flavor: 18.04.2
  • CPU: TR 1950x

Additional context
Add any other context about the problem here.

@etemiz etemiz added the bug Something isn't working label Oct 30, 2020
@hoffmang9
Copy link
Member

I want to confirm two things:

  1. This is Beta 17?
  2. You're using both a remote and local harvester?

@etemiz
Copy link
Author

etemiz commented Oct 30, 2020

Yes beta 17. Yes, I have 1 local harvester, 5 remote harvesters and 1 local farmer.

@mmulsow
Copy link

mmulsow commented Oct 31, 2020

I hit a similar issue with my farmer crashing unexpectedly. I have two nodes that are each running their own full node, farmer, and harvester. The full node and harvester continued to run but the farmer on both machines died after several days of running fine.

The farmer seems to have died a while back and I didn't immediately notice. All messages from the farmer disappeared after the crash:

$ grep -c farmer .chia/beta-1.0b17/log/debug.log.*
.chia/beta-1.0b17/log/debug.log.1:0
.chia/beta-1.0b17/log/debug.log.2:0
.chia/beta-1.0b17/log/debug.log.3:0
.chia/beta-1.0b17/log/debug.log.4:14575
.chia/beta-1.0b17/log/debug.log.5:15057
.chia/beta-1.0b17/log/debug.log.6:11195

And here is the last mentions of the string farmer in debug.log.4 on the first machine:

05:43:48.318 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('119.253.42.49', 60355)
05:43:48.318 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('108.211.109.173', 8444)
05:43:48.511 farmer farmer_server              : INFO     -> header_signature to peer ('127.0.0.1', 8444)
05:43:48.512 farmer farmer_server              : INFO     <- respond_signature from peer ('127.0.0.1', 45136)
05:43:48.671 full_node full_node_server        : INFO     <- header_signature from peer ('127.0.0.1', 38466)
05:43:48.673 full_node full_node_server        : INFO     <- respond_unfinished_block from peer ('24.186.76.230', 8444)
05:43:48.676 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('85.6.133.56', 8444)
05:43:48.677 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('85.149.90.80', 8444)
05:43:48.678 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('173.28.194.176', 55229)
05:43:48.678 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('12.181.169.147', 39878)
05:43:50.118 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('12.181.169.147', 39878)
05:43:50.119 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('108.211.109.173', 8444)
05:43:50.120 full_node full_node_server        : INFO     <- ping from peer ('108.211.109.173', 8444)
05:43:50.121 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('78.183.106.140', 52854)
05:43:50.122 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('85.6.133.56', 8444)
05:43:50.123 full_node full_node_server        : INFO     <- request_unfinished_block from peer ('24.186.76.230', 8444)
05:43:50.124 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('119.253.42.49', 60355)
05:43:50.125 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('108.211.109.173', 8444)
05:43:50.126 full_node full_node_server        : INFO     -> pong to peer ('108.211.109.173', 8444)
05:43:51.576 full_node src.full_node.full_node : INFO     New best unfinished block at height 16155
05:43:51.581 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('85.6.133.56', 8444)
05:43:51.582 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('85.149.90.80', 8444)
05:43:51.582 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('119.253.42.49', 60355)
05:43:51.583 full_node full_node_server        : INFO     <- new_unfinished_block from peer ('78.183.106.140', 52854)
05:43:51.584 full_node full_node_server        : INFO     -> respond_unfinished_block to peer ('173.28.194.176', 55229)
05:43:51.586 full_node full_node_server        : INFO     -> request_unfinished_block to peer ('174.89.217.214', 8444)
05:43:52.074 harvester harvester_server        : INFO     Received EOF from ('127.0.0.1', 8447), closing connection.
05:43:52.078 daemon src.daemon.server          : INFO     Websocket exception. Closing websocket with chia_farmer code = 1006 (connection closed abnormally [internal]), no reason Traceback (most recent call last):
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 827, in transfer_data
    message = await self.read_message()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 895, in read_message
    frame = await self.read_data_frame(max_size=self.max_size)
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 971, in read_data_frame
    frame = await self.read_frame(max_size)
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 1047, in read_frame
    frame = await Frame.read(
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/framing.py", line 105, in read
    data = await reader(2)
  File "/usr/lib/python3.8/asyncio/streams.py", line 721, in readexactly
    raise exceptions.IncompleteReadError(incomplete, n)
asyncio.exceptions.IncompleteReadError: 0 bytes read on a total of 2 expected bytes

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/chia-blockchain/src/daemon/server.py", line 132, in safe_handle
    async for message in websocket:
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 439, in __aiter__
    yield await self.recv()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 509, in recv
    await self.ensure_open()
  File "/home/ubuntu/chia-blockchain/venv/lib/python3.8/site-packages/websockets/protocol.py", line 803, in ensure_open
    raise self.connection_closed_exc()
websockets.exceptions.ConnectionClosedError: code = 1006 (connection closed abnormally [internal]), no reason

05:43:52.280 full_node src.full_node.full_node : INFO     Block is slow, expected 10717 seconds, waiting
05:43:52.280 full_node src.full_node.full_node : INFO     New best unfinished block at height 16155
05:43:52.281 full_node src.full_node.full_node : INFO     VDF will finish too late 47324 seconds, so don't propagate
05:43:52.282 full_node src.full_node.full_node : INFO     VDF will finish too late 9565 seconds, so don't propagate
05:43:52.283 full_node full_node_server        : INFO     Received EOF from ('127.0.0.1', 38466), closing connection.

Same error message on the second machine, but looks like it happened a couple hours earlier on that machine at 03:33:42.197 with the New best unfinished block at height 16125.

@mmulsow
Copy link

mmulsow commented Nov 1, 2020

Found this in dmesg on both machines:

Machine 1:

[309239.876342] show_signal_msg: 18 callbacks suppressed
[309239.876345] chia_farmer[45874]: segfault at 0 ip 00007ffb795f3293 sp 00007ffd5307ab00 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7ffb795a1000+106000]
[309239.876354] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

Machine 2:

[302029.128623] show_signal_msg: 18 callbacks suppressed
[302029.128625] chia_farmer[4407]: segfault at 0 ip 00007f9903ec58b8 sp 00007ffcabd63190 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f9903e73000+106000]
[302029.128632] Code: 81 c4 a8 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 48 8b 44 24 30 48 8b 4c 24 28 ba 01 00 00 00 4c 89 e7 41 b8 20 00 00 00 <c6> 00 01 48 89 c6 e8 4d 3a fd ff 44 89 fa 41 83 ff 08 0f 83 b0 00

@ToovaTech
Copy link

I've had the same farming failure described in the issue. I had 3 machines on b15. Machine A had full stack, Machine B had full stack, Machine C (pi) had just harvester pointed to Machine B. This setup ran perfectly for a long time. I upgraded to b17 all 3 machines, and wanted to check full stack on the pi, so had the full stack running on the pi. My config for the pi however was still pointing to Machine B. Since then I've had several farming failures. I've shut down the pi farmer, and have 2 machines running full stack, and will monitor for failures.

@ghost
Copy link

ghost commented Nov 2, 2020

Same issue with my setup.
1 Farmer (running GUI)
Multiple Harvesters
All on B17

dmesg output from the farmer:
These are the last few entries in dmesg:
[182826.347682] chia_farmer[496215]: segfault at 0 ip 00007f7580a03293 sp 00007fff05031bc0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f75809b1000+106000]
[182826.347693] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[193835.893856] chia_farmer[719983]: segfault at 0 ip 00007fba71efb293 sp 00007ffcd6c09250 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7fba71ea9000+106000]
[193835.893863] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[212038.981592] perf: interrupt took too long (4901 > 4900), lowering kernel.perf_event_max_sample_rate to 40750
[239036.993418] chia_farmer[939247]: segfault at 0 ip 00007faafcbb3293 sp 00007ffff114d5d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7faafcb61000+106000]
[239036.993424] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

@ghost
Copy link

ghost commented Nov 2, 2020

I ran "chia start farmer" on the latest run, instead of using the GUI. Same issue.

Not sure if it has any value, but these are the chia processes still running & defunct

(venv) user1@farmer01:~$ ps -ef |grep chia
user1 1162542 1 0 13:15 pts/2 00:00:06 /home/user1/chia-blockchain/venv/bin/python3.8 /home/user1/chia-blockchain/venv/bin/chia run_daemon
user1 1162665 1162542 0 13:15 pts/2 00:01:02 chia_harvester
user1 1162666 1162542 10 13:15 pts/2 00:28:08 [chia_farmer] <defunct>
user1 1162667 1162542 6 13:15 pts/2 00:18:13 chia_full_node
user1 1162668 1162542 0 13:15 pts/2 00:00:43 chia_wallet

@wjblanke
Copy link
Contributor

wjblanke commented Nov 3, 2020

nm -D -n blspy.cpython-38-x86_64-linux-gnu.so with offset 0x52293 says its in

00000000000520f0 T _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj

gdb blspy.cpython-38-x86_64-linux-gnu.so

Dump of assembler code for function _ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj:
***** DeriveChildSk 0x00000000000520f0 <+0>: push %r15
0x00000000000520f2 <+2>: push %r14
0x00000000000520f4 <+4>: push %r13
0x00000000000520f6 <+6>: push %r12
0x00000000000520f8 <+8>: mov %edx,%r12d
0x00000000000520fb <+11>: push %rbp
0x00000000000520fc <+12>: mov %rsi,%rbp
0x00000000000520ff <+15>: bswap %r12d
0x0000000000052102 <+18>: push %rbx
0x0000000000052103 <+19>: sub $0x1a8,%rsp
0x000000000005210a <+26>: mov 0x2b4dd7(%rip),%rbx # 0x306ee8
0x0000000000052111 <+33>: mov %rdi,0x38(%rsp)
0x0000000000052116 <+38>: mov $0x20,%edi
0x000000000005211b <+43>: callq *(%rbx)
0x000000000005211d <+45>: mov $0x4,%edi
0x0000000000052122 <+50>: mov %rax,0x40(%rsp)
0x0000000000052127 <+55>: callq *(%rbx)
0x0000000000052129 <+57>: mov $0x20,%edi
0x000000000005212e <+62>: mov %rax,%r15
0x0000000000052131 <+65>: mov %rax,(%rsp)
0x0000000000052135 <+69>: callq *(%rbx)
0x0000000000052137 <+71>: mov $0x20,%edi
0x000000000005213c <+76>: mov %rax,%r13
0x000000000005213f <+79>: callq *(%rbx)
0x0000000000052141 <+81>: mov $0x1fe0,%edi
0x0000000000052146 <+86>: mov %rax,%r14
0x0000000000052149 <+89>: mov %rax,0x10(%rsp)
0x000000000005214e <+94>: callq *(%rbx)
0x0000000000052150 <+96>: mov $0x1fe0,%edi
0x0000000000052155 <+101>: mov %rax,0x18(%rsp)
0x000000000005215a <+106>: callq *(%rbx)
0x000000000005215c <+108>: mov %r12d,(%r15)
0x000000000005215f <+111>: mov %r13,%rsi
0x0000000000052162 <+114>: mov %rbp,%rdi
0x0000000000052165 <+117>: mov %rax,0x20(%rsp)
0x000000000005216a <+122>: callq 0x25850 _ZNK3bls10PrivateKey9SerializeEPh@plt
0x000000000005216f <+127>: lea 0x10(%r14),%rax
0x0000000000052173 <+131>: cmp %rax,%r13
0x0000000000052176 <+134>: jae 0x52185 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+149>
0x0000000000052178 <+136>: lea 0x10(%r13),%rax
0x000000000005217c <+140>: cmp %rax,%r14
0x000000000005217f <+143>: jb 0x52a88 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2456>
0x0000000000052185 <+149>: movdqu 0x0(%r13),%xmm1
0x000000000005218b <+155>: pcmpeqd %xmm0,%xmm0
0x000000000005218f <+159>: mov 0x10(%rsp),%rax
0x0000000000052194 <+164>: pxor %xmm0,%xmm1
0x0000000000052198 <+168>: movups %xmm1,(%rax)
0x000000000005219b <+171>: movdqu 0x10(%r13),%xmm3
0x00000000000521a1 <+177>: pxor %xmm3,%xmm0
0x00000000000521a5 <+181>: movups %xmm0,0x10(%rax)
0x00000000000521a9 <+185>: mov $0x20,%edi
0x00000000000521ae <+190>: mov $0x1,%ebp
0x00000000000521b3 <+195>: callq *(%rbx)
0x00000000000521b5 <+197>: mov (%rsp),%rcx
0x00000000000521b9 <+201>: mov $0x20,%edx
0x00000000000521be <+206>: mov %r13,%rsi
0x00000000000521c1 <+209>: mov $0x4,%r8d
0x00000000000521c7 <+215>: mov %rax,%rdi
0x00000000000521ca <+218>: mov %rax,%r14
0x00000000000521cd <+221>: callq 0x26310 md_hmac@plt
0x00000000000521d2 <+226>: mov $0x20,%edi
0x00000000000521d7 <+231>: callq *(%rbx)
0x00000000000521d9 <+233>: mov $0x1,%edi
0x00000000000521de <+238>: mov %rax,%r15
0x00000000000521e1 <+241>: callq *(%rbx)
0x00000000000521e3 <+243>: mov $0x21,%edi
0x00000000000521e8 <+248>: mov %rax,0x30(%rsp)
0x00000000000521ed <+253>: callq *(%rbx)
0x00000000000521ef <+255>: mov %r13,0x50(%rsp)
0x00000000000521f4 <+260>: mov %rax,%r12
0x00000000000521f7 <+263>: xor %eax,%eax
0x00000000000521f9 <+265>: mov %r14,0x28(%rsp)
0x00000000000521fe <+270>: mov %rbp,%r14
0x0000000000052201 <+273>: mov %r12,%r13
0x0000000000052204 <+276>: mov %rax,%rdx
0x0000000000052207 <+279>: mov %r15,%r12
0x000000000005220a <+282>: jmp 0x52245 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+341>
0x000000000005220c <+284>: nopl 0x0(%rax)
0x0000000000052210 <+288>: and $0x4,%r15d
0x0000000000052214 <+292>: jne 0x52a70 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2432>
0x000000000005221a <+298>: test %edx,%edx
0x000000000005221c <+300>: je 0x5222f <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+319>
0x000000000005221e <+302>: movzbl (%r12),%ecx
0x0000000000052223 <+307>: mov %cl,0x0(%rbp)
0x0000000000052226 <+310>: test $0x2,%dl
0x0000000000052229 <+313>: jne 0x52ac8 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+2520>
0x000000000005222f <+319>: cmp $0xff,%r14
0x0000000000052236 <+326>: je 0x52315 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+549>
0x000000000005223c <+332>: mov 0x48(%rsp),%rdx
0x0000000000052241 <+337>: add $0x1,%r14
0x0000000000052245 <+341>: mov $0x1fe0,%r15d
0x000000000005224b <+347>: mov $0x20,%ecx
0x0000000000052250 <+352>: sub %rdx,%r15
0x0000000000052253 <+355>: cmp $0x20,%r15
0x0000000000052257 <+359>: cmova %rcx,%r15
0x000000000005225b <+363>: lea (%r15,%rdx,1),%rax
0x000000000005225f <+367>: mov %rax,0x48(%rsp)
0x0000000000052264 <+372>: mov 0x18(%rsp),%rax
0x0000000000052269 <+377>: lea (%rax,%rdx,1),%rbp
0x000000000005226d <+381>: cmp $0x1,%r14
0x0000000000052271 <+385>: je 0x528a0 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+1968>
0x0000000000052277 <+391>: mov 0x28(%rsp),%rcx
0x000000000005227c <+396>: mov $0x21,%edx
0x0000000000052281 <+401>: mov %r13,%rsi
0x0000000000052284 <+404>: mov %r12,%rdi
0x0000000000052287 <+407>: movdqu (%r12),%xmm2
0x000000000005228d <+413>: mov $0x20,%r8d
***** SegFault 0x0000000000052293 <+419>: movups %xmm2,0x0(%r13)
0x0000000000052298 <+424>: movdqu 0x10(%r12),%xmm3
0x000000000005229f <+431>: mov %r14b,0x20(%r13)
0x00000000000522a3 <+435>: movups %xmm3,0x10(%r13)
***** md_hmac 0x00000000000522a8 <+440>: callq 0x26310 md_hmac@plt
0x00000000000522ad <+445>: mov %r15d,%edx
0x00000000000522b0 <+448>: cmp $0x8,%r15d
0x00000000000522b4 <+452>: jb 0x52210 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+288>
0x00000000000522ba <+458>: mov (%r12),%rdx

Guessed possible call stack, based on md_hmac call?

DeriveChildSk
ParentSkToLamportPK
IKMToLamportSk
ExtractExpand
Extract or Expand

static void Extract(uint8_t* prk_output, const uint8_t* salt, const size_t saltLen, const uint8_t* ikm, const size_t ikm_len) {
    // assert(saltLen == 4);  // Used for EIP2333 key derivation
    // assert(ikm_len == 32);  // Used for EIP2333 key derivation
    // Hash256 used as the hash function (sha256)
    // PRK Output is 32 bytes (HashLen)
    md_hmac(prk_output, ikm, ikm_len, salt, saltLen);
}

static void Expand(uint8_t* okm, size_t L, const uint8_t* prk, const uint8_t* info, const size_t infoLen) {
    assert(L <= 255 * HASH_LEN); // L <= 255 * HashLen
    assert(infoLen >= 0);
    size_t N = ceil((double)L / HASH_LEN);
    size_t bytesWritten = 0;

    uint8_t* T = Util::SecAlloc<uint8_t>(HASH_LEN);
    uint8_t* hmacInput1 = Util::SecAlloc<uint8_t>(infoLen + 1);
    uint8_t* hmacInput = Util::SecAlloc<uint8_t>(HASH_LEN + infoLen + 1);

    assert(N >= 1 && N <= 255);

    for (size_t i = 1; i <= N; i++) {
        if (i == 1) {
            memcpy(hmacInput1, info, infoLen);
            hmacInput1[infoLen] = i;
            md_hmac(T, hmacInput1, infoLen + 1, prk, HASH_LEN);
        } else {
            memcpy(hmacInput, T, HASH_LEN);
            memcpy(hmacInput + HASH_LEN, info, infoLen);
            hmacInput[HASH_LEN + infoLen] = i;
            md_hmac(T, hmacInput, HASH_LEN + infoLen + 1, prk, HASH_LEN);
        }
        size_t to_write = L - bytesWritten;
        if (to_write > HASH_LEN) {
            to_write = HASH_LEN;
        }
        assert (to_write > 0 && to_write <= HASH_LEN);
        memcpy(okm + bytesWritten, T, to_write);
        bytesWritten += to_write;
    }
    Util::SecFree(T);
    Util::SecFree(hmacInput1);
    Util::SecFree(hmacInput);
    assert(bytesWritten == L);
}


static void ExtractExpand(uint8_t* output, size_t outputLen,
                          const uint8_t* key, size_t keyLen,
                          const uint8_t* salt, size_t saltLen,
                          const uint8_t* info, size_t infoLen) {
    uint8_t* prk = Util::SecAlloc<uint8_t>(HASH_LEN);
    HKDF256::Extract(prk, salt, saltLen, key, keyLen);
    HKDF256::Expand(output, outputLen, prk, info, infoLen);
    Util::SecFree(prk);
}

Could that N double + ceil be causing mischief? Or maybe a sodium_malloc issue?

mov %r13,%rsi means movups %xmm2,0x0(%r13) is crashing on a reference to the second parameter to md_hmac which is ikm (passed in as key) or hmacInput1/hmacInput (which could have problems if N were wrong).

Maybe memcpy is having issues with the zero length info from here

    const uint8_t info[1] = {0};
    HKDF256::ExtractExpand(outputLamportSk, 32 * 255, ikm, ikmLen, salt, saltLen, info, 0);

All just guesses

@wjblanke
Copy link
Contributor

wjblanke commented Nov 3, 2020

0x000000000005226d <+381>: cmp $0x1,%r14 // if (i == 1)
0x0000000000052271 <+385>: je 0x528a0 <_ZN3bls7CoreMPL13DeriveChildSkERKNS_10PrivateKeyEj+1968> if (i == 1)
so i !=1
} else {
memcpy(hmacInput, T, HASH_LEN);
memcpy(hmacInput + HASH_LEN, info, infoLen);
hmacInput[HASH_LEN + infoLen] = i;
md_hmac(T, hmacInput, HASH_LEN + infoLen + 1, prk, HASH_LEN);
}
0x0000000000052277 <+391>: mov 0x28(%rsp),%rcx // prk (md_hmac param - RCX)
0x000000000005227c <+396>: mov $0x21,%edx
0x0000000000052281 <+401>: mov %r13,%rsi // hmacInput (md_hmac param - RSI)
0x0000000000052284 <+404>: mov %r12,%rdi // T (md_hmac param - RDI)
0x0000000000052287 <+407>: movdqu (%r12),%xmm2 // Move first 16 bytes of T to xmm2
0x000000000005228d <+413>: mov $0x20,%r8d // 32 (HASHLEN md_hmac param = R8D)
***** SegFault 0x0000000000052293 <+419>: movups %xmm2,0x0(%r13) // Write first 16 bytes of T to hmacInput
0x0000000000052298 <+424>: movdqu 0x10(%r12),%xmm3 // Move second 16 bytes of T to xmm3
0x000000000005229f <+431>: mov %r14b,0x20(%r13) // hmacInput[HASH_LEN + infoLen] = i;
0x00000000000522a3 <+435>: movups %xmm3,0x10(%r13) // Write second 16 bytes of T to hmacInput
***** md_hmac 0x00000000000522a8 <+440>: callq 0x26310 md_hmac@plt

It isn't happy copying T to hmacInput and fails on the first write. Since hmacInput allocates 33 bytes using sodium_malloc, the memory will be unaligned. Memcpy shouldn't care about that though (it is using unaligned movups). infoLen = 0 has been optimized out.

uint8_t* hmacInput = Util::SecAlloc<uint8_t>(HASH_LEN + infoLen + 1);

hmmm. this must be failing. maybe we are running into this:

jedisct1/libsodium#901

it might be worth trying setrlimit

@wjblanke
Copy link
Contributor

wjblanke commented Nov 3, 2020

OK a different spot from above. This crashed at 0x528B8

Machine 2:

[302029.128623] show_signal_msg: 18 callbacks suppressed
[302029.128625] chia_farmer[4407]: segfault at 0 ip 00007f9903ec58b8 sp 00007ffcabd63190 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f9903e73000+106000]
[302029.128632] Code: 81 c4 a8 01 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 90 48 8b 44 24 30 48 8b 4c 24 28 ba 01 00 00 00 4c 89 e7 41 b8 20 00 00 00 00 01 48 89 c6 e8 4d 3a fd ff 44 89 fa 41 83 ff 08 0f 83 b0 00

above branched to <+1968> if i == 1
{
memcpy(hmacInput1, info, infoLen); // optimized out
hmacInput1[infoLen] = i;
md_hmac(T, hmacInput1, infoLen + 1, prk, HASH_LEN);
}
0x00000000000528a0 <+1968>: mov 0x30(%rsp),%rax
0x00000000000528a5 <+1973>: mov 0x28(%rsp),%rcx
0x00000000000528aa <+1978>: mov $0x1,%edx
0x00000000000528af <+1983>: mov %r12,%rdi
0x00000000000528b2 <+1986>: mov $0x20,%r8d
**** SegFault 0x00000000000528b8 <+1992>: movb $0x1,(%rax) // hmacInput1[infoLen] = i;
0x00000000000528bb <+1995>: mov %rax,%rsi
0x00000000000528be <+1998>: callq 0x26310 md_hmac@plt

This is the i == 1 case. In this case it is trying to write into hmacInput1, somewhat the same. This fails?

uint8_t* hmacInput1 = Util::SecAlloc<uint8_t>(infoLen + 1);

@arvidn
Copy link
Contributor

arvidn commented Nov 3, 2020

here are a few observations, just by lightly browsing the code at: https://github.com/Chia-Network/bls-signatures/blob/d1e8f892d1941ff38da08a85cf17fa2e40f4ea2a/src/hkdf.hpp#L46

  • Util::SecAlloc() should probably throw an exception if it fails to allocate memory. There's no checking for null right now.
  • md_hmac() can throw an exception if it fails to allocate stack space (here). If it does, we may leak all allocations made with SecAlloc(). So, those allocations should really be held by a std::unique_ptr<>. (it's a bit hard to say exactly what happens actually, RLC_THROW is defined here)
  • each call to SecAlloc() (when configured to use libsodium) is a call to sodium_malloc(), which allocates 3 or 4 pages. Which seems quite excessive. Given that these allocations are just used in this stack frame, it seems like they could be merged into a single allocation, presumably saving time, physical memory and virtual address space.

@wjblanke
Copy link
Contributor

wjblanke commented Nov 3, 2020

I am now thinking this is caused by a sodium_malloc leak. Stock Ubuntu will only tolerate 15000 of these sodium leaks and it looks like BNWrapper may have a leak.

@hoffmang9
Copy link
Member

@wjblanke has a potential fix for the crashing farmer. Can folks try the farm-crash-test https://github.com/Chia-Network/chia-blockchain/tree/farm-crash-test branch?

`git fetch; git checkout farm-crash-test; git pull; sh install.sh' etc...

I can make a Windows installer available too if anyone has seen this problem on Windows.

@ghost
Copy link

ghost commented Nov 4, 2020

I ran the farm-crash-test branch on the farmer overnight, sill crashed. (Ran from 20:50 to 05:49)
I only updated the Farmer, do I need to get the harvesters on the same branch?

[12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[18779.585991] show_signal_msg: 28 callbacks suppressed
[18779.585993] chia_farmer[3537]: segfault at 0 ip 00007f8e5e207293 sp 00007ffca4505760 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f8e5e1b5000+106000]
[18779.585999] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[20597.406646] perf: interrupt took too long (3167 > 3162), lowering kernel.perf_event_max_sample_rate to 63000
[36877.859927] perf: interrupt took too long (3959 > 3958), lowering kernel.perf_event_max_sample_rate to 50500
[193217.234802] audit: type=1400 audit(1604494840.413:40): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=249015 comm="cups-browsed" capability=23 capname="sys_nice"
[214122.784870] chia_farmer[239625]: segfault at 0 ip 00007fbd423f1503 sp 00007fff8a01b2d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7fbd4239f000+106000]
[214122.784876] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

@hoffmang9
Copy link
Member

You actually need to update the bls library - blspy is it's python name. The probably fixed version is 0.2.5 - https://pypi.org/project/blspy/#history

I modified setup.py in the branch above to pull that in when you run install.sh. If you wait like 20 minutes I'll have one more possible fix in a version 0.2.6 of blspy.

@hoffmang9
Copy link
Member

Ok, I've updated farm-crash-test branch with the newest blspy of 0.2.6. Please pull and run that one.

@ghost
Copy link

ghost commented Nov 5, 2020

I have been running the new version since 11:30 this morning (6.5 hours ago) and no crash yet, I will let it run through the night and report back tomorrow morning

@ghost
Copy link

ghost commented Nov 5, 2020

Unfortunately crashed again. Let me know if I need to post any logs/dmesg output

@ToovaTech
Copy link

Plots winding down now, will try out farm-crash-test today. One farmer did fail on regular b17 again, this time with just two plotters running full stack.

@etemiz
Copy link
Author

etemiz commented Nov 5, 2020

beta-1.0b18.dev19 (0.2.5?) crashed after 24 hours of running.

@wjblanke
Copy link
Contributor

wjblanke commented Nov 5, 2020

Yes please post the dmesg crash info

@ghost
Copy link

ghost commented Nov 5, 2020

This is the dmesg of my latest crash:

[12580.795642] perf: interrupt took too long (2530 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
[18779.585991] show_signal_msg: 28 callbacks suppressed
[18779.585993] chia_farmer[3537]: segfault at 0 ip 00007f8e5e207293 sp 00007ffca4505760 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f8e5e1b5000+106000]
[18779.585999] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[20597.406646] perf: interrupt took too long (3167 > 3162), lowering kernel.perf_event_max_sample_rate to 63000
[36877.859927] perf: interrupt took too long (3959 > 3958), lowering kernel.perf_event_max_sample_rate to 50500
[193217.234802] audit: type=1400 audit(1604494840.413:40): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=249015 comm="cups-browsed" capability=23 capname="sys_nice"
[214122.784870] chia_farmer[239625]: segfault at 0 ip 00007fbd423f1503 sp 00007fff8a01b2d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7fbd4239f000+106000]
[214122.784876] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8
[225548.021245] perf: interrupt took too long (4951 > 4948), lowering kernel.perf_event_max_sample_rate to 40250
[258089.360167] chia_farmer[472880]: segfault at 0 ip 00007f61f9cf3dc3 sp 00007fffd21f0dc0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f61f9ca1000+107000]
[258089.360173] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

@wjblanke
Copy link
Contributor

wjblanke commented Nov 5, 2020

hmm same spot. we've found another leak, more info soon. i wonder why it picks this particular spot (well, and one other we know about) to crash though.

@eFishCent
Copy link
Contributor

I also get the same kind of crash with chia start harvester:

[141362.172271] chia_harvester[89830]: segfault at 0 ip 00007f4e20523293 sp 00007ffde54b62d0 error 6 in blspy.cpython-38-x86_64-linux-gnu.so[7f4e204d1000+106000]
[141362.172281] Code: 48 8d 2c 10 49 83 fe 01 0f 84 29 06 00 00 48 8b 4c 24 28 ba 21 00 00 00 4c 89 ee 4c 89 e7 f3 41 0f 6f 14 24 41 b8 20 00 00 00 <41> 0f 11 55 00 f3 41 0f 6f 5c 24 10 45 88 75 20 41 0f 11 5d 10 e8

@hoffmang9
Copy link
Member

I've just pushed a potentially fixed blspy. Please git pull the farm-crash-test branch and run it as you can.

@ghost
Copy link

ghost commented Nov 8, 2020

Apologies for the delay in reporting back, I ended up reinstalling Linux on three of the harvesters to sort our the Python issues (These were upgrades from Ubuntu 19.x to 20.04)

I have the farm-crash-test branch running on my Farmer and all the harvesters for the last 9 hours and still going without any errors. (Typically crashed +-6 hours in) I will keep an eye on it during our day.

@ToovaTech
Copy link

I've got two machines running strong on dev20 farm-crash-test for 3+ days. I've just installed dev21 on a third machine that had only been plotting.

@ghost
Copy link

ghost commented Nov 9, 2020

Nearly 24 hours on the latest farm-crash-test branch and everything is still working fine. I ran the Farmer initially from the command line and then via the GUI, both working fine.

@etemiz
Copy link
Author

etemiz commented Nov 9, 2020

1.0b18.dev21 blspy 0.2.7 running for 47 hours without a problem.

@hoffmang9
Copy link
Member

I merged this into the dev branch. If anyone comes new to this issue, that's the branch you now need and not farm-crash-test branch. Feel free to keep running if you have the most recent farm-crash-test however.

I'm going to leave this open for another couple of days to make sure we don't see any more of this over longer time periods.

@eFishCent
Copy link
Contributor

eFishCent commented Nov 12, 2020

I have been running farm-crash-test for 5 days straight with no seg fault on the harvester. Peak memory usage is 983MB for me, and current usage is 928MB. What is everyone seeing as peak and current memory usage for their harvester?

@hoffmang9
Copy link
Member

Having heard no further reports of failure, I'm closing this as fixed for the next release.

@hoffmang9 hoffmang9 mentioned this issue Dec 3, 2020
@sonpython
Copy link

hi

my version
1.1.7

fresh install on ubuntu without GUI, I run harvester only to farm to a remote farmer.

It crashes every couple hours. When I check dmesg got this messages:

[T7 Thg 6 19 06:07:37 2021] chia_harvester[1315021]: segfault at 7f034413b1f8 ip 00007f03a7555922 sp 00007f03526ae7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f03a74e8000+b8000]
[T7 Thg 6 19 06:07:37 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[T7 Thg 6 19 15:33:38 2021] chia_harvester[1498469]: segfault at 7fa4741a9a17 ip 00007fa49ef0a922 sp 00007fa48e7ee7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7fa49ee9d000+b8000]
[T7 Thg 6 19 15:33:38 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[T7 Thg 6 19 20:21:22 2021] chia_harvester[1583997]: segfault at 7f76c0137d47 ip 00007f772c483922 sp 00007f76d9fed7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f772c416000+b8000]
[T7 Thg 6 19 20:21:22 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[CN Thg 6 20 00:34:44 2021] chia_harvester[1631373]: segfault at 7f50a81a4e25 ip 00007f50d40c4922 sp 00007f50d0d7c7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f50d4057000+b8000]
[CN Thg 6 20 00:34:44 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d

@hoffmang9
Copy link
Member

Almost always a bad plot.

@arvidn
Copy link
Contributor

arvidn commented Jun 19, 2021

our harvester should tolerate bad plots though..

@hoffmang9
Copy link
Member

Indeed it should. @sonpython could you figure out which plot is bad and upload it to us?

@sonpython
Copy link

sonpython commented Jun 20, 2021

@hoffmang9

sorry for late repy.

It took me a while couple hours to update 100Gb to Google Drive
https://drive.google.com/file/d/1N5ehFWhv-cRibJpN90y6r5sh4jmfd72r/view?usp=sharing

There is only 1 bad plot as the result of chia plots check. But after I rename it to .plot.bad the harvester still crash and report segfault to dmesg

[CN Thg 6 20 10:11:00 2021] chia_harvester[1747413]: segfault at 7f16a80d07ee ip 00007f16d578f922 sp 00007f168fff17f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f16d5722000+b8000]
[CN Thg 6 20 10:11:00 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d
[CN Thg 6 20 15:42:43 2021] chia_harvester[1768453]: segfault at 7f3de810e403 ip 00007f3e7db21922 sp 00007f3dfe7ee7f8 error 4 in chiapos.cpython-38-x86_64-linux-gnu.so[7f3e7dab4000+b8000]
[CN Thg 6 20 15:42:43 2021] Code: d9 48 8b 7c 24 c8 49 89 d3 0f bd c0 83 f0 1f 8d 74 c8 29 e9 fb fc ff ff 48 85 c9 0f 84 5a ff ff ff 48 8d 5a 08 48 89 54 24 e0 <0f> b6 74 0a ff 48 89 5c 24 e8 48 83 f9 07 0f 86 5a 02 00 00 4c 8d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants