Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage node overloads and returns 504, Very urgent !!! #1175

Open
NewDund opened this issue Dec 19, 2018 · 19 comments
Open

Storage node overloads and returns 504, Very urgent !!! #1175

NewDund opened this issue Dec 19, 2018 · 19 comments

Comments

@NewDund
Copy link

NewDund commented Dec 19, 2018

@mocchira
My cluster has been functioning normally, but today there is a weird situation.
Between 17:58-18:18, my storage node was loaded very heavily, and my request returned a lot of 504. When I ran'leofs-adm mq-stats storage-node', I found'leo_sync_obj_with_dc_queue'became very high, but I didn't do anything today. I don't know why?
Here are my crash. log and error logs.

crash.log

2018-12-19 18:03:05 =ERROR REPORT====
** Generic server <0.18428.2146> terminating
** Last message in was {get,2,{http_req,#Port<0.36308029>,ranch_tcp,keepalive,<0.18061.2146>,<<"GET">>,'HTTP/1.1',{{10,0,1,104},58701},<<"10.0.1.104">>,undefined,80,<<"/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>,undefined,<<>>,[],[],[{<<"host">>,<<"10.0.1.104">>},{<<"user-agent">>,<<"aws-sdk-php/1.6.2 PHP/5.6.27 curl/7.49.1 openssl/1.0.2h">>},{<<"accept">>,<<"*/*">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"referer">>,<<"http://10.0.1.104/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>},{<<"content-type">>,<<"application/x-www-form-urlencoded">>},{<<"date">>,<<"Wed, 19 Dec 2018 10:02:56 GMT">>},{<<"authorization">>,<<"AWS ea925e5b9fbc24e48e46:NwMk+eIcPXQlHOFAebcTiBu6mew=">>},{<<"content-length">>,<<"0">>}],[],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,undefined},{metadata_3,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,197420687858415348216969267030332246516,46,6917037,<<>>,0,5242880,2,0,87005854909,1545211399859741,63712430599,150382645841396063523320865937486611814,355900593,undefined,0,0,0,0,0,0}}
** When Server state == {state,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,{transport_record,ranch_tcp,#Port<0.36308029>,5242880},false,undefined}
** Reason for termination ==
** {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:05 =CRASH REPORT====
  crasher:
    initial call: leo_large_object_get_handler:init/1
    pid: <0.18428.2146>
    registered_name: []
    exception exit: {{enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.18061.2146>]
    messages: []
    links: [<0.18061.2146>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1079
  neighbours:
    neighbour: [{pid,<0.18061.2146>},{registered_name,[]},{initial_call,{cowboy_protocol,init,4}},{current_function,{gen,do_call,4}},{ancestors,[]},{messages,[{cowboy_req,resp_sent}]},{links,[<0.13843.1>,<0.18428.2146>,#Port<0.36308029>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,2586},{stack_size,52},{reductions,3412}]
2018-12-19 18:03:05 =ERROR REPORT====
Ranch listener leo_gateway_s3_api had connection process started with cowboy_protocol:start_link/4 at <0.18061.2146> exit with reason: {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:05 =ERROR REPORT====
** Generic server <0.18437.2146> terminating
** Last message in was {get,2,{http_req,#Port<0.36308099>,ranch_tcp,keepalive,<0.18239.2146>,<<"GET">>,'HTTP/1.1',{{10,0,1,104},58748},<<"10.0.1.104">>,undefined,80,<<"/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>,undefined,<<>>,[],[],[{<<"host">>,<<"10.0.1.104">>},{<<"user-agent">>,<<"aws-sdk-php/1.6.2 PHP/5.6.27 curl/7.49.1 openssl/1.0.2h">>},{<<"accept">>,<<"*/*">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"referer">>,<<"http://10.0.1.104/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>},{<<"content-type">>,<<"application/x-www-form-urlencoded">>},{<<"date">>,<<"Wed, 19 Dec 2018 10:02:58 GMT">>},{<<"authorization">>,<<"AWS ea925e5b9fbc24e48e46:qUPBHb8o9VjV+ZxzCAH2YLfQvr0=">>},{<<"content-length">>,<<"0">>}],[],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,undefined},{metadata_3,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,197420687858415348216969267030332246516,46,6917037,<<>>,0,5242880,2,0,87005854909,1545211399859741,63712430599,150382645841396063523320865937486611814,355900593,undefined,0,0,0,0,0,0}}
** When Server state == {state,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,{transport_record,ranch_tcp,#Port<0.36308099>,5242880},false,undefined}
** Reason for termination ==
** {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:05 =CRASH REPORT====
  crasher:
    initial call: leo_large_object_get_handler:init/1
    pid: <0.18437.2146>
    registered_name: []
    exception exit: {{enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.18239.2146>]
    messages: []
    links: [<0.18239.2146>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1079
  neighbours:
    neighbour: [{pid,<0.18239.2146>},{registered_name,[]},{initial_call,{cowboy_protocol,init,4}},{current_function,{gen,do_call,4}},{ancestors,[]},{messages,[{cowboy_req,resp_sent}]},{links,[<0.13843.1>,<0.18437.2146>,#Port<0.36308099>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,2586},{stack_size,52},{reductions,3412}]
2018-12-19 18:03:05 =ERROR REPORT====
Ranch listener leo_gateway_s3_api had connection process started with cowboy_protocol:start_link/4 at <0.18239.2146> exit with reason: {enotconn,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:27 =ERROR REPORT====
** Generic server <0.18451.2146> terminating
** Last message in was {get,2,{http_req,#Port<0.36308192>,ranch_tcp,keepalive,<0.18370.2146>,<<"GET">>,'HTTP/1.1',{{10,0,1,104},58805},<<"10.0.1.104">>,undefined,80,<<"/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>,undefined,<<>>,[],[],[{<<"host">>,<<"10.0.1.104">>},{<<"user-agent">>,<<"aws-sdk-php/1.6.2 PHP/5.6.27 curl/7.49.1 openssl/1.0.2h">>},{<<"accept">>,<<"*/*">>},{<<"accept-encoding">>,<<"gzip, deflate">>},{<<"referer">>,<<"http://10.0.1.104/xz-img-01/51%2C0%2C81%2C14929%2C4222%2C2000%2C854ec528.jpg">>},{<<"content-type">>,<<"application/x-www-form-urlencoded">>},{<<"date">>,<<"Wed, 19 Dec 2018 10:03:02 GMT">>},{<<"authorization">>,<<"AWS ea925e5b9fbc24e48e46:xacTyFtJ5aMFCiTyV0RzW+yCL+c=">>},{<<"content-length">>,<<"0">>}],[],undefined,[],waiting,<<>>,undefined,false,waiting,[],<<>>,undefined},{metadata_3,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,197420687858415348216969267030332246516,46,6917037,<<>>,0,5242880,2,0,87005854909,1545211399859741,63712430599,150382645841396063523320865937486611814,355900593,undefined,0,0,0,0,0,0}}
** When Server state == {state,<<"xz-img-01/51,0,81,14929,4222,2000,854ec528.jpg">>,{transport_record,ranch_tcp,#Port<0.36308192>,5242880},false,undefined}
** Reason for termination ==
** {closed,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-12-19 18:03:27 =CRASH REPORT====
  crasher:
    initial call: leo_large_object_get_handler:init/1
    pid: <0.18451.2146>
    registered_name: []
    exception exit: {{closed,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.18370.2146>]
    messages: []
    links: [<0.18370.2146>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 1307
  neighbours:
    neighbour: [{pid,<0.18370.2146>},{registered_name,[]},{initial_call,{cowboy_protocol,init,4}},{current_function,{gen,do_call,4}},{ancestors,[]},{messages,[{cowboy_req,resp_sent}]},{links,[<0.13843.1>,<0.18451.2146>,#Port<0.36308192>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,2586},{stack_size,52},{reductions,3412}]
2018-12-19 18:03:27 =ERROR REPORT====
Ranch listener leo_gateway_s3_api had connection process started with cowboy_protocol:start_link/4 at <0.18370.2146> exit with reason: {closed,[{leo_large_object_get_handler,handle_loop,3,[{file,"src/leo_large_object_get_handler.erl"},{line,247}]},{leo_large_object_get_handler,handle_call,3,[{file,"src/leo_large_object_get_handler.erl"},{line,117}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}

error

[W]	gateway_01@10.0.1.103	2018-12-19 18:10:06.832107 +0800	1545214206	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:07.718168 +0800	1545214207	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:08.496338 +0800	1545214208	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:10.385112 +0800	1545214210	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:11.262133 +0800	1545214211	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:11.331273 +0800	1545214211	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:11.403194 +0800	1545214211	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:11.602119 +0800	1545214211	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:13.252119 +0800	1545214213	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:14.570133 +0800	1545214214	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:15.556191 +0800	1545214215	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:17.700117 +0800	1545214217	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:20.974151 +0800	1545214220	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:25.51340 +0800	1545214225	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:28.598102 +0800	1545214228	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:30.52759 +0800	1545214230	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:31.908043 +0800	1545214231	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:32.98143 +0800	1545214232	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:33.599150 +0800	1545214233	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:34.440171 +0800	1545214234	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:35.54134 +0800	1545214235	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:35.506103 +0800	1545214235	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:36.406227 +0800	1545214236	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:38.80066 +0800	1545214238	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:38.497026 +0800	1545214238	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:38.600112 +0800	1545214238	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:39.239158 +0800	1545214239	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:40.809148 +0800	1545214240	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:40.827144 +0800	1545214240	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:41.165104 +0800	1545214241	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:41.332036 +0800	1545214241	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:41.407166 +0800	1545214241	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:42.171189 +0800	1545214242	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:44.150284 +0800	1545214244	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:45.628375 +0800	1545214245	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:46.16232 +0800	1545214246	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:46.408126 +0800	1545214246	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:47.697231 +0800	1545214247	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:47.701054 +0800	1545214247	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:48.628117 +0800	1545214248	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:49.530327 +0800	1545214249	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:52.173096 +0800	1545214252	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:54.254144 +0800	1545214254	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:55.508230 +0800	1545214255	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:10:58.399195 +0800	1545214258	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:11:05.507265 +0800	1545214265	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:11:35.508243 +0800	1545214295	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:11:59.501286 +0800	1545214319	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:07.716355 +0800	1545214327	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:12.825171 +0800	1545214332	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:19.533378 +0800	1545214339	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:29.502258 +0800	1545214349	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:37.717278 +0800	1545214357	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:42.826205 +0800	1545214362	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:12:49.534201 +0800	1545214369	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:25:49.532172 +0800	1545215149	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:25:58.744361 +0800	1545215158	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:03.263182 +0800	1545215163	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:09.941075 +0800	1545215169	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:14.36295 +0800	1545215174	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_05@10.0.1.110'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:19.533228 +0800	1545215179	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:28.745348 +0800	1545215188	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:33.264141 +0800	1545215193	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:39.942148 +0800	1545215199	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:44.37257 +0800	1545215204	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:49.534314 +0800	1545215209	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:26:58.747222 +0800	1545215218	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:27:03.265327 +0800	1545215223	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:27:09.943436 +0800	1545215229	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:27:14.38267 +0800	1545215234	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_directory},{method,find_by_parent_dir},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:31:22.92283 +0800	1545215482	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2018-12-19 18:31:52.594265 +0800	1545215512	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_04@10.0.1.109'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]

If you need any information, please contact me.
Because it is the production environment, so I hope to reply as soon as possible, thank you.

@mocchira
Copy link
Member

@NewDund The error enotconn we can see in crash.log will happen in case the connection is NOT established (for TCP connection) so I guess some network trouble happened around that time and some connections were forcefully closed by network switches placed between LeoGateway and LeoStorage. So I'd recommend you contact your infra team (OR if your run LeoFS on some public cloud then contact the provider) to confirm if something went wrong around that time.

If you need any information, please contact me.

Just in case, please give us the error log files on LeoStorage if exists.

@mocchira
Copy link
Member

And also any system metrics related to network would be helpful if your system gather those information regularly.

@NewDund
Copy link
Author

NewDund commented Dec 20, 2018

@mocchira
This is part of the error log of the storage node I intercepted, and you can see if it's helpful.

[W]	storage_01@10.0.1.106	2018-12-19 18:08:22.866362 +0800	1545214102	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:23.62127 +0800	1545214103	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:24.547035 +0800	1545214104	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,115042698958933531525465566678727684218},{key,<<"xz-img-01/51,0,96,15055,1500,2000,87fc2486.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:24.567591 +0800	1545214104	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,85522489140579400833319482861472971630},{key,<<"xz-img-01/51,0,6,14972,3038,2000,d559c8f4.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:29.645343 +0800	1545214109	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:30.740829 +0800	1545214110	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,301825275647700918725340957232212760431},{key,<<"xz-img-01/51,0,3,15066,2666,2000,74b05a3c.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:31.836745 +0800	1545214111	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:31.840704 +0800	1545214111	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:31.935687 +0800	1545214111	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:32.722483 +0800	1545214112	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,302704449271088706685579725248233407195},{key,<<"xz-img-01/51,0,47,15060,800,800,842d3c77.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:34.658964 +0800	1545214114	leo_storage_read_repairer:compare/4	174	[{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:34.661988 +0800	1545214114	leo_storage_read_repairer:compare/4	174	[{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:34.662048 +0800	1545214114	leo_storage_read_repairer:compare/4	174	[{node,'storage_03@10.0.1.108'},{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{clock,1545214100177830},{cause,secondary_inconsistency}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.173043 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,303555783980173236865626863861593103855},{key,<<"xz-img-01/51,0,40,15027,1024,682,54cd19b0.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.193046 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,303555783980173236865626863861593103855},{key,<<"xz-img-01/51,0,40,15027,1024,682,54cd19b0.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.362758 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.699131 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,132627485593750080073349362187704947564},{key,<<"xz-img-01/51,0,95,14931,1800,1200,dcbe253e.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.945801 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.950816 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:35.964784 +0800	1545214115	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:38.419244 +0800	1545214118	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,274799762605490884010058104327502159603},{key,<<"xz-img-01/51,0,79,15227,3000,2000,aa740cb9.jpg">>},{clock,1545214072066214},{cause,timeout}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:38.433349 +0800	1545214118	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,274799762605490884010058104327502159603},{key,<<"xz-img-01/51,0,79,15227,3000,2000,aa740cb9.jpg">>},{clock,1545214072066214},{cause,timeout}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:38.758215 +0800	1545214118	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,139114591507114234479358787829134935318},{key,<<"xz-img-01/51,0,11,14976,132,132,5030263d.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:38.799226 +0800	1545214118	leo_storage_read_repairer:compare/4	174	[{node,'storage_05@10.0.1.110'},{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{clock,1545214103039150},{cause,secondary_inconsistency}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:39.25278 +0800	1545214119	leo_storage_read_repairer:compare/4	174	[{node,'storage_03@10.0.1.108'},{addr_id,46555256417576372219622714676382478038},{key,<<"xz-img-01/51,0,58,3146,1334,1334,34132f53.jpg">>},{clock,1543114132553194},{cause,timeout}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:39.494405 +0800	1545214119	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:44.970543 +0800	1545214124	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:44.971314 +0800	1545214124	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,200868615888498000502667266991659758280},{key,<<"xz-img-01/51,0,57,14944,1500,2000,500e72e8.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:45.526425 +0800	1545214125	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:45.540390 +0800	1545214125	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:45.553651 +0800	1545214125	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:45.566184 +0800	1545214125	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:45.574858 +0800	1545214125	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,218849011693645702606343196622539469801},{key,<<"xz-img-01/51,0,86,15030,1500,2000,bf93d6ac.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:48.393474 +0800	1545214128	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,306816981848119513338965244292149792627},{key,<<"xz-img-01/51,0,27,15141,1500,2000,01499e8b.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:48.844323 +0800	1545214128	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:48.844409 +0800	1545214128	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:48.844456 +0800	1545214128	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:48.845286 +0800	1545214128	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,38608667592519813831701633028787826052},{key,<<"xz-img-01/51,0,67,14931,132,132,53f5edda.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:48.875203 +0800	1545214128	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,313736640671212921313340909969055426615},{key,<<"xz-img-01/51,0,81,14944,3000,2000,f8b2db6a.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:49.205864 +0800	1545214129	leo_storage_read_repairer:compare/4	174	[{node,'storage_05@10.0.1.110'},{addr_id,322931588457243899315235525521163488474},{key,<<"xz-img-01/51,0,58,14908,1616,1080,4b3b45e9.jpg">>},{clock,1545214109131558},{cause,primary_inconsistency}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:49.506328 +0800	1545214129	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,218849011693645702606343196622539469801},{key,<<"xz-img-01/51,0,86,15030,1500,2000,bf93d6ac.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:49.636818 +0800	1545214129	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,45662221158025527418605584447911674797},{key,<<"xz-img-01/51,0,37,15070,1500,2000,c8d4e3e2.jpg">>},{cause,sending_data_to_remote}]
[W]	storage_01@10.0.1.106	2018-12-19 18:08:50.353193 +0800	1545214130	leo_sync_remote_cluster:defer_stack_1/1	86	[{addr_id,132627485593750080073349362187704947564},{key,<<"xz-img-01/51,0,95,14931,1800,1200,dcbe253e.jpg">>},{cause,sending_data_to_remote}]

@NewDund
Copy link
Author

NewDund commented Dec 20, 2018

@mocchira
Here is my traffic limit order by tc.
/usr/sbin/tc qdisc add dev eno16777984 root tbf rate 150Mbit latency 60ms burst 15000
This restriction is a safe amount for my intranet communications. But I don't know how large the leofs can be used. Can you give me some advice?

@mocchira
Copy link
Member

@NewDund I can see many error lines including the word sending_data_to_remote. So let me confirm that under a normal circumstance, does this error line appear with such a high frequency rate? Since sending_data_to_remote means that "LeoStorage is trying to do MDCR with remote clusters so now it's impossible to take another replication tasks, try again after a certain period of time", MDCR seems to become bottleneck and that might cause LeoStorage to become high load situation.

One possible reason why MDCR becomes bottleneck is your bandwidth setting 150Mbit so please check the bandwidth actually used at the time the incident happened. if the network bandwidth is occupied then you may have to consider setting more larger number like 200 - 300Mbit to the bandwidth setting.

@NewDund
Copy link
Author

NewDund commented Dec 20, 2018

@mocchira
You mean that both my local and remote cluster bandwidth restrictions should be increased, right?

@NewDund
Copy link
Author

NewDund commented Dec 20, 2018

@mocchira
I checked it according to your prompt and found that there might be network fluctuation between our local cluster and remote cluster at that time. (Of course, I'm not entirely sure)

The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster.

I think this should be the focus of optimization. NEED FIX ~

@mocchira
Copy link
Member

@NewDund

I checked it according to your prompt and found that there might be network fluctuation between our local cluster and remote cluster at that time. (Of course, I'm not entirely sure)

Got it then probably the cause would be network related things.

The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster.

Right. The current retry mechanism we are using at any asynchronous tasks including MDCR isn't optimized sufficiently (There are rooms for improvements) so we are going to improve it in the future release.

@NewDund
Copy link
Author

NewDund commented Dec 20, 2018

Well, anyway. It is hoped that the related issues will be seen in the description of future version updates.

@mocchira
Copy link
Member

Yes we will. At first, I will file the issue related MDCR to deal with the situation in which the network bandwidth between both clusters is unstable.

@mocchira
Copy link
Member

@NewDund also I will share the recommended configuration which might contribute to make this kind of incident less likely to happen compared to the default settings. Please wait for a while.

@mocchira
Copy link
Member

@NewDund The setting mdc_replication.size_of_stacked_objs in leo_storage.conf explained on https://leo-project.net/leofs/docs/admin/settings/leo_storage/ would be to control how much amount of data is transferred to the remote cluster at once. The default is around 32MB. that said, this number probably would be too large for your environment (capped 150Mbit) so I'd recommend you lower this setting such as 16MB (16777216). This might prevent the same kind of incident from happening again.

@NewDund
Copy link
Author

NewDund commented Dec 20, 2018

Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters.

Then if the official has a more perfect solution, we will test it again.

@mocchira
Copy link
Member

@NewDund

Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters.

OK. little bit sad for us but your service stability should be the matter prior to anything else.

Then if the official has a more perfect solution, we will test it again.

Thanks! I'm sure we will be able to improve the MDCR stablity thanks to your feedback so once we are confident to meet your requirements, we'll get back to you.

@NewDund
Copy link
Author

NewDund commented Jan 9, 2019

@mocchira
My cluster has almost encountered a thorny problem. My LB has returned a lot of 503 and 504. My LB and network management nodes have set very large timeouts.

Let's take a look at my log.

The following is the result returned by my leofs mq-stats:

==========
storage_01
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 365            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 24             | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_02
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 245            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 10             | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_03
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 336            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 11             | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_04
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 858            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_05
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 178            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 8              | 100            | 1000           | requests of synchronizing objects w/remote-node
==========
storage_06
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 100            | 1000           | requests of removing objects asynchronously
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 100            | 1000           | requests of comparing metadata w/remote-node
 leo_delete_dir_queue_1         |      idling       | 0              | 100            | 1000           | requests of removing buckets #1
 leo_delete_dir_queue_2         |      idling       | 0              | 100            | 1000           | requests of removing buckets #2
 leo_delete_dir_queue_3         |      idling       | 0              | 100            | 1000           | requests of removing buckets #3
 leo_delete_dir_queue_4         |      idling       | 0              | 100            | 1000           | requests of removing buckets #4
 leo_delete_dir_queue_5         |      idling       | 0              | 100            | 1000           | requests of removing buckets #5
 leo_delete_dir_queue_6         |      idling       | 0              | 100            | 1000           | requests of removing buckets #6
 leo_delete_dir_queue_7         |      idling       | 0              | 100            | 1000           | requests of removing buckets #7
 leo_delete_dir_queue_8         |      idling       | 0              | 100            | 1000           | requests of removing buckets #8
 leo_per_object_queue           |      running      | 228            | 100            | 1000           | requests of fixing inconsistency of objects
 leo_rebalance_queue            |      idling       | 0              | 100            | 1000           | requests of relocating objects
 leo_recovery_node_queue        |      idling       | 0              | 100            | 1000           | requests of recovering objects of the node (incl. recover-consistency)
 leo_req_delete_dir_queue       |      idling       | 0              | 100            | 1000           | requests of removing directories
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects by vnode-id
 leo_sync_obj_with_dc_queue     |      idling       | 0              | 100            | 1000           | requests of synchronizing objects w/remote-node

And these are the error logs of my network management node:

[W]	gateway_01@10.0.1.103	2019-01-09 23:13:36.664776 +0800	1547046816	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{309428713511234830042129322128762777919,<<"xz-img-01/51,0,12,9422,2697,2000,3cc29bb8.jpg">>},-1,-1,false,13855437},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:36.881613 +0800	1547046816	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13458583},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:36.973674 +0800	1547046816	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13458675},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:37.312840 +0800	1547046817	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:37.396117 +0800	1547046817	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_50004_2,{get,{123201639718953384550134771689223183842,<<"xz-img-01/51,0,30,18212,2666,2000,695775a4.jpg">>},-1,-1,false,14269474},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:37.668072 +0800	1547046817	leo_gateway_rpc_handler:handle_error/5	298	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,put},{cause,"Replicate failure"}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:37.921751 +0800	1547046817	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13459623},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:38.378746 +0800	1547046818	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,head},{cause,timeout}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:38.380149 +0800	1547046818	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_01@10.0.1.106'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_50004_2,{get,{123201639718953384550134771689223183842,<<"xz-img-01/51,0,30,18212,2666,2000,695775a4.jpg">>},-1,-1,false,14270458},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:38.571475 +0800	1547046818	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13460273},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:41.169871 +0800	1547046821	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_03@10.0.1.108'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{309428713511234830042129322128762777919,<<"xz-img-01/51,0,12,9422,2697,2000,3cc29bb8.jpg">>},-1,-1,false,13859942},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:42.192968 +0800	1547046822	leo_gateway_rpc_handler:handle_error/5	298	[{node,'storage_02@10.0.1.107'},{mod,leo_storage_handler_object},{method,put},{cause,"Replicate failure"}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:42.378423 +0800	1547046822	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13464080},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:42.479124 +0800	1547046822	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_1,{get,{221241018722809845335291404478752879363,<<"xz-img-01/51,0,31,5897,2000,1333,42564455.jpg">>},-1,-1,false,13464180},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:43.77648 +0800	1547046823	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13464779},30000]}}}}]
[W]	gateway_01@10.0.1.103	2019-01-09 23:13:44.67854 +0800	1547046824	leo_gateway_rpc_handler:handle_error/5	303	[{node,'storage_06@10.0.1.105'},{mod,leo_storage_handler_object},{method,get},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_0_2,{get,{278625753006254488613486616696364700270,<<"xz-img-01/51,0,50,5948,2999,2000,433c6594.jpg">>},-1,-1,false,13465769},30000]}}}}]

Finally, the error log of my storage node:

[W]	storage_01@10.0.1.106	2019-01-09 23:26:04.334265 +0800	1547047564	leo_storage_read_repairer:loop/6	122	[{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:04.334730 +0800	1547047564	leo_storage_read_repairer:compare/4	174	[{node,'storage_03@10.0.1.108'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:04.334769 +0800	1547047564	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:04.596170 +0800	1547047564	leo_storage_read_repairer:loop/6	122	[{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:04.596268 +0800	1547047564	leo_storage_read_repairer:compare/4	174	[{node,'storage_05@10.0.1.110'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:04.596306 +0800	1547047564	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[E]	storage_01@10.0.1.106	2019-01-09 23:26:04.676252 +0800	1547047564	gen_server:call	0	gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:call/2 line 204
[E]	storage_01@10.0.1.106	2019-01-09 23:26:04.676440 +0800	1547047564	gen_server:call	0	CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:terminate/7 line 812
[E]	storage_01@10.0.1.106	2019-01-09 23:26:04.676641 +0800	1547047564	gen_server:call	0	Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.22727.13> exit with reason no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in context child_terminated
[W]	storage_01@10.0.1.106	2019-01-09 23:26:07.231076 +0800	1547047567	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"xz-img-01/51,0,46,3074,2666,2000,f0394644.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:08.218021 +0800	1547047568	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"xz-img-01/51,0,46,3074,2666,2000,f0394644.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:08.654064 +0800	1547047568	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:08.832096 +0800	1547047568	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:08.832555 +0800	1547047568	leo_storage_read_repairer:loop/6	122	[{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:08.832622 +0800	1547047568	leo_storage_read_repairer:compare/4	174	[{node,'storage_03@10.0.1.108'},{addr_id,28594113901002211803075798585761706491},{key,<<"xz-img-01/51,0,37,696,100,100,9fb285bb.jpg">>},{clock,1542114352087639},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:09.96855 +0800	1547047569	leo_storage_read_repairer:compare/4	174	[{node,'storage_05@10.0.1.110'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,{'EXIT',{timeout,{gen_server,call,[leo_object_storage_read_1_0,{head_with_check_avs,{327593916421263477271857669055578618711,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},check_header,14339443},30000]}}}}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:09.97249 +0800	1547047569	leo_storage_read_repairer:loop/6	122	[{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:09.97303 +0800	1547047569	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,327593916421263477271857669055578618711},{key,<<"xz-img-01/51,0,75,18627,2666,2000,ddb8492a.jpg">>},{clock,1545889560965447},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:09.525111 +0800	1547047569	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[E]	storage_01@10.0.1.106	2019-01-09 23:26:09.677330 +0800	1547047569	gen_server:call	0	gen_server leo_rpc_client_manager terminated with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:call/2 line 204
[E]	storage_01@10.0.1.106	2019-01-09 23:26:09.678023 +0800	1547047569	gen_server:call	0	CRASH REPORT Process leo_rpc_client_manager with 0 neighbours exited with reason: no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in gen_server:terminate/7 line 812
[E]	storage_01@10.0.1.106	2019-01-09 23:26:09.678264 +0800	1547047569	gen_server:call	0	Supervisor leo_rpc_client_sup had child leo_rpc_client_manager started with leo_rpc_client_manager:start_link(5000) at <0.22827.13> exit with reason no such process or port in call to gen_server:call(<0.29704.0>, raw_status) in context child_terminated
[E]	storage_01@10.0.1.106	2019-01-09 23:26:10.36222 +0800	1547047570	null:null	0	{module,"leo_mq_consumer"},{function,"consume/4"},{line,535},{body,[{module,leo_storage_mq},{id,leo_sync_obj_with_dc_queue},{cause,{timeout,{gen_server,call,[leo_object_storage_read_10000_0,{get,{229460670587452068026601146147356987503,<<"xz-img-01/51,0,62,24332,132,132,6922bea2.jpg">>},-1,-1,false,15022113},30000]}}}]}
[W]	storage_01@10.0.1.106	2019-01-09 23:26:10.200066 +0800	1547047570	leo_storage_handler_object:read_and_repair_3/3	1434	[{key,<<"xz-img-01/51,0,86,12634,1800,1200,76c06fc3.jpg">>},{cause,timeout}]
[W]	storage_01@10.0.1.106	2019-01-09 23:26:11.965959 +0800	1547047571	leo_storage_read_repairer:compare/4	174	[{node,'storage_02@10.0.1.107'},{addr_id,28903003486663116147292874696914478445},{key,<<"xz-img-01/51,0,24,14729,1656,1242,e953b005.jpg">>},{clock,1545127149393637},{cause,timeout}]
[E]	storage_01@10.0.1.106	2019-01-09 23:26:12.154656 +0800	1547047572	null:null	0	{module,"leo_pod_manager"},{function,"start_child/2"},{line,324},{body,{connection_error,{connection_error,econnrefused}}}
[E]	storage_01@10.0.1.106	2019-01-09 23:26:12.155043 +0800	1547047572	null:null	0	Supervisor leo_rpc_client_storage_14_at_13077_sup had child leo_rpc_client_storage_14_at_13077 started with leo_pod_manager:start_link(leo_rpc_client_storage_14_at_13077, 8, 8, leo_rpc_client_conn, [storage_14,"10.16.1.74",13077,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[E]	storage_01@10.0.1.106	2019-01-09 23:26:12.155201 +0800	1547047572	null:null	0	CRASH REPORT Process <0.23087.13> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[W]	storage_01@10.0.1.106	2019-01-09 23:26:12.155398 +0800	1547047572	null:null	0	{module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{body,{{shutdown,{failed_to_start_child,leo_rpc_client_storage_14_at_13077,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_storage_14_at_13077,{leo_pod_sup,start_link,[leo_rpc_client_storage_14_at_13077,8,8,leo_rpc_client_conn,[storage_14,"10.16.1.74",13077,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}
[E]	storage_01@10.0.1.106	2019-01-09 23:26:12.194692 +0800	1547047572	null:null	0	{module,"leo_pod_manager"},{function,"start_child/2"},{line,324},{body,{connection_error,{connection_error,econnrefused}}}
[E]	storage_01@10.0.1.106	2019-01-09 23:26:12.194994 +0800	1547047572	null:null	0	Supervisor leo_rpc_client_storage_14_at_13077_sup had child leo_rpc_client_storage_14_at_13077 started with leo_pod_manager:start_link(leo_rpc_client_storage_14_at_13077, 8, 8, leo_rpc_client_conn, [storage_14,"10.16.1.74",13077,0], #Fun<leo_rpc_client_sup.0.73440599>) at undefined exit with reason {connection_error,{connection_error,econnrefused}} in context start_error
[E]	storage_01@10.0.1.106	2019-01-09 23:26:12.195140 +0800	1547047572	null:null	0	CRASH REPORT Process <0.23091.13> with 1 neighbours exited with reason: {connection_error,{connection_error,econnrefused}} in gen_server:init_it/6 line 344
[W]	storage_01@10.0.1.106	2019-01-09 23:26:12.195310 +0800	1547047572	null:null	0	{module,"leo_rpc_client_sup"},{function,"start_child/3"},{line,106},{body,{{shutdown,{failed_to_start_child,leo_rpc_client_storage_14_at_13077,{connection_error,{connection_error,econnrefused}}}},{child,undefined,leo_rpc_client_storage_14_at_13077,{leo_pod_sup,start_link,[leo_rpc_client_storage_14_at_13077,8,8,leo_rpc_client_conn,[storage_14,"10.16.1.74",13077,0],#Fun<leo_rpc_client_sup.0.73440599>]},permanent,10000,supervisor,[leo_pod_sup]}}}

@NewDund
Copy link
Author

NewDund commented Jan 9, 2019

@mocchira
This is my production environment. Please reply as soon as possible. Thank you very much.

@NewDund
Copy link
Author

NewDund commented Jan 10, 2019

@mocchira
Moreover, I found that occasionally mistake 500 appear, which I have not seen before. Can you help me analyze why 500 appears?

@NewDund
Copy link
Author

NewDund commented Jan 11, 2019

@mocchira
Also, I found a very strange problem.

One of my PUT requests, haproxy returned me 200, but my leo_gateway did not receive this PUT request. Is this possible?

If possible, how can I judge whether PUT is successful?

@mocchira
Copy link
Member

@NewDund Sorry for the long delay. I can spare time to look into this tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants