storage_io_error while deploying cluster #75

MicDeDuiwel · 2020-04-07T10:37:59Z

Describe the bug
Hi, I'm having issues getting my scylla cluster to run on my kubernetes 1.18.0 cluster using rook-ceph as the storage backend.
I keep getting this error when the first node starts up:

[shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)

Or sometimes when the first node starts up the second node will throw the same error.
I set up some nodeSelectors to deploy the scylla cluster on nodes that I'm certain don't have disk issues.
To Reproduce
Steps to reproduce the behavior:

Create Operator
Create the Cluster
See error

Expected behavior
I expect the scylla pods to deploy one by one.

Logs

INFO  2020-04-06 19:06:27,464 [shard 0] init - Scylla version 3.2.1-0.20200122.e3e301906d5 starting.
WARN  2020-04-06 19:06:27,464 [shard 0] init - Only 512 MiB per shard; this is below the recommended minimum of 1 GiB/shard; continuing since running in developer mode
INFO  2020-04-06 19:06:27,465 [shard 0] init - starting prometheus API server
INFO  2020-04-06 19:06:27,465 [shard 0] init - creating tracing
INFO  2020-04-06 19:06:27,465 [shard 0] init - creating snitch
INFO  2020-04-06 19:06:27,465 [shard 0] init - determining DNS name
INFO  2020-04-06 19:06:27,465 [shard 0] init - starting API server
INFO  2020-04-06 19:06:27,466 [shard 0] init - Scylla API server listening on 127.0.0.1:10000 ...
INFO  2020-04-06 19:06:27,466 [shard 0] init - initializing storage service
INFO  2020-04-06 19:06:27,466 [shard 0] init - starting per-shard database core
WARN  2020-04-06 19:06:27,467 [shard 0] init - I/O Scheduler is not properly configured! This is a non-supported setup, and performance is expected to be unpredictably bad.
 Reason found: none of --max-io-requests, --io-properties and --io-properties-file are set.
To properly configure the I/O Scheduler, run the scylla_io_setup utility shipped with Scylla.INFO  2020-04-06 19:06:27,467 [shard 0] init - creating data directories
INFO  2020-04-06 19:06:27,477 [shard 0] init - creating commitlog directory
Generating public/private rsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
The key fingerprint is:
SHA256:vOn+hcc7vqswkf7dcrc8RPKde75ZxynIQAd8BKRY1mc root@simple-cluster-mn8-statefulset-0
The key's randomart image is:
+---[RSA 4096]----+
|       o++o.     |
|      + .o.E     |
|     . . .+.     |
|       ....  . . |
|        S.    + o|
|       . +oo.  =o|
|        * .o+...=|
|       . + +oo+o*|
|       .o.+o**.B*|
+----[SHA256]-----+
Could not load host key: /etc/ssh/ssh_host_ecdsa_key
Connecting to http://localhost:10000
Starting the JMX server
INFO  2020-04-06 19:06:30,608 [shard 0] init - creating hints directories
JMX is enabled to receive remote connections on port: 7199
INFO  2020-04-06 19:06:32,220 [shard 0] init - verifying directories
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 197, in <module>
    args.func(args)
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 123, in check_version
    current_version = sanitize_version(get_api('/storage_service/scylla_release_version'))
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 82, in get_api
    return get_json_from_url("http://" + api_address + path)
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 74, in get_json_from_url
    retval = result.get(timeout=5)
  File "/opt/scylladb/python3/lib64/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f9d599de4d0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
INFO  2020-04-06 19:06:37,466 [shard 0] database - Populating Keyspace system_schema
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF computed_columns id=cc7c7069-3740-33c1-92a4-c3de78dbd2c4 version=2b8c4439-de76-31e0-807f-3b7290a975d7
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF view_virtual_columns id=08843b63-45dc-3be2-9798-a0418295cfaa version=c777531c-15f7-326f-8ebe-39fd0265c8c9
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF dropped_columns id=5e7583b5-f3f4-3af1-9a39-b7e1d6f5f11f version=7426bc6c-4c2f-3200-8ad8-4329610ed59a
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF indexes id=0feb57ac-311f-382f-ba6d-9024d305702f version=99c40462-8687-304e-abe3-2bdbef1f25aa
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF functions id=96489b79-80be-3e14-a701-66a0b9159450 version=329ed804-55b3-3eee-ad61-d85317b96097
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF columns id=24101c25-a2ae-3af7-87c1-b40ee1aca33f version=d33236d4-9bdd-3c09-abf0-a0bc5edc2526
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF scylla_tables id=5d912ff1-f759-3665-b2c8-8042ab5103dd version=16b55508-a81a-3b90-9a0d-f58f5f833864
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF tables id=afddfb9d-bc1e-3068-8056-eed6c302ba09 version=b6240810-eeb7-36d5-9411-43b2d68dddab
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF keyspaces id=abac5682-dea6-31c5-b535-b3d6cffd0fb6 version=e79ca8ba-6556-3f7d-925a-7f20cf57938c
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF triggers id=4df70b66-6b05-3251-95a1-32b54005fd48 version=582d7071-1ef0-37c8-adc6-471a13636139
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF views id=9786ac1c-dd58-3201-a7cd-ad556410c985 version=5b58bb47-96e7-3f57-accf-0bfca4dbbc6e
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF types id=5a8b1ca8-6602-3f77-a045-9273d308917a version=de51b2ce-5e4d-3b7d-a75f-2204332ce8d1
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF aggregates id=924c5587-2e3a-345b-b10c-12f37c1ba895 version=4b53e92c-0368-3d5c-b959-2ec1bfd1a59f
WARN  2020-04-06 19:06:37,756 [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention
WARN  2020-04-06 19:06:37,756 [shard 0] storage_service - Commitlog error: std::system_error (error system:4, Interrupted system call)
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: starts
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2020-04-06 19:06:37,756 [shard 0] gossip - gossip is already stopped
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - messaging_service stopped
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - stream_manager stopped
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: auth shutdown
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: done
WARN  2020-04-06 19:06:39,492 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)

Environment:

Platform: Generic, local cluster
Kubernetes version: 1.18.0
Scylla version: 3.2.1
Scylla-operator version: yanniszark/scylla-operator:v0.1.4

Additional context
I had updated the cluster from 1.15 -> 1.16 -> 1.17 -> 1.18 when it first started happening, but on a second test cluster which I also performed the same Kubernetes upgrade procedure scylla was working fine.

The text was updated successfully, but these errors were encountered:

dahankzter · 2020-04-08T06:54:21Z

Do you mean it works sometimes and sometimes not?

dahankzter · 2020-04-08T07:14:01Z

Do you know what this could be @glommer ?

MicDeDuiwel · 2020-04-08T07:28:44Z

Do you mean it works sometimes and sometimes not?

The first node of a 3 Node Scylla cluster will spin up and be in a Ready state,

simple-cluster-statefulset-0   1/1     Running   0          11h
simple-cluster-statefulset-1   0/1     Running   1          8m38s

but will continue to spam the storage_io_error
simple-cluster-statefulset-0 logs:

WARN  2020-04-08 07:16:52,485 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
WARN  2020-04-08 07:16:52,653 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
{"L":"INFO","T":"2020-04-08T07:16:52.749Z","M":"Starting reconciliation...","_trace_id":"3SxP6x3FQ--zT8sALxtvNQ"}
WARN  2020-04-08 07:16:52,845 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
WARN  2020-04-08 07:16:54,163 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
WARN  2020-04-08 07:16:54,409 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)

While writing this reply and looking at simple-cluster-statefulset-1 logs to post here it seems that it is struggling with malformed sstables now and not storage io error like the previous run I did.

simple-cluster-statefulset-1

2020-04-08 07:22:05,895 INFO spawned: 'scylla' with pid 174
Scylla version 3.2.1-0.20200122.e3e301906d5 starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --smp 1 --overprovisioned --listen-address 10.244.6.138 --rpc-address 10.244.6.138 --seed-provider-parameters seeds=10.98.159.198,10.103.64.44 --broadcast-address 10.103.64.44 --broadcast-rpc-address 10.103.64.44 --blocked-reactor-notify-ms 999999999"
parsed command line options: [log-to-syslog: 0, log-to-stdout: 1, default-log-level: info, network-stack: posix, developer-mode: 1, smp: 1, overprovisioned, listen-address: 10.244.6.138, rpc-address: 10.244.6.138, seed-provider-parameters: seeds=10.98.159.198,10.103.64.44, broadcast-address: 10.103.64.44, broadcast-rpc-address: 10.103.64.44, blocked-reactor-notify-ms: 999999999]
INFO  2020-04-08 07:22:06,113 [shard 0] init - installing SIGHUP handler
INFO  2020-04-08 07:22:06,113 [shard 0] init - Scylla version 3.2.1-0.20200122.e3e301906d5 starting.
WARN  2020-04-08 07:22:06,113 [shard 0] init - Only 512 MiB per shard; this is below the recommended minimum of 1 GiB/shard; continuing since running in developer mode
INFO  2020-04-08 07:22:06,114 [shard 0] init - starting prometheus API server
INFO  2020-04-08 07:22:06,114 [shard 0] init - creating tracing
INFO  2020-04-08 07:22:06,114 [shard 0] init - creating snitch
INFO  2020-04-08 07:22:06,114 [shard 0] init - determining DNS name
INFO  2020-04-08 07:22:06,114 [shard 0] init - starting API server
INFO  2020-04-08 07:22:06,114 [shard 0] init - Scylla API server listening on 127.0.0.1:10000 ...
INFO  2020-04-08 07:22:06,114 [shard 0] init - initializing storage service
INFO  2020-04-08 07:22:06,114 [shard 0] init - starting per-shard database core
WARN  2020-04-08 07:22:06,115 [shard 0] init - I/O Scheduler is not properly configured! This is a non-supported setup, and performance is expected to be unpredictably bad.
 Reason found: none of --max-io-requests, --io-properties and --io-properties-file are set.
To properly configure the I/O Scheduler, run the scylla_io_setup utility shipped with Scylla.

INFO  2020-04-08 07:22:06,115 [shard 0] init - creating data directories
INFO  2020-04-08 07:22:06,120 [shard 0] init - creating commitlog directory
INFO  2020-04-08 07:22:06,124 [shard 0] init - creating hints directories
INFO  2020-04-08 07:22:06,132 [shard 0] init - verifying directories
2020-04-08 07:22:07,133 INFO success: scylla entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
{"L":"ERROR","T":"2020-04-08T07:22:08.919Z","M":"Readiness check failed","error":"Error while executing nodetool status in readiness check: unexpected end of JSON input","_trace_id":"eg_GtqPoQvCQowObWtNcyw","S":"github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.log\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:92\ngithub.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.Error\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:75\ngithub.com/scylladb/scylla-operator/pkg/sidecar.readinessCheck.func1\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/pkg/sidecar/checks.go:68\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2012\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2387\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2807\nnet/http.(*conn).serve\n\t/usr/lib/go/src/net/http/server.go:1895"}
INFO  2020-04-08 07:22:09,525 [shard 0] database - Populating Keyspace system_schema
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF computed_columns id=cc7c7069-3740-33c1-92a4-c3de78dbd2c4 version=2b8c4439-de76-31e0-807f-3b7290a975d7
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF view_virtual_columns id=08843b63-45dc-3be2-9798-a0418295cfaa version=c777531c-15f7-326f-8ebe-39fd0265c8c9
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF dropped_columns id=5e7583b5-f3f4-3af1-9a39-b7e1d6f5f11f version=7426bc6c-4c2f-3200-8ad8-4329610ed59a
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF indexes id=0feb57ac-311f-382f-ba6d-9024d305702f version=99c40462-8687-304e-abe3-2bdbef1f25aa
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF functions id=96489b79-80be-3e14-a701-66a0b9159450 version=329ed804-55b3-3eee-ad61-d85317b96097
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF columns id=24101c25-a2ae-3af7-87c1-b40ee1aca33f version=d33236d4-9bdd-3c09-abf0-a0bc5edc2526
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF scylla_tables id=5d912ff1-f759-3665-b2c8-8042ab5103dd version=16b55508-a81a-3b90-9a0d-f58f5f833864
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF tables id=afddfb9d-bc1e-3068-8056-eed6c302ba09 version=b6240810-eeb7-36d5-9411-43b2d68dddab
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF keyspaces id=abac5682-dea6-31c5-b535-b3d6cffd0fb6 version=e79ca8ba-6556-3f7d-925a-7f20cf57938c
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF triggers id=4df70b66-6b05-3251-95a1-32b54005fd48 version=582d7071-1ef0-37c8-adc6-471a13636139
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF views id=9786ac1c-dd58-3201-a7cd-ad556410c985 version=5b58bb47-96e7-3f57-accf-0bfca4dbbc6e
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF types id=5a8b1ca8-6602-3f77-a045-9273d308917a version=de51b2ce-5e4d-3b7d-a75f-2204332ce8d1
INFO  2020-04-08 07:22:09,525 [shard 0] database - Keyspace system_schema: Reading CF aggregates id=924c5587-2e3a-345b-b10c-12f37c1ba895 version=4b53e92c-0368-3d5c-b959-2ec1bfd1a59f
WARN  2020-04-08 07:22:09,525 [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention
WARN  2020-04-08 07:22:09,525 [shard 0] storage_service - Commitlog error: std::system_error (error system:4, Interrupted system call)
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: starts
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2020-04-08 07:22:09,526 [shard 0] gossip - gossip is already stopped
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - messaging_service stopped
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - stream_manager stopped
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: auth shutdown
INFO  2020-04-08 07:22:09,526 [shard 0] storage_service - Stop transport: done
WARN  2020-04-08 07:22:09,911 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
INFO  2020-04-08 07:22:09,927 [shard 0] database - Populating Keyspace system
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_aggregates id=a5fc57fc-9d6c-3bfd-a3fc-01ad54686fea version=d06c934f-9a69-374c-86f1-523ab26fb05c
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_triggers id=0359bc71-7123-3ee1-9a4a-b9dfb11fc125 version=75e4ec4c-76da-3f59-984e-66810ba0c62b
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF compaction_history id=b4dbb7b4-dc49-3fb5-b3bf-ce6e434832ca version=25cae56a-8d75-39f3-a146-3756ab4981c7
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF compactions_in_progress id=55080ab0-5d9c-3886-90a4-acb25fe1f77b version=540263f1-40db-3869-8a38-3baadedc222d
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF range_xfers id=55d76438-4e55-3f8b-9f6e-676d4af3976d version=dd15a078-409b-350d-9bef-c5f3520832d8
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF scylla_views_builds_in_progress id=a04c7bfd-1e13-36c9-a44d-f22da352281d version=e3fb736c-6956-3990-a31d-9a482279e3fc
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF IndexInfo id=9f5c6374-d485-3229-9a0a-5094af9ad1e3 version=bbb3743b-351f-3023-b4fc-09a9be37d529
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_functions id=d1b675fe-2b50-3ca4-8e49-c0f81989dcad version=a50959dd-2be0-3e4e-8bb5-cf40796e256a
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF peer_events id=59dfeaea-8db2-3341-91ef-109974d81484 version=7f874a05-72b8-3c21-9acb-baa164fc351a
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF local id=7ad54392-bcdd-35a6-8417-4e047860b377 version=7fa82c2e-5b67-37dd-8e5e-2079e18f1536
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF size_estimates id=618f817b-005f-3678-b8a4-53f3930b8e86 version=6d4dbcff-f05b-3dc3-95ad-f79f7b10504d
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF large_partitions id=8a7fe624-96b0-34b1-b90e-f71bddcdd2d3 version=04fa9920-9369-3a96-be39-6dd9fdc816b6
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF peers id=37f71aca-7dc2-383b-a706-72528af04d4f version=f6f6871f-8c86-3eca-ac0b-2a2e848e395d
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF hints id=2666e205-73ef-38b3-90fe-fecf96e8f0c7 version=0b74fdd1-e96d-309e-a14e-a5bcd7ac885d
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF large_rows id=40550f66-0858-39a0-9430-f27fc08034e9 version=aee3acb0-7926-317b-848e-db7bc3721695
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF large_cells id=ead8bbc5-f146-3ae1-9f71-0b11f9a1d296 version=0d4a6937-a781-3f87-a192-b4e4a4a40acf
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF scylla_local id=2972ec7f-fb20-38dd-aac1-d876f2e3fcbd version=5f0b407d-eedf-3845-a48e-dbb9673d10e1
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF views_builds_in_progress id=b7f2c108-78cd-3c80-9cd5-d609b2bd149c version=b905cce5-1474-3085-819a-7592453e2fb9
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_usertypes id=3aa75225-4f82-350b-8d5c-430fa221fa0a version=3c53e6bb-56d5-3618-a0b2-4ff5f18127d7
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF sstable_activity id=5a1ff267-ace0-3f12-8563-cfae6103c65e version=bdec57a3-b234-334b-be9c-7b1f33113995
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF truncated id=38c19fd0-fb86-3310-a4b7-0d0cc66628aa version=6da9a85c-7ae0-3917-aead-54a2f65a57a8
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF paxos id=b7b7f0c2-fd0a-3410-8c05-3ef614bb7c2d version=6dd372de-72fb-3e1b-9b8c-f97738a67fe9
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_keyspaces id=b0f22357-4458-3cdb-9631-c43e59ce3676 version=84d28cef-59f9-34a9-9be5-31c684997f03
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF batchlog id=0290003c-977e-397c-ac3e-fdfdc01d626b version=e2a2e804-49e4-3597-9f16-39fd9475835c
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_columnfamilies id=45f5b360-24bc-3f83-a363-1034ea4fa697 version=7f7a5fea-07b7-304d-a6f8-f7aec72cabdb
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF built_views id=4b3c50a9-ea87-3d76-9101-6dbc9c38494a version=a28aa0b3-6def-30a7-9fbe-ce78b3f3c9b9
INFO  2020-04-08 07:22:09,927 [shard 0] database - Keyspace system: Reading CF schema_columns id=296e9c04-9bec-3085-827d-c17d3df2122a version=53cfbb66-ee0d-37be-97bb-b2afb3746f85
ERROR 2020-04-08 07:22:09,936 [shard 0] database - malformed sstable system-local-ka-1-TOC.txt: SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-1-TOC.txt. Refusing to boot
ERROR 2020-04-08 07:22:09,936 [shard 0] database - Exception while populating keyspace 'system' with column family 'local' from file '/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377': sstables::malformed_sstable_exception (SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-1-TOC.txt)
ERROR 2020-04-08 07:22:09,937 [shard 0] database - malformed sstable system-truncated-ka-1-TOC.txt: SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/system-truncated-ka-1-TOC.txt. Refusing to boot
ERROR 2020-04-08 07:22:09,937 [shard 0] database - Exception while populating keyspace 'system' with column family 'truncated' from file '/var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa': sstables::malformed_sstable_exception (SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system/truncated-38c19fd0fb863310a4b70d0cc66628aa/system-truncated-ka-1-TOC.txt)
INFO  2020-04-08 07:22:09,937 [shard 0] init - Shutdown database started
INFO  2020-04-08 07:22:10,011 [shard 0] init - Shutdown database finished
INFO  2020-04-08 07:22:10,011 [shard 0] init - stopping prometheus API server
INFO  2020-04-08 07:22:10,011 [shard 0] init - Startup failed: std::runtime_error (Exception while populating keyspace 'system' with column family 'local' from file '/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377': sstables::malformed_sstable_exception (SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-1-TOC.txt))
2020-04-08 07:22:10,076 INFO exited: scylla (exit status 1; not expected)
2020-04-08 07:22:11,078 INFO spawned: 'scylla' with pid 187

Interestingly, on my secondary testing cluster which has the same setup [Rook-Ceph as storage backend, kubernetes 1.18.0, Ubuntu 18.04.4 LTS] (just smaller) everything works fine.

dahankzter · 2020-04-08T07:44:17Z

Can you test with developerMode=false in the cluster,yaml or just remove it, it defaults to false.

MicDeDuiwel · 2020-04-08T08:20:06Z

so running with developerMode: false in the cluster.yaml I get the following logs:

{"L":"INFO","T":"2020-04-08T08:01:01.164Z","M":"sidecar started","version":"0.1.4","build_date":"2020-03-25T14:00:20Z","commit":"8c5ecea56dfead9fd3a101a2364ec52daed01774","built_by":"goreleaser","go_version":"","options":{"Name":"simple-cluster-mn8-statefulset-0","Namespace":"scylla","LogLevel":"info","CPU":"1"},"_trace_id":"-rLZLq08Q1m6clekRLo63w"}
{"L":"INFO","T":"2020-04-08T08:01:01.230Z","M":"Registering Components.","_trace_id":"-rLZLq08Q1m6clekRLo63w"}
{"L":"INFO","T":"2020-04-08T08:01:01.244Z","M":"Member loaded","member":{"Name":"simple-cluster-mn8-statefulset-0","Namespace":"scylla","IP":"10.244.6.141","StaticIP":"10.106.185.173","Rack":"statefulset","Datacenter":"mn8","Cluster":"simple-cluster"},"_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.308Z","M":"Setting up HTTP Checks...","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.308Z","M":"Setting up config files","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.308Z","M":"Setting up scylla.yaml","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.308Z","M":"no scylla.yaml config map available"}
{"L":"INFO","T":"2020-04-08T08:01:01.310Z","M":"Setting up cassandra-rackdc.properties","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.310Z","M":"unable to read properties","file":"/mnt/scylla-config/cassandra-rackdc.properties"}
{"L":"INFO","T":"2020-04-08T08:01:01.310Z","M":"Setting up jolokia config","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.311Z","M":"Setting up entrypoint script","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.323Z","M":"Scylla entrypoint","command":"/docker-entrypoint.py --listen-address=10.244.6.141 --broadcast-address=10.106.185.173 --broadcast-rpc-address=10.106.185.173 --seeds=10.106.185.173 --developer-mode=0 --smp=1","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.324Z","M":"Starting the scylla process","_trace_id":"VdmpHC8PS1esB6x6TQ8QOQ"}
{"L":"INFO","T":"2020-04-08T08:01:01.324Z","M":"Starting the sidecar...","_trace_id":"-rLZLq08Q1m6clekRLo63w"}
running: (['/opt/scylladb/scripts/scylla_dev_mode_setup', '--developer-mode', '0'],)
running: (['/opt/scylladb/scripts/scylla_io_setup'],)
{"L":"INFO","T":"2020-04-08T08:01:01.530Z","M":"Starting reconciliation...","_trace_id":"vn4FAq7eS9ekZCS66xcJ3A"}
ERROR:root:/var/lib/scylla/commitlog was not found. Please check the configuration and run scylla_io_setup again.

tuning /sys/dev/block/0:0
failed!
Traceback (most recent call last):
  File "/docker-entrypoint.py", line 17, in <module>
    setup.io()
  File "/scyllasetup.py", line 53, in io
    self._run(['/opt/scylladb/scripts/scylla_io_setup'])
  File "/scyllasetup.py", line 32, in _run
    subprocess.check_call(*args, **kwargs)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.
{"L":"INFO","T":"2020-04-08T08:01:31.434Z","M":"Starting reconciliation...","_trace_id":"XdgC5jdiQ2ah96zTXSeigQ"}
{"L":"ERROR","T":"2020-04-08T08:01:39.532Z","M":"Readiness check failed","error":"Error while executing nodetool status in readiness check: failed to send HTTP request: Post \"http://127.0.0.1:8778/jolokia/\": dial tcp 127.0.0.1:8778: connect: connection refused","_trace_id":"yERiFg5rQaS2iSiCvk9mrw","S":"github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.log\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:92\ngithub.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.Error\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:75\ngithub.com/scylladb/scylla-operator/pkg/sidecar.readinessCheck.func1\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/pkg/sidecar/checks.go:68\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2012\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2387\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2807\nnet/http.(*conn).serve\n\t/usr/lib/go/src/net/http/server.go:1895"}
{"L":"ERROR","T":"2020-04-08T08:01:49.528Z","M":"Readiness check failed","error":"Error while executing nodetool status in readiness check: failed to send HTTP request: Post \"http://127.0.0.1:8778/jolokia/\": dial tcp 127.0.0.1:8778: connect: connection refused","_trace_id":"TLcwXmCqT960xhxMBZXkFA","S":"github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.log\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:92\ngithub.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.Error\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:75\ngithub.com/scylladb/scylla-operator/pkg/sidecar.readinessCheck.func1\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/pkg/sidecar/checks.go:68\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2012\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2387\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2807\nnet/http.(*conn).serve\n\t/usr/lib/go/src/net/http/server.go:1895"}
{"L":"ERROR","T":"2020-04-08T08:01:59.527Z","M":"Readiness check failed","error":"Error while executing nodetool status in readiness check: failed to send HTTP request: Post \"http://127.0.0.1:8778/jolokia/\": dial tcp 127.0.0.1:8778: connect: connection refused","_trace_id":"D3k1TDNmQc6ECqM9AAO-hg","S":"github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.log\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:92\ngithub.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.Error\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:75\ngithub.com/scylladb/scylla-operator/pkg/sidecar.readinessCheck.func1\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/pkg/sidecar/checks.go:68\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2012\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2387\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2807\nnet/http.(*conn).serve\n\t/usr/lib/go/src/net/http/server.go:1895"}
{"L":"INFO","T":"2020-04-08T08:02:01.434Z","M":"Starting reconciliation...","_trace_id":"tKgdEXZRSJCf1RGXnYrLKQ"}

dahankzter · 2020-04-08T08:21:29Z

That looks completely broken now if the commitlog is not found. Can you recreate the cluster from scratch?

dahankzter · 2020-04-08T08:40:27Z

Can you check the logs of the rook agent? There is an issue that could perhaps explain it but the logs needs to be checked to be sure.
rook/rook#4371

glommer · 2020-04-08T13:05:36Z

From the top of my head I don't know what in the commitlog could return EINTR. Maybe your fellow swedeman Calle knows?

However I'll just as a hunch say that maybe this is because the storage is not on XFS. I don't know if it is or it isn't, but my bet is that it isn't, and this is some fs-specific behavior. We very seldom test out of XFS.

MicDeDuiwel · 2020-04-08T13:55:18Z

Can you check the logs of the rook agent? There is an issue that could perhaps explain it but the logs needs to be checked to be sure.
rook/rook#4371

My rook ceph deployment doesn't make use of agents. As for if I can recreate the cluster, that would be the very last thing I'd try once all my other options are exhausted.

However I'll just as a hunch say that maybe this is because the storage is not on XFS. I don't know if it is or it isn't, but my bet is that it isn't, and this is some fs-specific behavior. We very seldom test out of XFS.

I set up rook-ceph as our storage backend to deploy the osd's on drives or partitions formatted in xfs

Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs  6.9G     0  6.9G   0% /dev
tmpfs          tmpfs     1.4G  2.6M  1.4G   1% /run
/dev/sda2      ext4       98G   22G   72G  24% /
tmpfs          tmpfs     6.9G     0  6.9G   0% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     6.9G     0  6.9G   0% /sys/fs/cgroup
/dev/loop1     squashfs   94M   94M     0 100% /snap/core/8935
/dev/loop0     squashfs   92M   92M     0 100% /snap/core/8689
/dev/sda3      ext4      488M  210M  243M  47% /boot
/dev/sda4      xfs       899G  8.3G  891G   1% /data
tmpfs          tmpfs     1.4G     0  1.4G   0% /run/user/1000

I'm also investigating if the root cause of my issue might a wonky rook-ceph cluster after my kubernetes upgrades. I initially dismissed that thought because other stores like minio and elasticsearch, which uses it, are working fine but scylla might be more sensitive to issue with rook-ceph.

MicDeDuiwel · 2020-04-13T16:11:25Z

So after tearing down and recreating a stable rook-ceph cluster I redeployed the latest scylla-operator

operator: yanniszark/scylla-operator:v0.1.4
scylladb: version: 3.2.4

I only get the storage_io_error when using scylla-operator version v0.1.2
WARN 2020-04-13 15:57:45,151 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
when using scylla-operator version 0.1.4 I get the startup error:

{"L":"INFO","T":"2020-04-13T16:07:08.836Z","M":"sidecar started","version":"0.1.4","build_date":"2020-03-25T14:00:20Z","commit":"8c5ecea56dfead9fd3a101a2364ec52daed01774","built_by":"goreleaser","go_version":"","options":{"Name":"simple-cluster-mn8-statefulset-0","Namespace":"scylla","LogLevel":"info","CPU":"1"},"_trace_id":"vDnBwkBhSQaXn3_tsT2hzQ"}
{"L":"INFO","T":"2020-04-13T16:07:08.898Z","M":"Registering Components.","_trace_id":"vDnBwkBhSQaXn3_tsT2hzQ"}
{"L":"INFO","T":"2020-04-13T16:07:08.917Z","M":"Member loaded","member":{"Name":"simple-cluster-mn8-statefulset-0","Namespace":"scylla","IP":"10.244.6.168","StaticIP":"10.96.24.168","Rack":"statefulset","Datacenter":"mn8","Cluster":"simple-cluster"},"_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.963Z","M":"Setting up HTTP Checks...","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.963Z","M":"Setting up config files","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.963Z","M":"Setting up scylla.yaml","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.966Z","M":"no scylla.yaml config map available"}
{"L":"INFO","T":"2020-04-13T16:07:08.968Z","M":"Setting up cassandra-rackdc.properties","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.968Z","M":"unable to read properties","file":"/mnt/scylla-config/cassandra-rackdc.properties"}
{"L":"INFO","T":"2020-04-13T16:07:08.968Z","M":"Setting up jolokia config","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.968Z","M":"Setting up entrypoint script","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.981Z","M":"Scylla entrypoint","command":"/docker-entrypoint.py --listen-address=10.244.6.168 --broadcast-address=10.96.24.168 --broadcast-rpc-address=10.96.24.168 --seeds=10.96.24.168 --developer-mode=0 --smp=1","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.981Z","M":"Starting the scylla process","_trace_id":"7Ly7bWE5RLq0puQufJ4kXA"}
{"L":"INFO","T":"2020-04-13T16:07:08.982Z","M":"Starting the sidecar...","_trace_id":"vDnBwkBhSQaXn3_tsT2hzQ"}
running: (['/opt/scylladb/scripts/scylla_dev_mode_setup', '--developer-mode', '0'],)
{"L":"INFO","T":"2020-04-13T16:07:09.189Z","M":"Starting reconciliation...","_trace_id":"vIlbBhbVR8WHyAq66ihS8w"}
running: (['/opt/scylladb/scripts/scylla_io_setup'],)
ERROR:root:/var/lib/scylla/commitlog was not found. Please check the configuration and run scylla_io_setup again.

tuning /sys/dev/block/0:0
failed!
Traceback (most recent call last):
  File "/docker-entrypoint.py", line 17, in <module>
    setup.io()
  File "/scyllasetup.py", line 53, in io
    self._run(['/opt/scylladb/scripts/scylla_io_setup'])
  File "/scyllasetup.py", line 32, in _run
    subprocess.check_call(*args, **kwargs)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.
{"L":"INFO","T":"2020-04-13T16:07:39.092Z","M":"Starting reconciliation...","_trace_id":"cCnVAhEhQT-ZvwYtrCH3CA"}
{"L":"ERROR","T":"2020-04-13T16:07:39.316Z","M":"Readiness check failed","error":"Error while executing nodetool status in readiness check: failed to send HTTP request: Post \"http://127.0.0.1:8778/jolokia/\": dial tcp 127.0.0.1:8778: connect: connection refused","_trace_id":"MFqilFNLT0WuCuIuHR8xCg","S":"github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.log\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:92\ngithub.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log.Logger.Error\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/vendor/github.com/scylladb/go-log/logger.go:75\ngithub.com/scylladb/scylla-operator/pkg/sidecar.readinessCheck.func1\n\t/home/henrik/go/src/github.com/scylladb/scylla-operator/pkg/sidecar/checks.go:68\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2012\nnet/http.(*ServeMux).ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2387\nnet/http.serverHandler.ServeHTTP\n\t/usr/lib/go/src/net/http/server.go:2807\nnet/http.(*conn).serve\n\t/usr/lib/go/src/net/http/server.go:1895"}

cmball1 · 2020-04-13T19:04:17Z

@MicDeDuiwel I was having a similar issue starting a Scylla Cluster on AKS where the /var/lib/scylla/commitlog directory was not being created for an unknown reason. I was using Scylla version 3.2.1 that is in the cluster.yaml.

I switched to using Scylla 3.3.0 and was able to successfully deploy a Scylla cluster. Note sure if that is an option for you but it might be worth trying.

dahankzter · 2020-04-14T13:04:36Z

That is very strange can you reproduce this at will? Does it always fail on 3.2.1 @cmball1 ?

MicDeDuiwel · 2020-04-14T14:24:02Z

@cmball1 Thanks, I now get a different error which leads me to believe that the issue I'm facing might be the way scylla interprets the ceph filesysystem:

INFO  2020-04-14 13:45:24,069 [shard 0] iotune - /var/lib/scylla/data passed sanity checks
ERROR 2020-04-14 13:45:24,071 [shard 0] iotune - Error while parsing sysfs. Will continue with guessed values: filesystem error: cannot make canonical path: No such file or directory [/sys/dev/block/0:0]
ERROR 2020-04-14 13:45:24,372 [shard 0] seastar - Exiting on unhandled exception: std::system_error (error system:122, Disk quota exceeded)
ERROR:root:['/var/lib/scylla/data'] did not pass validation tests, it may not be on XFS and/or has limited disk space.
This is a non-supported setup, and performance is expected to be very bad.
For better performance, placing your data on XFS-formatted directories is required.
To override this error, enable developer mode as follow:
sudo /opt/scylladb/scripts/scylla_dev_mode_setup --developer-mode 1
tuning /sys/dev/block/0:0
failed!
Traceback (most recent call last):
  File "/docker-entrypoint.py", line 17, in <module>
    setup.io()
  File "/scyllasetup.py", line 61, in io
    self._run(['/opt/scylladb/scripts/scylla_io_setup'])
  File "/scyllasetup.py", line 32, in _run
    subprocess.check_call(*args, **kwargs)
  File "/opt/scylladb/python3/lib64/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/scylladb/scripts/scylla_io_setup']' returned non-zero exit status 1.

slivne · 2020-04-14T14:26:23Z

Scylla 3.2 docker images may hit a known issue scylladb/scylladb#5638 that is only fixed in Scylla 3.3 docker images (we work around an issue of not having enough resources set)

MicDeDuiwel · 2020-04-16T15:31:49Z

Update on my progress.
Updated my rook to v1.2.7 and ceph to 14.2.8
Using the latest operator and scylla versions

operator: v1.16.0
scylla: 4.0.rc1-0.20200416.a8ef820f27c

Set my rook cluster crd to fstype: xfs according to this link: https://kubernetes.io/docs/concepts/storage/storage-classes/#ceph-rbd

I consistently manage to get one or two nodes running when using the latest 4.0.rc1, any more nodes are a struggle. I did find that when redeploying cluster, it is best to redeploy the operator as well.

NAME                               READY   STATUS    RESTARTS   AGE
simple-cluster-statefulset-0   2/2     Running   0          28m
simple-cluster-statefulset-1   2/2     Running   0          24m
simple-cluster-statefulset-2   1/2     Running   1          17m

Might be a 4.0.rc1 thing but the nodes will not deploy unless they are in developer mode.

logs from the final node of a 3 node cluster:

WARN  2020-04-16 15:19:06,410 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
INFO  2020-04-16 15:19:06,864 [shard 0] gossip - Connect seeds again ... (109 seconds passed)
{"L":"INFO","T":"2020-04-16T15:19:07.845Z","N":"member_controller","M":"Starting reconciliation...","_trace_id":"xBAeiJ5DSk6QlNFml8v6xg"}
INFO  2020-04-16 15:19:07,864 [shard 0] gossip - Connect seeds again ... (110 seconds passed)
INFO  2020-04-16 15:19:08,864 [shard 0] gossip - Connect seeds again ... (111 seconds passed)
INFO  2020-04-16 15:19:09,864 [shard 0] gossip - Connect seeds again ... (112 seconds passed)
INFO  2020-04-16 15:19:11,233 [shard 0] gossip - Connect seeds again ... (113 seconds passed)

The operator repeats these logs:

{"L":"INFO","T":"2020-04-16T15:11:17.779Z","N":"cluster-controller","M":"Starting reconciliation...","cluster":"scylla/simple-cluster","resourceVersion":"24120200","_trace_id":"HAtnir3UQ26YrThFZYHl6A"}
{"L":"INFO","T":"2020-04-16T15:11:17.809Z","N":"cluster-controller","M":"Calculating cluster status...","cluster":"scylla/simple-cluster","resourceVersion":"24120200","_trace_id":"HAtnir3UQ26YrThFZYHl6A"}
{"L":"INFO","T":"2020-04-16T15:11:17.809Z","N":"cluster-controller","M":"Rack is not ready","cluster":"scylla/simple-cluster","resourceVersion":"24120200","name":"statefulset","members":3,"ready_members":2,"_trace_id":"HAtnir3UQ26YrThFZYHl6A"}
{"L":"INFO","T":"2020-04-16T15:11:17.810Z","N":"cluster-controller","M":"Writing cluster status.","cluster":"scylla/simple-cluster","resourceVersion":"24120200","_trace_id":"9VqTk4JpRsOPoNSsYy1B5Q"}
{"L":"INFO","T":"2020-04-16T15:11:17.825Z","N":"cluster-controller","M":"Reconciliation successful","cluster":"scylla/simple-cluster","resourceVersion":"24120200","_trace_id":"9VqTk4JpRsOPoNSsYy1B5Q"}

Also interesting to note that I can't scale up the scylla cluster after deployment as indicated by the docs:

~$ kubectl -n scylla edit clusters.scylla.scylladb.com simple-cluster
error: clusters.scylla.scylladb.com "simple-cluster" could not be patched: admission webhook "validating-create-update-cluster.scylladb.com" denied the request: invalid old semantic version, err=Invalid character(s) found in patch number "rc1"

Also, even on the running nodes I can't seem to access cqlsh:

~$ kubectl exec -n scylla -it simple-cluster-statefulset-0 -- cqlsh
Defaulting container name to scylla.
Use 'kubectl describe pod/simple-cluster-statefulset-0 -n scylla' to see all of the containers in this pod.
Connection error: ('Unable to connect to any servers', {'10.244.11.82': error(111, "Tried connecting to [('10.244.11.82', 9042)]. Last error: Connection refused")})
command terminated with exit code 1

Finally, before the final node reboots itself it will throw a bunch of errors regarding malformed sstables:

...
ERROR 2020-04-16 15:26:43,102 [shard 0] database - malformed sstable la-1-big-TOC.txt: SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system_schema/types-5a8b1ca866023f77a0459273d308917a/la-1-big-TOC.txt. Refusing to boot
ERROR 2020-04-16 15:26:43,102 [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'types' from file '/var/lib/scylla/data/system_schema/types-5a8b1ca866023f77a0459273d308917a': sstables::malformed_sstable_exception (SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system_schema/types-5a8b1ca866023f77a0459273d308917a/la-1-big-TOC.txt)
ERROR 2020-04-16 15:26:43,103 [shard 0] database - malformed sstable la-1-big-TOC.txt: SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system_schema/views-9786ac1cdd583201a7cdad556410c985/la-1-big-TOC.txt. Refusing to boot
ERROR 2020-04-16 15:26:43,103 [shard 0] database - Exception while populating keyspace 'system_schema' with column family 'views' from file '/var/lib/scylla/data/system_schema/views-9786ac1cdd583201a7cdad556410c985': sstables::malformed_sstable_exception (SSTable too big: 4096 bytes in sstable /var/lib/scylla/data/system_schema/views-9786ac1cdd583201a7cdad556410c985/la-1-big-TOC.txt)

...

avikivity · 2020-04-19T17:29:10Z

@MicDeDuiwel looks like ceph has bad interaction with asynchronous I/O.

@tchaikov, do you happen to know of any problems with aio/dio and ceph? We see EINTR and maybe data corruption.

yanniszark · 2020-04-19T18:33:20Z

Not sure if relevant, but Ceph seems to have issues with XFS support on top of RBDs:
https://rook.io/docs/rook/v1.3/ceph-common-issues.html#a-worker-node-using-rbd-devices-hangs-up
rook/rook#3132

avikivity · 2020-04-19T18:54:12Z

It does look relevant.

tchaikov · 2020-04-20T07:35:47Z

@yanniszark @avikivity it looks like an issue on nbd (rbd) side. so it's a bit out of my expertise. i am more focused on the underlying RADOS cluster. rbd is a client of RADOS cluster.

but the symptom described in rook/rook#3132 was high CPU usage caused by dead lock in kernel where the rbd client is colocated with OSD daemon. but i cannot find the evidence of either of them in the description above.

is scylla colocated with the osd? is high CPU usage observed?

also, probably a little bit off-topic, shall scylla (or seastar) retry the syscall when EINTR is returned instead of translating it to a storage_io_error exception?

avikivity · 2020-04-20T11:32:36Z

Good idea re EINTR. Most filesystems don't return it, but NFS soft mounts can and we see that so does ceph.

But how much should we retry? At some point we should give up.

Is there documentation somewhere on when we can expect EINTR? I don't want to retry EINTR if I get is as a response to a signal.

cmball1 · 2020-04-20T13:21:53Z

@dahankzter Yes, I tried deploying a cluster several times and each time same issue with the commitlog not getting created. I tried adding a mkdir to the docker entry point script and that didn't work either. I think there is something wrong with getting the script to run correctly. If I entered the cluster pod using kubectl exec and ran the docker entry point file I think I got the entry point script to run successfully (I'd have to retry to be sure). I could try redeploying again with 3.2.1 and document the logs in a separate issue if that would be helpful. I think it's an issue related to ScyllaDB, not the scylla operator. However, that's just an assumption.

MicDeDuiwel · 2020-04-21T17:35:38Z

An Update from our side:

So after a lot of digging and reaching out we came to the realisation that our rook deployment must be at fault. We had many issues with rook and cephfs and decided to switch to ceph block storage. This solved our issue, scylla deployments are working just fine now.

tchaikov · 2020-04-21T17:52:41Z

Good idea re EINTR. Most filesystems don't return it, but NFS soft mounts can and we see that so does ceph.

But how much should we retry? At some point we should give up.

Is there documentation somewhere on when we can expect EINTR? I don't want to retry EINTR if I get is as a response to a signal.

not sure. i need to consult rbd developers on this.

avikivity · 2020-04-22T07:32:36Z

@dahankzter can we prevent/warn this kind of deployment? At least until we can figure out how we can support it.

@MicDeDuiwel note that you'll be experiencing double replication and poor performance running on top of ceph, since each of Scylla's replicas will be replicated by ceph. So if each layer has a replication factor of 3, you end up with an overall replication factor of 9. The recommendation is to work with local volumes.

dahankzter · 2020-05-05T07:10:42Z

Afaik it's very hard to detect this @avikivity we could perhaps try to detect the presence of essential directories but that doesn't really help since it's pretty late in the cycle. Deployment is well under way once this is possible and it's already done by Scylla no?

gnumoreno · 2020-05-06T14:22:15Z

@dahankzter I opened this issue yesterday and it looks somewhat similar. I have a reproducer:
scylladb/scylladb#6381

akhilles · 2020-05-21T21:18:17Z

@MicDeDuiwel did you have to enable developer mode for Scylla to work on top of Ceph block storage? I'm trying to do the same thing but io setup always fails.

MicDeDuiwel · 2020-05-22T12:43:56Z

@akhilles yes we deployed Scylla-operator with the developer mode enabled.

MicDeDuiwel added the kind/bug Categorizes issue or PR as related to a bug. label Apr 7, 2020

dahankzter closed this as completed May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage_io_error while deploying cluster #75

storage_io_error while deploying cluster #75

MicDeDuiwel commented Apr 7, 2020

dahankzter commented Apr 8, 2020

dahankzter commented Apr 8, 2020

MicDeDuiwel commented Apr 8, 2020

dahankzter commented Apr 8, 2020

MicDeDuiwel commented Apr 8, 2020

dahankzter commented Apr 8, 2020

dahankzter commented Apr 8, 2020

glommer commented Apr 8, 2020

MicDeDuiwel commented Apr 8, 2020

MicDeDuiwel commented Apr 13, 2020

cmball1 commented Apr 13, 2020

dahankzter commented Apr 14, 2020

MicDeDuiwel commented Apr 14, 2020

slivne commented Apr 14, 2020 •

edited

MicDeDuiwel commented Apr 16, 2020

avikivity commented Apr 19, 2020

yanniszark commented Apr 19, 2020 •

edited

avikivity commented Apr 19, 2020

tchaikov commented Apr 20, 2020 •

edited

avikivity commented Apr 20, 2020

cmball1 commented Apr 20, 2020

MicDeDuiwel commented Apr 21, 2020

tchaikov commented Apr 21, 2020

avikivity commented Apr 22, 2020

dahankzter commented May 5, 2020

gnumoreno commented May 6, 2020

akhilles commented May 21, 2020

MicDeDuiwel commented May 22, 2020

storage_io_error while deploying cluster #75

storage_io_error while deploying cluster #75

Comments

MicDeDuiwel commented Apr 7, 2020

dahankzter commented Apr 8, 2020

dahankzter commented Apr 8, 2020

MicDeDuiwel commented Apr 8, 2020

dahankzter commented Apr 8, 2020

MicDeDuiwel commented Apr 8, 2020

dahankzter commented Apr 8, 2020

dahankzter commented Apr 8, 2020

glommer commented Apr 8, 2020

MicDeDuiwel commented Apr 8, 2020

MicDeDuiwel commented Apr 13, 2020

cmball1 commented Apr 13, 2020

dahankzter commented Apr 14, 2020

MicDeDuiwel commented Apr 14, 2020

slivne commented Apr 14, 2020 • edited

MicDeDuiwel commented Apr 16, 2020

avikivity commented Apr 19, 2020

yanniszark commented Apr 19, 2020 • edited

avikivity commented Apr 19, 2020

tchaikov commented Apr 20, 2020 • edited

avikivity commented Apr 20, 2020

cmball1 commented Apr 20, 2020

MicDeDuiwel commented Apr 21, 2020

tchaikov commented Apr 21, 2020

avikivity commented Apr 22, 2020

dahankzter commented May 5, 2020

gnumoreno commented May 6, 2020

akhilles commented May 21, 2020

MicDeDuiwel commented May 22, 2020

slivne commented Apr 14, 2020 •

edited

yanniszark commented Apr 19, 2020 •

edited

tchaikov commented Apr 20, 2020 •

edited