Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing with "panic: failed to migrate ds" #4588

Open
shashank42 opened this issue Apr 18, 2024 · 8 comments
Open

Crashing with "panic: failed to migrate ds" #4588

shashank42 opened this issue Apr 18, 2024 · 8 comments

Comments

@shashank42
Copy link

shashank42 commented Apr 18, 2024

Describe the bug

Deployed using https://github.com/rudderlabs/rudderstack-helm
rudderstack backend keeps crashing and restarting with the error

Screenshot 2024-04-18 at 12 19 59 AM
Defaulted container "rudderstack-backend" out of: rudderstack-backend, rudderstack-telegraf-sidecar
Executing docker entrypoint script
Computed db host to namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless
2024/04/18 06:57:30 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
INFO: No .env file found.
2024-04-18T06:57:30.671Z	INFO	rudder-server/main.go:27	Setting memory limit to	{"limit": 1073741824}
2024-04-18T06:57:30.673Z	INFO	stats	stats/statsd.go:44	StatsD client setup succeeded.
2024-04-18T06:57:30.673Z	INFO	stats	stats/statsd.go:89	Stats started successfully in mode "StatsD" with address "localhost:8125"
2024-04-18T06:57:30.675Z	INFO	app	app/app.go:55	Open source version of rudder-server
2024-04-18T06:57:30.681Z	INFO	backend-config	backend-config/single_workspace.go:60	Setup backend config complete	{"workspaceId": ""}
2024-04-18T06:57:33.131Z	INFO	runner	runner/runner.go:344	warehousemode embedded
2024-04-18T06:57:33.133Z	INFO	db.recovery	db/embeddedRecovery.go:42	Starting in normal mode
2024-04-18T06:57:33.479Z	INFO	backend-config	backend-config/backend-config.go:215	Workspace Config changed{"workspaces": 1, "sources": 6}
2024-04-18T06:57:36.430Z	INFO	runner.warehouse.notifier	notifier/notifier.go:165	Initializing Notifier...
2024-04-18T06:57:37.141Z	INFO	runner	runner/runner.go:344	warehousemode embedded
2024-04-18T06:57:37.142Z	INFO	admin	admin/admin.go:167	Serving on admin interface @ /data/rudderstack/rudder-server.sock
2024-04-18T06:57:37.142Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:87	Embedded mode: Starting Rudder Core
2024-04-18T06:57:37.143Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:95	Configured deployment type: "DEDICATED"
2024-04-18T06:57:37.143Z	INFO	apphandlers.EMBEDDED	apphandlers/embeddedAppHandler.go:105	Clearing DB false
2024-04-18T06:57:37.143Z	INFO	debugger.transformation	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.144Z	INFO	debugger.destination	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.145Z	INFO	debugger.source	cache/cache.go:29	Using in-memory cache
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:310	Starting Warehouse service...
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:375	Starting warehouse slave...
2024-04-18T06:57:37.146Z	INFO	runner.warehouse	warehouse/app.go:391	[WH]: Starting warehouse master...
2024-04-18T06:57:37.147Z	INFO	runner.warehouse.api	api/http.go:157	waiting for BackendConfig before starting on 8082
2024-04-18T06:57:37.147Z	INFO	runner.warehouse.api	api/http.go:142	Starting warehouse master service on8082
2024-04-18T06:57:37.148Z	INFO	runner.warehouse.notifier	notifier/notifier.go:312	Deleting all jobs for workspace: default::f597437212cb10a56115df53cd0f6881
2024-04-18T06:57:37.150Z	INFO	runner.warehouse	warehouse/app.go:475	Starting a new Warehouse Destination Router: RS
2024-04-18T06:57:37.150Z	INFO	runner.warehouse.router.RS	router/router.go:147	WH: Warehouse Router started: RS
2024-04-18T06:57:38.035Z	INFO	runner.warehouse.source	source/source.go:143	starting source jobs processing
2024-04-18T06:57:38.240Z	INFO	runner.warehouse.router.RS	router/router.go:254	Received updated workspace config
2024-04-18T06:57:38.635Z	INFO	transformer-features	transformer/features_impl.go:62	Fetching transformer features from http://namaste-rudderstack-transformer:9090
2024-04-18T06:57:39.030Z	INFO	jobsdb.gw	jobsdb/jobsdb.go:840	Connected to gw DB
2024-04-18T06:57:50.231Z	INFO	jobsdb.gw	jobsdb/jobsdb.go:840	Connected to gw DB
2024-04-18T06:57:56.737Z	INFO	jobsdb.rt	jobsdb/jobsdb.go:840	Connected to rt DB
2024-04-18T06:57:59.233Z	INFO	jobsdb.batch_rt	jobsdb/jobsdb.go:840	Connected to batch_rt DB
2024-04-18T06:58:54.035Z	INFO	jobsdb.proc_error	jobsdb/jobsdb.go:840	Connected to proc_error DB
2024-04-18T06:58:55.035Z	INFO	jobsdb.proc_error	jobsdb/jobsdb.go:840	Connected to proc_error DB
2024-04-18T06:58:57.340Z	INFO	jobsdb.esch	jobsdb/jobsdb.go:840	Connected to esch DB
2024-04-18T06:58:58.930Z	INFO	jobsdb.arc	jobsdb/jobsdb.go:840	Connected to arc DB
2024-04-18T06:59:00.036Z	INFO	apphandlers.EMBEDDED	apphandlers/setup.go:114	using Static Cluster Manager
2024-04-18T06:59:00.237Z	INFO	enterprise.suppress-user	suppress-user/factory.go:48	Suppress User feature is enterprise only
2024-04-18T06:59:00.238Z	WARN	rsources	rsources/handler.go:770	shared database is not configured, skipping logical replication monitoring
2024-04-18T06:59:00.239Z	INFO	gateway	gateway/handle_lifecycle.go:363	WebHandler waiting for BackendConfig before starting on 8080
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:100	Got trigger to change the mode, new mode: NORMAL, old mode: DEGRADED
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:218	Transiting the server from DegradedMode to NormalMode
2024-04-18T06:59:00.239Z	INFO	cluster	cluster/dynamic.go:123	Starting the server
2024-04-18T06:59:00.242Z	INFO	gateway	gateway/handle_lifecycle.go:195	BackendConfig initialised	{"sources": 6}
2024-04-18T06:59:00.242Z	INFO	gateway	gateway/handle_lifecycle.go:365	WebHandler Starting on 8080
2024-04-18T06:59:00.530Z	INFO	jobsdb.proc_error	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:01.737Z	INFO	jobsdb.gw	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:02.237Z	INFO	jobsdb.esch	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:03.536Z	INFO	jobsdb.arc	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:04.738Z	INFO	jobsdb.rt	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T06:59:13.138Z	INFO	jobsdb.batch_rt	jobsdb/backup.go:36	backupDSLoop is running
2024-04-18T07:08:49.633Z	INFO	archiver	archiver/archiver.go:99	Starting archiver
2024-04-18T07:08:49.635Z	INFO	processor	processor/processor.go:628	Starting processor in isolation mode: source
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:659	Starting pinger loop
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:661	Backend config received
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:664	Waiting for async init group
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:671	Async init group done
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:674	Waiting for transformer features
2024-04-18T07:08:49.637Z	INFO	processor	processor/processor.go:681	Transformer features received
2024-04-18T07:08:49.830Z	INFO	processor.stash	stash/stash.go:296	Processor errors stash loop started
2024-04-18T07:09:43.545Z	ERROR	utils.misc	misc/misc.go:926	Panic detected. Application will crash.{"stack": "goroutine 5537 [running]:\nruntime/debug.Stack()\n\t/usr/local/go/src/runtime/debug/stack.go:24 +0x5e\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()\n\t/rudder-server/utils/misc/misc.go:927 +0x37f\nsync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)\n\t/usr/local/go/src/sync/once.go:74 +0xc2\nsync.(*Once).Do(...)\n\t/usr/local/go/src/sync/once.go:65\ngithub.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()\n\t/rudder-server/utils/misc/misc.go:916 +0xbb\npanic({0x34155c0?, 0xc006a39540?})\n\t/usr/local/go/src/runtime/panic.go:770 +0x132\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})\n\t/rudder-server/jobsdb/migration.go:54 +0x1ff\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()\n\t/rudder-server/jobsdb/migration.go:28 +0x1f\ngithub.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()\n\t/rudder-server/utils/misc/misc.go:945 +0x5c\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\t/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:78 +0x56\ncreated by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162\n\t/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:75 +0x96\n", "panic": "failed to migrate ds: pq: canceling statement due to user request; driver: bad connection", "team": "Core", "goRoutines": 685, "version": "1.23.0", "releaseStage": "development"}
2024-04-18T07:09:43.545Z	ERROR	utils.misc	misc/misc.go:926	goroutine 5537 [running]:
github.com/rudderlabs/rudder-go-kit/logger.(*logger).Fatalw(0xc00128f9e0, {0x3a85163?, 0x4c6?}, {0xc0011ae600?, 0x1?, 0x34155c0?})
	/go/pkg/mod/github.com/rudderlabs/rudder-go-kit@v0.23.3/logger/logger.go:332 +0xa5
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
	/rudder-server/utils/misc/misc.go:926 +0x5b3
sync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
	/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})
	/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
	/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
	/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:75 +0x96

2024/04/18 07:09:43 notifying bugsnag: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection
panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
	panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection [recovered]
	panic: failed to migrate ds: pq: canceling statement due to user request; driver: bad connection

goroutine 5537 [running]:
github.com/bugsnag/bugsnag-go/v2.AutoNotify({0xc00740dac0, 0x3, 0xc001969500?})
	/go/pkg/mod/github.com/bugsnag/bugsnag-go/v2@v2.3.1/bugsnag.go:114 +0x3ef
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1.1()
	/rudder-server/utils/misc/misc.go:935 +0x5d4
sync.(*Once).doSlow(0x4778d2?, 0xc002607d78?)
	/usr/local/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
	/usr/local/go/src/sync/once.go:65
github.com/rudderlabs/rudder-server/utils/misc.BugsnagNotify.func1()
	/rudder-server/utils/misc/misc.go:916 +0xbb
panic({0x34155c0?, 0xc006a39540?})
	/usr/local/go/src/runtime/panic.go:770 +0x132
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).migrateDSLoop(0xc000cdd188, {0x4377970, 0xc0025f6ff0})
	/rudder-server/jobsdb/migration.go:54 +0x1ff
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.func1()
	/rudder-server/jobsdb/migration.go:28 +0x1f
github.com/rudderlabs/rudder-server/jobsdb.(*Handle).startMigrateDSLoop.WithBugsnag.func2()
	/rudder-server/utils/misc/misc.go:945 +0x5c
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 2162
	/go/pkg/mod/golang.org/x/sync@v0.6.0/errgroup/errgroup.go:75 +0x96

While at the same time postgress machine logs the following error

2024-04-18 07:57:09.839 UTC [16135] STATEMENT:  DELETE FROM "batch_rt_job_status_435_2"
							WHERE NOT id = ANY(
								SELECT DISTINCT ON (job_id) id from "batch_rt_job_status_435_2" ORDER BY job_id ASC, id DESC
							)
2024-04-18 07:57:10.730 UTC [16111] LOG:  unexpected EOF on client connection with an open transaction
2024-04-18 07:57:11.328 UTC [18493] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.347 UTC [18494] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.429 UTC [18490] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.441 UTC [18488] LOG:  could not send data to client: Broken pipe
2024-04-18 07:57:11.453 UTC [18488] FATAL:  connection to client lost
2024-04-18 07:57:11.538 UTC [18491] LOG:  could not receive data from client: Connection reset by peer
2024-04-18 07:57:11.543 UTC [18435] LOG:  could not send data to client: Connection reset by peer
2024-04-18 07:57:11.543 UTC [18435] STATEMENT:  SELECT * FROM (SELECT
										jobs.job_id, jobs.uuid, jobs.user_id, jobs.parameters, jobs.custom_val, jobs.event_payload, jobs.event_count,
										jobs.created_at, jobs.expire_at, jobs.workspace_id,
										pg_column_size(jobs.event_payload) as payload_size,
										sum(jobs.event_count) over (order by jobs.job_id asc) as running_event_counts,
										sum(pg_column_size(jobs.event_payload)) over (order by jobs.job_id) as running_payload_size,
										job_latest_state.job_state, job_latest_state.attempt,
										job_latest_state.exec_time, job_latest_state.retry_time,
										job_latest_state.error_code, job_latest_state.error_response, job_latest_state.parameters
									FROM
										"gw_jobs_545" AS jobs
										LEFT JOIN "gw_job_status_545" job_latest_state ON jobs.job_id=job_latest_state.job_id
									    WHERE ((job_latest_state.job_id IS NULL)) AND ((jobs.custom_val='GW')) AND (jobs.parameters->>'source_id'='2dPZSbNyzCt11TNn88HyTi3hF7N')
										ORDER BY jobs.job_id  LIMIT 772 ) t WHERE running_event_counts - t.event_count <= $1 AND running_payload_size - t.payload_size <= $2
2024-04-18 07:57:11.543 UTC [18435] FATAL:  connection to client lost
2024-04-18 07:57:11.543 UTC [18435] STATEMENT:  SELECT * FROM (SELECT
										jobs.job_id, jobs.uuid, jobs.user_id, jobs.parameters, jobs.custom_val, jobs.event_payload, jobs.event_count,
										jobs.created_at, jobs.expire_at, jobs.workspace_id,
										pg_column_size(jobs.event_payload) as payload_size,
										sum(jobs.event_count) over (order by jobs.job_id asc) as running_event_counts,
										sum(pg_column_size(jobs.event_payload)) over (order by jobs.job_id) as running_payload_size,
										job_latest_state.job_state, job_latest_state.attempt,
										job_latest_state.exec_time, job_latest_state.retry_time,
										job_latest_state.error_code, job_latest_state.error_response, job_latest_state.parameters
									FROM
										"gw_jobs_545" AS jobs
										LEFT JOIN "gw_job_status_545" job_latest_state ON jobs.job_id=job_latest_state.job_id
									    WHERE ((job_latest_state.job_id IS NULL)) AND ((jobs.custom_val='GW')) AND (jobs.parameters->>'source_id'='2dPZSbNyzCt11TNn88HyTi3hF7N')
										ORDER BY jobs.job_id  LIMIT 772 ) t WHERE running_event_counts - t.event_count <= $1 AND running_payload_size - t.payload_size <= $2
2024-04-18 07:57:11.932 UTC [18470] LOG:  unexpected EOF on client connection with an open transaction
2024-04-18 07:57:12.031 UTC [18499] LOG:  incomplete startup packet
2024-04-18 07:57:17.140 UTC [18514] ERROR:  relation "rsources_failed_keys_v2" already exists
2024-04-18 07:57:17.140 UTC [18514] STATEMENT:  create table "rsources_failed_keys_v2" (
			id VARCHAR(27) COLLATE "C",
			db_name text not null default 'namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless',
			job_run_id text not null,
			task_run_id text not null,
			source_id text not null,
			destination_id text not null,
			primary key (id),
			unique (job_run_id, task_run_id, source_id, destination_id, db_name)
		)
2024-04-18 07:57:17.143 UTC [18514] ERROR:  relation "rsources_failed_keys_v2_records" already exists
2024-04-18 07:57:17.143 UTC [18514] STATEMENT:  create table "rsources_failed_keys_v2_records" (
			id VARCHAR(27) COLLATE "C",
			record_id text not null,
	    	ts timestamp not null default NOW(),
			primary key (id, record_id)
		)
2024-04-18 07:57:17.338 UTC [18514] ERROR:  relation "rsources_stats" already exists
2024-04-18 07:57:17.338 UTC [18514] STATEMENT:  create table "rsources_stats" (
			db_name text not null default 'namaste-rudderstack-rudderstack-postgresql-0.namaste-rudderstack-rudderstack-postgresql-headless',
			job_run_id text not null,
			task_run_id text not null,
			source_id text not null,
			destination_id text not null,
			in_count integer not null default 0,
			out_count integer not null default 0,
			failed_count integer not null default 0,
			ts timestamp not null default NOW(),
			primary key (db_name, job_run_id, task_run_id, source_id, destination_id)
		)
2024-04-18 07:57:20.533 UTC [18425] ERROR:  canceling autovacuum task
2024-04-18 07:57:20.533 UTC [18425] CONTEXT:  automatic vacuum of table "jobsdb.public.gw_job_status_511_1"
2024-04-18 07:57:21.128 UTC [18148] LOG:  unexpected EOF on client connection with an open transaction

While there are no errors on transformer

➜  rudderstack-helm git:(master) ✗ kubectl logs --tail=10000 -f namaste-rudderstack-transformer-f45745c8f-8psss

> rudder-transformer@1.62.1 start
> cd dist;node ./src/index.js;cd ..

setting up statsd client
Stack Depth set: 64
Interval Bytes set: 524288
Transformer: Swagger route loading
Transformer: Swagger route loaded
{"extraData":{},"level":"info","message":"Using new routes","timestamp":"2024-04-18 06:48:30"}
Master (pid: 19) has started
Worker (pid: 26) is online
setting up statsd client
Stack Depth set: 64
Interval Bytes set: 524288
Transformer: Swagger route loading
Transformer: Swagger route loaded
{"extraData":{},"level":"info","message":"Using new routes","timestamp":"2024-04-18 06:48:33"}
Worker (pid: 26) has started
App started. Listening on port: 9090
@gitcommitshow
Copy link
Collaborator

Hi @shashank42 , thank you for reporting the issue. I will get back to you soon.

@gitcommitshow
Copy link
Collaborator

@shashank42 can you please share the version number of rudder-server you're using?

@shashank42
Copy link
Author

For the current deployment

    Image:         rudderlabs/rudder-server:latest
    Image ID:      docker.io/rudderlabs/rudder-server@sha256:6ed8c0155ed821dfded3ba8efd6909b313989684cacfe210322f4131ecdbbe77

https://github.com/glip-gg/rudderstack-helm

Here is the repo I am using

@shashank42
Copy link
Author

I think the issue is similar to this. I have taken the steps here, that is deleting all the tables inside jobsdb, however the errors have not stopped
https://rudderstack.slack.com/archives/C01E4PLB135/p1692320392539509?thread_ts=1689977868.588139&cid=C01E4PLB135

@gitcommitshow
Copy link
Collaborator

@shashank42 does the issue persist when you restart the postgres db pod?
The shared logs are not enough to make any conclusion - it seems migration loop timeouts, and from the db logs it looks like the database is overloaded

@shashank42
Copy link
Author

I just did a new deployment on a new cluster to check fresh setup case. There are still restarts, but less frequent.

Screenshot 2024-05-05 at 6 25 55 PM

Postgress doesn't seem to run out memory at any point.

@gitcommitshow
Copy link
Collaborator

@shashank42 can you check the resource (CPU, RAM) usage before and after you start the server? please also mention what resources(CPU, memory) have you provisioned for postgres?

Context: with the limited info, we are guessing that enough resources might not have been allocated (assuming the server is not overloaded with unusual number of requests) which causes a query timeout

@gitcommitshow
Copy link
Collaborator

@shashank42 can you please help with more information as mentioned above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants