Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Crash on insert or deletion into compressed chunk #6753

Open
hostops opened this issue Mar 8, 2024 · 20 comments
Open

[Bug]: Crash on insert or deletion into compressed chunk #6753

hostops opened this issue Mar 8, 2024 · 20 comments
Labels

Comments

@hostops
Copy link

hostops commented Mar 8, 2024

What type of bug is this?

Crash

What subsystems and features are affected?

Compression, Data ingestion

What happened?

When we try to insert data into compressed chunk, database crashes.

TimescaleDB version affected

2.14.2

PostgreSQL version used

15.6

What operating system did you use?

Ubuntu 20.04.2 x64

What installation method did you use?

Docker, Other

What platform did you run on?

Amazon Web Services (AWS)

Relevant log output and stack trace

2024-03-08 11:41:13	
2024-03-08 10:41:13,080 WARNING: Request failed to calms-timescaledb-ha-0: GET http://100.124.21.154:8008/patroni (HTTPConnectionPool(host='100.124.21.154', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
	
2024-03-08 11:41:12	
app= FATAL:  could not connect to the primary server: connection to server at "100.124.21.154", port 5432 failed: server closed the connection unexpectedly
	
2024-03-08 11:40:48	
2024-03-08 10:40:48,962 WARNING: Request failed to calms-timescaledb-ha-1: GET http://100.106.177.97:8008/patroni (HTTPConnectionPool(host='100.106.177.97', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
	
2024-03-08 11:40:48	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: server closed the connection unexpectedly
	
2024-03-08 11:38:03	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:38:03	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:37:43	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:37:43	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:37:23	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:37:22	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:37:02	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:37:02	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:36:41	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:36:41	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:36:32	
app=[unknown] ERROR:  cannot execute COPY FROM in a read-only transaction
	
2024-03-08 11:36:21	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:36:21	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:36:00	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:36:00	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:35:40	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:35:40	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:35:19	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:35:19	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:34:59	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:34:59	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:34:39	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:34:39	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:34:18	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.97", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:34:18	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:33:58	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:33:54	
2024-03-08 10:33:54,785 WARNING: Request failed to calms-timescaledb-ha-0: GET http://100.124.21.168:8008/patroni (HTTPConnectionPool(host='100.124.21.168', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
	
2024-03-08 11:33:53	
app= FATAL:  could not connect to the primary server: connection to server at "100.124.21.168", port 5432 failed: server closed the connection unexpectedly
	
2024-03-08 11:33:47	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:33:28	
psycopg2.OperationalError: connection to server at "100.124.21.168", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:33:28	
2024-03-08 10:33:28,266 ERROR: Exception when working with leader
	
2024-03-08 11:33:28	
2024-03-08 10:33:28,210 WARNING: Failed to rewind because primary calms-timescaledb-ha-0 become unreachable
	
2024-03-08 11:33:28	
psycopg2.OperationalError: connection to server at "100.124.21.168", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:33:28	
2024-03-08 10:33:28,210 ERROR: Exception when working with leader
	
2024-03-08 11:33:27	
pg_rewind: error: unexpected result while fetching remote files: server closed the connection unexpectedly
	
2024-03-08 11:33:27	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:33:07	
psycopg2.OperationalError: connection to server at "100.124.21.168", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:33:07	
2024-03-08 10:33:07,956 ERROR: Exception when working with leader
	
2024-03-08 11:33:07	
2024-03-08 10:33:07,898 WARNING: Failed to rewind because primary calms-timescaledb-ha-0 become unreachable
	
2024-03-08 11:33:07	
psycopg2.OperationalError: connection to server at "100.124.21.168", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:33:07	
2024-03-08 10:33:07,892 ERROR: Exception when working with leader
	
2024-03-08 11:33:07	
pg_rewind: error: unexpected result while fetching remote files: server closed the connection unexpectedly
	
2024-03-08 11:33:07	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:32:46	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:32:40	
2024-03-08 10:32:40,489 WARNING: Request failed to calms-timescaledb-ha-1: GET http://100.106.177.101:8008/patroni (HTTPConnectionPool(host='100.106.177.101', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
	
2024-03-08 11:32:36	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:32:36	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:32:16	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:32:15	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:31:55	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:31:55	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:31:35	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:31:35	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:31:15	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:31:14	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:30:54	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:30:54	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:30:34	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:30:33	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:30:14	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:30:13	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:29:53	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:29:53	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:29:33	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:29:32	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:29:12	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:29:12	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
	
2024-03-08 11:28:52	
app= FATAL:  could not connect to the primary server: connection to server at "100.106.177.101", port 5432 failed: FATAL:  the database system is in recovery mode
	
2024-03-08 11:28:52	
app= DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;Show cont

How can we reproduce the bug?

We use `COPY FROM STDIN` for insertion.
We also use delete before each such copy.
So problem could also be in DELETE statement.
@hostops hostops added the bug label Mar 8, 2024
@hostops
Copy link
Author

hostops commented Mar 8, 2024

I also believe this is linked to this issue: #6031

@antekresic
Copy link
Contributor

Hi there,

Could you send us the schema definitions of the hypertable and also your compression settings? It does look like the DELETE is the problem here.

Thanks,

@hostops
Copy link
Author

hostops commented Mar 8, 2024

                         Table "public.signal"
 Column |            Type             | Collation | Nullable | Default
--------+-----------------------------+-----------+----------+---------
 id     | bigint                      |           | not null |
 time   | timestamp without time zone |           | not null |
 value  | double precision            |           |          |
Indexes:
    "signal_id_time_idx" btree (id, "time" DESC)
    "signal_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON signal FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Number of child tables: 362 (Use \d+ to list them.)

                          Table "public.aggregated"
    Column    |            Type             | Collation | Nullable | Default
--------------+-----------------------------+-----------+----------+---------
 id           | bigint                      |           | not null |
 time         | timestamp without time zone |           | not null |
 duration     | double precision            |           | not null |
 integral     | double precision            |           | not null |
 rising_edges | integer                     |           | not null |
 min          | double precision            |           |          |
 max          | double precision            |           |          |
 level        | aggregation_level           |           | not null |
Indexes:
    "aggregated_id_level_time_idx" btree (id, level, "time" DESC)
    "aggregated_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON aggregated FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Number of child tables: 361 (Use \d+ to list them.)
CREATE TABLE signal (
    id              BIGINT            NOT NULL,
    time            TIMESTAMP         NOT NULL,
    value           DOUBLE PRECISION  NULL
);
SELECT create_hypertable('signal', 'time', chunk_time_interval => interval '10 days');
CREATE TABLE aggregated (
    id              BIGINT            NOT NULL,
    time            TIMESTAMP         NOT NULL,
    duration        DOUBLE PRECISION  NOT NULL,
    integral        DOUBLE PRECISION  NOT NULL,
    rising_edges    INT               NOT NULL,
    min             DOUBLE PRECISION  NULL,
    max             DOUBLE PRECISION  NULL,
    UNIQUE (time, id)
);
SELECT create_hypertable('aggregated', 'time', chunk_time_interval => interval '10 days');

To compress data we use Kubernetes cronjob, and we run queries below once a week:

SELECT compress_chunk(CONCAT(chunk_schema, '.', chunk_name)::regclass, if_not_compressed => true) 
FROM timescaledb_information.chunks 
WHERE 
  hypertable_name IN ('signal','aggregated') 
  AND range_end < NOW()-INTERVAL '91 days' ;
VACUUM ANALYZE;

@mkindahl
Copy link
Contributor

This seems like it could be related to #6717, which would make COPY the problem.

A workaround for this would be to use INSERT instead of COPY or decompress the chunks before running COPY.

The fix above is not released yet, but it is very likely to be in the next release.

Could you show me the part of the code that involves the DELETE and the COPY? We have a reproduction using COPY already, but it would be good if we could rule out that the DELETE is the problem.

@hostops
Copy link
Author

hostops commented Mar 13, 2024

There is too much code to share the whole code. So I extracted only relevant SQL.

For each data upload, we run these queries:

DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
COPY signal (id, time, value) FROM STDIN BINARY;
INSERT INTO signal VALUES ($id, $time, NULL) ON CONFLICT DO NOTHING;

@mkindahl mkindahl self-assigned this Mar 14, 2024
@mkindahl
Copy link
Contributor

@hostops I tried to reproduce it in different ways but have not yet succeeded. Can you get access to the stack trace for the crash or show more in the log about what caused the crash?

The log extract you showed above just shows that the delete statement cannot be executed because the server is in recovery mode, not what statement caused the crash.

Here is an example of something I did:

postgres=# \dx
List of installed extensions
-[ RECORD 1 ]--------------------------------------------------------------------------------------
Name        | plpgsql
Version     | 1.0
Schema      | pg_catalog
Description | PL/pgSQL procedural language
-[ RECORD 2 ]--------------------------------------------------------------------------------------
Name        | timescaledb
Version     | 2.14.2
Schema      | public
Description | Enables scalable inserts and complex queries for time-series data (Community Edition)

postgres=# \d signal
                         Table "public.signal"
 Column |            Type             | Collation | Nullable | Default 
--------+-----------------------------+-----------+----------+---------
 id     | bigint                      |           | not null | 
 time   | timestamp without time zone |           | not null | 
 value  | double precision            |           |          | 
Indexes:
    "signal_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON signal FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Number of child tables: 1 (Use \d+ to list them.)
postgres=# \copy _timescaledb_internal._hyper_1_1_chunk (id, time, value) to 'magic.dat' binary
COPY 10
postgres=# select * from signal;
 id |           time            | value 
----+---------------------------+-------
 12 | 2024-01-20 15:02:03.87313 |  3.14
 13 | 2024-01-20 15:02:03.87313 |  2.14
 12 | 2024-01-20 15:02:03.87313 |  3.14
 13 | 2024-01-20 15:02:03.87313 |  2.14
 12 | 2024-01-20 15:02:03.87313 |  3.14
 13 | 2024-01-20 15:02:03.87313 |  2.14
 12 | 2024-01-20 15:02:03.87313 |  3.14
 13 | 2024-01-20 15:02:03.87313 |  2.14
 12 | 2024-01-20 15:02:03.87313 |  3.14
 13 | 2024-01-20 15:02:03.87313 |  2.14
(10 rows)

postgres=# DELETE FROM signal WHERE id = 12 AND time >= '2024-01-20 00:00:00Z' AND time < '2024-01-21 00:00:00Z';
DELETE 5
postgres=# \copy signal(id,time,value) from 'magic.dat' binary
COPY 10

@mkindahl mkindahl removed their assignment Mar 14, 2024
@hostops
Copy link
Author

hostops commented Mar 14, 2024

Hello! I did not know I missed the important part of the logs:

2024-03-08 11:28:52	LOG:  server process (PID 395739) was terminated by signal 11: Segmentation fault
2024-03-08 11:28:52	DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
2024-03-08 11:28:52	LOG:  terminating any other active server processes
2024-03-08 11:28:52	FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-03-08 11:28:52	This probably means the server terminated abnormally
2024-03-08 11:28:52	before or while processing the request.
2024-03-08 11:28:52	LOG:  record with incorrect prev-link 0/DC at 8FDD/416319B0
2024-03-08 11:28:59 2024-03-08 10:28:59,394 INFO: closed patroni heartbeat connection to postgres
2024-03-08 10:29:00 INFO: establishing a new patroni heartbeat connection to postgres
2024-03-08 11:29:12 LOG:  server process (PID 397119) was terminated by signal 11: Segmentation fault
2024-03-08 11:29:12 DETAIL:  Failed process was running: DELETE FROM signal WHERE id = $1 AND time >= $2 AND time < $3;
2024-03-08 11:29:12 LOG:  terminating any other active server processes
2024-03-08 11:29:12 FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
2024-03-08 11:29:12 This probably means the server terminated abnormally
2024-03-08 11:29:12 before or while processing the request.
2024-03-08 11:29:12 LOG:  unexpected pageaddr 8FDD/2A64C000 in log segment 0000004200008FDD00000041, offset 6602752
2024-03-08 11:29:12 [unknown] FATAL:  the database system is in recovery mode

@hostops
Copy link
Author

hostops commented Mar 14, 2024

I also tried the same test in our development environment as you @mkindahl, but I can only get it to fail in production.

Here is the relevant code:

async fn do_save(db: &PostgresService, signal: &Signal) -> Result<(), DatabaseError> {
    let mut client = log_slow(
        db.get_conn(&crate::prometheus::CONN_SIGNAL_SAVE),
        "signal_repo, pool.get()",
        5,
    )
    .await?;
    let transaction = log_slow(
        client
            .build_transaction()
            .isolation_level(Serializable)
            .start(),
        "signal_repo, transaction.start()",
        5,
    )
    .await?;

    log_slow(
        clear_period(&transaction, signal),
        "signal_repo, clear_period",
        5,
    )
    .await?;

    insert(&transaction, signal.id.into(), &signal.points)
        .timed(&crate::prometheus::TIME_SIG_INSERT)
        .await?;

    log_slow(transaction.commit(), "signal_repo, transaction.commit()", 5).await?;
    Ok(())
}

async fn insert(
    transaction: &Transaction<'_>,
    id: i64,
    points: &[DataPoint],
) -> Result<(), DatabaseError> {
    // open COPY FROM STDIN sink
    let sink = transaction
        .copy_in("COPY signal (id, time, value) FROM STDIN BINARY")
        .await?;

    let writer = BinaryCopyInWriter::new(sink, &[Type::INT8, Type::TIMESTAMP, Type::FLOAT8]);
    pin_mut!(writer);

    // write all points except last
    for p in &points[0..points.len() - 1] {
        if let Some(value) = p.value {
            writer
                .as_mut()
                .write(&[&id, &wrap_time(p.timestamp), &value])
                .await?;
        } else {
            writer
                .as_mut()
                .write(&[&id, &wrap_time(p.timestamp), &None::<&f64>])
                .await?;
        }
    }

    writer.finish().await?;

    // insert last point with INSERT to utilise ON CONFLICT
    // because we clear before inserting, conflict may arise only for last data point
    // since last data point is always null, on conflict we want to preserve current value
    let last_timestamp = (points.last().unwrap().timestamp as f64) / 1000.0;
    let last_insert = format!(
        "INSERT INTO signal VALUES ({:}, to_timestamp({:0.3}), NULL) ON CONFLICT DO NOTHING;",
        id, last_timestamp
    );
    transaction.simple_query(&last_insert[..]).await?;

    Ok(())
}

@mkindahl
Copy link
Contributor

@hostops Definitely delete failing then, so #6717 will not solve this issue. If you have a stack trace that would help a lot since it is hard to figure out where in the code it is failing.

@KamalAman
Copy link

KamalAman commented Mar 15, 2024

This might be related to #6028 and or #6031, i just updated to 2.14 and it cleared a similar issue out

@hostops
Copy link
Author

hostops commented Mar 15, 2024

I just tested simple delete query in compressed chunk and bug was reproduced.

DELETE FROM signal WHERE id=738744145666 AND "time" BETWEEN '2023-10-01' AND '2023-11-02';
LOG:  server process (PID 45855) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: DELETE FROM signal WHERE id=738744145666 AND "time" BETWEEN '2023-10-01' AND '2023-11-02';

@mkindahl
Copy link
Contributor

I just tested simple delete query in compressed chunk and bug was reproduced.

DELETE FROM signal WHERE id=738744145666 AND "time" BETWEEN '2023-10-01' AND '2023-11-02';
LOG:  server process (PID 45855) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: DELETE FROM signal WHERE id=738744145666 AND "time" BETWEEN '2023-10-01' AND '2023-11-02';

Do you have a stack trace for this?

@hostops
Copy link
Author

hostops commented Mar 18, 2024

If you mean stack trace on client. I just executed this command using psql.

If log_min_messages=DEBUG5
and log_min_error_statement=DEBUG5 would be enough for server, I can get it. But if you mean I should get stack trace using gdb it would take me more time.

@mkindahl
Copy link
Contributor

If you mean stack trace on client. I just executed this command using psql.

If log_min_messages=DEBUG5 and log_min_error_statement=DEBUG5 would be enough for server, I can get it. But if you mean I should get stack trace using gdb it would take me more time.

I meant a stack trace using gdb. It would help us to verify if this is a bug we've seen previously and possibly fix it if it is not. I'll try again to see if I can reproduce it.

@hostops
Copy link
Author

hostops commented Mar 18, 2024

@mkindahl I believe this is the stack trace you are looking for.

Already logging to /var/lib/postgresql/data/debuglog.txt.
Attaching to process 172
Reading symbols from /usr/lib/postgresql/15/bin/postgres...
Reading symbols from /usr/lib/debug/.build-id/a1/7fec33e3a8cf89bd57230032466c94be408f6f.debug...
Reading symbols from /lib/x86_64-linux-gnu/libzstd.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libzstd.so.1)
Reading symbols from /lib/x86_64-linux-gnu/liblz4.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/liblz4.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libxml2.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libxml2.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libpam.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libpam.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libssl.so.3...
(No debugging symbols found in /lib/x86_64-linux-gnu/libssl.so.3)
Reading symbols from /lib/x86_64-linux-gnu/libcrypto.so.3...
(No debugging symbols found in /lib/x86_64-linux-gnu/libcrypto.so.3)
Reading symbols from /lib/x86_64-linux-gnu/libgssapi_krb5.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgssapi_krb5.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libz.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libm.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libldap-2.5.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libldap-2.5.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libicui18n.so.70...
(No debugging symbols found in /lib/x86_64-linux-gnu/libicui18n.so.70)
Reading symbols from /lib/x86_64-linux-gnu/libicuuc.so.70...
(No debugging symbols found in /lib/x86_64-linux-gnu/libicuuc.so.70)
Reading symbols from /lib/x86_64-linux-gnu/libsystemd.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libsystemd.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libc.so.6)
Reading symbols from /lib/x86_64-linux-gnu/liblzma.so.5...
(No debugging symbols found in /lib/x86_64-linux-gnu/liblzma.so.5)
Reading symbols from /lib/x86_64-linux-gnu/libaudit.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libaudit.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libkrb5.so.3...
(No debugging symbols found in /lib/x86_64-linux-gnu/libkrb5.so.3)
Reading symbols from /lib/x86_64-linux-gnu/libk5crypto.so.3...
(No debugging symbols found in /lib/x86_64-linux-gnu/libk5crypto.so.3)
Reading symbols from /lib/x86_64-linux-gnu/libcom_err.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libcom_err.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libkrb5support.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libkrb5support.so.0)
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
Reading symbols from /lib/x86_64-linux-gnu/liblber-2.5.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/liblber-2.5.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libsasl2.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libsasl2.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libgnutls.so.30...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgnutls.so.30)
Reading symbols from /lib/x86_64-linux-gnu/libstdc++.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libstdc++.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgcc_s.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libicudata.so.70...
(No debugging symbols found in /lib/x86_64-linux-gnu/libicudata.so.70)
Reading symbols from /lib/x86_64-linux-gnu/libcap.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libcap.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libgcrypt.so.20...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgcrypt.so.20)
Reading symbols from /lib/x86_64-linux-gnu/libcap-ng.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libcap-ng.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libkeyutils.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libkeyutils.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libresolv.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libresolv.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libp11-kit.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libp11-kit.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libidn2.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libidn2.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libunistring.so.2...
(No debugging symbols found in /lib/x86_64-linux-gnu/libunistring.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libtasn1.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libtasn1.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libnettle.so.8...
(No debugging symbols found in /lib/x86_64-linux-gnu/libnettle.so.8)
Reading symbols from /lib/x86_64-linux-gnu/libhogweed.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libhogweed.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libgmp.so.10...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgmp.so.10)
Reading symbols from /lib/x86_64-linux-gnu/libgpg-error.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgpg-error.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libffi.so.8...
(No debugging symbols found in /lib/x86_64-linux-gnu/libffi.so.8)
Reading symbols from /usr/lib/postgresql/15/lib/timescaledb.so...
Reading symbols from /usr/lib/postgresql/15/lib/pg_stat_statements.so...
Reading symbols from /usr/lib/debug/.build-id/6c/2a36e82a982dde945f9b2fd12f24bbcd30790d.debug...
Reading symbols from /usr/lib/postgresql/15/lib/timescaledb-2.14.0.so...
Reading symbols from /usr/lib/postgresql/15/lib/timescaledb-tsl-2.14.0.so...
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ff247c5edea in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
Ambiguous command "con": condition, continue.
Continuing.

Program received signal SIGINT, Interrupt.
0x00007ff247c5edea in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
Breakpoint 1 at 0x56283cd948b0: file utils/error/./build/../src/backend/utils/error/elog.c, line 509.
Continuing.

Program received signal SIGSEGV, Segmentation fault.
0x00007ff1edc691b4 in fill_predicate_context (is_null=<synthetic pointer>, index_filters=<synthetic pointer>, heap_filters=<synthetic pointer>, predicates=0x0, settings=<optimized out>, ch=<optimized out>) at /build/timescaledb/tsl/src/compression/compression.c:2553
2553	/build/timescaledb/tsl/src/compression/compression.c: No such file or directory.
#0  0x00007ff1edc691b4 in fill_predicate_context (is_null=<synthetic pointer>, index_filters=<synthetic pointer>, heap_filters=<synthetic pointer>, predicates=0x0, settings=<optimized out>, ch=<optimized out>) at /build/timescaledb/tsl/src/compression/compression.c:2553
#1  decompress_batches_for_update_delete.constprop.0 (ht_state=<optimized out>, chunk=chunk@entry=0x56283eb99608, predicates=predicates@entry=0x56283eb990d8, estate=<optimized out>) at /build/timescaledb/tsl/src/compression/compression.c:3065
#2  0x00007ff1edc4133c in decompress_chunk_walker (ps=0x56283eb8a198, ctx=0x7fff3579ff90) at /build/timescaledb/tsl/src/compression/compression.c:3227
#3  0x000056283cb011cc in planstate_walk_members (context=0x7fff3579ff90, walker=0x7ff1edc411f0 <decompress_chunk_walker>, nplans=<optimized out>, planstates=<optimized out>) at nodes/./build/../src/backend/nodes/nodeFuncs.c:4202
#4  planstate_tree_walker (planstate=0x56283eb8d6c8, walker=0x7ff1edc411f0 <decompress_chunk_walker>, context=0x7fff3579ff90) at nodes/./build/../src/backend/nodes/nodeFuncs.c:4124
#5  0x000056283cb01159 in planstate_tree_walker (planstate=0x56283eb89b78, walker=0x7ff1edc411f0 <decompress_chunk_walker>, context=0x7fff3579ff90) at nodes/./build/../src/backend/nodes/nodeFuncs.c:4109
#6  0x00007ff1edc413c6 in decompress_target_segments (ht_state=<optimized out>) at /build/timescaledb/tsl/src/compression/compression.c:3161
#7  0x00007ff1edcecd4d in ExecModifyTable (pstate=0x56283eb89b78, cs_node=0x56283eb896b8) at /build/timescaledb/src/nodes/hypertable_modify.c:752
#8  hypertable_modify_exec (node=0x56283eb896b8) at /build/timescaledb/src/nodes/hypertable_modify.c:181
#9  0x000056283ca8946d in ExecProcNode (node=0x56283eb896b8) at executor/./build/../src/include/executor/executor.h:262
#10 ExecutePlan (execute_once=<optimized out>, dest=0x7ff1ed097880, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_DELETE, use_parallel_mode=<optimized out>, planstate=0x56283eb896b8, estate=0x56283eb88468) at executor/./build/../src/backend/executor/execMain.c:1636
#11 standard_ExecutorRun (queryDesc=0x56283eb872d8, direction=<optimized out>, count=0, execute_once=<optimized out>) at executor/./build/../src/backend/executor/execMain.c:363
#12 0x00007ff245013f75 in pgss_ExecutorRun (queryDesc=0x56283eb872d8, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1010
#13 0x000056283cc32523 in ExecutorRun (execute_once=true, count=0, direction=ForwardScanDirection, queryDesc=0x56283eb872d8) at executor/./build/../src/backend/executor/execMain.c:305
#14 ProcessQuery (plan=plan@entry=0x7ff1ed097910, sourceText=<optimized out>, params=<optimized out>, queryEnv=<optimized out>, dest=dest@entry=0x7ff1ed097880, qc=qc@entry=0x7fff357a04d0) at tcop/./build/../src/backend/tcop/pquery.c:160
#15 0x000056283cc34142 in PortalRunMulti (portal=portal@entry=0x56283e8910b8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x7ff1ed097880, altdest=altdest@entry=0x7ff1ed097880, qc=qc@entry=0x7fff357a04d0) at tcop/./build/../src/backend/tcop/pquery.c:1277
#16 0x000056283cc346ac in PortalRun (portal=0x56283e8910b8, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x7ff1ed097880, altdest=0x7ff1ed097880, qc=0x7fff357a04d0) at tcop/./build/../src/backend/tcop/pquery.c:791
#17 0x000056283cc35403 in exec_simple_query (query_string=0x56283e8184a8 "DELETE FROM signal WHERE id=738744145666 AND \"time\" BETWEEN '2023-10-01' AND '2023-11-02';") at tcop/./build/../src/backend/tcop/postgres.c:1250
#18 0x000056283cc36fb5 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at tcop/./build/../src/backend/tcop/postgres.c:4598
#19 0x000056283cbaa185 in BackendRun (port=0x56283e841b30, port=0x56283e841b30) at postmaster/./build/../src/backend/postmaster/postmaster.c:4514
#20 BackendStartup (port=0x56283e841b30) at postmaster/./build/../src/backend/postmaster/postmaster.c:4242
#21 ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1809
#22 0x000056283cb9f196 in PostmasterMain (argc=17, argv=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:1481
#23 0x000056283c89ff3b in main (argc=17, argv=0x56283e7832d0) at main/./build/../src/backend/main/main.c:202
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
No stack.

@hostops
Copy link
Author

hostops commented Mar 18, 2024

To save you some time:

leftop = linitial(opexpr->args);

@mkindahl
Copy link
Contributor

@hostops Yes, that is what I was looking for. Thanks!

@mkindahl mkindahl added segfault Segmentation fault and removed segfault Segmentation fault labels Apr 3, 2024
@svenklemm
Copy link
Member

Could you check what opexpr->args is in the coredump and maybe all the other fields of opexpr too to see what might be wrong.
This should be fixed in main as we now abort if there are not 2 args.

@hostops
Copy link
Author

hostops commented May 8, 2024

Hello!
I am on my paternity leave until the end of the month.

But I may have used version mismatched to docker image used. I may have forgotten to run update command from 2.14.0 to 2.14.2.

But I am not sure and had no time to test and confirm if the issue still persists.
It seems likely I was not on latest version, so I would like to ask you if this issue can wait until I can perform additional tests, if this is true.
If this issue is non-existent, I would really like you to not waste any more time on this.

So please wait until I have the time to test if this issue is still there on latest version.

@svenklemm
Copy link
Member

Hello, thx for the update i'll label the issue as waiting for your input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants