Skip to content

Latest commit

 

History

History
218 lines (160 loc) · 9.65 KB

timescaledb.md

File metadata and controls

218 lines (160 loc) · 9.65 KB

TSBS Supplemental Guide: TimescaleDB

TimescaleDB is a database built on top of PostgreSQL, utilizing a SQL interface and RDBMS characteristics. This supplemental guide explains how the data generated for TSBS is stored, additional flags available when using the data importer (tsbs_load_timescaledb), and additional flags available for the query runner (tsbs_run_queries_timescaledb). This should be read after the main README.

Data format

Data generated by tsbs_generate_data for TimescaleDB is serialized in a "pseudo-CSV" format, along with a custom header at the beginning. The header is several lines long:

  • one line composed of a comma-separated list of tag labels, with the literal string tags as the first value in the list
  • one or more lines composed of a comma-separated list of field labels, with the hypertable name as the first value in the list
  • a blank line

An example for the cpu-only use case:

tags,hostname,region,datacenter,rack,os,arch,team,service,service_version,service_environment
cpu,usage_user,usage_system,usage_idle,usage_nice,usage_iowait,usage_irq,usage_softirq,usage_steal,usage_guest,usage_guest_nice

Following this, each reading is composed of two rows:

  1. a comma-separated list of tag values for the reading, with the literal string tags as the first value in the list
  2. a comma-separated list of field values for the reading, with the hypertable the reading belongs to being the first value and the timestamp as the second value

An example for the cpu-only use case:

tags,host_0,eu-central-1,eu-central-1b,21,Ubuntu15.10,x86,SF,6,0,test
cpu,1451606400000000000,58.1317132304976170,2.6224297271376256,24.9969495069947882,61.5854484633778867,22.9481393231639395,63.6499207106198313,6.4098777048301052,44.8799140503027445,80.5028770761136201,38.2431182911542820

tsbs_load_timescaledb Additional Flags

Examples of TimescaleDB YAML configuration for loading data can be found in the sample-configs directory. The main difference in benchmarking TimescaleDB regards whether you are attempting to benchmark a regular, single-node hypertable or a distributed multi-node hypertable. There are sample configuration files for both.

The settings that determines whether TSBS will attempt to create a distributed hypertable or a regular hypertable are below. All other settings impact the overall test, regardless of which type of hypertable setup you are testing.

Setting Hypertable Distributed Hypertable
in-table-partition-tag false true
replication-factor omit/0 1

PostgreSQL related

-host (type: string, default: localhost)

Hostname of the PostgreSQL server.

-postgres (type: string, default: sslmode=disable)

Specifies any connection parameters to pass along as the client connects to the PostgreSQL server. Values for dbname, host, and user in the connection string will be overridden with the values from the flags -db-name, -host, and -user, respectively. See the PostgreSQL documentation for more details.

-use-hypertable (type: boolean, default: true)

Whether to actually use TimescaleDB's hypertable for storing data. Set to false to measure the insert/write performance of plain PostgreSQL.

-user (type: string, default: postgres)

User to use to connect to the PostgreSQL server.

--admin-db-name (type: string, default: value of -user)

Database to connect to in order to create additional benchmark databases. By default this is the same as the user (i.e., postgres if neither is set), but sometimes a user does not have its own database.

Tags related

-in-table-partition-tag (type: boolean, default: false)

Whether to store the primary key tag as a column in the hypertable. By default the hypertable just stores the primary key (an int) of the tag set associated with the reading in the table. However this option can be more costly for some queries because of the need to do a JOIN. This option will store the primary tag (first tag in the list of tags in the data format) in the hypertable so that if most queries are using that as a filter, query performance can be improved.

Please note, -in-table-partition-tag is required for multi-node TimescaleDB benchmark testing. Currently, the multi-node query engine does not support push-down of JOINs and so the most efficient queries will filter on the primary tag name.

-use-jsonb-tags (type: boolean, default: false)

Whether to store the tags as a JSONB element in the tags table. By default tags are stored in separate columns in a metadata table named tags, where each unique set of tags is stored in a separate row. Using this option will store the tags a JSONB element in tags instead. Write performance does not seem to be dramatically affected by this option, but query performance is typically better with non-JSONB tags so this defaults to false.

Hypertable related

-chunk-time (type: duration, default 12h)

Size of each time partition in terms of time. It is expressed as a Golang time.Duration string, meaning a number followed by a unit abbreviation (s = seconds, m = minutes, h = hours), e.g., the default 12h is 12 hours. This should be adjusted based on the dataset size.

-partitions (type: int, default: 0)

Number of space partitions for the primary tag. Setting this to 0 will disable space partitioning.

Multi-node Benchmarking: When testing a multi-node installation, this value should be set to zero or omitted (which will default to zero.) In doing so, the distributed hypertable will be created across all connected data nodes, which is the preferred setup.

-replication-factor (type: int, default: 0)

This setting is specifically required to create a distributed hypertable. This is only supported on multi-node installations. Setting it to a value of 1 is recommended for most multi-node testing.

If this value is set >=1 on a single-node TimescaleDB instance, tsbs_load will error.

Index related

-field-index (type: string, default: VALUE-TIME)

The format for (any) field indexes, which are additional secondary indexes on fields in a hypertable. These are used for more performant threshold queries when the threshold is on a field rather than time, e.g., cpu.usage_user > 90. The two valid options are:

  • VALUE-TIME which creates a compound index on (<field>, time DESC)
  • TIME-VALUE which creates a compound index on (time DESC, <field>)

(<field> is replaced with the actual field name)

-field-index-count (type: int, default: 0)

Number of secondary indexes to create on measurement fields, with -1 signifying to create indexes on all fields. While secondary indexes can increase query performance, they will also increase disk usage and reduce write performance.

-partition-index (type: boolean, default: true)

Whether to create a compound index on the primary tag and time dimension (i.e., an index on (tags_id, time DESC)). Removing this index is likely to significantly reduce query performance.

-time-index (type: boolean, default: true)

Whether to create an index on the time dimension. For datasets with smaller number of devices (i.e., <100k), this is usually recommended. For a larger number of devices, -time-partition-index is recommended instead.

-time-partition-index (type: boolean, default: false)

Whether to create a compound index on the time dimension and the primary tag (i.e., an index on (time DESC, tags_id)). For datasets with a larger number of devices (i.e., >100k), this is usually recommended because it increases write performance by reducing lock contention on nodes in the B-tree since they are additionally partitioned by tags_id.

Miscellaneous

-hash-workers (type: boolean, default: false)

Whether to consistently hash data across the multiple insert workers by the value of the primary (first) tag. For datasets with larger numbers of devices, this option helps improve data locality on disk which can lead to better query performance. For datasets with smaller numbers of devices, it is typically not necessary.

-write-profile (type: string, default: none)

File to output periodic CPU and memory statistics. Useful for understanding system performance while writing data to the database.

-write-replication-stats (type: string, default: none)

File to output replication statistics. Useful for understanding how long it takes for data to be written in a replicated setup.


tsbs_run_queries_timescaledb Additional Flags

PostgreSQL related

-hosts (type: string, default: localhost)

Comma separated list of hostnames for the PostgreSQL servers. Each server should contain a full copy/replica of the dataset. Workers are connected to a server in a round-robin fashion.

-postgres (type: string, default: sslmode=disable)

Specifies any connection parameters to pass along as the client connects to the PostgreSQL server. Values for dbname, host, and user in the connection string will be overridden with the values from the flags -db-name, -hosts, and -user, respectively. See the PostgreSQL documentation for more details.

-show-explain (type: boolean, default: false)

Whether to print out a sample EXPLAIN ANALYZE output for the first query in the set of queries. This will be the only query run and is useful for understanding the query plan that is being generated for a particular query type.

-user (type: string, default: postgres)

User to use to connect to the PostgreSQL server(s).