Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Cloud storage metrics reference #349

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

add Cloud storage metrics reference #349

wants to merge 2 commits into from

Conversation

Deflaimun
Copy link
Collaborator

@Deflaimun Deflaimun commented Mar 13, 2024

@Deflaimun Deflaimun requested a review from a team as a code owner March 13, 2024 20:39
Copy link

netlify bot commented Mar 13, 2024

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 635fec1
🔍 Latest deploy log https://app.netlify.com/sites/redpanda-docs-preview/deploys/65f212ad7b22a600080b3347
😎 Deploy Preview https://deploy-preview-349--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Deflaimun
Copy link
Collaborator Author

Deflaimun commented Mar 13, 2024

Where are the cloud metrics available to the customer? @vuldin

I did some tests myself, and when running Redpanda through Docker, I wasn't able to see the metrics that we listed under http://localhost:9644/public_metrics.
However I did find them when running http://localhost:9644/metrics But on that endpoint they all have the 'vectorized_' prefix.

So, do I need to be on a cloud environment to be able to see them with /public_metrics? Why do we have them listed under /metrics for self-hosted?

@Feediver1
Copy link
Collaborator

@Deflaimun @micheleRP This will be moved out of this context after we split RP Cloud out into its own repo, right?

@Feediver1
Copy link
Collaborator

@Deflaimun Who is the engineering SME? Should also be added as a reviewer.

@vuldin
Copy link
Member

vuldin commented Mar 21, 2024

Where are the cloud metrics available to the customer? @vuldin

I did some tests myself, and when running Redpanda through Docker, I wasn't able to see the metrics that we listed under http://localhost:9644/public_metrics. However I did find them when running http://localhost:9644/metrics But on that endpoint they all have the 'vectorized_' prefix.

So, do I need to be on a cloud environment to be able to see them with /public_metrics? Why do we have them listed under /metrics for self-hosted?

You definitely should have access to the /public_metrics endpoint in all recent Redpanda deployments (self-hosted, docker, cloud, etc.). I just deployed via helm and ran the following command to print out these metrics:

kubectl exec -it -n redpanda redpanda-0 -c redpanda -- curl http://localhost:9644/public_metrics

This is also how I generated the table of metrics (and their description here... just filter output by "# HELP").

@Deflaimun
Copy link
Collaborator Author

When running curl http://localhost:9644/public_metrics and filtering by HELP, I only get these results.

# HELP redpanda_application_build Redpanda build information
# HELP redpanda_application_uptime_seconds_total Redpanda uptime in seconds
# HELP redpanda_cluster_brokers Number of configured brokers in the cluster
# HELP redpanda_cluster_controller_log_limit_requests_available_rps Controller log rate limiting. Available rps for group
# HELP redpanda_cluster_controller_log_limit_requests_dropped Controller log rate limiting. Amount of requests that are dropped due to exceeding limit in group
# HELP redpanda_cluster_members_backend_queued_node_operations Number of queued node operations
# HELP redpanda_cluster_partition_moving_from_node Amount of partitions that are moving from node
# HELP redpanda_cluster_partition_moving_to_node Amount of partitions that are moving to node
# HELP redpanda_cluster_partition_node_cancelling_movements Amount of cancelling partition movements for node
# HELP redpanda_cluster_partition_num_with_broken_rack_constraint Number of partitions that don't satisfy the rack awareness constraint
# HELP redpanda_cluster_partitions Number of partitions in the cluster (replicas not included)
# HELP redpanda_cluster_topics Number of topics in the cluster
# HELP redpanda_cluster_unavailable_partitions Number of partitions that lack quorum among replicants
# HELP redpanda_cpu_busy_seconds_total Total CPU busy time in seconds
# HELP redpanda_io_queue_total_read_ops Total read operations passed in the queue
# HELP redpanda_io_queue_total_write_ops Total write operations passed in the queue
# HELP redpanda_kafka_handler_latency_seconds Latency histogram of kafka requests
# HELP redpanda_kafka_max_offset Latest readable offset of the partition (i.e. high watermark)
# HELP redpanda_kafka_partitions Configured number of partitions for the topic
# HELP redpanda_kafka_records_fetched_total Total number of records fetched
# HELP redpanda_kafka_records_produced_total Total number of records produced
# HELP redpanda_kafka_replicas Configured number of replicas for the topic
# HELP redpanda_kafka_request_bytes_total Total number of bytes produced per topic
# HELP redpanda_kafka_request_latency_seconds Internal latency of kafka produce requests
# HELP redpanda_kafka_rpc_sasl_session_expiration_total Total number of SASL session expirations
# HELP redpanda_kafka_rpc_sasl_session_reauth_attempts_total Total number of SASL reauthentication attempts
# HELP redpanda_kafka_rpc_sasl_session_revoked_total Total number of SASL sessions revoked
# HELP redpanda_kafka_under_replicated_replicas Number of under replicated replicas (i.e. replicas that are live, but not at the latest offest)
# HELP redpanda_memory_allocated_memory Allocated memory size in bytes
# HELP redpanda_memory_available_memory Total shard memory potentially available in bytes (free_memory plus reclaimable)
# HELP redpanda_memory_available_memory_low_water_mark The low-water mark for available_memory from process start
# HELP redpanda_memory_free_memory Free memory size in bytes
# HELP redpanda_node_status_rpcs_received Number of node status RPCs received by this node
# HELP redpanda_node_status_rpcs_sent Number of node status RPCs sent by this node
# HELP redpanda_node_status_rpcs_timed_out Number of timed out node status RPCs from this node
# HELP redpanda_raft_leadership_changes Number of leadership changes across all partitions of a given topic
# HELP redpanda_raft_recovery_offsets_pending Sum of offsets that partitions on this node need to recover.
# HELP redpanda_raft_recovery_partition_movement_available_bandwidth Bandwidth available for partition movement. bytes/sec
# HELP redpanda_raft_recovery_partition_movement_consumed_bandwidth Bandwidth consumed for partition movement. bytes/sec
# HELP redpanda_raft_recovery_partitions_active Number of partition replicas are currently recovering on this node.
# HELP redpanda_raft_recovery_partitions_to_recover Number of partition replicas that have to recover for this node.
# HELP redpanda_rest_proxy_request_errors_total Total number of rest_proxy server errors
# HELP redpanda_rest_proxy_request_latency_seconds Internal latency of request for rest_proxy
# HELP redpanda_rpc_active_connections Count of currently active connections
# HELP redpanda_rpc_received_bytes internal: Number of bytes received from the clients in valid requests
# HELP redpanda_rpc_request_errors_total Number of rpc errors
# HELP redpanda_rpc_request_latency_seconds RPC latency
# HELP redpanda_rpc_sent_bytes internal: Number of bytes sent to clients
# HELP redpanda_scheduler_runtime_seconds_total Accumulated runtime of task queue associated with this scheduling group
# HELP redpanda_schema_registry_request_errors_total Total number of schema_registry server errors
# HELP redpanda_schema_registry_request_latency_seconds Internal latency of request for schema_registry
# HELP redpanda_security_audit_errors_total Running count of errors in creating/publishing audit event log entries
# HELP redpanda_security_audit_last_event_timestamp_seconds Timestamp of last successful publish on the audit log (seconds since epoch)
# HELP redpanda_storage_disk_free_bytes Disk storage bytes free.
# HELP redpanda_storage_disk_free_space_alert Status of low storage space alert. 0-OK, 1-Low Space 2-Degraded
# HELP redpanda_storage_disk_total_bytes Total size of attached storage, in bytes.

@Deflaimun
Copy link
Collaborator Author

Results for running curl http://localhost:9644/metrics and filtering by HELP:

# HELP vectorized_alien_receive_batch_queue_length Current receive batch queue length
# HELP vectorized_alien_total_received_messages Total number of received messages
# HELP vectorized_alien_total_sent_messages Total number of sent messages
# HELP vectorized_application_build Redpanda build information
# HELP vectorized_application_uptime Redpanda uptime in milliseconds
# HELP vectorized_cloud_storage_partition_chunk_size Size of chunk downloaded from cloud storage
# HELP vectorized_cloud_storage_partition_read_bytes Total bytes read from remote partition
# HELP vectorized_cloud_storage_partition_read_records Total number of records read from remote partition
# HELP vectorized_cluster_controller_pending_partition_operations Number of partitions with ongoing/requested operations
# HELP vectorized_cluster_partition_bytes_fetched_from_follower_total Total number of bytes fetched from follower (not all might be returned to the client)
# HELP vectorized_cluster_partition_bytes_fetched_total Total number of bytes fetched (not all might be returned to the client)
# HELP vectorized_cluster_partition_bytes_produced_total Total number of bytes produced
# HELP vectorized_cluster_partition_cloud_storage_segments_metadata_bytes Current number of bytes consumed by remote segments managed for this partition
# HELP vectorized_cluster_partition_committed_offset Partition commited offset. i.e. safely persisted on majority of replicas
# HELP vectorized_cluster_partition_end_offset Last offset stored by current partition on this node
# HELP vectorized_cluster_partition_high_watermark Partion high watermark i.e. highest consumable offset
# HELP vectorized_cluster_partition_last_stable_offset Last stable offset
# HELP vectorized_cluster_partition_leader Flag indicating if this partition instance is a leader
# HELP vectorized_cluster_partition_leader_id Id of current partition leader
# HELP vectorized_cluster_partition_moving_from_node Amount of partitions that are moving from node
# HELP vectorized_cluster_partition_moving_to_node Amount of partitions that are moving to node
# HELP vectorized_cluster_partition_node_cancelling_movements Amount of cancelling partition movements for node
# HELP vectorized_cluster_partition_num_with_broken_rack_constraint Number of partitions that don't satisfy the rack awareness constraint
# HELP vectorized_cluster_partition_records_fetched Total number of records fetched
# HELP vectorized_cluster_partition_records_produced Total number of records produced
# HELP vectorized_cluster_partition_start_offset start offset
# HELP vectorized_cluster_partition_under_replicated_replicas Number of under replicated replicas
# HELP vectorized_cluster_producer_state_manager_evicted_producers Number of evicted producers so far.
# HELP vectorized_cluster_producer_state_manager_producer_manager_total_active_producers Total number of active idempotent and transactional producers.
# HELP vectorized_fetch_stats_plan_and_execute_latency_us Latency of fetch planning and excution
# HELP vectorized_httpd_connections_current The current number of open  connections
# HELP vectorized_httpd_connections_total The total number of connections opened
# HELP vectorized_httpd_read_errors The total number of errors while reading http requests
# HELP vectorized_httpd_reply_errors The total number of errors while replying to http
# HELP vectorized_httpd_requests_served The total number of http requests served
# HELP vectorized_internal_rpc_active_connections internal_rpc: Currently active connections
# HELP vectorized_internal_rpc_connection_close_errors internal_rpc: Number of errors when shutting down the connection
# HELP vectorized_internal_rpc_connections_rejected internal_rpc: Number of connections rejected for hitting connection limits
# HELP vectorized_internal_rpc_connections_wait_rate internal_rpc: Number of connections are blocked by connection rate
# HELP vectorized_internal_rpc_connects internal_rpc: Number of accepted connections
# HELP vectorized_internal_rpc_consumed_mem_bytes internal_rpc: Memory consumed by request processing
# HELP vectorized_internal_rpc_corrupted_headers internal_rpc: Number of requests with corrupted headers
# HELP vectorized_internal_rpc_dispatch_handler_latency internal_rpc: Latency 
# HELP vectorized_internal_rpc_latency Internal RPC service latency
# HELP vectorized_internal_rpc_max_service_mem_bytes internal_rpc: Maximum memory allowed for RPC
# HELP vectorized_internal_rpc_method_not_found_errors internal_rpc: Number of requests with not available RPC method
# HELP vectorized_internal_rpc_produce_bad_create_time number of produce requests with timestamps too far in the future or in the past
# HELP vectorized_internal_rpc_received_bytes internal_rpc: Number of bytes received from the clients in valid requests
# HELP vectorized_internal_rpc_requests_blocked_memory internal_rpc: Number of requests blocked in memory backpressure
# HELP vectorized_internal_rpc_requests_completed internal_rpc: Number of successful requests
# HELP vectorized_internal_rpc_requests_pending internal_rpc: Number of requests pending in the queue
# HELP vectorized_internal_rpc_sent_bytes internal_rpc: Number of bytes sent to clients
# HELP vectorized_internal_rpc_service_errors internal_rpc: Number of service errors
# HELP vectorized_io_queue_adjusted_consumption Consumed disk capacity units adjusted for class shares and idling preemption
# HELP vectorized_io_queue_consumption Accumulated disk capacity units consumed by this class; an increment per-second rate indicates full utilization
# HELP vectorized_io_queue_delay random delay time in the queue
# HELP vectorized_io_queue_disk_queue_length Number of requests in the disk
# HELP vectorized_io_queue_queue_length Number of requests in the queue
# HELP vectorized_io_queue_shares current amount of shares
# HELP vectorized_io_queue_starvation_time_sec Total time spent starving for disk
# HELP vectorized_io_queue_total_bytes Total bytes passed in the queue
# HELP vectorized_io_queue_total_delay_sec Total time spent in the queue
# HELP vectorized_io_queue_total_exec_sec Total time spent in disk
# HELP vectorized_io_queue_total_operations Total operations passed in the queue
# HELP vectorized_io_queue_total_read_bytes Total read bytes passed in the queue
# HELP vectorized_io_queue_total_read_ops Total read operations passed in the queue
# HELP vectorized_io_queue_total_split_bytes Total number of bytes split
# HELP vectorized_io_queue_total_split_ops Total number of requests split
# HELP vectorized_io_queue_total_write_bytes Total write bytes passed in the queue
# HELP vectorized_io_queue_total_write_ops Total write operations passed in the queue
# HELP vectorized_kafka_fetch_sessions_cache_mem_usage_bytes Fetch sessions cache memory usage in bytes
# HELP vectorized_kafka_fetch_sessions_cache_sessions_count Total number of fetch sessions
# HELP vectorized_kafka_handler_latency_microseconds Latency histogram of kafka requests
# HELP vectorized_kafka_handler_received_bytes_total Number of bytes received from kafka requests
# HELP vectorized_kafka_handler_requests_completed_total Number of kafka requests completed
# HELP vectorized_kafka_handler_requests_errored_total Number of kafka requests errored
# HELP vectorized_kafka_handler_requests_in_progress_total A running total of kafka requests in progress
# HELP vectorized_kafka_handler_sent_bytes_total Number of bytes sent in kafka replies
# HELP vectorized_kafka_latency_fetch_latency_us Fetch Latency
# HELP vectorized_kafka_latency_produce_latency_us Produce Latency
# HELP vectorized_kafka_quotas_balancer_runs Number of times throughput quota balancer has been run
# HELP vectorized_kafka_quotas_quota_effective Currently effective quota, in bytes/s
# HELP vectorized_kafka_quotas_throttle_time Throttle time histogram (in seconds)
# HELP vectorized_kafka_quotas_traffic_egress Amount of Kafka traffic published to the clients that was taken into processing, in bytes
# HELP vectorized_kafka_quotas_traffic_intake Amount of Kafka traffic received from the clients that is taken into processing, in bytes
# HELP vectorized_kafka_rpc_active_connections kafka_rpc: Currently active connections
# HELP vectorized_kafka_rpc_connection_close_errors kafka_rpc: Number of errors when shutting down the connection
# HELP vectorized_kafka_rpc_connections_rejected kafka_rpc: Number of connections rejected for hitting connection limits
# HELP vectorized_kafka_rpc_connections_wait_rate kafka_rpc: Number of connections are blocked by connection rate
# HELP vectorized_kafka_rpc_connects kafka_rpc: Number of accepted connections
# HELP vectorized_kafka_rpc_consumed_mem_bytes kafka_rpc: Memory consumed by request processing
# HELP vectorized_kafka_rpc_corrupted_headers kafka_rpc: Number of requests with corrupted headers
# HELP vectorized_kafka_rpc_dispatch_handler_latency kafka_rpc: Latency 
# HELP vectorized_kafka_rpc_fetch_avail_mem_bytes kafka_rpc: Memory available for fetch request processing
# HELP vectorized_kafka_rpc_max_service_mem_bytes kafka_rpc: Maximum memory allowed for RPC
# HELP vectorized_kafka_rpc_method_not_found_errors kafka_rpc: Number of requests with not available RPC method
# HELP vectorized_kafka_rpc_produce_bad_create_time number of produce requests with timestamps too far in the future or in the past
# HELP vectorized_kafka_rpc_received_bytes kafka_rpc: Number of bytes received from the clients in valid requests
# HELP vectorized_kafka_rpc_requests_blocked_memory kafka_rpc: Number of requests blocked in memory backpressure
# HELP vectorized_kafka_rpc_requests_completed kafka_rpc: Number of successful requests
# HELP vectorized_kafka_rpc_requests_pending kafka_rpc: Number of requests pending in the queue
# HELP vectorized_kafka_rpc_sasl_session_expiration_total Total number of SASL session expirations
# HELP vectorized_kafka_rpc_sasl_session_reauth_attempts_total Total number of SASL reauthentication attempts
# HELP vectorized_kafka_rpc_sasl_session_revoked_total Total number of SASL sessions revoked
# HELP vectorized_kafka_rpc_sent_bytes kafka_rpc: Number of bytes sent to clients
# HELP vectorized_kafka_rpc_service_errors kafka_rpc: Number of service errors
# HELP vectorized_kafka_schema_id_cache_batches_decompressed Total number of batches decompressed for server-side schema ID validation
# HELP vectorized_kafka_schema_id_cache_hits Total number of hits for the server-side schema ID validation cache (see cluster config: kafka_schema_id_validation_cache_capacity)
# HELP vectorized_kafka_schema_id_cache_misses Total number of misses for the server-side schema ID validation cache (see cluster config: kafka_schema_id_validation_cache_capacity)
# HELP vectorized_leader_balancer_leader_transfer_error Number of errors attempting to transfer leader
# HELP vectorized_leader_balancer_leader_transfer_no_improvement Number of times no balance improvement was found
# HELP vectorized_leader_balancer_leader_transfer_succeeded Number of successful leader transfers
# HELP vectorized_leader_balancer_leader_transfer_timeout Number of timeouts attempting to transfer leader
# HELP vectorized_memory_allocated_memory Allocated memory size in bytes
# HELP vectorized_memory_available_memory Total shard memory potentially available in bytes (free_memory plus reclaimable)
# HELP vectorized_memory_available_memory_low_water_mark The low-water mark for available_memory from process start
# HELP vectorized_memory_cross_cpu_free_operations Total number of cross cpu free
# HELP vectorized_memory_free_memory Free memory size in bytes
# HELP vectorized_memory_free_operations Total number of free operations
# HELP vectorized_memory_malloc_failed Total count of failed memory allocations
# HELP vectorized_memory_malloc_live_objects Number of live objects
# HELP vectorized_memory_malloc_operations Total number of malloc operations
# HELP vectorized_memory_reclaims_operations Total reclaims operations
# HELP vectorized_memory_total_memory Total memory size in bytes
# HELP vectorized_node_status_rpcs_received Number of node status RPCs received by this node
# HELP vectorized_node_status_rpcs_sent Number of node status RPCs sent by this node
# HELP vectorized_node_status_rpcs_timed_out Number of timed out node status RPCs from this node
# HELP vectorized_pandaproxy_request_latency Request latency
# HELP vectorized_raft_configuration_change_in_progress Indicates if current raft group configuration is in joint state i.e. configuration is being changed
# HELP vectorized_raft_done_replicate_requests Number of finished replicate requests
# HELP vectorized_raft_full_heartbeat_requests Number of full heartbeats sent by the leader
# HELP vectorized_raft_group_configuration_updates Number of raft group configuration updates
# HELP vectorized_raft_group_count Number of raft groups
# HELP vectorized_raft_heartbeat_requests_errors Number of failed heartbeat requests
# HELP vectorized_raft_leader_for Number of groups for which node is a leader
# HELP vectorized_raft_leadership_changes Number of leadership changes
# HELP vectorized_raft_lightweight_heartbeat_requests Number of lightweight heartbeats sent by the leader
# HELP vectorized_raft_log_flushes Number of log flushes
# HELP vectorized_raft_log_truncations Number of log truncations
# HELP vectorized_raft_received_append_requests Number of append requests received
# HELP vectorized_raft_received_vote_requests Number of vote requests received
# HELP vectorized_raft_recovery_offsets_pending Sum of offsets that partitions on this node need to recover.
# HELP vectorized_raft_recovery_partition_movement_assigned_bandwidth Bandwidth assigned for partition movement in last tick. bytes/sec
# HELP vectorized_raft_recovery_partition_movement_available_bandwidth Bandwidth available for partition movement. bytes/sec
# HELP vectorized_raft_recovery_partitions_active Number of partition replicas are currently recovering on this node.
# HELP vectorized_raft_recovery_partitions_to_recover Number of partition replicas that have to recover for this node.
# HELP vectorized_raft_recovery_requests Number of recovery requests
# HELP vectorized_raft_recovery_requests_errors Number of failed recovery requests
# HELP vectorized_raft_replicate_ack_all_requests Number of replicate requests with quorum ack consistency
# HELP vectorized_raft_replicate_ack_leader_requests Number of replicate requests with leader ack consistency
# HELP vectorized_raft_replicate_ack_none_requests Number of replicate requests with no ack consistency
# HELP vectorized_raft_replicate_batch_flush_requests Number of replicate batch flushes
# HELP vectorized_raft_replicate_request_errors Number of failed replicate requests
# HELP vectorized_raft_sent_vote_requests Number of vote requests sent
# HELP vectorized_reactor_abandoned_failed_futures Total number of abandoned failed futures, futures destroyed while still containing an exception
# HELP vectorized_reactor_aio_bytes_read Total aio-reads bytes
# HELP vectorized_reactor_aio_bytes_write Total aio-writes bytes
# HELP vectorized_reactor_aio_errors Total aio errors
# HELP vectorized_reactor_aio_outsizes Total number of aio operations that exceed IO limit
# HELP vectorized_reactor_aio_reads Total aio-reads operations
# HELP vectorized_reactor_aio_writes Total aio-writes operations
# HELP vectorized_reactor_cpp_exceptions Total number of C++ exceptions
# HELP vectorized_reactor_cpu_busy_ms Total cpu busy time in milliseconds
# HELP vectorized_reactor_cpu_steal_time_ms Total steal time, the time in which some other process was running while Seastar was not trying to run (not sleeping).Because this is in userspace, some time that could be legitimally thought as steal time is not accounted as such. For example, if we are sleeping and can wake up but the kernel hasn't woken us up yet.
# HELP vectorized_reactor_fstream_read_bytes Counts bytes read from disk file streams.  A high rate indicates high disk activity. Divide by fstream_reads to determine average read size.
# HELP vectorized_reactor_fstream_read_bytes_blocked Counts the number of bytes read from disk that could not be satisfied from read-ahead buffers, and had to block. Indicates short streams, or incorrect read ahead configuration.
# HELP vectorized_reactor_fstream_reads Counts reads from disk file streams.  A high rate indicates high disk activity. Contrast with other fstream_read* counters to locate bottlenecks.
# HELP vectorized_reactor_fstream_reads_ahead_bytes_discarded Counts the number of buffered bytes that were read ahead of time and were discarded because they were not needed, wasting disk bandwidth. Indicates over-eager read ahead configuration.
# HELP vectorized_reactor_fstream_reads_aheads_discarded Counts the number of times a buffer that was read ahead of time and was discarded because it was not needed, wasting disk bandwidth. Indicates over-eager read ahead configuration.
# HELP vectorized_reactor_fstream_reads_blocked Counts the number of times a disk read could not be satisfied from read-ahead buffers, and had to block. Indicates short streams, or incorrect read ahead configuration.
# HELP vectorized_reactor_fsyncs Total number of fsync operations
# HELP vectorized_reactor_io_threaded_fallbacks Total number of io-threaded-fallbacks operations
# HELP vectorized_reactor_logging_failures Total number of logging failures
# HELP vectorized_reactor_polls Number of times pollers were executed
# HELP vectorized_reactor_stalls A histogram of reactor stall durations
# HELP vectorized_reactor_tasks_pending Number of pending tasks in the queue
# HELP vectorized_reactor_tasks_processed Total tasks processed
# HELP vectorized_reactor_timers_pending Number of tasks in the timer-pending queue
# HELP vectorized_reactor_utilization CPU utilization
# HELP vectorized_rpc_client_active_connections Currently active connections
# HELP vectorized_rpc_client_client_correlation_errors Number of errors in client correlation id
# HELP vectorized_rpc_client_connection_errors Number of connection errors
# HELP vectorized_rpc_client_connects Connection attempts
# HELP vectorized_rpc_client_corrupted_headers Number of responses with corrupted headers
# HELP vectorized_rpc_client_in_bytes Total number of bytes sent (including headers)
# HELP vectorized_rpc_client_out_bytes Total number of bytes received
# HELP vectorized_rpc_client_read_dispatch_errors Number of errors while dispatching responses
# HELP vectorized_rpc_client_request_errors Number or requests errors
# HELP vectorized_rpc_client_request_timeouts Number or requests timeouts
# HELP vectorized_rpc_client_requests Number of requests
# HELP vectorized_rpc_client_requests_blocked_memory Number of requests that are blocked because of insufficient memory
# HELP vectorized_rpc_client_requests_pending Number of requests pending
# HELP vectorized_rpc_client_server_correlation_errors Number of responses with wrong correlation id
# HELP vectorized_scheduler_queue_length Size of backlog on this queue, in tasks; indicates whether the queue is busy and/or contended
# HELP vectorized_scheduler_runtime_ms Accumulated runtime of this task queue; an increment rate of 1000ms per second indicates full utilization
# HELP vectorized_scheduler_shares Shares allocated to this queue
# HELP vectorized_scheduler_starvetime_ms Accumulated starvation time of this task queue; an increment rate of 1000ms per second indicates the scheduler feels really bad
# HELP vectorized_scheduler_tasks_processed Count of tasks executing on this queue; indicates together with runtime_ms indicates length of tasks
# HELP vectorized_scheduler_time_spent_on_task_quota_violations_ms Total amount in milliseconds we were in violation of the task quota
# HELP vectorized_scheduler_waittime_ms Accumulated waittime of this task queue; an increment rate of 1000ms per second indicates queue is waiting for something (e.g. IO)
# HELP vectorized_security_audit_buffer_usage_ratio Audit event buffer usage ratio.
# HELP vectorized_security_audit_errors_total Running count of errors in creating/publishing audit event log entries
# HELP vectorized_security_audit_last_event_timestamp_seconds Timestamp of last successful publish on the audit log (seconds since epoch)
# HELP vectorized_stall_detector_reported Total number of reported stalls, look in the traces for the exact reason
# HELP vectorized_storage_compaction_backlog_controller_backlog_size controller backlog
# HELP vectorized_storage_compaction_backlog_controller_error current controller error, i.e difference between set point and backlog size
# HELP vectorized_storage_compaction_backlog_controller_shares controller output, i.e. number of shares
# HELP vectorized_storage_kvstore_cached_bytes Size of the database in memory
# HELP vectorized_storage_kvstore_entries_fetched Number of entries fetched
# HELP vectorized_storage_kvstore_entries_removed Number of entries removaled
# HELP vectorized_storage_kvstore_entries_written Number of entries written
# HELP vectorized_storage_kvstore_key_count Number of keys in the database
# HELP vectorized_storage_kvstore_segments_rolled Number of segments rolled
# HELP vectorized_storage_log_batch_parse_errors Number of batch parsing (reading) errors
# HELP vectorized_storage_log_batch_write_errors Number of batch write errors
# HELP vectorized_storage_log_batches_read Total number of batches read
# HELP vectorized_storage_log_batches_written Total number of batches written
# HELP vectorized_storage_log_cache_hits Reader cache hits
# HELP vectorized_storage_log_cache_misses Reader cache misses
# HELP vectorized_storage_log_cached_batches_read Total number of cached batches read
# HELP vectorized_storage_log_cached_read_bytes Total number of cached bytes read
# HELP vectorized_storage_log_compacted_segment Number of compacted segments
# HELP vectorized_storage_log_compaction_ratio Average segment compaction ratio
# HELP vectorized_storage_log_corrupted_compaction_indices Number of times we had to re-construct the .compaction index on a segment
# HELP vectorized_storage_log_log_segments_active Current number of local log segments
# HELP vectorized_storage_log_log_segments_created Total number of local log segments created since node startup
# HELP vectorized_storage_log_log_segments_removed Total number of local log segments removed since node startup
# HELP vectorized_storage_log_partition_size Current size of partition in bytes
# HELP vectorized_storage_log_read_bytes Total number of bytes read
# HELP vectorized_storage_log_readers_added Number of readers added to cache
# HELP vectorized_storage_log_readers_evicted Number of readers evicted from cache
# HELP vectorized_storage_log_written_bytes Total number of bytes written
# HELP vectorized_tx_partition_idempotency_pid_cache_size Number of active producers (known producer_id seq number pairs).
# HELP vectorized_tx_partition_tx_mem_tracker_consumption_bytes Total memory bytes in use by tx subsystem.
# HELP vectorized_tx_partition_tx_num_inflight_requests Number of ongoing transactional requests.

@Deflaimun
Copy link
Collaborator Author

Am I missing a configuration or something to show these metrics?
Using this example
https://docs.redpanda.com/current/get-started/quick-start/?tab=tabs-1-three-brokers

@JakeSCahill
Copy link
Collaborator

JakeSCahill commented Apr 8, 2024

I tried this on K8s:

kubectl exec -it -n redpanda redpanda-0 -c redpanda -- curl https://localhost:9644/public_metrics -k | grep '# HELP'
# HELP redpanda_application_build Redpanda build information
# HELP redpanda_application_uptime_seconds_total Redpanda uptime in seconds
# HELP redpanda_cluster_brokers Number of configured brokers in the cluster
# HELP redpanda_cluster_controller_log_limit_requests_available_rps Controller log rate limiting. Available rps for group
# HELP redpanda_cluster_controller_log_limit_requests_dropped Controller log rate limiting. Amount of requests that are dropped due to exceeding limit in group
# HELP redpanda_cluster_members_backend_queued_node_operations Number of queued node operations
# HELP redpanda_cluster_partition_moving_from_node Amount of partitions that are moving from node
# HELP redpanda_cluster_partition_moving_to_node Amount of partitions that are moving to node
# HELP redpanda_cluster_partition_node_cancelling_movements Amount of cancelling partition movements for node
# HELP redpanda_cluster_partition_num_with_broken_rack_constraint Number of partitions that don't satisfy the rack awareness constraint
# HELP redpanda_cluster_partitions Number of partitions in the cluster (replicas not included)
# HELP redpanda_cluster_topics Number of topics in the cluster
# HELP redpanda_cluster_unavailable_partitions Number of partitions that lack quorum among replicants
# HELP redpanda_cpu_busy_seconds_total Total CPU busy time in seconds
# HELP redpanda_io_queue_total_read_ops Total read operations passed in the queue
# HELP redpanda_io_queue_total_write_ops Total write operations passed in the queue
# HELP redpanda_kafka_handler_latency_seconds Latency histogram of kafka requests
# HELP redpanda_kafka_max_offset Latest readable offset of the partition (i.e. high watermark)
# HELP redpanda_kafka_partitions Configured number of partitions for the topic
# HELP redpanda_kafka_records_fetched_total Total number of records fetched
# HELP redpanda_kafka_records_produced_total Total number of records produced
# HELP redpanda_kafka_replicas Configured number of replicas for the topic
# HELP redpanda_kafka_request_bytes_total Total number of bytes produced per topic
# HELP redpanda_kafka_request_latency_seconds Internal latency of kafka produce requests
# HELP redpanda_kafka_rpc_sasl_session_expiration_total Total number of SASL session expirations
# HELP redpanda_kafka_rpc_sasl_session_reauth_attempts_total Total number of SASL reauthentication attempts
# HELP redpanda_kafka_rpc_sasl_session_revoked_total Total number of SASL sessions revoked
# HELP redpanda_kafka_under_replicated_replicas Number of under replicated replicas (i.e. replicas that are live, but not at the latest offest)
# HELP redpanda_memory_allocated_memory Allocated memory size in bytes
# HELP redpanda_memory_available_memory Total shard memory potentially available in bytes (free_memory plus reclaimable)
# HELP redpanda_memory_available_memory_low_water_mark The low-water mark for available_memory from process start
# HELP redpanda_memory_free_memory Free memory size in bytes
# HELP redpanda_node_status_rpcs_received Number of node status RPCs received by this node
# HELP redpanda_node_status_rpcs_sent Number of node status RPCs sent by this node
# HELP redpanda_node_status_rpcs_timed_out Number of timed out node status RPCs from this node
# HELP redpanda_raft_leadership_changes Number of leadership changes across all partitions of a given topic
# HELP redpanda_raft_recovery_offsets_pending Sum of offsets that partitions on this node need to recover.
# HELP redpanda_raft_recovery_partition_movement_available_bandwidth Bandwidth available for partition movement. bytes/sec
# HELP redpanda_raft_recovery_partition_movement_consumed_bandwidth Bandwidth consumed for partition movement. bytes/sec
# HELP redpanda_raft_recovery_partitions_active Number of partition replicas are currently recovering on this node.
# HELP redpanda_raft_recovery_partitions_to_recover Number of partition replicas that have to recover for this node.
# HELP redpanda_rest_proxy_request_errors_total Total number of rest_proxy server errors
# HELP redpanda_rest_proxy_request_latency_seconds Internal latency of request for rest_proxy
# HELP redpanda_rpc_active_connections Count of currently active connections
# HELP redpanda_rpc_received_bytes internal: Number of bytes received from the clients in valid requests
# HELP redpanda_rpc_request_errors_total Number of rpc errors
# HELP redpanda_rpc_request_latency_seconds RPC latency
# HELP redpanda_rpc_sent_bytes internal: Number of bytes sent to clients
# HELP redpanda_scheduler_runtime_seconds_total Accumulated runtime of task queue associated with this scheduling group
# HELP redpanda_schema_registry_request_errors_total Total number of schema_registry server errors
# HELP redpanda_schema_registry_request_latency_seconds Internal latency of request for schema_registry
# HELP redpanda_security_audit_errors_total Running count of errors in creating/publishing audit event log entries
# HELP redpanda_security_audit_last_event_timestamp_seconds Timestamp of last successful publish on the audit log (seconds since epoch)
# HELP redpanda_storage_disk_free_bytes Disk storage bytes free.
# HELP redpanda_storage_disk_free_space_alert Status of low storage space alert. 0-OK, 1-Low Space 2-Degraded
# HELP redpanda_storage_disk_total_bytes Total size of attached storage, in bytes.
# HELP redpanda_tls_certificate_expires_at_timestamp_seconds Expiry time of the server certificate (seconds since epoch)
# HELP redpanda_tls_certificate_serial Least significant four bytes of the server certificate serial number
# HELP redpanda_tls_certificate_valid The value is one if the certificate is valid with the given truststore, otherwise zero.
# HELP redpanda_tls_loaded_at_timestamp_seconds Load time of the server certificate (seconds since epoch).
# HELP redpanda_tls_truststore_expires_at_timestamp_seconds Expiry time of the shortest-lived CA in the truststore(seconds since epoch)

I also don't see cloud storage metrics

Version:     v23.3.10
Git ref:     255020e51d
Build date:  2024-03-28T07:53:07Z
OS/Arch:     linux/arm64
Go version:  go1.21.3

Redpanda Cluster
  node-0  v23.3.10 - 255020e51d9ed9b7f90a67872bfc81d62923e3e9
  node-1  v23.3.10 - 255020e51d9ed9b7f90a67872bfc81d62923e3e9
  node-2  v23.3.10 - 255020e51d9ed9b7f90a67872bfc81d62923e3e9

@gavinheavyside
Copy link

As per Slack, I think you'll only see the metrics in public_metrics when using an enterprise license and configured tiered storage, but the core team will be able to confirm

@andijcr andijcr self-requested a review April 23, 2024 16:24
Copy link
Contributor

@andijcr andijcr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

= Cloud Metrics
:description: Use Redpanda Cloud metrics to create your system dashboard.

This section provides reference descriptions about the public metrics exported from Redpanda Cloud.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section provides reference descriptions about the public metrics exported from Redpanda Cloud.
This section provides descriptions of the public metrics exported from Redpanda Cloud.

@Deflaimun Deflaimun changed the title add Cloud metrics reference add Cloud storage metrics reference Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants