Skip to content

Understanding FoundationDB Log Files

Evan Tschannen edited this page Dec 12, 2019 · 25 revisions

Mandatory Fields

Type - The name of the trace event.

LogGroup - The value of the --loggroup parameter passed to fdbserver. This is useful when trace logs from multiple databases are ingested into the same system.

Severity:

  • 10 - SevInfo - General information events
  • 20 - SevWarn - Indicates a generally rare event, but not indicate a actual problem
  • 30 - SevWarnAlways - Indicates an event that could be causing performance problems in a cluster
  • 40 - SevError - Indicates one of the assumptions of the database has been violated. Generally caused by hardware failures.

ID - The ID of the role which produced the event. For instance, messages logged by the master will contain the master's ID.

Machine - The IP address and port of the process which produced the event. For client trace logs the pid of the process is logged instead of the port.

Time - The time at which the event happened.

Roles - A comma separated list of two letter codes which correspond to all of the roles running on the process at the time the message was logged.

  • StorageServer - SS
  • TLog - TL
  • MasterProxyServer - MP
  • MasterServer - MS
  • Resolver - RV
  • ClusterController - CC
  • Tester - TS
  • LogRouter - LR
  • DataDistributor - DD
  • Ratekeeper - RK
  • Coordinator - CD

Common Fields

Error - Events which are logged because of an error include the associated error code.

Backtrace - Some high severity events include a backtrace which provides the call stack of the system at the time the event was generated.

TrackLatestType - The trace files generated by FoundationDB will automatically roll after a period of time. Some events are re-logged every time the trace file is rolled, so we will always have a history of the last time the message was called. If the TrackLatestType is Original that means the event just happened. If the TrackLatestType is Rolled it means the event happened sometime in the past. For Rolled trace events, the OriginalTime field is the time at which the original event happened.

SuppressedEventCount - Many events have a maximum cap on the number of times they can be logged in a given time interval. Events that have this time of suppression include the number of events that were not logged in the next message which does get written.

Counters - Fields which have values that are three space separated numeric values are called counters. The first number is the the amount the value has increases per second since the field was last logged. The second number is a roughness, where larger numbers represent that the increases to this field are very bursty, and smaller values represent that the value has increase at a very regular interval. The third number is the total amount this value has increased since the start of the role logging the event.

Periodic Messages

TLogMetrics

The difference between BytesInput and BytesDurable is the amount of memory the TLog is using. Multiple TLogs can be recruited on the same process, but only one of them can be active. SharedBytesInput and SharedBytesDurable represent the combined memory used by all TLogs on the same process. Once this difference reaches 1.1GB batch priority transactions will be throttled. Once it reaches 1.5GB the TLog will start take writing mutations from memory to disk. Future reads for these mutations will need to read them from disk. Once this difference reaches 2.0GB Ratekeeper will start slowing down client traffic, with the goal of ensuring TLog memory does not exceed 2.4GB. SharedOverheadBytesInput and SharedOverheadBytesDurable show the amount of memory being used on overhead, and should generally be much smaller than the total amount of memory in use.

  • Version
  • QueueCommittedVersion
  • PersistentDataVersion
  • PersistentDataDurableVersion
  • KnownCommittedVersion
  • MinPoppedTagVersion
  • MinPoppedTagLocality
  • MinPoppedTagId
  • KvstoreBytesUsed
  • KvstoreBytesFree
  • KvstoreBytesAvailable
  • KvstoreBytesTotal
  • QueueDiskBytesUsed
  • QueueDiskBytesFree
  • QueueDiskBytesAvailable
  • QueueDiskBytesTotal
  • PeekMemoryReserved
  • PeekMemoryRequestsStalled

ProxyMetrics

  • TxnStartIn
  • TxnStartOut
  • TxnStartBatch
  • TxnSystemPriorityStartIn
  • TxnSystemPriorityStartOut
  • TxnBatchPriorityStartIn
  • TxnBatchPriorityStartOut
  • TxnDefaultPriorityStartIn
  • TxnDefaultPriorityStartOut
  • TxnCommitIn
  • TxnCommitVersionAssigned
  • TxnCommitResolving
  • TxnCommitResolved
  • TxnCommitOut
  • TxnCommitOutSuccess
  • TxnConflicts
  • CommitBatchIn
  • CommitBatchOut
  • MutationBytes
  • Mutations
  • ConflictRanges
  • KeyServerLocationRequests
  • LastAssignedCommitVersion
  • Version
  • CommittedVersion
  • CommitBatchesMemBytesCount

StorageMetrics

  • GetKeyQueries
  • GetValueQueries
  • GetRangeQueries
  • QueryQueue
  • FinishedQueries
  • RowsQueried
  • BytesQueried
  • WatchQueries
  • EmptyQueries
  • BytesInput
  • BytesDurable
  • BytesFetched
  • MutationBytes
  • SampledBytesCleared
  • Mutations
  • SetMutations
  • ClearRangeMutations
  • AtomicMutations
  • UpdateBatches
  • UpdateVersions
  • Loops
  • FetchWaitingMS
  • FetchWaitingCount
  • FetchExecutingMS
  • FetchExecutingCount
  • ReadsRejected
  • LastTLogVersion
  • Version
  • StorageVersion
  • DurableVersion
  • DesiredOldestVersion
  • VersionLag
  • LocalRate
  • FetchKeysFetchActive
  • FetchKeysWaiting
  • QueryQueueMax
  • BytesStored
  • ActiveWatches
  • WatchBytes
  • KvstoreBytesUsed
  • KvstoreBytesFree
  • KvstoreBytesAvailable
  • KvstoreBytesTotal

ProcessMetrics

MachineMetrics

NetworkMetrics

MemoryMetrics

LogRouterMetrics

ClusterControllerMetrics

MovingData

TotalDataInFlight

DDTrackerStats

RkUpdate

RkUpdateBatch

DatacenterVersionDifference

SpringCleaningMetrics

Edge Triggered Messages

Generic Failure Info

SlowTask

Net2SlowTaskTrace

FailureDetectionStatus

ConnectingTo

ConnectionClosed

ConnectionTimedOut

ConnectionTimeout

ProgramStart

Role

TLogDegraded

LargeTransaction

ProxyCommitBatchMemoryThresholdExceeded

TooManyNotifications

TooManyStatusRequests

LargePacketSent

LargePacketReceived

HugeArenaSample

GetMagazineSample

SlowKAIOTruncate

SlowKAIOLaunch

KVCommit10sSample

StorageServerUpdateLag

TLogUpdateLag

TraceEventThrottle_*

TLS*

ServerTag

ServerTagRemove

Master Recovery

MasterRecoveryState

MasterRecovering

MasterRecoveredConfig

MasterRegistrationReceived

MasterRecoveryDuration

MasterTerminated

BetterMasterExists

CCWDB

TLog Recovery

TLogRestorePersistentState

TLogRestorePersistentStateDone

TLogRecover

TLogReady

TLogPoppedTag

TLogLockStarted

TLogLocked

TLogStop

GetDurableResult

GetDurableResultWaiting

Storage Server Recovery

Rollback

ReadingDurableState

RestoringDurableState

SSTimeRestoreDurableState

StorageServerStartingCore

Data Distribution

StatusMapChange

TeamHealthChanged

BuildTeamsBegin

ServerTeamRemover

MachineTeamRemover

DDInitTakingMoveKeysLock

DDInitTookMoveKeysLock

Coordination

NominatingLeader

ReplacedAsLeader

LeaderTrueHeartbeat

LeaderFalseHeartbeat

LeaderNoHeartbeat

ReleasingLeadership

Client Messages

ClientStart

TransactionMetrics

Clone this wiki locally