Aghast Specification

Introduction

Aghast encodes aggregated, histogram-like statistics in a way that can be shared among many libraries that fill, fit, or plot such things. It represents reduced data, such as the output of a histogram-filling (map) and combining (reduce) process, or the output of an SQL-like group-by procedure.

Aghast is not intended to be used directly by data analysts. It is horrifically verbose. Instead, it is a general ontology of aggregated data types with conversion routines to translate among libraries. With the appropriate translations behind the scenes, data analysts can reduce large datasets with one library, model/fit the aggregation with another, and plot it with a third.

This role is inspired by two projects: Apache Arrow, which acts as an efficient memory representation for nested, columnar data, and PMML, which is an ontology of data mining models in XML. Like Arrow, aghast is an in-memory format to pass data in a random-access and zero-copy way (though an in-memory format can also be saved to disk). Like PMML, aghast expresses domain-specific statistical concepts, such as histograms, moments, quantiles, covariances, fit functions, and ntuples, rather than generic data types.

Aghast is specified using Google Flatbuffers. Flatbuffers allows for lazy, random-access interpretation: any nested property can be accessed without reading or deserializing the others. Thus, a “ghast” (a sharable object) could contain thousands of histograms and accessing the bin contents of a particular one would not only leave the other histograms untouched, it would leave the histogram’s own metadata untouched. Flatbuffers minimizes its serialized size (because it is intended as a network protocol) and its serialization/deserialization time (because it is intended for computer games).

Aghast is designed to scale. Modern particle physics analyses deal in thousands of histograms generated by viewing the same quantities with many different “cuts” (selection filters) and variations (to test senstivity to systematic effects). This usually results in duplicated histogram axis descriptions: every histogram in a group of a hundred must be binned the same way. Duplication inflates serialized size, but it also introduces the possibility that some binnings might disagree (due to a copy-paste error) and the need for extra validity checks. Aghast is designed around the idea of “superhistograms,” representing groups of histograms that have identical binnings as a binning in a larger histogram. Thus, there are axis types for PredicateBinning (defined by if-then predicates) and VariationBinning (defined by systematic variations), and the data filling these bins are contiguous across all histograms in the group.

What this specification does not define are any methods for filling, fitting, or plotting. From a data analyst’s perspective, this is everything one does with histograms or other statistical objects. This is because aghast is firmly behind-the-scenes, a helper to other fine analysis tools, such as ROOT, Boost.Histogram, Physt, and Pandas.

Data Types

Aghast is specified as a Flatbuffers format in flatbuffers/aghast.fbs. Flatbuffers provides a standard suite of types that can be translated into many languages. However, the code the Flatbuffers code generator produces is too low-level even for applications to use as a backend, so we describe the interfaces and types of wrapper classes here.

These class descriptions are sufficiently constrained to fit into any static type system, and while they include heterogeneous lists (lists of an enumerated union type), they can be decomposed into homogeneous lists with an additional level of nesting. In fact, the Flatbuffers specification doesn’t allow heterogeneous lists, so it provides an example of this decomposition. (For example, the heterogeneous objects list in a Collection, which can contain Histogram, ParameterizedFunction, BinnedEvaluatedFunction, and Ntuple, is encoded in Flatbuffers as a homogeneous list of Object, which contains a single union of ObjectData.)

Basic types, like booleans, integers, floating point numbers, and strings, are passed through without modification (though strings are explicitly encoded as utf-8; Flatbuffers strings are not encoding-aware). Integers and floating point numbers may have a constrained range, such as [0, ∞) for non-negative numbers excluding ∞ or (0, 2π] for positive values less than or equal to 2π. (A square bracket includes the endpoint; a round bracket excludes it.) Empty strings and missing strings (null) are distinct.

Lists may contain basic types or class instances, and there is no distinction between empty lists and missing lists (an artifact of Flatbuffers).

In some cases, we want a mapping type, such as str → X, so that objects are retrievable by name, rather than index. Flatbuffers does not have such a type, so we build it by decomposing the high-level mapping into a low-level pair of lists with equal length. (For example, the objects mapping in a Collection is encoded in Flatbuffers as a list objects and a list lookup.)

Class objects may be missing (null) if they are not required. Required properties are a Flatbuffers feature: it doesn’t generate code that would allow the serialized object to be missing. The class schemas can evolve to include more properties (with full forward and backward compatibility), but properties cannot be removed and required properties cannot become non-required.

Any properties that are not required have a default value (usually null).

In addition to the type constraints, which are tighter in the wrapper classes than they are in the Flatbuffers serialization, we list invariants (conditions that must be true) that depend on multiple properties or multiple class objects. To avoid unnecessary Flatbuffers deserialization, these are not automatically checked, but a check can be invoked and any behavior when those invariants are not satisfied is undefined.

A sharable ghast may have one of the following types: Collection, Histogram, ParameterizedFunction, BinnedEvaluatedFunction, and Ntuple.

Collection

Collection of named objects, possibly with one or more common axis.

• objects: str → Histogram or Ntuple or ParameterizedFunction or BinnedEvaluatedFunction or Collection
(default: null/empty)
• axis: list of Axis (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
• script: str (default: null)

Details:

A simple reason for using a collection would be to gather many objects into a convenient package that can be transmitted as a group. For this purpose, axis should be empty. Note that objects (such as histograms, functions, and ntuples) do not have names on their own; names are just keys in the objects property, used solely for lookup.

Assigning an axis to a collection, rather than individually to all objects it contains, is to avoid duplication when defining similarly binned data. As an example of the latter, consider histograms three h1, h2, h3 with two sets of cuts applied, "signal" and "control" (six histograms total).

Collection({"h1": h1, "h2": h2, "h3": h3},
           axis=[Axis(PredicateBinning("signal"), PredicateBinning("control"))])

This predicate axis (defined by if-then rules when the histograms were filled) is prepended onto the axes defined in each histogram separately. For instance, if h1 had one regular axis and h2 had two irregular axes, the "h1" in this collection has two axes: predicate, then regular, and the "h2" in this collection has three axes: predicate, then irregular, then irregular. This way, hundreds or thousands of histograms with similar binning can be defined in a contiguous block without repetition of axis definition (good for efficiency and avoiding copy-paste errors).

To subdivide one set of objects and not another, or to subdivide two sets of objects differently, put collections inside of collections. In the following example, h1 and h2 are subdivided but h3 is not.

Collection({"by region":
                Collection({"h1": h1, "h2": h2},
                axis=[Axis(PredicateBinning("signal"), PredicateBinning("control"))]),
            "h3": h3})

Similarly, regions can be subdivided into subregions, and other binning types may be used.

The buffers for each object must be the appropriate size to represent all of its axes, including any inherited from collections. (For example, a counts buffer appropriate for a standalone h1 would not fit an "h1" with prepended axes due to being in a collection.)

The title, metadata, decoration, and script properties have no semantic constraints.

Histogram

Histogram of a distribution, defined by a (possibly weighted) count of observations in each bin of an n-dimensional space.

• axis: list of Axis with length in [1, ∞) (required)
• counts: UnweightedCounts or WeightedCounts (required)
• profile: list of Profile (default: null/empty)
• axis_covariances: list of Covariance (default: null/empty)
• profile_covariances: list of Covariance (default: null/empty)
• functions: str → ParameterizedFunction or EvaluatedFunction (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
• script: str (default: null)
The xindex and yindex of each Covariance in axis_covariances must be in [0, number of axis) and be unique pairs (unordered).
The xindex and yindex of each Covariance in profile_covariances must be in [0, number of profile) and be unique pairs (unordered).

Details:

The space is subdivided by an n-dimensional axis. As described in Collection, nesting a histogram within a collection prepends the collection’s axis. The number of Axis objects is not necessarily the dimensionality of the space; some binnings, such as HexagonalBinning, define more than one dimension (though most do not).

The counts are separate from the axis, though the buffers providing counts must be exactly the right size to fit the n-dimensional binning (including axes inherited from a Collection).

Histograms with only axis and counts are pure distributions, histograms in the conventional sense. All other properties provide additional information about the dataset.

Any profiles summarize dependent variables (where the axis defines independent variables). For instance, a profile can represent mean and standard deviation y values for an axis binned in x.

The Axis and Profile classes internally define summary statistics, such as the mean or median of that axis. However, those Statistics objects cannot describe correlations among axes. If this information is available, it can be expressed in axis_covariances or profile_covariances.

Any functions associated with the histogram, such as fit results, may be attached directly to the histogram object with names. If an EvaluatedFunction is included, its binning is derived from the histogram’s full axis (including any axis inherited from a Collection).

The title, metadata, decoration, and script properties have no semantic constraints.

See also:

EvaluatedFunction and BinnedEvaluatedFunction: for comparison and lookup functions that aren’t statistical distributions.

Axis

Axis of a histogram or binned function representing one or more binned dimensions.

• binning: IntegerBinning or RegularBinning or HexagonalBinning or EdgesBinning or IrregularBinning or CategoryBinning or SparseRegularBinning or FractionBinning or PredicateBinning or VariationBinning
(default: null)
• expression: str (default: null)
• statistics: list of Statistics (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
The statistics must be empty or have a length equal to the number of dimensions in the binning (no binning is one-dimensional).

Details:

The dimension or dimensions are subdivided by the binning property; all other properties provide additional information.

If the axis represents a computed expression (derived feature), it may be encoded here as a string. The title is a human-readable description.

A Statistics object (one per dimension) summarizes the data separately from the histogram counts. For instance, it may contain the mean and standard deviation of all data along a dimension, which is more accurate than a mean and standard deviation derived from the counts.

The expression, title, metadata, and decoration properties have no semantic constraints.

IntegerBinning

Splits a one-dimensional axis into a contiguous set of integer-valued bins.

• min: int in (‒∞, ∞) (required)
• max: int in (‒∞, ∞) (required)
• loc_underflow: one of {BinLocation.below3, BinLocation.below2, BinLocation.below1, BinLocation.nonexistent, BinLocation.above1, BinLocation.above2, BinLocation.above3}
(default: BinLocation.nonexistent)
• loc_overflow: one of {BinLocation.below3, BinLocation.below2, BinLocation.below1, BinLocation.nonexistent, BinLocation.above1, BinLocation.above2, BinLocation.above3}
(default: BinLocation.nonexistent)
The min must be strictly less than the max.
The loc_underflow and loc_overflow must not be equal unless they are nonexistent.

Details:

This binning is intended for one-dimensional, integer-valued data in a compact range. The min and max values are both inclusive, so the number of bins is 1 + max - min.

If loc_underflow and loc_overflow are nonexistent, then there are no slots in the Histogram counts or BinnedEvaluatedFunction values for underflow or overflow. If they are below, then their slots precede the normal bins, if above, then their slots follow the normal bins, and their order is in sequence: below3, below2, below1, (normal bins), above1, above2, above3.

RegularBinning

Splits a one-dimensional axis into an ordered, abutting set of equal-sized real intervals.

• num: int in [1, ∞) (required)
• interval: RealInterval (required)
• overflow: RealOverflow (default: null)
• circular: bool (default: false)
The interval.low and interval.high limits must both be finite.
The interval.low_inclusive and interval.high_inclusive cannot both be true. (They can both be false, which allows for infinitesimal gaps between bins.)

Details:

This binning is intended for one-dimensional, real-valued data in a compact range. The limits of this range are specified in a single RealInterval, and the number of subdivisions is num.

The existence and positions of any underflow, overflow, and nanflow bins, as well as how non-finite values were handled during filling, are contained in the RealOverflow.

If the binning is circular, then it represents a finite segment in which interval.low is topologically identified with interval.high. This could be used to convert [‒π, π) intervals into [0, 2π) intervals, for instance.

See also:

RegularBinning: for ordered, equal-sized, abutting real intervals.
EdgesBinning: for ordered, any-sized, abutting real intervals.
IrregularBinning: for unordered, any-sized real intervals (that may even overlap).
SparseRegularBinning: for unordered, equal-sized real intervals aligned to a regular grid, but only need to be defined if the bin content is not empty.

RealInterval

Represents a real interval with inclusive (closed) or exclusive (open) endpoints.

• low: float in [‒∞, ∞] (required)
• high: float in [‒∞, ∞] (required)
• low_inclusive: bool (default: true)
• high_inclusive: bool (default: false)
The low limit must be less than or equal to the high limit.
The low limit may only be equal to the high limit if at least one endpoint is inclusive (low_inclusive or high_inclusive is true). Such an interval would represent a single real value.

Details:

The position and size of the real interval is defined by low and high, and each endpoint is inclusive (closed) if low_inclusive or high_inclusive, respectively, is true. Otherwise, the endpoint is exclusive (open).

A single interval defines a RegularBinning and a set of intervals defines an IrregularBinning.

RealOverflow

Underflow, overflow, and nanflow configuration for one-dimensional, real-valued data.

• loc_underflow: one of {BinLocation.below3, BinLocation.below2, BinLocation.below1, BinLocation.nonexistent, BinLocation.above1, BinLocation.above2, BinLocation.above3}
(default: BinLocation.nonexistent)
• loc_overflow: one of {BinLocation.below3, BinLocation.below2, BinLocation.below1, BinLocation.nonexistent, BinLocation.above1, BinLocation.above2, BinLocation.above3}
(default: BinLocation.nonexistent)
• loc_nanflow: one of {BinLocation.below3, BinLocation.below2, BinLocation.below1, BinLocation.nonexistent, BinLocation.above1, BinLocation.above2, BinLocation.above3}
(default: BinLocation.nonexistent)
• minf_mapping: one of {RealOverflow.missing, RealOverflow.in_underflow, RealOverflow.in_overflow, RealOverflow.in_nanflow}
(default: RealOverflow.in_underflow)
• pinf_mapping: one of {RealOverflow.missing, RealOverflow.in_underflow, RealOverflow.in_overflow, RealOverflow.in_nanflow}
(default: RealOverflow.in_overflow)
• nan_mapping: one of {RealOverflow.missing, RealOverflow.in_underflow, RealOverflow.in_overflow, RealOverflow.in_nanflow}
(default: RealOverflow.in_nanflow)
The loc_underflow, loc_overflow, and loc_nanflow must not be equal unless they are nonexistent.
The minf_mapping (‒∞ mapping) can only be missing, in_underflow, or in_nanflow, not in_overflow.
The pinf_mapping (+∞ mapping) can only be missing, in_overflow, or in_nanflow, not in_underflow.

Details:

If loc_underflow, loc_overflow, and loc_nanflow are nonexistent, then there are no slots in the Histogram counts or BinnedEvaluatedFunction values for underflow, overflow, or nanflow. Underflow represents values smaller than the lower limit of the binning, overflow represents values larger than the upper limit of the binning, and nanflow represents floating point values that are nan (not a number). With the normal bins, underflow, overflow, and nanflow, every possible input value corresponds to some bin.

If any of the loc_underflow, loc_overflow, and loc_nanflow are below, then their slots precede the normal bins, if above, then their slots follow the normal bins, and their order is in sequence: below3, below2, below1, (normal bins), above1, above2, above3. It is possible to represent a histogram counts buffer with the three special bins in any position relative to the normal bins.

The minf_mapping specifies whether ‒∞ values were ignored when the histogram was filled (missing), are in the underflow bin (in_underflow) or are in the nanflow bin (in_nanflow). The pinf_mapping specifies whether +∞ values were ignored when the histogram was filled (missing), are in the overflow bin (in_overflow) or are in the nanflow bin (in_nanflow). Thus, it would be possible to represent a histogram that was filled with finite underflow/overflow bins and a generic bin for all three non-finite floating point states.

HexagonalBinning

Splits a two-dimensional axis into a tiling of equal-sized hexagons.

• qmin: int in (‒∞, ∞) (required)
• qmax: int in (‒∞, ∞) (required)
• rmin: int in (‒∞, ∞) (required)
• rmax: int in (‒∞, ∞) (required)
• coordinates: one of {HexagonalBinning.offset, HexagonalBinning.doubled_offset, HexagonalBinning.cube_xy, HexagonalBinning.cube_yz, HexagonalBinning.cube_xz}
(default: HexagonalBinning.offset)
• xorigin: float in (‒∞, ∞) (default: 0.0)
• yorigin: float in (‒∞, ∞) (default: 0.0)
• qangle: float in [‒π/2, π/2] (default: 0.0)
• bin_width: float in (0.0, ∞) (default: 1.0)
• qoverflow: RealOverflow (default: null)
• roverflow: RealOverflow (default: null)
The qmin must be strictly less than the qmax.
The rmin must be strictly less than the rmax.

Details:

This binning is intended for two-dimensional, real-valued data in a compact region. Hexagons tile a two-dimensional plane, just as rectangles do, but whereas a rectangular tiling can be represented by two RegularBinning axes, hexagonal binning requires a special binning. Some advantages of hexagonal binning are described here.

As with any other binning, integer-valued indexes in the Histogram counts or BinnedEvaluatedFunction values are mapped to values in the data space. However, rather than mapping a single integer slot position to an integer, real interval, or categorical data value, two integers from a rectangular integer grid are mapped to hexagonal tiles. The integers are labeled q and r, with q values between qmin and qmax (inclusive) and r values between rmin and rmax (inclusive). The total number of bins is (1 + qmax - qmin)*(1 + rmax - rmin). Data coordinates are labeled x and y.

There are several different schemes for mapping integer rectangles to hexagonal tiles; we use the ones defined here: offset, doubled_offset, cube_xy, cube_yz, cube_xz, specified by the coordinates property. The center of the q = 0, r = 0 tile is at xorigin, yorigin.

In “pointy topped” coordinates, qangle is zero if increasing q is collinear with increasing x, and this angle ranges from ‒π/2, if increasing q is collinear with decreasing y, to π/2, if increasing q is collinear with increasing y. The bin_width is the shortest distance between adjacent tile centers: the line between tile centers crosses the border between tiles at a right angle.

A roughly but not exactly rectangular region of x and y fall within a slot in q and r. Overflows, underflows, and nanflows, converted to floating point q and r, are represented by overflow, underflow, and nanflow bins in qoverflow and roverflow. Note that the total number of bins is strictly multiplicative (as it would be for a rectangular with two RegularBinning axes): the total number of bins is the number of normal q bins plus any overflows times the number of normal r bins plus any overflows. That is, all r bins are represented for each q bin, even overflow q bins.

EdgesBinning

Splits a one-dimensional axis into an ordered, abutting set of any-sized real intervals.

• edges: list of float with length in [1, ∞) (required)
• overflow: RealOverflow (default: null)
• low_inclusive: bool (default: true)
• high_inclusive: bool (default: false)
• circular: bool (default: false)
All edges must be finite and strictly increasing.
An edges of length 1 is only allowed if overflow is non-null with at least one underflow, overflow, or nanflow bin.
The low_inclusive and high_inclusive cannot both be true. (They can both be false, which allows for infinitesimal gaps between bins.)

Details:

This binning is intended for one-dimensional, real-valued data in a compact range. The limits of this range and the size of each bin are defined by edges, which are the edges between the bins. Since they are edges between bins, the number of non-overflow bins is len(edges) - 1. The degenerate case of exactly one edge is only allowed if there are any underflow, overflow, or nanflow bins.

The existence and positions of any underflow, overflow, and nanflow bins, as well as how non-finite values were handled during filling, are contained in the RealOverflow.

If low_inclusive is true, then all intervals between pairs of edges include the low edge. If high_inclusive is true, then all intervals between pairs of edges include the high edge.

If the binning is circular, then it represents a finite segment in which interval.low is topologically identified with interval.high. This could be used to convert [‒π, π) intervals into [0, 2π) intervals, for instance.

See also:

RegularBinning: for ordered, equal-sized, abutting real intervals.
EdgesBinning: for ordered, any-sized, abutting real intervals.
IrregularBinning: for unordered, any-sized real intervals (that may even overlap).
SparseRegularBinning: for unordered, equal-sized real intervals aligned to a regular grid, but only need to be defined if the bin content is not empty.

IrregularBinning

Splits a one-dimensional axis into unordered, any-sized real intervals (that may even overlap).

• intervals: list of RealInterval with length in [1, ∞) (required)
• overflow: RealOverflow (default: null)
• overlapping_fill: one of {IrregularBinning.unspecified, IrregularBinning.all, IrregularBinning.first, IrregularBinning.last}
(default: IrregularBinning.unspecified)
The intervals, as defined by their low, high, low_inclusive, high_inclusive fields, must be unique.

Details:

This binning is intended for one-dimensional, real-valued data. Unlike EdgesBinning, the any-sized intervals do not need to be abutting, so this binning can describe a distribution with large gaps.

The existence and positions of any underflow, overflow, and nanflow bins, as well as how non-finite values were handled during filling, are contained in the RealOverflow.

In fact, the intervals are not even required to be non-overlapping. A data value may correspond to zero, one, or more than one bin. The latter case raises the question of which bin was filled by a value that corresponds to multiple bins: the overlapping_fill strategy may be unspecified if we don’t know, all if every corresponding bin was filled, first if only the first match was filled, and last if only the last match was filled.

Irregular bins are usually not directly created by histogramming libraries, but they may come about as a result of merging histograms with different binnings.

See also:

RegularBinning: for ordered, equal-sized, abutting real intervals.
EdgesBinning: for ordered, any-sized, abutting real intervals.
IrregularBinning: for unordered, any-sized real intervals (that may even overlap).
SparseRegularBinning: for unordered, equal-sized real intervals aligned to a regular grid, but only need to be defined if the bin content is not empty.

CategoryBinning

Associates disjoint categories from a categorical dataset with bins.

• categories: list of str (required)
• loc_overflow: one of {BinLocation.below3, BinLocation.below2, BinLocation.below1, BinLocation.nonexistent, BinLocation.above1, BinLocation.above2, BinLocation.above3}
(default: BinLocation.nonexistent)
The categories must be unique.

Details:

This binning is intended for string-valued categorical data (or values that can be converted to strings without losing uniqueness). Each named category in categories corresponds to one bin.

If loc_overflow is nonexistent, unspecified strings were ignored in the filling procedure. Otherwise, the overflow bin corresponds to unspecified strings, and it can be below or above the normal bins. Unlike RealOverflow, which has up to three overflow bins (underflow, overflow, and nanflow), no distinction is made among below3, below2, below1 or above1, above2, above3.

See also:

CategoryBinning: for disjoint categories with a possible overflow bin.
PredicateBinning: for possibly overlapping regions defined by predicate functions.
VariationBinning: for completely overlapping input data, with derived features computed different ways.

SparseRegularBinning

Splits a one-dimensional axis into unordered, equal-sized real intervals aligned to a regular grid, which only need to be defined if the bin content is not empty.

• bins: list of int (required)
• bin_width: float in (0, ∞] (required)
• origin: float in [‒∞, ∞] (default: 0.0)
• overflow: RealOverflow (default: null)
• low_inclusive: bool (default: true)
• high_inclusive: bool (default: false)
• minbin: int in [‒2⁶³, 2⁶³ ‒ 1] (default: ‒2⁶³)
• maxbin: int in [‒2⁶³, 2⁶³ ‒ 1] (default: 2⁶³ ‒ 1)

Details:

This binning is intended for one-dimensional, real-valued data. Unlike RegularBinning and EdgesBinning, the intervals do not need to be abutting. Unlike IrregularBinning, they must be equal-sized, non-overlapping, and aligned to a grid.

Integer-valued bin indexes i are mapped to real intervals using bin_width and origin: each interval starts at bin_width*(i) + origin and stops at bin_width*(i + 1) + origin. The bins property is an unordered list of bin indexes, with the same length and order as the Histogram bins or BinnedEvaluatedFunction values. Unspecified bins are empty: for counts or sums of weights, this means zero; for minima, this means +∞; for maxima, this meanss ‒∞; for all other values, nan (not a number).

There is a degeneracy between bins and origin: adding an integer multiple of bin_width to origin and subtracting that integer from all bins yields an equivalent binning.

If low_inclusive is true, then all intervals between pairs of edges include the low edge. If high_inclusive is true, then all intervals between pairs of edges include the high edge.

Although this binning can reach a very wide range of values without using much memory, there is a limit. The bins array values are 64-bit signed integers, so they are in principle limited to [‒2⁶³, 2⁶³ ‒ 1]. Changing the origin moves this window, and chaning the bin_width widens its coverage of real values at the expense of detail. In some cases, the meaningful range is narrower than this. For instance, if a binning is shifted to a higher origin (e.g. to align two histograms to add them), some values below 2⁶³ ‒ 1 in the shifted histogram were out of range in the unshifted histogram, so we cannot say that they are in range in the new histogram. For this, the maxbin would be less than 2⁶³ ‒ 1. By a similar argument, the minbin can be greater than ‒2⁶³.

Therefore, even though this binning is sparse, it can have underflow and overflow bins for values below minbin or above maxbin. Since nan (not a number) values don’t map to any integer, this binning may also need a nanflow. The existence and positions of any underflow, overflow, and nanflow bins, as well as how non-finite values were handled during filling, are contained in the RealOverflow.

See also:

RegularBinning: for ordered, equal-sized, abutting real intervals.
EdgesBinning: for ordered, any-sized, abutting real intervals.
IrregularBinning: for unordered, any-sized real intervals (that may even overlap).
SparseRegularBinning: for unordered, equal-sized real intervals aligned to a regular grid, but only need to be defined if the bin content is not empty.

FractionBinning

Splits a boolean (true/false) axis into two bins.

• layout: one of {FractionBinning.passall, FractionBinning.failall, FractionBinning.passfail}
(default: FractionBinning.passall)
• layout_reversed: bool (default: false)
• error_method: one of {FractionBinning.unspecified, FractionBinning.normal, FractionBinning.clopper_pearson, FractionBinning.wilson, FractionBinning.agresti_coull, FractionBinning.feldman_cousins, FractionBinning.jeffrey, FractionBinning.bayesian_uniform}
(default: FractionBinning.unspecified)

Details:

This binning is intended for predicate data, values that can only be true or false. It can be combined with other axis types to compute fractions as a function of some other binned variable, such as efficiency (probability of some condition) versus a real value or categories. For example,

Histogram([Axis(FractionBinning(), "pass cuts"),
           Axis(RegularBinning(10, RealInterval(-5, 5)), "x")],
          UnweightedCounts(InterpretedInlineInt64Buffer(
              [[  9,  25,  29,  35,  54,  67,  60,  84,  80,  94],
               [ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122]])))

could represent a rising probability of passing cuts versus "x". The first axis has two bins, number passing and total, and the second axis has 10 bins, values of x. Fraction binnings are also a good choice for a Collection axis, because only one set of histograms need to be defined to construct all numerators and denominators.

The layout and layout_reversed specify what the two bins mean. With a false layout_reversed, if layout is passall, the first bin is the number of inputs that pass a condition (the predicate evaluates to true) and the second is the total number of inputs. If layout is failall, the first bin is the number of inputs that fail the condition (the predicate evaluates to false). If layout is passfail, the first bin is the number that pass and the second bin is the number tha fail. These three types of layout can easily be converted to one another, but doing so requires a change to the Histogram bins or BinnedEvaluatedFunction values. If layout_reversed is true, the order of the two bins is reversed. (Thus, six layouts are possible.)

The error_method does not specify how the histograms or functions were filled, but how the fraction should be interpreted statistically. It may be unspecified, leaving that interpretation unspecified. The normal method (sometimes called “Wald”) is a naive binomial interpretation, in which zero passing or zero failing values are taken to have zero uncertainty. The clopper_pearson method (sometimes called “exact”) is a common choice, though it fails in some statistical criteria. The computation and meaning of the methods are described in the references below.

See also:

Newcombe, R. “Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods” [doi] [pdf]
Dunnigan, K. “Confidence Interval Calculation for Binomial Proportion” [pdf]
Mayfield, P. “Understanding Binomial Confidence Intervals” [pdf]
ATLAS collaboration efficiency error bar recommendations
ROOT TEfficiency class documentation
R binom package [CRAN] [pdf]
Wikipedia Binomial proportion confidence interval

PredicateBinning

Associates predicates (derived boolean features), which may represent different data “regions,” with bins.

• predicates: list of str with length in [1, ∞) (required)
• overlapping_fill: one of {IrregularBinning.unspecified, IrregularBinning.all, IrregularBinning.first, IrregularBinning.last}
(default: IrregularBinning.unspecified)

Details:

This binning is intended to represent data “regions,” such as signal and control regions, defined by boolean functions of some input variables. The details of the predicate function are not captured by this class; they are expressed as strings in the predicates property. It is up to the user or application to associate string-valued predicates with data regions or predicate functions, as executable code, as keys in a lookup function, or as human-readable titles.

Unlike CategoryBinning, this binning has no possibility of an overflow bin and a single input datum could pass multiple predicates. As with IrregularBinning, there is an overlapping_fill property to specify whether such a value is in all matching predicates, the first, the last, or if this is unknown (unspecified).

Use a CategoryBinning if the data regions are strictly disjoint, have string-valued labels computed in the filling procedure, or could produce strings that are not known before filling. Use a PredicateBinning if the data regions overlap or are identified by a fixed set of predicate functions. There are some cases in which a CategoryBinning and a PredicateBinning are both appropriate.

See also:

CategoryBinning: for disjoint categories with a possible overflow bin.
PredicateBinning: for possibly overlapping regions defined by predicate functions.
VariationBinning: for completely overlapping input data, with derived features computed different ways.

VariationBinning

Associates alternative derived features of the same input data, which may represent systematic variations of the data, with bins.

• variations: list of Variation with length in [1, ∞) (required)
• systematic_units: one of {VariationBinning.unspecified, VariationBinning.confidence, VariationBinning.sigmas}
(default: VariationBinning.unspecified)
• systematic_names: list of str (default: null/empty)
• category_systematic_names: list of str (default: null/empty)
All variations must define the same set of identifiers in its assignments.
All variations must have the same lengh systematic vector as this binning has systematic_names and the same length category_systematic vector as this binning has category_systematic_names.

Details:

This binning is intended to represent systematic variations of the same data. A filling procedure should fill every bin with derived features computed in different ways. In this way, the relevance of a systematic error can be estimated.

Each of the variations are Variation objects, which are defined below.

Variations may be labeled as representing systematic errors. For instance, one bin may be “one sigma high” and another “one sigma low.” In general, several types of systematic error may be varied at once, and they may be varied by any amount in any direction. Each Variation therefore describes a point in a vector space: the number of dimensions in this space is the number of types of systematic errors and the basis vectors are variations of each type of systematic error separately.

Some systematic errors are quantitative (e.g. misalignment) and others are categorical (e.g. choice of simulation algorithm). There are therefore two vectors in each Variation, one real-valued, the other string-valued. The systematic_units defines the units of the real-valued systematics vector.

The systematic_names labels the dimensions of the Variation systematic vectors; they must all have the same number of dimensions. The category_systematic_names labels the dimensions of the Variation category_systematic vectors; they, too, must all have the same number of dimensions.

See also:

CategoryBinning: for disjoint categories with a possible overflow bin.
PredicateBinning: for possibly overlapping regions defined by predicate functions.
VariationBinning: for completely overlapping input data, with derived features computed different ways.

Variation

Represents one systematic variation, which is one bin of a VariationBinning.

• assignments: list of Assignment (required)
• systematic: list of float (default: null/empty)
• category_systematic: list of str (default: null/empty)
The identifier in each of the assignments must be unique.

Details:

The assignments specify how the derived features were computed when filling this bin. The Assignment class is defined below.

Variations may be labeled as representing systematic errors. For instance, one bin may be “one sigma high” and another “one sigma low.” In general, several types of systematic error may be varied at once, and they may be varied by any amount in any direction. Therefore, this object describes a point in a vector space: the number of dimensions in this space is the number of types of systematic errors and the basis vectors are variations of each type of systematic error separately.

Some systematic errors are quantitative (e.g. misalignment) and others are categorical (e.g. choice of simulation algorithm). There are therefore two vectors: systematic is real-valued and category_systematic is string-valued.

Assignment

Represents one derived feature in a Variation.

• identifier: unique str (required)
• expression: str (required)

Details:

The identifier is the name of the derived feature that gets recomputed in this Variation, and expression is what it is assigned to. No constraints are placed on the expression syntax; it may even be a key to a lookup function or a human-readable description.

UnweightedCounts

Represents counts in a Histogram that were filled without weighting. (All inputs increase bin values by one unit.)

• counts: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)

Details:

The counts buffer contains the actual values. Since these counts are unweighted, they could have unsigned integer type, but no such constraint is applied.

A Histogram bin count is typically interpreted as an estimate of the probability of a data value falling into that bin times the total number of input values. It is therefore estimating a probability distribution, and that estimate has uncertainty. The uncertainty for unweighted counts follows a Poisson distribution. In the limit of large counts, the uncertainty approaches the square root of the number of counts, with deviations from this for small counts. A separate statistic to quantify this uncertainty is unnecessary because it can be fully determined from the number of counts.

To be valid, the length of the counts buffer (in number of items, not number of bytes) must be equal to the number of bins in this Histogram, including any axes inherited by nesting the Histogram in a Collection. The number of bins in the Histogram is the product of the number of bins in each Axis, including any underflow, overflow, or nanflow bins. That is, it must be possible to reshape the buffer into a multidimensional array, in which every dimension corresponds to one Axis.

WeightedCounts

Represents counts in a Histogram that were filled with weights. (Some inputs may increase bin values more than others, or even by a negative amount.)

• sumw: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• sumw2: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(default: null)
• unweighted: UnweightedCounts (default: null)

Details:

The sumw (sum of weights) buffer contains the actual values. Since these values are weighted, they might need a floating point or even signed type.

A Histogram bin count is typically interpreted as an estimate of the probability of a data value falling into that bin times the total number of input values. It is therefore estimating a probability distribution, and that estimate has uncertainty. The uncertainty for weighted counts is approximately the square root of the sum of squared weights, so this object can optionally store sumw2, the sum of squared weights, to compute this uncertainty.

It may also be necessary to know the unweighted counts, as well as the weighted counts, so there is an unweighted property for that.

To be valid, the length of all of these buffers (in number of items, not number of bytes) must be equal to the number of bins in this Histogram, including any axes inherited by nesting the Histogram in a Collection. The number of bins in the Histogram is the product of the number of bins in each Axis, including any underflow, overflow, or nanflow bins. That is, it must be possible to reshape these buffers into multidimensional arrays of the same shape, in which every dimension corresponds to one Axis.

InterpretedInlineBuffer

A generic array in the Flatbuffers hierarchy; used for any quantity that can have different values in different Histogram or BinnedEvaluatedFunction bins.

• buffer: buffer (required)
• filters: list of {Buffer.none, Buffer.gzip, Buffer.lzma, Buffer.lz4}
(default: null/empty)
• postfilter_slice: slice (start:stop:step) (default: null)
• dtype: one of {Interpretation.none, Interpretation.bool, Interpretation.int8, Interpretation.uint8, Interpretation.int16, Interpretation.uint16, Interpretation.int32, Interpretation.uint32, Interpretation.int64, Interpretation.uint64, Interpretation.float32, Interpretation.float64}
(default: Interpretation.none)
• endianness: one of {Interpretation.little_endian, Interpretation.big_endian}
(default: Interpretation.little_endian)
• dimension_order: one of {InterpretedBuffer.c_order, InterpretedBuffer.fortran}
(default: InterpretedBuffer.c_order)
The postfilter_slice's step cannot be zero.
The number of items in the buffer must be equal to the number of bins at this level of the hierarchy.

Details:

This array class provides its own interpretation in terms of data type and dimension order. It does not specify its own shape, the number of bins in each dimension, because that is given by its position in the hierarchy. If it is the UnweightedCounts of a Histogram, for instance, it must be reshapable to fit the number of bins implied by the Histogram axis.

The buffer is the actual data, encoded in Flatbuffers as an array of bytes with known length.

The list of filters are applied to convert bytes in the buffer into an array. Typically, filters are compression algorithms such as gzip, lzma, and lz4, but they may be any predefined transformation (e.g. zigzag deencoding of integers or affine mappings from integers to floating point numbers may be added in the future). If there is more than one filter, the output of each step is provided as input to the next.

The postfilter_slice, if provided, selects a subset of the bytes returned by the last filter (or directly in the buffer if there are no filters). A slice has the following structure:

struct Slice {
  start: long;
  stop: long;
  step: int;
  has_start: bool;
  has_stop: bool;
  has_step: bool;
}

though in Python, a builtin slice object should be provided to this class’s constructor. The postfilter_slice is interpreted according to Python’s rules (negative indexes, start-inclusive and stop-exclusive, clipping-not-errors if beyond the range, etc.).

The dtype is the numeric type of the array, which includes bool, all signed and unsigned integers from 8 bits to 64 bits, and IEEE 754 floating point types with 32 or 64 bits. The none interpretation is presumed, if necessary, to be unsigned, 8 bit integers.

The endianness may be little_endian or big_endian; the former is used by most recent architectures.

The dimension_order may be c_order to follow the C programming language’s convention or fortran to follow the FORTRAN programming language’s convention. The dimension_order only has an effect when shaping an array with more than one dimension.

InterpretedInlineInt64Buffer

An integer array in the Flatbuffers hierarchy; used for integer-valued quantities that can have different values in different Histogram or BinnedEvaluatedFunction bins.

• buffer: buffer (required)
The number of items in the buffer must be equal to the number of bins at this level of the hierarchy.

Details:

This class is equivalent to an InterpretedInlineBuffer with no filters, no postfilter_slice, a dtype of int64, an endianness of little_endian, and a dimension_order of c_order. It is provided as an optimization because many small arrays should avoid unnecessary Flatbuffers lookup overhead.

InterpretedInlineFloat64Buffer

A floating point array in the Flatbuffers hierarchy; used for real-valued quantities that can have different values in different Histogram or BinnedEvaluatedFunction bins.

• buffer: buffer (required)
The number of items in the buffer must be equal to the number of bins at this level of the hierarchy.

Details:

This class is equivalent to an InterpretedInlineBuffer with no filters, no postfilter_slice, a dtype of float64, an endianness of little_endian, and a dimension_order of c_order. It is provided as an optimization because many small arrays should avoid unnecessary Flatbuffers lookup overhead.

InterpretedExternalBuffer

A generic array stored outside the Flatbuffers hierarchy; used for any quantity that can have different values in different Histogram or BinnedEvaluatedFunction bins.

• pointer: int in [0, ∞) (required)
• numbytes: int in [0, ∞) (required)
• external_source: one of {ExternalBuffer.memory, ExternalBuffer.samefile, ExternalBuffer.file, ExternalBuffer.url}
(default: ExternalBuffer.memory)
• filters: list of {Buffer.none, Buffer.gzip, Buffer.lzma, Buffer.lz4}
(default: null/empty)
• postfilter_slice: slice (start:stop:step) (default: null)
• dtype: one of {Interpretation.none, Interpretation.bool, Interpretation.int8, Interpretation.uint8, Interpretation.int16, Interpretation.uint16, Interpretation.int32, Interpretation.uint32, Interpretation.int64, Interpretation.uint64, Interpretation.float32, Interpretation.float64}
(default: Interpretation.none)
• endianness: one of {Interpretation.little_endian, Interpretation.big_endian}
(default: Interpretation.little_endian)
• dimension_order: one of {InterpretedBuffer.c_order, InterpretedBuffer.fortran}
(default: InterpretedBuffer.c_order)
• location: str (default: null)
The postfilter_slice's step cannot be zero.
The number of items in the buffer must be equal to the number of bins at this level of the hierarchy.

Details:

This array class is like InterpretedInlineBuffer, but its contents are outside of the Flatbuffers hierarchy. Instead of a buffer property, it has a pointer and a numbytes to specify the source of bytes.

If the external_source is memory, then the pointer and numbytes are interpreted as a raw array in memory. If the external_source is samefile, then the pointer is taken to be a seek position in the same file that stores the Flatbuffer (assuming the Flatbuffer resides in a file). If external_source is file, then the location property is taken to be a file path, and the pointer is taken to be a seek position in that file. If external_source is url, then the location property is taken to be a URL and the bytes are requested by HTTP.

Like InterpretedInlineBuffer, this array class provides its own interpretation in terms of data type and dimension order. It does not specify its own shape, the number of bins in each dimension, because that is given by its position in the hierarchy. If it is the UnweightedCounts of a Histogram, for instance, it must be reshapable to fit the number of bins implied by the Histogram axis.

The list of filters are applied to convert bytes in the buffer into an array. Typically, filters are compression algorithms such as gzip, lzma, and lz4, but they may be any predefined transformation (e.g. zigzag deencoding of integers or affine mappings from integers to floating point numbers may be added in the future). If there is more than one filter, the output of each step is provided as input to the next.

The postfilter_slice, if provided, selects a subset of the bytes returned by the last filter (or directly in the buffer if there are no filters). A slice has the following structure:

struct Slice {
  start: long;
  stop: long;
  step: int;
  has_start: bool;
  has_stop: bool;
  has_step: bool;
}

though in Python, a builtin slice object should be provided to this class’s constructor. The postfilter_slice is interpreted according to Python’s rules (negative indexes, start-inclusive and stop-exclusive, clipping-not-errors if beyond the range, etc.).

The dtype is the numeric type of the array, which includes bool, all signed and unsigned integers from 8 bits to 64 bits, and IEEE 754 floating point types with 32 or 64 bits. The none interpretation is presumed, if necessary, to be unsigned, 8 bit integers.

The endianness may be little_endian or big_endian; the former is used by most recent architectures.

The dimension_order may be c_order to follow the C programming language’s convention or fortran to follow the FORTRAN programming language’s convention. The dimension_order only has an effect when shaping an array with more than one dimension.

Profile

Summarizes a dependent variable in a Histogram, binned by the Histogram axis (independent variables).

• expression: str (required)
• statistics: Statistics (required)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)

Details:

Although a statistician’s histogram strictly represents a distribution, it is often useful to store a few more values per bin to estimate average values for an empirical function from a dataset. This practice is common in particle physics, from HPROF in CERNLIB to TProfile in ROOT.

To estimate an unweighted mean and standard deviation of x, one needs the counts from UnweightedCounts as well as a sum of x and a sum of squares of x. For a weighted mean and standard deviation of x, one needs the sumw (sum of weights) and sumw2 (sum of squared weights) from WeightedCounts as well as a sum of weights times x and a sum of weights times squares of x.

Rather than making profile a separate class from histograms, as is commonly done in particle physics, we can add profiled quantities to a Histogram object. If we have many profiles with the same binning, this avoids duplication of the counts or sumw and sumw2. We can also generalize from storing only moments (to compute mean and standard deviation) to also storing quantiles (to compute a box-and-whiskers plot, for instance).

If the profile represents a computed expression (derived feature), it may be encoded here as a string. The title is a human-readable description.

All of the moments, quantiles, and any mode, min, or max are in the required statistics object. See below for a definition of the Statistics class.

The title, metadata, and decoration properties have no semantic constraints.

Statistics

Represents summary statistics for a Histogram axis or for each bin in a Profile or for an NtupleInstance.

• moments: list of Moments (default: null/empty)
• quantiles: list of Quantiles (default: null/empty)
• mode: Modes (default: null)
• min: Extremes (default: null)
• max: Extremes (default: null)
All moments must have unique n and weightpower properties.
All quantiles must have unique n and weightpower properties.

Details:

This object provides a statistical summary of a distribution without binning it as a histogram does. Examples include mean, standard deviation, median, and mode.

Anything that can be computed from moments, such as the mean and standard deviation, are stored as raw moments, in the moments property. Concepts like “mean” and “standard deviation” are not explicitly called out by the structure; they must be constructed.

Medians, quartiles, and quintiles are all stored in the quantiles property.

If the mode of the distribution was computed, it is stored in the mode property.

The minimum and maximum of a distribution are special cases of quantiles, but quantiles can’t in general be combined from preaggregated subsets of the data. The min and max can be combined (they are monadic calculations, like the sums that are moments), so they are stored separately as Extremes.

Moments

Represents one type of moment; a single value for an Axis or one per bin for a Profile or a single value for an NtupleInstance.

• sumwxn: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• n: int in [‒128, 127] (required)
• weightpower: int in [‒128, 127] (default: 0)
• filter: StatisticFilter (default: null)

Details:

Moments are primarily used for mean and standard deviation, but they can also be used to compute skew, kurtosis, etc. In general, a moment is a sum of weights (to some power) times the quantity of interest (to some power). Moments from preaggregated subsets of the data can simply be added, whereas a prepared mean cannot.

The sumwxn is a buffer containing a single value if this Moments is attached under an Axis (summarizing the quantity that axis represents for all input data) or a buffer containing as many values as there are bins in a Histogram if this Moments is attached under a Profile. Thus, it serves two purposes: auxiliary data about an Axis and the bin-by-bin data that make up a profile plot.

The quantity of interest is raised to the power n. Thus, the total number of entries would be computed from n = 0, the mean from n = 1, and the standard deviation from the n = 2 and n = 1 moments.

The weights are raised to the power weightpower. Typically, the weightpower would be zero in a Histogram with UnweightedCounts and one in a Histogram with WeightedCounts, but weightpower = 2 is necessary for some calculations.

If not all of the data were included in the sum, a filter describes which values were excluded. This StatisticFilter is described below.

Quantiles

Represents one type of quantile; a single value for an Axis or one per bin for a Profile or a single value for an NtupleInstance.

• values: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• p: float in [0.0, 1.0] (required)
• weightpower: int in [‒128, 127] (default: 0)
• filter: StatisticFilter (default: null)

Details:

Quantiles are a generalization of median, quartiles, and quintiles. A median is the point in a distribution where 50% of the probability is below that value, quartiles are 25%, 50%, 75%, and quintiles are 20%, 40%, 60%, 80%.

The values is a buffer containing a single value if this Quantiles is attached under an Axis (summarizing the quantity that axis represents for all input data) or a buffer containing as many values as there are bins in a Histogram if this Moments is attached under a Profile. Thus, it serves two purposes: auxiliary data about an Axis and the bin-by-bin data that make up a box-and-whiskers plot.

The dividing point is p, a value between 0 and 1 (inclusive on both endpoints). For a median, p = 0.5, etc.

If weightpower is not zero, the contribution of input values to p were weighted. weightpower = 1 would be typical of a Histogram with WeightedCounts, so that the weighted quantile agrees with an approximate calculation performed on the histogram’s distribution.

If not all of the data were included in the quantile calculation, a filter describes which values were excluded. This StatisticFilter is described below.

Modes

Represents the mode of a distribution; a single value for an Axis or one per bin for a Profile or a single value for an NtupleInstance.

• values: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• filter: StatisticFilter (default: null)

Details:

The values is a buffer containing a single value if this Modes is attached under an Axis (summarizing the quantity that axis represents for all input data) or a buffer containing as many values as there are bins in a Histogram if this Modes is attached under a Profile.

If not all of the data were included in the mode calculation, a filter describes which values were excluded. This StatisticFilter is described below.

Extremes

Represents the minimum or maximum of a distribution; a single value for an Axis or one per bin for a Profile or a single value for an NtupleInstance; also used in ColumnChunk to summarize data in a page of an Ntuple.

• values: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• filter: StatisticFilter (default: null)

Details:

The values is a buffer containing a single value if this Extremes is attached under an Axis (summarizing the quantity that axis represents for all input data) or a buffer containing as many values as there are bins in a Histogram if this Extremes is attached under a Profile. If attached under a ColumnChunk in an Ntuple, it represents the minimum or maximum values in each Page of the ColumnChunk, to quickly determine if the Page needs to be read/decompressed, for instance.

If not all of the data were included in the min/max calculation, a filter describes which values were excluded. This StatisticFilter is described below.

StatisticFilter

Specifies which values were excluded from a statistic, such as Moments, Quantiles, Modes, or Extremes.

• min: float in [‒∞, ∞] (default: ‒∞)
• max: float in [‒∞, ∞] (default: ∞)
• excludes_minf: bool (default: false)
• excludes_pinf: bool (default: false)
• excludes_nan: bool (default: false)
The min must be less than or equal to the max.

Details:

The statistic to which this filter belongs was calculated from finite values between min and max (inclusive on both endpoints), as well as ‒∞ if excludes_minf is false, +∞ if excludes_pinf is false, and nan (not a number) if excludes_nan is false.

Covariance

Represents one element of a covariance matrix for a pair of Axis or for all bins in a pair of Profile in a Histogram or a pair of columns in an NtupleInstance.

• xindex: int in [0, ∞) (required)
• yindex: int in [0, ∞) (required)
• sumwxy: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• weightpower: int in [‒128, 127] (default: 0)
• filter: StatisticFilter (default: null)
The xindex must not be equal to the yindex (see Moments for variances).

Details:

N axes in a Histogram potentially have N*(N - 1)/2 covariance matrix elements; an object of this class represents one of them. However, if it is one of the profile_covariances in a Histogram, it represents that element of the covariance matrix for all bins in the Histogram.

The sumwxy buffer holds the raw covariance, the sum of x times y from the input data. This may be a single sum or an array for all bins in a profile covariance matrix element.

If weightpower is not zero, the sum of x times y was weighted. weightpower = 1 would be typical of a Histogram with WeightedCounts, so that the weighted quantile agrees with an approximate calculation performed on the histogram’s distribution.

If not all of the data were included in the quantile calculation, a filter describes which values were excluded. This StatisticFilter is described below.

ParameterizedFunction

A function defined by a mathematical expression and a set of parameters, to attach to a Histogram or Ntuple or to include in a Collection.

• expression: str (required)
• parameters: list of Parameter (default: null/empty)
• paramaxis: list of int (default: null/empty)
• parameter_covariances: list of Covariance (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
• script: str (default: null)
The identifiers of all parameters must be unique.
After converting from negative indexes, paramaxis values must be unique.
All paramaxis values must be in [0, number of axes, including any inherited from a Collection).
The xindex and yindex of each Covariance in parameter_covariances must be in [0, number of parameters) and be unique pairs (unordered).

Details:

A common application for functions is to attach a fit result to a Histogram. This class defines a function as a mathematical expression with parameters. No particular syntax is specified for the expression.

The parameters may all be fixed for some Histogram axes and all be variable for some other Histogram axes. The paramaxis set specifies the indexes of axes that are variable in the parameters. If paramaxis is an empty set, each Parameter has a buffer of only one value; otherwise, each Parameter has a buffer of as many values as the product of the number of bins in the selected axes (including overflow bins). Negative indexes are interpreted as in Python: -1 is the last axis, -2 for the next-to-last, etc.

Even if the parameterized function is not attached to a Histogram but is standalone in a Collection, the paramaxis is still relevant because a Collection has an axis, too.

The Parameter class, described below, can internally describe errors on each parameter. Covariances between parameters are described by parameter_covariances. The size of each Covariance buffer is equal to the size of each Parameter buffer, controlled by paramaxis and the number of axes.

The title, metadata, decoration, and script properties have no semantic constraints.

See also:

ParameterizedFunction: defined by a mathematical expression and parameters; may be attached to a Histogram or included in a Collection.
EvaluatedFunction: defined by a value at each bin of a Histogram; must be attached to a Histogram.
BinnedEvaluatedFunction: defined by a value at each bin of an internally defined Axis; must be standalone in a Collection or attached to an Ntuple.

Parameter

Sets values in a ParameterizedFunction.

• identifier: unique str (required)
• values: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• errors: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(default: null)

Details:

A parameter is named by an identifier and stores one or two buffers for values and errors. The number of values in each buffer is controlled by the ParameterizedFunction paramaxis and the number of axes at this level of hierarchy.

EvaluatedFunction

A function defined by explicit values in each bin of the Histogram to which it is attached.

• values: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• derivatives: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(default: null)
• errors: list of Quantiles (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
• script: str (default: null)

Details:

Some functions are difficult, impossible, or undesirable to express in terms of a mathematical expression and parameters, but they can be expressed in terms of their values at a set of points. An EvaluatedFunction can only be attached to a Histogram and each item in its values buffer corresponds to one item in a Histogram's counts. (For a standalone function, see BinnedEvaluatedFunction below.)

If the derivatives or the errors of the function at each bin are also known, they can be stored as well.

See also:

ParameterizedFunction: defined by a mathematical expression and parameters; may be attached to a Histogram or included in a Collection.
EvaluatedFunction: defined by a value at each bin of a Histogram; must be attached to a Histogram.
BinnedEvaluatedFunction: defined by a value at each bin of an internally defined Axis; must be standalone in a Collection or attached to an Ntuple.

BinnedEvaluatedFunction

A standalone function defined by explicit values in each bin of its axis.

• axis: list of Axis with length in [1, ∞) (required)
• values: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(required)
• derivatives: InterpretedInlineBuffer or InterpretedInlineInt64Buffer or InterpretedInlineFloat64Buffer or InterpretedExternalBuffer
(default: null)
• errors: list of Quantiles (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
• script: str (default: null)

Details:

Some functions are difficult, impossible, or undesirable to express in terms of a mathematical expression and parameters, but they can be expressed in terms of their values at a set of points. A BinnedEvaluatedFunction defines an axis and a values buffer for each bin described by the axis. A BinnedEvaluatedFunction can only be standalone in a Collection or attached to an Ntuple.

If the derivatives or the errors of the function at each bin are also known, they can be stored as well.

The title, metadata, decoration, and script properties have no semantic constraints.

See also:

ParameterizedFunction: defined by a mathematical expression and parameters; may be attached to a Histogram or included in a Collection.
EvaluatedFunction: defined by a value at each bin of a Histogram; must be attached to a Histogram.
BinnedEvaluatedFunction: defined by a value at each bin of an internally defined Axis; must be standalone in a Collection or attached to an Ntuple.

Ntuple

A non-aggregated collection of data; points in an n-dimensional vector space.

• columns: list of Column with length in [1, ∞) (required)
• instances: list of NtupleInstance with length in [1, ∞) (required)
• column_statistics: list of Statistics (default: null/empty)
• column_covariances: list of Covariance (default: null/empty)
• functions: str → ParameterizedFunction or BinnedEvaluatedFunction (default: null/empty)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
• script: str (default: null)
The identifier of each of the columns must be unique.
The number of instances must equal the number of Collection axes at this level of hierarchy.
The xindex and yindex of each Covariance in column_covariances must be in [0, number of Ntuple columns) and be unique pairs (unordered).

Details:

Unlike Histogram, which represents aggregated data, an Ntuple represents points in an n-dimensional vector space. It may be the result of some filtering or it may be a table returned by a group-by operation, and it could be useful for generating scatter plots, for unbinned fits, or for machine learning.

Ntuples are standalone objects in a Collection, like histograms, and as such, they are subject to a Collection's axis. If the Collection has an axis with N bins (representing, for example, different data regions or systematic variations), the Ntuple object represents M different ntuples (for each of the regions or variations). Thus, it must have N objects of type NtupleInstance in its instances parameter.

All of these instances share columns, which define the name, meaning, and data type of elements in each tuple.

The column_statistics and column_covariances provide additional information about the data in the columns: moments, quantiles, modes, and correlations. Their buffers have one item for each of the instances (e.g. a column mean can be recorded for each NtupleInstance separately).

Like a Histogram, an Ntuple can have attached functions, but since the Ntuple doesn’t define a binning, these functions can only be ParameterizedFunction or BinnedEvaluatedFunction.

The title, metadata, decoration, and script properties have no semantic constraints.

Column

Provides a name, meaning, and a data type for one column of Ntuple data.

• identifier: unique str (required)
• dtype: one of {Interpretation.none, Interpretation.bool, Interpretation.int8, Interpretation.uint8, Interpretation.int16, Interpretation.uint16, Interpretation.int32, Interpretation.uint32, Interpretation.int64, Interpretation.uint64, Interpretation.float32, Interpretation.float64}
(required)
• endianness: one of {Interpretation.little_endian, Interpretation.big_endian}
(default: Interpretation.little_endian)
• filters: list of {Buffer.none, Buffer.gzip, Buffer.lzma, Buffer.lz4}
(default: null/empty)
• postfilter_slice: slice (start:stop:step) (default: null)
• title: str (default: null)
• metadata: Metadata (default: null)
• decoration: Decoration (default: null)
The postfilter_slice's step cannot be zero.

Details:

Whereas the bin contents for instances of a Histogram (i.e. a Histogram within a Collection with an Axis) are expressed in a single buffer, instances of an Ntuple have separate buffers, as they may need to grow. Also, even a single NtupleInstance may have more than one Chunk or Page, which means separate buffers. Rather than duplicating the columns names and data types (possibly allowing those duplicates to disagree with each other), we define the column type once with a Column object. Rather than containing interpreted buffers, ntuples are filled with uninterpreted RawInlineBuffer and RawExternalBuffer instances.

Column properties are similar to interpreted buffer properties (see InterpretedInlineBuffer), except that it has no buffer.

The list of filters are applied to convert bytes in each raw buffer into an array. Typically, filters are compression algorithms such as gzip, lzma, and lz4, but they may be any predefined transformation (e.g. zigzag deencoding of integers or affine mappings from integers to floating point numbers may be added in the future). If there is more than one filter, the output of each step is provided as input to the next.

The postfilter_slice, if provided, selects a subset of the bytes returned by the last filter (or directly in each raw buffer if there are no filters). A slice has the following structure:

struct Slice {
  start: long;
  stop: long;
  step: int;
  has_start: bool;
  has_stop: bool;
  has_step: bool;
}

though in Python, a builtin slice object should be provided to this class’s constructor. The postfilter_slice is interpreted according to Python’s rules (negative indexes, start-inclusive and stop-exclusive, clipping-not-errors if beyond the range, etc.).

The dtype is the numeric type of the array, which includes bool, all signed and unsigned integers from 8 bits to 64 bits, and IEEE 754 floating point types with 32 or 64 bits. The none interpretation is presumed, if necessary, to be unsigned, 8 bit integers.

The endianness may be little_endian or big_endian; the former is used by most recent architectures.

The dimension_order may be c_order to follow the C programming language’s convention or fortran to follow the FORTRAN programming language’s convention. The dimension_order only has an effect when shaping an array with more than one dimension.

The title, metadata, and decoration properties have no semantic constraints.

NtupleInstance

A single instance of an Ntuple; allows for an Ntuple to be instantiated in a Collection with Axis.

• chunks: list of Chunk (required)
• chunk_offsets: list of int (default: null/empty)
The chunk_offsets, if present, must start with 0, be monotonically increasing, and its length must be one more than the length of chunks.

Details:

Whereas the Ntuple might be thought of as a collection of ntuples with the same type (split by a Collection's Axis), an NtupleInstance would appear to a data analyst as a single ntuple table of data. For scalability, however, it is internally divided into chunks. A Chunk contains a whole number of ntuple entries (table rows) across all columns. A parallel processing system could divide work such that each processor operates on one Chunk.

Optionally, the entry ranges for each chunk can be expressed in a chunk_offsets list. The starting entry (inclusive) for chunk i is chunk_offsets[i] and the stopping entry (exclusive) for chunk i is chunk_offsets[i + 1].

Chunk

An internal division of an NtupleInstance containing a whole number of entries.

• column_chunks: list of ColumnChunk (required)
• metadata: Metadata (default: null)
The number of column_chunks must be equal to the number of columns in the Ntuple in which this Chunk is embedded.

Details:

A Chunk is a division that cuts across all columns (of the Ntuple in which it is embedded); the individual columns are split into column_chunks. Consequently, there must be as many column_chunks as there are columns and they are identified by index position.

The metadata property has no semantic constraints, but it is included here to provide hints for parallel processing systems.

ColumnChunk

An internal division of an Ntuple column for parallel processing.

• pages: list of Page (required)
• page_offsets: list of int with length in [1, ∞) (required)
• page_min: list of Extremes (default: null/empty)
• page_max: list of Extremes (default: null/empty)
The page_offsets must start with 0, be monotonically increasing, and its length must be one more than the length of pages.
If page_min or page_max is included, its length must be equal to the length of pages.

Details:

Column chunks are further divided into pages, which are separate buffers, may be located on different disk pages, and may be separately compressed. Like an NtupleInstance's column_offsets, the page_offsets provides an index for finding particular entries; unlike column_offsets, the page_offsets are required (to avoid reading unnecessary pages). The starting entry (inclusive) for page i is page_offsets[i] and the stopping entry (exclusive) for page i is page_offsets[i + 1].

Additionally, pages may have a “zone map” of minimum and maximum values in each page, so that it may be skipped if a value in the desired range won’t be found. The page_min and page_max are Extremes.

Page

The atomic unit of reading/decompression for an Ntuple column.

• buffer: RawInlineBuffer or RawExternalBuffer (required)

Details:

A Page contains one raw buffer, which may be inline or external.

RawInlineBuffer

A generic, uninterpreted array in the Flatbuffers hierarchy; used for small buffers, like Ntuple pages, that are interpreted centrally, as in an Ntuple column.

• buffer: buffer (required)

Details:

This array class does not provide its own interpretation in terms of data type and dimension order. The interpretation must be provided elsewhere, such as in an ntuple’s Column. This is to avoid repeating (and possibly introduce conflicting) interpretation metadata for many buffers whose type is identical but are stored in pages for performance reasons.

The buffer is the actual data, encoded in Flatbuffers as an array of bytes with known length.

RawExternalBuffer

A generic, uninterpreted array stored outside the Flatbuffers hierarchy; used for small buffers, like Ntuple pages, that are interpreted centrally, as in an Ntuple column.

• pointer: int in [0, ∞) (required)
• numbytes: int in [0, ∞) (required)
• external_source: one of {ExternalBuffer.memory, ExternalBuffer.samefile, ExternalBuffer.file, ExternalBuffer.url}
(default: ExternalBuffer.memory)

Details:

This array class is like RawInlineBuffer, but its contents are outside of the Flatbuffers hierarchy. Instead of a buffer property, it has a pointer and a numbytes to specify the source of bytes.

If the external_source is memory, then the pointer and numbytes are interpreted as a raw array in memory. If the external_source is samefile, then the pointer is taken to be a seek position in the same file that stores the Flatbuffer (assuming the Flatbuffer resides in a file). If external_source is file, then the location property is taken to be a file path, and the pointer is taken to be a seek position in that file. If external_source is url, then the location property is taken to be a URL and the bytes are requested by HTTP.

Metadata

Optional container for applications to attach metadata to histograms, functions, ntuples, and collections.

• data: str (required)
• language: one of {Metadata.unspecified, Metadata.json} (required)

Details:

Anything that an application needs to track that is not or won’t be encoded in aghast structures may be attached as metadata. The data are expressed as a string in some language, such as JSON.

Graphical properties of plots are not encoded in a ghast, but they may use Decoration for graphics-specific metadata.

Decoration

Optional container for applications to attach graphical properties to histograms, functions, ntuples, and collections.

• data: str (required)
• language: one of {Decoration.unspecified, Decoration.css, Decoration.vega, Decoration.json}
(required)

Details:

The aghast specification does not encode any graphical properties, such as colors or arrangements of a plot. However, an application may want to save or communicate these properties. The Decoration class is intended to hold this information.

The data are expressed as a string in some language, such as CSS, Vega, or JSON format.

Files

specification.adoc

Latest commit

History

specification.adoc

File metadata and controls

Aghast Specification

Introduction

Data Types

Collection

Histogram

Axis

IntegerBinning

RegularBinning

RealInterval

RealOverflow

HexagonalBinning

EdgesBinning

IrregularBinning

CategoryBinning

SparseRegularBinning

FractionBinning

PredicateBinning

VariationBinning

Variation

Assignment

UnweightedCounts

WeightedCounts

InterpretedInlineBuffer

InterpretedInlineInt64Buffer

InterpretedInlineFloat64Buffer

InterpretedExternalBuffer

Profile

Statistics

Moments

Quantiles

Modes

Extremes

StatisticFilter

Covariance

ParameterizedFunction

Parameter

EvaluatedFunction

BinnedEvaluatedFunction

Ntuple

Column

NtupleInstance

Chunk

ColumnChunk

Page

RawInlineBuffer

RawExternalBuffer

Metadata

Decoration