Skip to content

Releases: apache/pinot

Apache Pinot 0.8.0

11 Aug 22:57
Compare
Choose a tag to compare

Summary

This release introduced several awesome new features, including compatibility tests, enhanced complex type and Json support, partial upsert support, and new stream ingestion plugins (AWS Kinesis, Apache Pulsar). It contains a lot of query enhancements such as new timestamp and boolean type support and flexible numerical column comparison. It also includes many key bug fixes. See details below.

The release was cut from the following commit: fe83e95

and the following cherry-picks:

Notable New Features

  • Extract time handling for SegmentProcessorFramework (#7158)
  • Add Apache Pulsar low level and high level connector (#7026)
  • Enable parallel builds for compat checker (#7149)
  • Add controller/server API to fetch aggregated segment metadata (#7102)
  • Support Dictionary Based Plan For DISTINCT (#7141)
  • Provide HTTP client to kinesis builder (#7148)
  • Add datetime function with 2 arguments (#7116)
  • Adding ability to check ingestion status for Offline Pinot table (#7070)
  • Add timestamp datatype support in JDBC (#7117)
  • Allow updating controller and broker helix hostname (#7064)
  • Cancel running Kinesis consumer tasks when timeout occurs (#7109)
  • Implement Append merger for partial upsert (#7087)
  • SegmentProcessorFramework Enhancement (#7092)
  • Added TaskMetricsEmitted periodic controler job (#7091)
  • Support json path expressions in query. (#6998)
  • Support data preprocessing for AVRO and ORC formats (#7062)
  • Add partial upsert config and mergers (#6899)
  • Add support for range index rule recommendation(#7034) (#7063)
  • Allow reloading consuming segment by default (#7078)
  • Add LZ4 Compression Codec (#6804) (#7035)
  • Make Pinot JDK 11 Compilable (#6424)
  • Introduce in-Segment Trim for GroupBy OrderBy Query (#6991)
  • Produce GenericRow file in segment processing mapper (#7013)
  • Add ago() scalar transform function (#6820)
  • Add Bloom Filter support for IN predicate(#7005) (#7007)
  • Add genericRow file reader and writer (#6997)
  • Normalize LHS and RHS numerical types for >, >=, <, and <= operators. (#6927)
  • Add Kinesis Stream Ingestion Plugin (#6661)
  • feature/#6766 JSON and Startree index information in API (#6873)
  • Support null value fields in generic row ser/de (#6968)
  • Implement PassThroughTransformOperator to optimize select queries(#6972) (#6973)
  • Optimize TIME_CONVERT/DATE_TIME_CONVERT predicates (#6957)
  • Prefetch call to fetch buffers of columns seen in the query (#6967)
  • Enabling compatibility tests in the script (#6959)
  • Add collectionToJsonMode to schema inference (#6946)
  • Add the complex-type support to decoder/reader (#6945)
  • Adding a new Controller API to retrieve ingestion status for realtime… (#6890)
  • Add support for Long in Modulo partition function. (#6929)
  • Enhance PinotSegmentRecordReader to preserve null values (#6922)
  • add complex-type support to avro-to-pinot schema inference (#6928)
  • Add correct yaml files for real time data(#6787) (#6916)
  • Add complex-type transformation to offline segment creation (#6914)
  • Add config File support(#6787) (#6901)
  • Enhance JSON index to support nested array (#6877)
  • Add debug endpoint for tables. (#6897)
  • JSON column datatype support. (#6878)
  • Allow empty string in MV column (#6879)
  • Add Zstandard compression support with JMH benchmarking(#6804) (#6876)
  • Normalize LHS and RHS numerical types for = and != operator. (#6811)
  • Change ConcatCollector implementation to use off-heap (#6847)
  • [PQL Deprecation] Clean up the old BrokerRequestOptimizer (#6859)
  • [PQL Deprecation] Do not compile PQL broker request for SQL query (#6855)
  • Add TIMESTAMP and BOOLEAN data type support (#6719)
  • Add admin endpoint for Pinot Minon. (#6822)
  • Remove the usage of PQL compiler (#6808
  • Add endpoints in Pinot Controller, Broker and Server to get system and application configs. (#6817)
  • Support IN predicate in ColumnValue SegmentPruner(#6756) (#6776)
  • Enable adding new segments to a upsert-enabled realtime table (#6567)
  • Interface changes for Kinesis connector (#6667)
  • Pinot Minion SegmentGenerationAndPush task: PinotFS configs inside taskSpec is always temporary and has higher priority than default PinotFS created by the minion server configs (#6744)
  • DataTable V3 implementation and measure data table serialization cost on server (#6710)
  • add uploadLLCSegment endpoint in TableResource (#6653)
  • File-based SegmentWriter implementation (#6718)
  • Basic Auth for pinot-controller (#6613)
  • UI integration with Authentication API and added login page (#6686)
  • Support data ingestion for offline segment in one pass (#6479)
  • SumPrecision: support all data types and star-tree (#6668)
  • complete compatibility regression testing (#6650)
  • Kinesis implementation Part 1: Rename partitionId to partitionGroupId (#6655)
  • Make Pinot metrics pluggable (#6640)
  • Recover the segment from controller when LLC table cannot load it (#6647)
  • Adding a new API for validating specified TableConfig and Schema (#6620)
  • Introduce a metric for query/response size on broker. (#6590)
  • Adding a controller periodic task to clean up dead minion instances (#6543)
  • Adding new validation for Json, TEXT indexing (#6541)
  • Always return a response from query execution. (#6596)

Special notes

  • After the 0.8.0 release, we will officially support jdk 11, and can now safely start to use jdk 11 features. Code is still compilable with jdk 8 (#6424)
  • RealtimeToOfflineSegmentsTask config has some backward incompatible changes (#7158)
    timeColumnTransformFunction is removed (backward-incompatible, but rollup is not supported anyway)
    — Deprecate collectorType and replace it with mergeType
    — Add roundBucketTimePeriod and partitionBucketTimePeriod to config the time bucket for round and partition
  • Regex path for pluggable MinionEventObserverFactory is changed from org.apache.pinot.*.event.* to org.apache.pinot.*.plugin.minion.tasks.* (#6980)
  • Moved all pinot built-in minion tasks to the pinot-minion-builtin-tasks module and package them into a shaded jar (#6618)
  • Reloading consuming segment flag pinot.server.instance.reload.consumingSegment will be true by default (#7078)
  • Move JSON decoder from pinot-kafka to pinot-json package. (#7021)
  • Backward incompatible schema change through controller rest API PUT /schemas/{schemaName} will be blocked. (#6737)
  • Deprecated /tables/validateTableAndSchema in favor of the new configs/validate API and introduced new APIs for /tableConfigs to operate on the realtime table config, offline table config and schema in one shot. (#6840)

Major Bug fixes

  • Fix race condition in MinionInstancesCleanupTask (#7122)
  • Fix custom instance id for controller/broker/minion (#7127)
  • Fix UpsertConfig JSON deserialization. (#7125)
  • Fix the memory issue for selection query with large limit (#7112)
  • Fix the deleted segments directory not exist warning (#7097)
  • Fixing docker build scripts by providing JDK_VERSION as parameter (#7095)
  • Misc fixes for json data type (#7057)
  • Fix handling of date time columns in query recommender(#7018) (#7031)
  • fixing pinot-hadoop and pinot-spark test (#7030)
  • Fixing HadoopPinotFS listFiles method to always contain scheme (#7027)
  • fixed GenericRow compare for different _fieldToValueMap size (#6964)
  • Fix NPE in NumericalFilterOptimizer due to IS NULL and IS NOT NULL operator. (#7001)
  • Fix the race condition in realtime text index refresh thread (#6858) (#6990)
  • Fix deep store directory structure (#6976)
  • Fix NPE issue when consumed kafka message is null or the record value is null. (#6950)
  • Mitigate calcite NPE bug. (#6908)
  • Fix the exception thrown in the case that a specified table name does not exist (#6328) (#6765)
  • Fix CAST transform function for chained transforms (#6941)
  • Fixed failing pinot-controller npm build (#6795)

Apache Pinot (incubating) 0.7.1

07 Apr 17:58
Compare
Choose a tag to compare

Summary

This release introduced several awesome new features, including JSON index, lookup-based join support, geospatial support, TLS support for pinot connections, and various performance optimizations and improvements. It also adds several new APIs to better manage the segments and upload data to offline table. It also contains many key bug fixes. See details below.

The release was cut from the following commit: 78152cd

and the following cherry-picks:

Notable New Features

  • Add a server metric: queriesDisabled to check if queries disabled or not. (#6586)
  • Optimization on GroupKey to save the overhead of ser/de the group keys (#6593) (#6559)
  • Support validation for jsonExtractKey and jsonExtractScalar functions (#6246) (#6594)
  • Real-Time Provisioning Helper tool improvement to take data characteristics as input instead of an actual segment (#6546)
  • Add the isolation level config isolation.level to Kafka consumer (2.0) to ingest transactionally committed messages only (#6580)
  • Enhance StarTreeIndexViewer to support multiple trees (#6569)
  • Improves ADLSGen2PinotFS with service principal-based auth, auto-create container on the initial run. It's backward compatible with key-based auth. (#6531)
  • Add metrics for minion tasks status (#6549)
  • Use minion data directory as tmp directory for SegmentGenerationAndPushTask to ensure directory is always cleaned up (#6560)
  • Add optional HTTP basic auth to pinot broker, which enables user- and table-level authentication of incoming queries. (#6552)
  • Add Access Control for REST endpoints of Controller (#6507)
  • Add date_trunc to scalar functions to support date_trunc during ingestion (#6538)
  • Allow tar gz with > 8gb size (#6533)
  • Add Lookup UDF Join support (#6530), (#6465), (#6383) (#6286)
  • Add cron scheduler metrics reporting (#6502)
  • Support generating derived column during segment load, so that derived columns can be added on-the-fly (#6494)
  • Support chained transform functions (#6495)
  • Add scalar function JsonPathArray to extract arrays from json (#6490)
  • Add a guard against multiple consuming segments for same partition (#6483)
  • Remove the usage of deprecated range delimiter (#6475)
  • Handle scheduler calls with a proper response when it's disabled. (#6474)
  • Simplify SegmentGenerationAndPushTask handling getting schema and table config (#6469)
  • Add a cluster config to config number of concurrent tasks per instance for minion task: SegmentGenerationAndPushTaskGenerator (#6468)
  • Replace BrokerRequestOptimizer with QueryOptimizer to also optimize the PinotQuery (#6423)
  • Add additional string scalar functions (#6458)
  • Add additional scalar functions for array type (#6446)
  • Add CRON scheduler for Pinot tasks (#6451)
  • Set default Data Type while setting type in Add Schema UI dialog (#6452)
  • Add ImportData sub command in pinot admin (#6396)
  • H3-based geospatial index (#6409) (#6306)
  • Add JSON index support (#6408) (#6216) (#6346)
  • Make minion tasks pluggable via reflection (#6395)
  • Add compatibility test for segment operations upload and delete (#6382)
  • Add segment reset API that disables and then enables the segment (#6336)
  • Add Pinot minion segment generation and push task. (#6340)
  • Add a version option to pinot admin to show all the component versions (#6380)
  • Add FST index using Lucene lib to speedup REGEXP_LIKE operator on text (#6120)
  • Add APIs for uploading data to an offline table. (#6354)
  • Allow the use of environment variables in stream configs (#6373)
  • Enhance task schedule api for single type/table support (#6352)
  • Add broker time range based pruner for routing. Query operators supported: RANGE, =, <, <=, >, >=, AND, OR(#6259)
  • Add json path functions to extract values from json object (#6347)
  • Create a pluggable interface for Table config tuner (#6255)
  • Add a Controller endpoint to return table creation time (#6331)
  • Add tooltips, ability to enable-disable table state to the UI (#6327)
  • Add Pinot Minion client (#6339)
  • Add more efficient use of RoaringBitmap in OnHeapBitmapInvertedIndexCreator and OffHeapBitmapInvertedIndexCreator (#6320)
  • Add decimal percentile support. (#6323)
  • Add API to get status of consumption of a table (#6322)
  • Add support to add offline and realtime tables, individually able to add schema and schema listing in UI (#6296)
  • Improve performance for distinct queries (#6285)
  • Allow adding custom configs during the segment creation phase (#6299)
  • Use sorted index based filtering only for dictionary encoded column (#6288)
  • Enhance forward index reader for better performance (#6262)
  • Support for text index without raw (#6284)
  • Add api for cluster manager to get table state (#6211)
  • Perf optimization for SQL GROUP BY ORDER BY (#6225)
  • Add support using environment variables in the format of ${VAR_NAME:DEFAULT_VALUE} in Pinot table configs. (#6271)

Special notes

  • Pinot controller metrics prefix is fixed to add a missing dot (#6499). This is a backward-incompatible change that JMX query on controller metrics must be updated
  • Legacy group key delimiter (\t) was removed to be backward-compatible with release 0.5.0 (#6589)
  • Upgrade zookeeper version to 3.5.8 to fix ZOOKEEPER-2184: Zookeeper Client should re-resolve hosts when connection attempts fail. (#6558)
  • Add TLS-support for client-pinot and pinot-internode connections (#6418)
    Upgra...
Read more

Apache Pinot (incubating) 0.6.0

05 Nov 22:27
Compare
Choose a tag to compare

Summary

This release introduced some excellent new features, including upsert, tiered storage, pinot-spark-connector, support of having clause, more validations on table config and schema, support of ordinals in GROUP BY and ORDER BY clause, array transform functions, adding push job type of segment metadata only mode, and some new APIs like updating instance tags, new health check endpoint. It also contains many key bug fixes. See details below.

The release was cut from the following commit:
e5c9bec
and the following cherry-picks:

Notable New Features

  • Tiered storage (#5793)
  • Upsert feature (#6096, #6113, #6141, #6149, #6167)
  • Pre-generate aggregation functions in QueryContext (#5805)
  • Adding controller healthcheck endpoint: /health (#5846)
  • Add pinot-spark-connector (#5787)
  • Support multi-value non-dictionary group by (#5851)
  • Support type conversion for all scalar functions (#5849)
  • Add additional datetime functionality (#5438)
  • Support post-aggregation in ORDER-BY (#5856)
  • Support post-aggregation in SELECT (#5867)
  • Add RANGE FilterKind to support merging ranges for SQL (#5898)
  • Add HAVING support (#5889)
  • Support for exact distinct count for non int data types (#5872)
  • Add max qps bucket count (#5922)
  • Add Range Indexing support for raw values (#5853)
  • Add IdSet and IdSetAggregationFunction (#5926)
  • [Deepstore by-pass]Add a Deepstore bypass integration test with minor bug fixes. (#5857)
  • Add Hadoop counters for detecting schema mismatch (#5873)
  • Add RawThetaSketchAggregationFunction (#5970)
  • Instance API to directly updateTags (#5902)
  • Add streaming query handler (#5717)
  • Add InIdSetTransformFunction (#5973)
  • Add ingestion descriptor in the header (#5995)
  • Zookeeper put api (#5949)
  • Feature/#5390 segment indexing reload status api (#5718)
  • Segment processing framework (#5934)
  • Support streaming query in QueryExecutor (#6027)
  • Add list of allowed tables for emitting table level metrics (#6037)
  • Add FilterOptimizer which supports optimizing both PQL and SQL query filter (#6056)
  • Adding push job type of segment metadata only mode (#5967)
  • Minion taskExecutor for RealtimeToOfflineSegments task (#6050, #6124)
  • Adding array transform functions: array_average, array_max, array_min, array_sum (#6084)
  • Allow modifying/removing existing star-trees during segment reload (#6100)
  • Implement off-heap bloom filter reader (#6118)
  • Support for multi-threaded Group By reducer for SQL. (#6044)
  • Add OnHeapGuavaBloomFilterReader (#6147)
  • Support using ordinals in GROUP BY and ORDER BY clause (#6152)
  • Merge common APIs for Dictionary (#6176)
  • Add table level lock for segment upload (#6165)
  • Added recursive functions validation check for group by (#6186)
  • Add StrictReplicaGroupInstanceSelector (#6208)
  • Add IN_SUBQUERY support (#6022)
  • Add IN_PARTITIONED_SUBQUERY support (#6043)
  • Some UI features (#5810, #5981, #6117, #6215)

Special notes

  • Brokers should be upgraded before servers in order to keep backward-compatible:
    — Change group key delimiter from '\t' to '\0' (#5858)
    — Support for exact distinct count for non int data types (#5872)
  • Pinot Components have to be deployed in the following order:
    (PinotServiceManager -> Bootstrap services in role ServiceRole.CONTROLLER -> All remaining bootstrap services in parallel)
    — Starts Broker and Server in parallel when using ServiceManager (#5917)
  • New settings introduced and old ones deprecated:
    — Make realtime threshold property names less ambiguous (#5953)
    — Change Signature of Broker API in Controller (#6119)
  • This aggregation function is still in beta version. This PR involves change on the format of data sent from server to broker, so it works only when both broker and server are upgraded to the new version:
    — Enhance DistinctCountThetaSketchAggregationFunction (#6004)

Major Bug fixes

  • Improve performance of DistinctCountThetaSketch by eliminating empty sketches and unions. (#5798)
  • Enhance VarByteChunkSVForwardIndexReader to directly read from data buffer for uncompressed data (#5816)
  • Fixing backward-compatible issue of schema fetch call (#5885)
  • Fix race condition in MetricsHelper (#5887)
  • Fixing the race condition that segment finished before ControllerLeaderLocator created. (#5864)
  • Fix CSV and JSON converter on BYTES column (#5931)
  • Fixing the issue that transform UDFs are parsed as function name 'OTHER', not the real function names (#5940)
  • Incorporating embedded exception while trying to fetch stream offset (#5956)
  • Use query timeout for planning phase (#5990)
  • Add null check while fetching the schema (#5994)
  • Validate timeColumnName when adding/updating schema/tableConfig (#5966)
  • Handle the partitioning mismatch between table config and stream (#6031)
  • Fix built-in virtual columns for immutable segment (#6042)
  • Refresh the routing when realtime segment is committed (#6078)
  • Add support for Decimal with Precision Sum aggregation (#6053)
  • Fixing the calls to Helix to throw exception if zk connection is broken (#6069)
  • Allow modifying/removing existing star-trees during segment reload (#6100)
  • Add max length support in schema builder (#6112)
  • Enhance star-tree to skip matching-all predicate on non-star-tree dimension (#6109)

Backward Incompatible Changes

  • Make realtime threshold property names less ambiguous (#5953)
  • Enhance DistinctCountThetaSketchAggregationFunction (#6004)
  • Deep Extraction Support for ORC, Thrift, and ProtoBuf Records (#6046)

Apache Pinot (incubating) 0.5.0

03 Sep 04:26
Compare
Choose a tag to compare

Summary

This release includes many new features on Pinot ingestion and connectors (e.g., support for filtering during ingestion which is configurable in table config; support for json during ingestion; proto buf input format support and a new Pinot JDBC client), query capability (e.g., a new GROOVY transform function UDF) and admin functions (a revamped Cluster Manager UI & Query Console UI). It also contains many key bug fixes. See details below.

The release was cut from the following commit:
d1b4586
and the following cherry-picks:

Notable New Features

  • Allowing update on an existing instance config: PUT /instances/{instanceName} with Instance object as the pay-load (#PR4952)
  • Add PinotServiceManager to start Pinot components (#PR5266)
  • Support for protocol buffers input format. (#PR5293)
  • Add GenericTransformFunction wrapper for simple ScalarFunctions (PR#5440)
    — Adding support to invoke any scalar function via GenericTransformFunction
  • Add Support for SQL CASE Statement (PR#5461)
  • Support distinctCountRawThetaSketch aggregation that returns serialized sketch. (PR#5465)
  • Add multi-value support to SegmentDumpTool (PR#5487)
    — add segment dump tool as part of the pinot-tool.sh script
  • Add json_format function to convert json object to string during ingestion. (PR#5492)
    — Can be used to store complex objects as a json string (which can later be queries using jsonExtractScalar)
  • Support escaping single quote for SQL literal (PR#5501)
    — This is especially useful for DistinctCountThetaSketch because it stores expression as literal
    E.g. DistinctCountThetaSketch(..., 'foo=''bar''', ...)
  • Support expression as the left-hand side for BETWEEN and IN clause (PR#5502)
  • Add a new field IngestionConfig in TableConfig
    — FilterConfig: ingestion level filtering of records, based on filter function. (PR#5597)
    — TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
  • Allow star-tree creation during segment load (#PR5641)
    — Introduced a new boolean config enableDynamicStarTreeCreation in IndexingConfig to enable/disable star-tree creation during segment load.
  • Support for Pinot clients using JDBC connection (#PR5602)
  • Support customized accuracy for distinctCountHLL, distinctCountHLLMV functions by adding log2m value as the second parameter in the function. (#PR5564)
    —Adding cluster config: default.hyperloglog.log2m to allow user set default log2m value.
  • Add segment encryption on Controller based on table config (PR#5617)
  • Add a constraint to the message queue for all instances in Helix, with a large default value of 100000. (PR#5631)
  • Support order-by aggregations not present in SELECT (PR#5637)
    — Example: "select subject from transcript group by subject order by count() desc"
    This is equivalent to the following query but the return response should not contain count(
    ).
    "select subject, count() from transcript group by subject order by count() desc"
  • Add geo support for Pinot queries (PR#5654)
    — Added geo-spatial data model and geospatial functions
  • Cluster Manager UI & Query Console UI revamp (PR#5684 and PR#5732)
    — updated cluster manage UI and added table details page and segment details page
  • Add Controller API to explore Zookeeper (PR#5687)
  • Support BYTES type for dictinctCount and group-by (PR#5701 and PR#5708)
    —Add BYTES type support to DistinctCountAggregationFunction
    —Correctly handle BYTES type in DictionaryBasedAggregationOperator for DistinctCount
  • Support for ingestion job spec in JSON format (#PR5729)
  • Improvements to RealtimeProvisioningHelper command (#PR5737)
    — Improved docs related to ingestion and plugins
  • Added GROOVY transform function UDF (#PR5748)
    — Ability to run a groovy script in the query as a UDF. e.g. string concatenation:
    SELECT GROOVY('{"returnType": "INT", "isSingleValue": true}', 'arg0 + " " + arg1', columnA, columnB) FROM myTable

Special notes

  • Changed the stream and metadata interface (PR#5542)
    — This PR concludes the work for the issue #5359 to extend offset support for other streams
  • TransformConfig: ingestion level column transformations. This was previously introduced in Schema (FieldSpec#transformFunction), and has now been moved to TableConfig. It continues to remain under schema, but we recommend users to set it in the TableConfig starting this release (PR#5681).
  • Config key enable.case.insensitive.pql in Helix cluster config is deprecated, and replaced with enable.case.insensitive. (#PR5546)
  • Change default segment load mode to MMAP. (PR#5539)
    —The load mode for segments currently defaults to heap.

Major Bug fixes

  • Fix bug in distinctCountRawHLL on SQL path (#5494)
  • Fix backward incompatibility for existing stream implementations (#5549)
  • Fix backward incompatibility in StreamFactoryConsumerProvider (#5557)
  • Fix logic in isLiteralOnlyExpression. (#5611)
  • Fix double memory allocation during operator setup (#5619)
  • Allow segment download url in Zookeeper to be deep store uri instead of hardcoded controller uri (#5639)
  • Fix a backward compatible issue of converting BrokerRequest to QueryContext when querying from Presto segment splits (#5676)
  • Fix the issue that PinotSegmentToAvroConverter does not handle BYTES data type. (#5789)

Backward Incompatible Changes

  • PQL queries with HAVING clause will no longer be accepted for the following reasons: (#PR5570)
    — HAVING clause does not apply to PQL GROUP-BY semantic where each aggregation column is ordered individually
    — The current behavior can produce inaccurate results without any notice
    — HAVING support will be added for SQL queries in the next release
  • Because of the standardization of the DistinctCountThetaSketch predicate strings, please upgrade Broker before Server. The new Broker can handle both standard and non-standard predicate strings for backward-compatibility. (#PR5613)

Apache Pinot (incubating) 0.4.0

02 Jun 04:40
Compare
Choose a tag to compare

Summary

This release introduced various new features, including the theta-sketch based distinct count aggregation function, an S3 filesystem plugin, a unified star-tree index implementation, deprecation of TimeFieldSpec in favor of DateTimeFieldSpec, etc. Miscellaneous refactoring, performance improvement and bug fixes were also included in this release. See details below.

The release was cut from this commit:
008be2d
with cherry-picking the following patches:

Notable New Features

  • Made DateTimeFieldSpecs mainstream and deprecated TimeFieldSpec (#2756)
    • Used time column from table config instead of schema (#5320)
    • Included dateTimeFieldSpec in schema columns of Pinot Query Console #5392
    • Used DATE_TIME as the primary time column for Pinot tables (#5399)
  • Supported range queries using indexes (#5240)
  • Supported complex aggregation functions
    • Supported Aggregation functions with multiple arguments (#5261)
    • Added api in AggregationFunction to get compiled input expressions (#5339)
  • Added a simple PinotFS benchmark driver (#5160)
  • Supported default star-tree (#5147)
  • Added an initial implementation for theta-sketch based distinct count aggregation function (#5316)
    • One minor side effect: DataSchemaPruner won't work for DistinctCountThetaSketchAggregatinoFunction (#5382)
  • Added access control for Pinot server segment download api (#5260)
  • Added Pinot S3 Filesystem Plugin (#5249)
  • Text search improvement
    • Pruned stop words for text index (#5297)
    • Used 8byte offsets in chunk based raw index creator (#5285)
    • Derived num docs per chunk from max column value length for varbyte raw index creator (#5256)
    • Added inter segment tests for text search and fixed a bug for Lucene query parser creation (#5226)
    • Made text index query cache a configurable option (#5176)
    • Added Lucene DocId to PinotDocId cache to improve performance (#5177)
    • Removed the construction of second bitmap in text index reader to improve performance (#5199)
  • Tooling/usability improvement
    • Added template support for Pinot Ingestion Job Spec (#5341)
    • Allowed user to specify zk data dir and don't do clean up during zk shutdown (#5295)
    • Allowed configuring minion task timeout in the PinotTaskGenerator (#5317)
    • Update JVM settings for scripts (#5127)
    • Added Stream github events demo (#5189)
    • Moved docs link from gitbook to docs.pinot.apache.org (#5193)
  • Re-implemented ORCRecordReader (#5267)
  • Evaluated schema transform expressions during ingestion (#5238)
  • Handled count distinct query in selection list (#5223)
  • Enabled async processing in pinot broker query api (#5229)
  • Supported bootstrap mode for table rebalance (#5224)
  • Supported order-by on BYTES column (#5213)
  • Added Nightly publish to binary (#5190)
  • Shuffled the segments when rebalancing the table to avoid creating hotspot servers (#5197)
  • Supported inbuilt transform functions (#5312)
    • Added date time transform functions (#5326)
  • Deepstore by-pass in LLC: introduced segment uploader (#5277, #5314)
  • APIs Additions/Changes
    • Added a new server api for download of segments
      • /GET /segments/{tableNameWithType}/{segmentName}
  • Upgraded helix to 0.9.7 (#5411)
  • Added support to execute functions during query compilation (#5406)
  • Other notable refactoring
    • Moved table config into pinot-spi (#5194)
    • Cleaned up integration tests. Standardized the creation of schema, table config and segments (#5385)
    • Added jsonExtractScalar function to extract field from json object (#4597)
    • Added template support for Pinot Ingestion Job Spec #5372
    • Cleaned up AggregationFunctionContext (#5364)
    • Optimized real-time range predicate when cardinality is high (#5331)
    • Made PinotOutputFormat use table config and schema to create segments (#5350)
    • Tracked unavailable segments in InstanceSelector (#5337)
    • Added a new best effort segment uploader with bounded upload time (#5314)
    • In SegmentPurger, used table config to generate the segment (#5325)
    • Decoupled schema from RecordReader and StreamMessageDecoder (#5309)
    • Implemented ARRAYLENGTH UDF for multi-valued columns (#5301)
    • Improved GroupBy query performance (#5291)
    • Optimized ExpressionFilterOperator (#5132)

Major Bug Fixes

  • Do not release the PinotDataBuffer when closing the index (#5400)
  • Handled a no-arg function in query parsing and expression tree (#5375)
  • Fixed compatibility issues during rolling upgrade due to unknown json fields (#5376)
  • Fixed missing error message from pinot-admin command (#5305)
  • Fixed HDFS copy logic (#5218)
  • Fixed spark ingestion issue (#5216)
  • Fixed the capacity of the DistinctTable (#5204)
  • Fixed various links in the Pinot website

Work in Progress

  • Upsert: support overriding data in the real-time table (#4261).
    • Add pinot upsert features to pinot common (#5175)
  • Enhancements for theta-sketch, e.g. multiValue aggregation support, complex predicates, performance tuning, etc

Backward Incompatible Changes

  • TableConfig no longer support de-serialization from json string of nested json string (i.e. no \" inside the json) (#5194)
  • The following APIs are changed in AggregationFunction (use TransformExpressionTree instead of String as the key of blockValSetMap) (#5371):
  void aggregate(int length, AggregationResultHolder aggregationResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
  void aggregateGroupBySV(int length, int[] groupKeyArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
  void aggregateGroupByMV(int length, int[][] groupKeysArray, GroupByResultHolder groupByResultHolder, Map<TransformExpressionTree, BlockValSet> blockValSetMap);
  • A different segment writing logic was introduced in #5256. Although this is backward compatible in a sense that the old segments can be read by the new code, rollback would be tricky since new segments after the upgrade would have been written in the new format, and the old code cannot read those new segments.

Apache Pinot (incubating) 0.3.0

25 Mar 21:11
Compare
Choose a tag to compare

What's the big change?

The reason behind the architectural change from the previous release (0.2.0) and this release (0.3.0), is the possibility of extending Apache Pinot. The 0.2.0 release was not flexible enough to support new storage types nor new stream types. Basically, inserting a new functionality required to change too much code. Thus, the Pinot team went through an extensive refactoring and improvement of the source code.

For instance, the picture below shows the module dependencies of the 0.2.X or previous releases. If we wanted to support a new storage type, we would have had to change several modules. Pretty bad, huh?

image

In order to conquer this challenge, below major changes are made:

  • Refactored common interfaces to pinot-spi module
  • Concluded four types of modules:
    • Pinot input format: How to read records from various data/file formats: e.g. Avro/CSV/JSON/ORC/Parquet/Thrift
    • Pinot filesystem: How to operate files on various filesystems: e.g. Azure Data Lake/Google Cloud Storage/S3/HDFS
    • Pinot stream ingestion: How to ingest data stream from various upstream systems, e.g. Kafka/Kinesis/Eventhub
    • Pinot batch ingestion: How to run Pinot batch ingestion jobs in various frameworks, like Standalone, Hadoop, Spark.
  • Built shaded jars for each individual plugin
  • Added support to dynamically load pinot plugins at server startup time

Now the architecture supports a plug-and-play fashion, where new tools can be supported with little and simple extensions, without affecting big chunks of code. Integrations with new streaming services and data formats can be developed in a much more simple and convenient way.

Below is current supported Pinot Plugins module structure:

  • pinot-input-format
    • pinot-avro
    • pinot-csv
    • pinot-json
    • pinot-orc
    • pinot-parquet
    • pinot-thrift
  • pinot-file-system
    • pinot-adls
    • pinot-gcs
    • pinot-hdfs
  • pinot-stream-ingestion
    • pinot-kafka-0.9
    • pinot-kafka-2.0
  • pinot-batch-ingestion
    • pinot-batch-ingestion-hadoop
    • pinot-batch-ingestion-spark
    • pinot-batch-ingestion-standalone

Notable New Features

  • Added support for DISTINCT (#4535)
  • Added support default value for BYTES column (#4583)
  • JDK 11 Support
  • Added support to tune size vs accuracy for approximation aggregation functions: DistinctCountHLL, PercentileEst, PercentileTDigest (#4666)
  • Added Data Anonymizer Tool (#4747)
  • Deprecated pinot-hadoop and pinot-spark modules, replace with pinot-batch-ingestion-hadoop and pinot-batch-ingestion-spark
  • Support STRING and BYTES for no dictionary columns in realtime consuming segments (#4791)
  • Make pinot-distribution to build a pinot-all jar and assemble it (#4977)
  • Added support for PQL case insensitive (#4983)
  • Enhanced TableRebalancer logics
    • Moved to new rebalance strategy (#4695)
    • Supported rebalancing tables under any condition(#4990)
    • Supported reassigning completed segments along with Consuming segments for LLC realtime table (#5015)
  • Added experimental support for Text Search‌ (#4993)
  • Upgraded Helix to version 0.9.4, task management now works as expected (#5020)
  • Added date_trunc transformation function. (#4740)
  • Support schema evolution for consuming segment. (#4954)
  • SQL Support
    • Added Calcite SQL compiler
    • Added SQL response format (#4694, #4877)
    • Added support for GROUP BY with ORDER BY (#4602)
    • Query console defaults to use SQL syntax (#4994)
    • Support column alias (#5016, #5033)
    • Added SQL query endpoint: /query/sql (#4964)
    • Support arithmetic operators (#5018)
    • Support non-literal expressions for right-side operand in predicate comparison(#5070)
  • APIs Additions/Changes
    • Pinot Admin Command
      • Added -queryType option in PinotAdmin PostQuery subcommand (#4726)
      • Added -schemaFile as option in AddTable command (#4959)
      • Added OperateClusterConfig sub command in PinotAdmin (#5073)
    • Pinot Controller Rest APIs
      • Get Table leader controller resource (#4545)
      • Support HTTP POST/PUT to upload JSON encoded schema (#4639)
      • Table rebalance API now requires both table name and type as parameters. (#4824)
      • Refactored Segments APIs (#4806)
      • Added segment batch deletion REST API (#4828)
      • Update schema API to reload table on schema change when applicable (#4838)
      • Enhance the task related REST APIs (#5054)
      • Added PinotClusterConfig REST APIs (#5073)
        • GET /cluster/configs
        • POST /cluster/configs
        • DELETE /cluster/configs/{configName}
  • Configurations Additions/Changes
    • Config: controller.host is now optional in Pinot Controller
    • Added instance config: queriesDisabled to disable query sending to a running server (#4767)
    • Added broker config: pinot.broker.enable.query.limit.override configurable max query response size (#5040)
    • Removed deprecated server configs (#4903)
      • pinot.server.starter.enableSegmentsLoadingCheck
      • pinot.server.starter.timeoutInSeconds
      • pinot.server.instance.enable.shutdown.delay
      • pinot.server.instance.starter.maxShutdownWaitTime
      • pinot.server.instance.starter.checkIntervalTime
    • Decouple server instance id with hostname/port config. (#4995)
    • Add FieldConfig to encapsulate encoding, indexing info for a field.(#5006)

Major Bug Fixes

  • Fixed the bug of releasing the segment when there are still threads working on it. (#4764)
  • Fixed the bug of uneven task distribution for threads (#4793)
  • Fixed encryption for .tar.gz segment file upload (#4855)
  • Fixed controller rest API to download segment from non local FS. (#4808)
  • Fixed the bug of not releasing segment lock if segment recovery throws exception (#4882)
  • Fixed the issue of server not registering state model factory before connecting the Helix manager (#4929)
  • Fixed the exception in server instance when Helix starts a new ZK session (#4976)
  • Fixed ThreadLocal DocIdSet issue in ExpressionFilterOperator (#5114)
  • Fixed the bug in default value provider classes (#5137)
  • Fixed the bug when no segment exists in RealtimeSegmentSelector (#5138)

Work in Progress

  • We are in the process of supporting text search query functionalities.
  • We are in the process of supporting null value (#4230), currently limited query feature is supported
    • Added Presence Vector to represent null value (#4585)
    • Added null predicate support for leaf predicates (#4943)

Backward Incompatible Changes

  • It’s a disruptive upgrade from version 0.1.0 to this because of the protocol changes between Pinot Broker and Pinot Server. Please ensure that you upgrade to release 0.2.0 first, then upgrade to this version.
  • If you build your own startable or war without using scripts generated in Pinot-distribution module. For Java 8, an environment variable “plugins.dir” is required for Pinot to find out where to load all the Pinot plugin jars. For Java 11, plugins directory is required to be explicitly set into classpath. Please see pinot-admin.sh as an example.
  • As always, we recommend that you upgrade controllers first, and then brokers and lastly the servers in order to have zero downtime in production clusters.
  • Kafka 0.9 is no longer included in the release distribution.
  • Pull request #4806 introduces a backward incompatible API change for segments management.
    • Removed segment toggle APIs
    • Removed list all segments in cluster APIs
    • Deprecated below APIs:
      • GET /tables/{tableName}/segments
      • GET /tables/{tableName}/segments/metadata
      • GET /tables/{tableName}/segments/crc
      • GET /tables/{tableName}/segments/{segmentName}
      • GET /tables/{tableName}/segments/{segmentName}/metadata
      • GET /tables/{tableName}/segments/{segmentName}/reload
      • POST /tables/{tableName}/segments/{segmentName}/reload
      • GET /tables/{tableName}/segments/reload
      • POST /tables/{tableName}/segments/reload
  • Pull request #5054 deprecated below task related APIs:
    • GET:
      • /tasks/taskqueues: List all task queues
      • /tasks/taskqueuestate/{taskType} -> /tasks/{taskType}/state
      • /tasks/tasks/{taskType} -> /tasks/{taskType}/tasks
      • /tasks/taskstates/{taskType} -> /tasks/{taskType}/taskstates
      • /tasks/taskstate/{taskName} -> /tasks/task/{taskName}/taskstate
      • /tasks/taskconfig/{taskName} -> /tasks/task/{taskName}/taskconfig
    • PUT:
      • /tasks/scheduletasks -> POST /tasks/schedule
      • /tasks/cleanuptasks/{taskType} -> /tasks/{taskType}/cleanup
      • /tasks/taskqueue/{taskType}: Toggle a task queue
    • DELETE:
      • /tasks/taskqueue/{taskType} -> /tasks/{taskType}
  • Deprecated modules pinot-hadoop and pinot-spark and replaced with pinot-batch-ingestion-hadoop and pinot-batch-ingestion-spark.
  • Introduced new Pinot batch ingestion jobs and yaml based job specs to define segment generation jobs and segment push jobs.
  • You may see exceptions like below in pinot-brokers during cluster upgrade, but it's safe to ignore them.
2020/03/09 23:37:19.879 ERROR [HelixTaskExecutor] [CallbackProcessor@b808af5-pinot] [pinot-broker] [] Message cannot be processed: 78816abe-5288-4f08-88c0-f8aa596114fe, {CREATE_TIMESTAMP=1583797034542, MSG_ID=78816abe-5288-4f08-88c0-f8aa596114fe, MSG_STATE=unprocessable, MSG_SUBTYPE=REFRESH_SEGMENT, MSG_TYPE=USER_DEFINE_MSG, PARTITION_NAME=fooBar_OFFLINE, RESOURCE_NAME=brokerResource, RETRY_COUNT=0, SRC_CLUSTER=pinot, SRC_INSTANCE_TYPE=PARTICIPANT, SRC_NAME=Controller_hostname.domain,com_9000, TGT_NAME=Broker_hostname,domain.com_6998, TGT_SESSION_ID=f6e19a457b80db5, TIMEOUT=-1, segmentName=fooBar_559, tableName=fooBar_OFFLINE}{}{}
java.lang...
Read more

Apache Pinot (incubating) 0.2.0

11 Nov 22:16
Compare
Choose a tag to compare

Major changes/feature additions since 0.1.0:

  • Added support for Kafka 2.0.

  • Table rebalancer now supports a minimum number of serving replicas during rebalance.

  • Added support for UDF in filter predicates and selection.

  • Added support to use hex string as the representation of byte array for queries (#4041)

  • Added support for parquet reader (#3852)

  • Introduced interface stability and audience annotations (#4063)

  • Refactor HelixBrokerStarter to separate constructor and start() (#4100) - backwards incompatible

  • Admin tool for listing segments with invalid intervals for offline tables

  • Migrated to log4j2 (#4139)

  • Added simple avro msg decoder

  • Added support for passing headers in pinot client

  • Table rebalancer now supports a minimum number of serving replicas during rebalance.

  • Support transform functions with AVG aggregation function (#4557)

  • Configurations additions/changes

    • Allow customized metrics prefix (#4392)
    • Controller.enable.batch.message.mode to false by default (#3928)
    • RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (#3946)
    • Config to control kafka fetcher size and increase default (#3869)
    • Added a percent threshold to consider startup of services (#4011)
    • Make SingleConnectionBrokerRequestHandler as default (#4048)
    • Always enable default column feature, remove the configuration (#4074)
    • Remove redundant default broker configurations (#4106)
    • Removed some config keys in server (#4222)
    • Add config to disable HLC realtime segment (#4235)
    • Make RetentionManager and OfflineSegmentIntervalChecker initial delays configurable (#3946)
    • The following config variables are deprecated and will be removed in the next release:
      • pinot.broker.requestHandlerType will be removed, in favor of using the "singleConnection"
        broker request handler. If you have set this configuration, please remove it and use the default
        type ("singleConnection") for broker request handler.

Work in progress

  • We are in the process of separating Helix and Pinot controllers, so that
    admninistrators can have the option of running independent Helix
    controllers and Pinot controllers.

  • We are in the process of moving towards supporting SQL query format and results.

  • We are in the process of seperating instance and segment assignment using instance
    pools to optimize the number of Helix state transitions in Pinot clusters with
    thousands of tables.

Other important notes

  • Task management does not work correctly in this relese, due to bugs in Helix.
    We will upgrade to Helix 0.9.2 (or later) version to get this fixed.

  • You must upgrade to this release before moving onto newer versions of Pinot
    release. The protocol between pinot-broker and pinot-server has been changed
    and this release has the code to retain compatibility moving forward.
    Skipping this release may (depending on your environment) cause query errors
    if brokers are upgraded and servers are in the process of being upgraded.

  • As always, we recommend that you upgrade controllers first, and then brokers
    and lastly the servers in order to have zero downtime in production clusters.

  • PR #4100 introduces a backwards incompatible change to pinot broker. If you
    use the Java constructor on HelixBrokerStarter class, then you will face a
    compilation error with this version. You will need to construct the object
    and call start() method in order to start the broker.

  • PR #4139 introduces a backwards incompatible change for log4j configuration.
    If you used a custom log4j configuration (log4j.xml), you need to write a new
    log4j2 configuration (log4j2.xml). In addition, you may need to change the
    arguments on the command line to start pinot components.

    If you used pinot-admin command to start pinot components, you don't need any
    change. If you used your own commands to start pinot components, you will
    need to pass the new log4j2 config as a jvm parameter (i.e. substitute
    -Dlog4j.configuration or -Dlog4j.configurationFile argument with
    -Dlog4j2.configurationFile=log4j2.xml).

Apache Pinot (incubating) 0.1.0

07 Mar 23:19
Compare
Choose a tag to compare

Pinot 0.1.0

This is the first official release of Apache Pinot.

Requirements

  • Java 8 (It has been reported that Pinot cannot be compiled with Java 11 due to a missing package for sun.misc , #3625)

Highlights

Pluggable Storage

We have added a Pinot filesystem abstraction that provides users the option to plug-in their own storage backend. Currently, we support HDFS, NFS, and Azure Data Lake.

Pluggable Realtime Streams

We have decoupled Pinot from Kafka for realtime ingestion and added the abstraction for Streams. Users can now add their own plugins to read from any pub-sub systems.

Native Byte Array and TDigest Support

Pinot now can support byte[] type natively. Also, Pinot can accept byte serialized TDigest object (com.tdunning.math.stats.TDigest) and this can be queried to compute approximate percentiles as follows.

select percentileTDigest95(tDigestColumn) from myTable where ... group by... top N

Compatibility

Since this is the first official Apache release, these notes applies to people who have used Pinot in production by building from the source code.

Backward Incompatible Changes

  • All the packages have been renamed from com.linkedin to org.apache.
  • Method signatures has been changed in interface ControllerRestApi and SegmentNameGenerator.

Rollback Restrictions

  • If you have used PartitionAwareRouting feature, we have changed the format of partition values from the segment metadata and zk segment metadata. (e.g. partitionRanges: [0 0], [1 1] -> partitions: 0, 1) The new code is backward compatible with the old format; however, old code will not be able to understand the new format in case of rollback.

Deprecation

  • N/A

Credits

Thanks for everyone who have contributed to Pinot.

We would also like to express our gratitude to our Apache mentors @kishoreg, @felixcheung, @jimjag, @olamy and @rvs.