Skip to content

JensRantil/flatten-sstabledump

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

flatten-sstabledump

flatten-sstabledump is a utility that unwraps Apache Cassandra's sstabledump JSON array output to multiple small JSON objects, one per partition row. Partition metadata is inlined into each row (key: partition). This utility allows you to easily use map/reduce jobs (such as Apache Hive, AWS Athena, Google BigQuery etc.) to process sstables.

The utility parses the JSON in a streaming fashion using very little memory.

Installation

Install by issuing:

$ go get github.com/tink-ab/flatten-sstabledump

Testing

$ go test

Usage

$ cat testdata/testdata1.json
[
  {
    "partition" : {
      "key" : [ "d7f50415-3c9e-4a84-bdf2-54cbcbb0df0b", "201806" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 408,
        "clustering" : [ "35136e4c-ffa2-4205-82c1-7ce63d2519b9" ],
        "liveness_info" : { "tstamp" : "2018-06-19T11:20:49.363Z" },
        "cells" : [
          { "name" : "accountid", "value" : "457f21b5-69c0-48bc-bffa-037d88c8ecf8" }
        ]
      },
      {
        "type" : "row",
        "position" : 408,
        "clustering" : [ "203b5189-d9e1-4db7-b00c-c1b759790b8f" ],
        "liveness_info" : { "tstamp" : "2018-06-18T11:20:49.363Z" },
        "cells" : [
          { "name" : "accountid", "value" : "ee312163-75bf-4df5-94de-f34146efa502" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "70c2ca4a-84f5-4cc2-b44a-e2f92b4888fb", "201806" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 408,
        "clustering" : [ "e1e6fbe4-4ec4-498d-b892-c00f7667bbc8" ],
        "liveness_info" : { "tstamp" : "2018-08-18T11:20:49.363Z" },
        "cells" : [
          { "name" : "accountid", "value" : "501cde77-cf60-468e-954f-e987e7490d4c" }
        ]
      }
    ]
  }
]
$ cat testdata/testdata1.json | fss
{"cells":[{"name":"accountid","value":"ee312163-75bf-4df5-94de-f34146efa502"}],"clustering":["203b5189-d9e1-4db7-b00c-c1b759790b8f"],"liveness_info":{"tstamp":"2018-06-18T11:20:49.363Z"},"partition":{"partition":{"key":["d7f50415-3c9e-4a84-bdf2-54cbcbb0df0b","201806"],"position":0}},"position":408,"type":"row"}
{"cells":[{"name":"accountid","value":"457f21b5-69c0-48bc-bffa-037d88c8ecf8"}],"clustering":["35136e4c-ffa2-4205-82c1-7ce63d2519b9"],"liveness_info":{"tstamp":"2018-06-19T11:20:49.363Z"},"partition":{"partition":{"key":["d7f50415-3c9e-4a84-bdf2-54cbcbb0df0b","201806"],"position":0}},"position":408,"type":"row"}
{"cells":[{"name":"accountid","value":"501cde77-cf60-468e-954f-e987e7490d4c"}],"clustering":["e1e6fbe4-4ec4-498d-b892-c00f7667bbc8"],"liveness_info":{"tstamp":"2018-08-18T11:20:49.363Z"},"partition":{"partition":{"key":["70c2ca4a-84f5-4cc2-b44a-e2f92b4888fb","201806"],"position":0}},"position":408,"type":"row"}

FAQ

Can't you use AWS Athena to simply process raw sstabledump JSON output?

No. AWS Athena requires every JSON entity to be line-delimited. sstabledump outputs its JSON in multiline JSON. Compacting using something like jq -c . doesn't work because AWS Athena requires the JSON root object to be a JSON object, not array.

If you have large partitions, each row will be huge and require a lot of memory in the mapper. This might also make Athena run out of memory.

Won't the output from ffs be huge?

Yes. :-) That's why you want to compress its output. Or, even better, convert it to compressed Avro, ORC or Parquet.

What is the format of these JSON objects?

Have a look at these two links (which also describes tombstones):

About

`flatten-sstabledump` is a utility that unwraps Apache Cassandra's `sstabledump` JSON array output to multiple small JSON objects

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 100.0%