INSTATE (Multidimensional indexing for reliable and scalable IoT data management) aims to enable a novel reliable and scalable architecture for data management for IoT applications, using multidimensional indexing to support efficient query, searching, and analytics over data.
The code of INSTATE adds the necessary automatization to Qbeast Spark code to be built on top of AWS or other Cloud Provider's Architecture for Streaming IoT data into an Object Storage (in this case, S3), applying Qbeast Layout to organize it efficiently.
An image of the central pieces of the architecture.
- Streaming Source. The source can be any type of IoT device that it's continously generating data, such as: image, device activity, geolocalization...
- Spark Streaming App. Set up and configure a Spark Streaming application that reads from the generated data and writes using an optimized Qbeast layout.
- Qbeast Layout. Organization of S3 files for faster and more resource-efficient retrieval. (Find all the specifications for the format at https://github.com/Qbeast-io/qbeast-spark)
The core of INSTATE is Qbeast Format: a layout format that organizes the information in files using indexing and sampling techniques.
To get started with Qbeast Format, you can use the first reference Open Source implementation for Apache Spark.
wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz
tar -xzvf spark-3.4.2-bin-hadoop3.tgz
export SPARK_HOME=$PWD/spark-3.4.2-bin-hadoop3
$SPARK_HOME/bin/spark-shell \
--packages io.qbeast:qbeast-spark_2.12:0.5.0,io.delta:delta-core_2.12:2.1.0 \
--conf spark.sql.extensions=io.qbeast.spark.internal.QbeastSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=io.qbeast.spark.internal.sources.catalog.QbeastCatalog
val data = Seq((1, "a", 10), (2, "b", 20), (3, "c", 30)).toDF("id", "name", "age")
data.write.format("qbeast").option("columnsToIndex", "id,age").save("/tmp/qbeast_test")
val indexed_data = spark.read.format("qbeast").load("/tmp/qbeast_test")
indexed_data.filter("id > 2 and age > 20").show()
In the notebooks folder, you will find examples of use for IoT public datasets.