Skip to content

GoogleCloudPlatform/datashare-toolkit

Repository files navigation

Datashare Toolkit

Datashare

DIY commercial datasets on Google Cloud Platform

This is not an officially supported Google product.

The Datashare Toolkit is a solution for data publishers to easily manage datasets residing within BigQuery. The toolkit includes functionality to ingest and entitle data, relieving consumers from much of the toil involved in onboarding datasets from a variety of providers. Publishers upload data files to a storage bucket and allocate permissioned datasets for their consumers to use with BigQuery authorized views.

While these tools are used for data management and entitlement, they follow a bring-your-own-license (BYOL) for entitling publisher data. Hence, publishers should already have licensing arrangements for those consumers withing to access their data within GCP, and the consumers can furnish the GCP account ID's corresponding to their entitled user principals. These account IDs are required for the creation of the authorized views.

The toolkit is open-source. Some supporting infrastructure, such as storage buckets, serverless functions, and BigQuery datasets, must be maintained within GCP by publishers as a prerequisite. As a consumer, when the GCP accounts are added to the publisher entitlements, the published can be queried directly within BigQuery, ready to integrate into your analytics workflow, machine learning model, or runtime application. Publishers are responsible for managing the limited support infrastructure necessary. While consumers are billed for BigQuery compute and networking, publishers incur costs only on the storage of their data in BigQuery and Cloud Storage.

Key Features

Getting started with Datashare

If you plan to use GCP Marketplace integration, the production project that you install and manage Datashare from must follow the required naming convention (punctuation and spaces not allowed): [yourcompanyname]-public.

  1. Install Datashare
  2. Initialize Schema

Then get started, see the User Guide for usage information.

Requirements

Publishers

  • A GCP account with billing enabled
  • A Google Cloud Storage bucket to store staged data

Consumers

  • A valid Google Account or Google Group email address (which includes Gsuite and Gmail email addresses).
    Note: Consumers can create a Google account with an existing email address here
  • Entitlements granted by the publisher to your specific licensed datasets

Architecture

Architecture

Disclaimers

This is not an officially supported Google product.

Datashare is under active development. Interfaces and functionality may change at any time.

License

This repository is licensed under the Apache 2 license (see LICENSE).

Contributions are welcome. See CONTRIBUTING for more information.