Skip to content

Latest commit

 

History

History
60 lines (50 loc) · 3.13 KB

3.5. Data Analysis (Azure Analysis Services, HDInsight, Azure Data Catalog).md

File metadata and controls

60 lines (50 loc) · 3.13 KB

Data Analysis

Azure Analysis Services

  • PaaS
  • Integrated with Azure data platform services.
  • You can mashup and combine data from multiple sources, define metrics, and secure your data in a single, trusted semantic data model.
  • Handles
    • Security
    • In-memory cache
    • Data modeling
    • Lifecycle management
    • Business logic & metrics
  • Compatible with many features already in SQL Server Analysis Services Enterprise Edition
    • Supports tabular models at the 1200 and 1400 compatibility levels
    • Partitions, row-level security, bi-directional relationships, and translations are all supported.
    • In-memory and DirectQuery modes are also available for fast queries over massive and complex datasets.

Integrations

  • Data Sources
    • Cloud: E.g. SQL Database, Azure Synapse Analytics, Data Lake, HDInsights/Spark…
    • On-premises: E.g. SQL Server / Oracle…
  • Client tools
    • Cloud: Power BI
    • On-premises: Third-Party. Power BI Desktop. Excel

Tabular Object Model (TOM)

  • Client library for SQL to describe model objects for developers.
  • Exposed in JSON through the Tabular Model Scripting Language (TMSL) and the AMO data definition language.
    • TOM is built on AMO.
      • Analysis Management Objects (AMO) is a library of programmatically accessed objects that enables an application to manage an Analysis Services instance.
      • E.g. AMO has data mining classes
  • Has classes for models, relationship, roles, annotations, cultures etc. to manage SQL analysis objects.
  • Structured in a tabular form.
  • Arranges data elements in vertical columns and horizontal rows. Each cell is formed by the intersection of a column and row.

HDInsight

  • Common use:
    1. Create HDInsight
    2. Schedule Jobs
    3. Delete HDInsight Cluster
  • Azure distribution of Apache Hadoop components
    • Framework for processing and analysis of big data sets on clusters.
    • Including Apache Hive, HBase, Spark, Kafka, Storm, R and many others.
      • Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications.
  • Built on top of Azure Storage

Azure Data Catalog

  • A single, central place for all of an organization's users to contribute their knowledge and build a community and culture of data.
    • It includes a crowdsourcing model of metadata and annotations.
      • Descriptive metadata supplements the structural metadata (such as column names and data types) that's registered from the data source.
    • The data remains in its existing location, but a copy of its metadata is added to Data Catalog, along with a reference to the data-source location.
    • The metadata is also indexed to make each data source easily discoverable via search and understandable to the users who discover it.
  • Any user (analyst, data scientist, or developer) can discover, understand, and consume data sources.
    • Users can contribute to the catalog by tagging, documenting, and annotating data sources that have already been registered.
    • They can also register new data sources, which can then be discovered, understood, and consumed by the community of catalog users.