Papers4DataAchitect

Background

There are so many kinds of distributed data store systems , distributed compute systems, distributed machine learning system in DT times.

As a application engineer, you may use RDBMS,NoSQL even NewSQL to store and manage data.
As a data enginner, you may
collect data first
- extract data from app's log file
- capture data chang in RDBMS or NoSQL database,
- crawl data from various web sites
- pull data from third party data vendor through web service api
- massive time sequence data from IOT frontend or some sensor such as car-net or monitor-camera with AI enhancement.
clean and transform data next
- use some ETL utility or run-time stream process system such as flink, kafka.
analyze and training data at last
- analyzed in spark/SQL on hadoop/OLAP datawarehouse, generate data report and visual the result use some tools such as tableau.
- training machine learning models in a distributed machine learning system such as spark ML-Lib, Angel. These model will make some adervtise CTR inference or user recommendation.

Purpose

It is a key ability to work efficently with the different utility for big data/ML pipeline. However, various tools for big data and machine learning are more complicated and more complex. These tools is neither mature as traditional RDBMS nor simple as local algorithm library sucha as sk-learn . Sometimes digging deep into the implemention details of a distributed data store/process/training system may be hard and unnecessary. Nevertheless, understanding some common sense of the software stack will be a great help.

Different system do have some common sense and design patern. It it a good idea to read the original paper which describles the background ,key algorithm,author's consideration. For a data architect or algorithm engineer, reading these paper may be a greate help.

This repository wll collect and classify these papers as a user guide for data/algorithm enginner/architectur.

How to read

A short comment w available as user guide for every paper. The comment consist of background description, abstraction, and contrast between similar systems.

Contents

Online Analysis Process
New SQL DB
Run-time Streaming Compute
Graph Compute
Distributed Machine Learning Paradigm
LinkedIn Big Data Stack

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
GraphCompute		GraphCompute
OLAP		OLAP
distributedML		distributedML
linkedin		linkedin
newSQL		newSQL
runtimeStreamCompute		runtimeStreamCompute
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GraphCompute

GraphCompute

OLAP

OLAP

distributedML

distributedML

linkedin

linkedin

newSQL

newSQL

runtimeStreamCompute

runtimeStreamCompute

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Papers4DataAchitect

Background

Purpose

How to read

About

Releases

Packages

loadwiki/Papers4DataAchitect

Folders and files

Latest commit

History

Repository files navigation

Papers4DataAchitect

Background

Purpose

How to read

About

Topics

Resources

Stars

Watchers

Forks