Skip to content

andremann/DataHub-workshop

Repository files navigation

DataHub-workshop

This repository is intended as a companion guide through the DataHub Workshop help at the Knowledge Media Institute (KMi, Open University, Milton Keynes, UK) on the 12th of June 2018. The workshop is organised within the framework and planned activities of the EU project CityLabs (http://www.citylabs.org.uk).

Agenda

08:30 - 09:00 Registration and Intro
09:00 - 10:30 Stream data and IoT: from capturing sensor data to their visualisation. A session to showcase a set of technologies for acquiring, modeling and managing big data streamed by sensors and IoT devices
10:30 - 12:00 Revealing insights with Data Science techniques. Showing how to use basic Machine Learning techniques (regression, classification, clustering) to reveal patterns hidden in data
12:00 - 13:30 Lunch
13:30 - 14:00 Big Data processing and analytics. Introducing basic principles on designing and working with large tables. The hands-on session will show how to load, process, and query large data tables using the Hadoop stack for tabular data (Hue, PIG and HIVE)
14:00 - 15:30 “Datawareness” - big data and privacy challenges. Exploring privacy concerns emerging from the process of big data processing, and introducing principles and regulations of the newly implemented European regulation on privacy (EUGDPR)
15:30 - 16:00 Wrap-up

Sessions

1. Stream data and IoT: from capturing sensor data to their visualisation

Chairs: Andrea Mannocci, Niaz Chowdhuri
Abstract: In this session we will showcase a set of technologies for acquiring big data, modeling and managing them in scalable way, and producing cogent analytics. In particular we will: i) demonstrate how to use sensors and IoT devices to create data streams from remote locations and collect those data at the datahub end; ii) showcase the capabilities of the ELK stack (Elasticsearch + Kibana, in this case) for fast prototyping lightweight interfaces and get sense out of your data.
1st part: Sensors and IoT devices with hands-on (30')
2nd part: Lightwight analytics and visualisation with hands-on (60')

2. Revealing insights with Data Science techniques

Chairs: Emanuele Bastianelli, Alessio Antonini
Abstract: The focus of this session will be on basic Machine Learning (ML) techniques for Data Analysis. We will show their application to a common dataset of interest, as well as introducing fundamental concepts to understand the quality of the analysis. The hands-on part will show how to use a common off-the-shelf library to perform some basic data analysis in Python.
Presentation (30’): Basic principles of Machine Learning (regression, classification, clusters).
Hands-on (1h): 2-3 ML solutions to extract useful patterns from data SciKit Learn http://scikit-learn.org/stable/

3. Working with large tables: Big Data processing and analytics

Chairs: Enrico Daga
Abstract: Introducing basic principles on designing and working with large tables. The hands-on session will show how to load, process, and query large data tables using the Hadoop stack for tabular data (Hue, PIG and HIVE).
Presentation (30’): Hadoop, Pig, Hive + Operations on large datasets
Hands-on (1h): Using the Big Data cluster on three datasets:

  • from session 1
  • Text analytics (e.g. Gutenberg)
  • Secklow Sounds chunk of texts

4. “Datawareness” - big data and privacy challenges

Chairs: Pinelopi Troullinou, Angelo Salatino
Abstract: In this session, we will explore privacy concerns emerging from the process of big data using as a showcase the recent scandal of Cambridge Analytica Ltd. Furthermore, we will introduce the basic principles of the newly implemented European regulation on privacy (EUGDPR) and its limitations.
Presentation (30’): data ethics, data licensing and policies
No hands-on.

Links

SSH → workshop.bigdata.kmi.org
Jupiterhub → http://workshop.bigdata.kmi.org:8000
Kibana → http://workshop.bigdata.kmi.org:5601