S3 file reader support #32

jankatins · 2020-04-14T10:02:06Z

Refactors the file reader and adds s3 file reader as an alternative to local file reads.

New commands:

data_integration.parallel_tasks.files.ParallelReadS3File: reads in a whole bucket
data_integration.commands.files.ReadS3File: reads a single file from S3

From initial testing, this is a lot slower than sync + reading from a local file system (both iterating over the bucket to get the file list and the individual reads...) but then syncing that bucket to a local filesystem is also taking time... From my perspective this is only worth it if you have to do a "sync to local" every time (which we have to do, not volumns in our ETL container :-(), so the second run is then saving time compared to doing a sync + incremental read via file system. That's at least the theory, up to now I only tested locally.

The single file read will also come in handy as a replacement of google sheet imports.

WIP...

martin-loetzsch · 2020-04-15T15:57:42Z

data_integration/pipelines.py

-    initial_node: Node = None
-    final_node: Node = None
+    initial_node: Task = None
+    final_node: Task = None

    def __init__(self, id: str,


One can also add a pipeline or ParallelTask as initial / final node

not really: there are places which expect a task, at least I had places where intelij complained that a method wasn't available

Fixed it in a different way

martin-loetzsch · 2020-04-15T15:57:59Z

data_integration/pipelines.py

        self.initial_node = node
        for downstream in self.nodes.values():
            if not downstream.upstreams and downstream != self.final_node:
                self.add_dependency(node, downstream)
        self.add(node)

-    def add_final(self, node: Node) -> 'Pipeline':
+    def add_final(self, node: Task) -> 'Pipeline':
        self.final_node = node


fixed it a different way

martin-loetzsch

Looks very good otherwise. Please squash.

Let's wait with a release for the other PR

This reverts commit c39697d.

ghost · 2020-09-07T13:16:57Z

data_integration/commands/files.py

+class ReadS3File(_ReadFile):
+    """Reads data from a S3 file"""
+
+    def __init__(self, s3_url: str, compression: Compression, target_table: str,


I'd think the parameter s3_url should be called s3_uri, according to the cp command. An URL is always an URI, but not all URIs are URLs. See as well wikipedia URL

# Conflicts: # mara_pipelines/commands/files.py

martin-loetzsch · 2021-03-08T22:44:28Z

@jankatins is this running in production?

jankatins · 2021-03-08T23:01:14Z

@martin-loetzsch Nope, should also be integrated into https://github.com/mara/mara-storage where this looks much easier to do.

jankatins added 4 commits April 14, 2020 11:43

Fix typing information in pipeline

c39697d

Fix call in error case

98f18d1

Refactor out a base class for file reading

2e280bd

Add s3 file reader commands

22b6e0c

jankatins requested review from Tafkas and martin-loetzsch April 14, 2020 10:02

martin-loetzsch reviewed Apr 15, 2020

View reviewed changes

martin-loetzsch approved these changes Apr 15, 2020

View reviewed changes

Allow parallel tasks in final/intitial node

5696a0b

This reverts commit c39697d.

ghost suggested changes Sep 9, 2020

View reviewed changes

Merge remote-tracking branch 'master' into s3_reader

e1ddf1c

# Conflicts: # mara_pipelines/commands/files.py

ghost added the enhancement New feature or request label Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 file reader support #32

S3 file reader support #32

jankatins commented Apr 14, 2020

martin-loetzsch Apr 15, 2020

jankatins Apr 15, 2020

jankatins Apr 28, 2020

martin-loetzsch Apr 15, 2020

jankatins Apr 28, 2020

martin-loetzsch left a comment

ghost Sep 7, 2020

martin-loetzsch commented Mar 8, 2021

jankatins commented Mar 8, 2021

S3 file reader support #32

Are you sure you want to change the base?

S3 file reader support #32

Conversation

jankatins commented Apr 14, 2020

martin-loetzsch Apr 15, 2020

Choose a reason for hiding this comment

jankatins Apr 15, 2020

Choose a reason for hiding this comment

jankatins Apr 28, 2020

Choose a reason for hiding this comment

martin-loetzsch Apr 15, 2020

Choose a reason for hiding this comment

jankatins Apr 28, 2020

Choose a reason for hiding this comment

martin-loetzsch left a comment

Choose a reason for hiding this comment

ghost Sep 7, 2020

Choose a reason for hiding this comment

martin-loetzsch commented Mar 8, 2021

jankatins commented Mar 8, 2021