Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce FileIOManager and FileIO implementations for HDFS and Local Storage #96

Merged
merged 6 commits into from May 6, 2024

Conversation

HotSushi
Copy link
Collaborator

@HotSushi HotSushi commented May 2, 2024

Summary

Laying foundations for storage part 4: FileIOManager and FileIO implementations for HDFS and Local

FileIOManager interface looks like:

FileIOManager {
   FileIO getFileIO(Type)
}

This interface is accompanied by ConfigureFileIO which sets up FileIOs for all "configured" storages.

We do not replace the existing FileIO instances to ensure production systems do not break.

To learn the motivation behind these changes please see this doc

What's the next plan

  1. Deploy new services with new + old cluster yaml (along with new fileIOs and old fileIOs)
- storages
    - newconfs
- storage
    - oldconfs  
  1. Make refactors/remove old usage safely (remove old fileIOs and use new fileIOs)
  2. Switch to new cluster yaml completely.
- storages
    - newconfs

Changes

Testing Done

  • Added new tests for the changes made.
  • docker tests, we change cluster.yaml in docker setup and tested the server boot with old and new config
/infra/recipes/docker-compose/oh-hadoop-spark> docker compose up -d
/infra/recipes/docker-compose/oh-hadoop-spark> docker exec -it local.spark-master /bin/bash

scala> spark.sql("CREATE TABLE openhouse.db.tb (ts timestamp, col1 string, col2 string) PARTITIONED BY (days(ts))").show()
++
||
++
++


scala> spark.sql("INSERT INTO TABLE openhouse.db.tb VALUES (date_sub(CAST(current_timestamp() as DATE), 30), 'val1', 'val2')")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("SELECT * FROM openhouse.db.tb").show()
+-------------------+----+----+
|                 ts|col1|col2|
+-------------------+----+----+
|2024-04-02 00:00:00|val1|val2|
+-------------------+----+----+

we can observe logs like:

INFO 9 --- [           main] c.l.o.c.s.h.HdfsStorageClient            : Initializing storage client for type:..

jainlavina
jainlavina previously approved these changes May 2, 2024
Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HotSushi for the PR. I have a few questions

Copy link
Collaborator

@autumnust autumnust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing major mainly naming

@sumedhsakdeo
Copy link
Collaborator

LGTM. Looking forward to more information about cutover from old config to new config in PR description. It's not blocking though.

@sumedhsakdeo
Copy link
Collaborator

Also please check why Build is failing

@HotSushi HotSushi merged commit 9339553 into linkedin:main May 6, 2024
1 check passed
@HotSushi HotSushi deleted the FileIOs branch May 6, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants