Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new functions to support downloading from AWS s3 #218

Open
wants to merge 12 commits into
base: new_inference
Choose a base branch
from

Conversation

fang19911030
Copy link
Collaborator

Add new python functions which can be used to download files from AWS s3 to the utils.py

@fang19911030 fang19911030 changed the title New inference pengcheng Add new functions to support downloading from AWS s3 May 13, 2024
@kjsato
Copy link
Contributor

kjsato commented May 13, 2024

@fang19911030 I got errors when invoking pytest test_utils.py such as:

============================================== test session starts ===============================================
platform darwin -- Python 3.9.13, pytest-7.4.0, pluggy-1.0.0
rootdir: /Users/koji/Codes/flepimop/flepiMoP/flepimop/gempyor_pkg/tests/utils
plugins: anyio-3.5.0, cov-4.1.0
collected 17 items

test_utils.py ......F....FFFFFF [100%]

==================================================== FAILURES ====================================================

============================================ short test summary info =============================================
FAILED test_utils.py::test_print_disk_diagnosis_success - AttributeError: module 'gempyor.utils' has no attribute 'print_disk_diagnosis'
FAILED test_utils.py::test_create_resume_out_filename - AttributeError: module 'gempyor.utils' has no attribute 'create_resume_out_filename'
FAILED test_utils.py::test_create_resume_input_filename - AttributeError: module 'gempyor.utils' has no attribute 'create_resume_input_filename'
FAILED test_utils.py::test_get_parquet_types_resume_discard_seeding_true_flepi_block_index_1 - AttributeError: module 'gempyor.utils' has no attribute 'get_parquet_types'
FAILED test_utils.py::test_get_parquet_types_resume_discard_seeding_false_flepi_block_index_1 - AttributeError: module 'gempyor.utils' has no attribute 'get_parquet_types'
FAILED test_utils.py::test_get_parquet_types_flepi_block_index_2 - AttributeError: module 'gempyor.utils' has no attribute 'get_parquet_types'
FAILED test_utils.py::test_create_resume_file_names_map - AttributeError: module 'gempyor.utils' has no attribute 'create_resume_file_names_map'
========================================== 7 failed, 10 passed in 3.31s ==========================================

Are any prerequisites required?

@kjsato
Copy link
Contributor

kjsato commented May 14, 2024

after updated conda env and gempyor:

============================================== test session starts ===============================================
platform darwin -- Python 3.11.4, pytest-7.4.0, pluggy-1.0.0
rootdir: /Users/koji/Codes/flepimop/flepiMoP/flepimop/gempyor_pkg/tests/utils
collected 17 items

test_utils.py .............FFF. [100%]

==================================================== FAILURES ====================================================
============================================ short test summary info =============================================
FAILED test_utils.py::test_get_parquet_types_resume_discard_seeding_true_flepi_block_index_1 - AttributeError: module 'gempyor.utils' has no attribute 'get_parquet_types'
FAILED test_utils.py::test_get_parquet_types_resume_discard_seeding_false_flepi_block_index_1 - AttributeError: module 'gempyor.utils' has no attribute 'get_parquet_types'
FAILED test_utils.py::test_get_parquet_types_flepi_block_index_2 - AttributeError: module 'gempyor.utils' has no attribute 'get_parquet_types'
==================================== 3 failed, 14 passed, 9 warnings in 3.83s ====================================

@fang19911030
Copy link
Collaborator Author

Hi @kjsato , I corrected the wrong function name used in unit tests. These errors are solved.

@kjsato
Copy link
Contributor

kjsato commented May 14, 2024

Hi @kjsato , I corrected the wrong function name used in unit tests. These errors are solved.

I confirmed, thx

Copy link
Contributor

@kjsato kjsato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing it like L364, for readability in utils.py?

@kjsato kjsato self-requested a review May 14, 2024 23:17
Copy link
Collaborator

@jcblemai jcblemai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is very good, I had some comments along the way we can discuss.

The end goal is to have a command line to resume like there is currently.gempyor-simulate and these functions should be adapted to be used through this entry point, that will read environment variable or command line argument through click.

def create_resume_out_filename(filetype: str, liketype: str) -> str:
run_id = os.environ.get("FLEPI_RUN_INDEX")
prefix = f"{os.environ.get('FLEPI_PREFIX')}/{os.environ.get('FLEPI_RUN_INDEX')}"
inference_filepath_suffix = f"{liketype}/intermidate"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intermediate

prefix = f"{os.environ.get('FLEPI_PREFIX')}/{os.environ.get('FLEPI_RUN_INDEX')}"
inference_filepath_suffix = f"{liketype}/intermidate"
inference_filename_prefix = "{:09d}.".format(int(os.environ.get("FLEPI_SLOT_INDEX")))
index = "{:09d}.{:09d}".format(1, int(os.environ.get("FLEPI_BLOCK_INDEX")) - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the create_file_name function handles the formatting of the index.

)


def get_parquet_types_for_resume() -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good, but these functions should take arguments and not read environment variables, because we want to use them from gempyor (and have less and fewer environment variable), and because if we need an environment variable version we can add a wrapper around it.

(think of a script like simulate.py, that is e.g called resume_from.py : this script parse the environment variable from the safety of the click module and call these functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note: seeding is a csv, so I think this function should be renamed "get_filetype_for_resume" (this 4 letter id is called filetype throughout the code)-

return resume_file_name_mapping


def download_file_from_s3(name_map: Dict[str, str]) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing this is very good. Ideally we also have a "move on filesystem" function that can be used as drop-in when resuming from another folder. In this case, it should just move the file in a safe way (creating folders along the way)

@jcblemai
Copy link
Collaborator

(again, sorry for the delay reviewing, as you know round 18 has been really hard)

@fang19911030
Copy link
Collaborator Author

No worry, I will do change based the feedbacks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants