Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run via Papermill w/store-to fails #430

Open
jklap opened this issue Jan 19, 2023 · 4 comments
Open

Run via Papermill w/store-to fails #430

jklap opened this issue Jan 19, 2023 · 4 comments

Comments

@jklap
Copy link

jklap commented Jan 19, 2023

Tried to run a Notebook w/Papermill that uses %%gremlin --store-data results but the following cells fail with a name 'results' is not defined. The Notebook works just fine when executed manually within JupyterLab.

Not sure where the issue lies, within Papermill or graph-notebook so starting here.

Papermill isn't a requirement if there is a better way to run a notebook and save the results

@michaelnchin
Copy link
Member

michaelnchin commented Jan 19, 2023

Could you clarify the intended end goal of using Papermill here? Are you looking to save the results of a single %%gremlin query to a local file?

@jklap
Copy link
Author

jklap commented Jan 23, 2023

Hi @michaelnchin sorry for the delay -- yes, using --store-to -- was going from memory even if I got it right in the title but then wrong in the bug of this issue :(

We have several notebooks that run several queries (in a single notebook), save the results to variables and then run the results through Pandas Dataframes for some "ETL" at which point we write out both the resulting Notebook for later viewing AND also push some of the results out to Prometheus' Pushgateway so we can embed data into Grafana. These Notebooks work just great when executed manually in JupyterLab -- it's only when we tried to run it via Papermill that it failed.

We picked Papermill because we need to schedule these to run on a regular basis, ie daily, and Papermill's parameter functionality along with the "save-cell-on-execute" has been very useful.

If there is a better tool for this, we are certainly open to input; one additional caveat though is that we are also using Airflow for execution, ie via https://airflow.apache.org/docs/apache-airflow-providers-papermill/stable/operators.html as that is our standard execution engine (caveat though this error that I ran into was running manually via the CLI not using Airflow yet).

Papermill also supports writing to S3 which is a use-case we are working on refining -- ie nightly jobs that execute against Neptune and then write the results to S3 for another team to pick up -- this is not a hard requirement tho because of course there are other tools such as awswranger or boto3 or even the AWS CLI that can solve for it with low overhead.

@bechbd
Copy link
Contributor

bechbd commented Jan 24, 2023

@jklap One option you could also explore is the AWS SDK for pandas(https://aws-sdk-pandas.readthedocs.io/en/stable/api.html) which supports both Neptune and S3. You can use this to execute queries from Neptune (https://aws-sdk-pandas.readthedocs.io/en/stable/tutorials/033%20-%20Amazon%20Neptune.html) which returns a Pandas DataFrame and then save that data to S3 (https://aws-sdk-pandas.readthedocs.io/en/stable/tutorials/003%20-%20Amazon%20S3.html)

@jklap
Copy link
Author

jklap commented Jan 24, 2023

@bechbd yes-- that is the awswranger that I mentioned. But, really, doesn't solve for the core problem of scheduling notebooks -- users are used to creating notebooks and executing queries with graph-notebook. I mentioned S3/etc simply to better describe the scope of functionality we've been looking at using with Papermill to better help clarify needs for any other suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants