Run via Papermill w/store-to fails #430

jklap · 2023-01-19T16:16:38Z

Tried to run a Notebook w/Papermill that uses %%gremlin --store-data results but the following cells fail with a name 'results' is not defined. The Notebook works just fine when executed manually within JupyterLab.

Not sure where the issue lies, within Papermill or graph-notebook so starting here.

Papermill isn't a requirement if there is a better way to run a notebook and save the results

The text was updated successfully, but these errors were encountered:

michaelnchin · 2023-01-19T18:37:59Z

Could you clarify the intended end goal of using Papermill here? Are you looking to save the results of a single %%gremlin query to a local file?

jklap · 2023-01-23T21:01:55Z

Hi @michaelnchin sorry for the delay -- yes, using --store-to -- was going from memory even if I got it right in the title but then wrong in the bug of this issue :(

We have several notebooks that run several queries (in a single notebook), save the results to variables and then run the results through Pandas Dataframes for some "ETL" at which point we write out both the resulting Notebook for later viewing AND also push some of the results out to Prometheus' Pushgateway so we can embed data into Grafana. These Notebooks work just great when executed manually in JupyterLab -- it's only when we tried to run it via Papermill that it failed.

We picked Papermill because we need to schedule these to run on a regular basis, ie daily, and Papermill's parameter functionality along with the "save-cell-on-execute" has been very useful.

If there is a better tool for this, we are certainly open to input; one additional caveat though is that we are also using Airflow for execution, ie via https://airflow.apache.org/docs/apache-airflow-providers-papermill/stable/operators.html as that is our standard execution engine (caveat though this error that I ran into was running manually via the CLI not using Airflow yet).

Papermill also supports writing to S3 which is a use-case we are working on refining -- ie nightly jobs that execute against Neptune and then write the results to S3 for another team to pick up -- this is not a hard requirement tho because of course there are other tools such as awswranger or boto3 or even the AWS CLI that can solve for it with low overhead.

bechbd · 2023-01-24T01:36:05Z

@jklap One option you could also explore is the AWS SDK for pandas(https://aws-sdk-pandas.readthedocs.io/en/stable/api.html) which supports both Neptune and S3. You can use this to execute queries from Neptune (https://aws-sdk-pandas.readthedocs.io/en/stable/tutorials/033%20-%20Amazon%20Neptune.html) which returns a Pandas DataFrame and then save that data to S3 (https://aws-sdk-pandas.readthedocs.io/en/stable/tutorials/003%20-%20Amazon%20S3.html)

jklap · 2023-01-24T02:05:34Z

@bechbd yes-- that is the awswranger that I mentioned. But, really, doesn't solve for the core problem of scheduling notebooks -- users are used to creating notebooks and executing queries with graph-notebook. I mentioned S3/etc simply to better describe the scope of functionality we've been looking at using with Papermill to better help clarify needs for any other suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run via Papermill w/store-to fails #430

Run via Papermill w/store-to fails #430

jklap commented Jan 19, 2023

michaelnchin commented Jan 19, 2023 •

edited

jklap commented Jan 23, 2023

bechbd commented Jan 24, 2023

jklap commented Jan 24, 2023 •

edited

Run via Papermill w/store-to fails #430

Run via Papermill w/store-to fails #430

Comments

jklap commented Jan 19, 2023

michaelnchin commented Jan 19, 2023 • edited

jklap commented Jan 23, 2023

bechbd commented Jan 24, 2023

jklap commented Jan 24, 2023 • edited

michaelnchin commented Jan 19, 2023 •

edited

jklap commented Jan 24, 2023 •

edited