Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-fetching jars in docker environment fails to populate classpath #1265

Open
jpolchlo opened this issue Sep 21, 2023 · 4 comments
Open

Pre-fetching jars in docker environment fails to populate classpath #1265

jpolchlo opened this issue Sep 21, 2023 · 4 comments

Comments

@jpolchlo
Copy link

I want to build a docker environment where I can pre-load the classpath with spark-sql and some other stuff to avoid boilerplate in my notebooks. So I built the following Dockerfile:

FROM almondsh/almond:0.14.0-RC12-scala-2.12.18

RUN coursier fetch org.apache.logging.log4j:log4j-core:2.17.0
RUN coursier fetch org.apache.logging.log4j:log4j-1.2-api:2.17.0
RUN coursier fetch org.apache.spark::spark-sql:3.1.2

However, upon running this container, running import org.apache.spark.sql._ yields an error:

cell1.sc:1: object apache is not a member of package org
import org.apache.spark.sql._
           ^
Compilation Failed

What step am I missing to get Almond to recognize the coursier-installed jars?

@kiendang
Copy link
Collaborator

Almond uses a separate directory for cache. coursier fetch by default fetch the artifacts to .cache/coursier (on Linux). You can try to find where almond stores the cache. If I remember correctly it's .cache/almond/coursier then you can do coursier fetch --cache <almond-coursier-cache-dir> ....

@jpolchlo
Copy link
Author

jpolchlo commented Sep 22, 2023

That doesn't appear to be the case. Both methods (import from notebook and coursier fetch) place the jar files in the ~/.cache/coursier tree. However, there is a file ~/.cache/almond/ammonite/history that appears to track the notebook imports. The contents after executing

import $ivy.`org.apache.logging.log4j:log4j-core:2.17.0`

are

[
    "import $ivy.`org.apache.logging.log4j:log4j-core:2.17.0`"
]

I'm thinking that the way to pre-load is to provide a notebook with the desired inputs and run it through jupyter during the docker build. There appears to be some amount of state that is created in in-notebook imports that coursier fetch is not replicating.

Edit:
I've been able to preload the container with jars using jupyter execute ... on a notebook containing import $ivy... directives. It appears that the import statements in the notebook are still required to register the imported modules in the current context. However, the jar files are now present, and it's not necessary to wait for the maven downloads.

@coreyoconnor
Copy link
Contributor

hmm I did not observe this with the docker image I'm using. However, I'm using

ENV COURSIER_CACHE=/usr/share/coursier/cache

in the dockerfile. Does that impact the coursier cache for even the notebook session?

https://github.com/coreyoconnor/nix_configs/blob/dev/modules/ufo-k8s/almond-2/Dockerfile

@coreyoconnor
Copy link
Contributor

After further testing. Yes, setting ENV COURSIER_CACHE will pre-populate as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants