Skip to content
This repository has been archived by the owner on Apr 28, 2023. It is now read-only.

Suggestion to improve simplicity of the docker installation #575

Open
KonradHoeffner opened this issue Apr 8, 2022 · 11 comments
Open

Suggestion to improve simplicity of the docker installation #575

KonradHoeffner opened this issue Apr 8, 2022 · 11 comments
Milestone

Comments

@KonradHoeffner
Copy link

KonradHoeffner commented Apr 8, 2022

Usually when I check out a docker compose setup, docker-compose up --build integrates all the necessary steps, except maybe setting some documented environment variables.
However in this repository, it seems that there is a process involving multiple steps with manual intervention.
While the steps aren't that complicated, and can probably be automated by wrapping it in another Dockerfile or writing another docker-compose that executes those steps, I don't understand why that is necessary in the first place.
Isn't the entire purpose of docker compose to achieve a reliable reproducible setup without manual intervention except for configuration through environment variables or maybe a configuration file?

For example:

docker volume create --name=ols-neo4j-data
docker volume create --name=ols-mongo-data
docker volume create --name=ols-solr-data
docker volume create --name=ols-downloads

Aren't the volumes created automatically? Sorry if I have some wrong assumptions here, I have not been using Docker for that long but in the cases I have seen until now, there was never a need to do docker volume create because docker automatically generated all volumes listed in docker-compose.yml.

Then, start solr and mongodb only:

 docker-compose up -d solr mongo

Then, adjust the configuration YAML files in the config directory as required, and load the configuration into the Mongo database using the config loader:

Why is it necessary to start the services before changing the configuration file, can't this be done before?

docker run --net=host -v $(pwd)/config:/config ebispot/ols-config-importer:stable

Can't this be mounted inside docker-compose.yml?

docker run --net=host -v ols-neo4j-data:/mnt/neo4j -v ols-downloads:/mnt/downloads ebispot/ols-indexer:stable

Why not put that in docker-compose instead?

@jamesamcl
Copy link
Member

Hi, as the readme describes the steps have to run in a very specific order. The config importer has to run first. Then the web app cannot be running when the indexer is running because they share the same embedded database, and it has to run before the web app. AFAIK this cannot be accomplished with Docker compose.

If you have any suggestions to improve this, please open a PR.

@KonradHoeffner
Copy link
Author

KonradHoeffner commented Apr 8, 2022

This is possible using the docker-compose "depends_on" field. For example in one of our projects a similar situation is handled by first generating data and only starting a Virtuoso SPARQL endpoint with an import script when that data generating docker container is finished. The data is also converted into SQL statements and only when that is finished then the database container is run so that it can import it.

Simplified excerpt from a similar setup

  rdf:
    build: ../ontology
    volumes:
      - rdf:/ontology/dist

  virtuoso:
    build: ./virtuoso
    environment: [...]
    volumes:
      - virtuoso-data:/data
      - rdf:/data/toLoad:ro
    ports: [...]
    depends_on:
      rdf: 
        condition: service_completed_successfully
[...]
 sparqltosql:
    build: ../database
    volumes:
      - rdf:/rdf
      - sql:/sql
    depends_on:
      rdf: 
        condition: service_completed_successfully
    command: ["python", "download.py"]

  postgresql:
    image: postgres:13
    ports: [...]
    environment: [...]
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - sql:/docker-entrypoint-initdb.d
    depends_on:
      sparqltosql:
        condition: service_completed_successfully

  phppgadmin:
    image: bitnami/phppgadmin:latest
    ports: [...]
    depends_on:
      - postgresql
  
  database-frontend:
    build: ../database-frontend
    depends_on:
      - postgresql
    ports: [...]
    environment: [...]

The full docker-compose.yml can be seen at https://github.com/hitontology/docker/blob/master/docker-compose.yml.
However there was one problem which needed additional code changes: While this worked fine for one time services that you can wait for successful completion, this created an annoying timing issue between the relational database and the frontend which creates a default user at first start. Because "depends_on" without a condition just coordinates starting order, the relational database was not fully up yet sometimes which prevented user generation. This needed to be solved by adding a retry mechanism in the database frontend.

In case I figure it out for OLS I will create a pull request.

P.S.: There are also health checks in docker compose but this didn't work out for us but unfortunately I don't remember why.

@jamesamcl
Copy link
Member

Interesting, thanks. Will the data generation run every time with depends_on? The indexer is very slow - at EBI it regularly takes ~48 hours.

@jamesamcl jamesamcl reopened this Apr 8, 2022
@jamesamcl
Copy link
Member

Also, for the same reason in larger setups (including ours) it is necessary to run the indexer elsewhere and copy the data in. This is another reason we have it as a separate distinct step.

@KonradHoeffner
Copy link
Author

We perform the data generation in the build step, so the data is baked into the resulting image and there is no custom entrypoint at all, so it only runs once on build:

FROM alpine
RUN apk add raptor2
WORKDIR /ontology
COPY . .
RUN ./build && rm dist/all.ttl
VOLUME /ontology/dist

docker-rdf-dive

That could probably be improved further with a two step build with a second step "from SCRATCH" to save an additional 5 MB or so.
However I am not sure if that solution is possible in your case because in the example all the data generation source is included in the repository.

@KonradHoeffner
Copy link
Author

KonradHoeffner commented Apr 8, 2022

After reading the OLS readme I think that example is not applicable because the source data for the indexer is not known at build time.
In our "sparqltosql" service, the SQL files are generated at run time, so the logic to not download it again if the file exists is built into the python script.

  sparqltosql:
    build: ../database
    volumes:
      - rdf:/rdf
      - sql:/sql
    depends_on:
      rdf: 
        condition: service_completed_successfully
    command: ["python", "download.py"]

The SQL files are stored in the "SQL" volume and are later picked up by the database container.
However I don't know how it works with Solr and Neo4j indexes, are they generated as files which then can be loaded later or are the indexes directly loaded using some functions of a live Solr and Neo4j instance?

P.S.: I'm not a Docker expert by any means, so I can't guarantee that this is the best way to implement that kind of dependency, but I found it to make development, testing and deployment much easier in my experiences if the setup is build in a way that everything can be thrown away and rebuilt at any time by using docker compose down -v and docker compose up --build. But that takes less than a minute for the small projects I'm working on so I'm not sure if that gives you the same advantage if it takes 48 hours to index.

@jamesamcl
Copy link
Member

Thanks Konrad,

However I don't know how it works with Solr and Neo4j indexes, are they generated as files which then can be loaded later or are the indexes directly loaded using some functions of a live Solr and Neo4j instance?

Kind of both! Solr is loaded into a live instance, but Neo4j is embedded (a bit like e.g. sqlite) by both the indexer and ols-web, so the indexer has to be shut down so that ols-web can operate on the same files afterwards.

I do think there will be a way we can simplify all of this, but we're working with code which has been designed for use with very large amounts of data at EBI, and the requirements are not always the same for smaller scale deployments. (Until recently we had very little support for deployments of OLS outside of EBI.)

We are planning a new version of OLS in which we aim to reduce much of this complexity. I will tag this issue with the OLS 4 milestone and we can have a rethink then.

@jamesamcl jamesamcl added this to the 4.x milestone Apr 20, 2022
@KonradHoeffner
Copy link
Author

I got it to work locally! The following shows the page on http://localhost:8080 where you can browse and search the ontologies. For example, entering "diabetes" shows milk thistle supplement as first hit.
All the instructions on the readme file are unnecessary with this docker-compose.yml, only docker-compose up is enough!
However this was only possible by using the host network for now, because "localhost" seems to be hardcoded in the ols-config-importer.
Host mode is fine for local testing but for production it may not be suitable because there could be port clashes.
So I will not create a pull request out of it yet, also I did not move any processing from the run step into the build step, so I don't know if it still fits your large scale deployment.
However it is well suited for our use case of local deployment.
The "Documentation" and "About" tabs are empty however, I don't know if that is working as intended.

version: '2'
services:
    solr:
        image: ebispot/ols-solr:latest
        environment:
          - SOLR_HOME=/mnt/solr-config
        ports:
          - 8983:8983
        volumes:
          - ols-solr-data:/var/solr
          - ./ols-solr/src/main/solr-5-config:/mnt/solr-config
        network_mode: "host"
        command: ["-Dsolr.solr.home=/mnt/solr-config", "-Dsolr.data.dir=/var/solr", "-f"]
    mongo:
      image: mongo:3.2.9
      ports:
          - 27017:27017
      volumes:
          - ols-mongo-data:/data/db
      network_mode: "host"
      command:
          - mongod
    ols-config-importer:
      #image: ebispot/ols-config-importer:stable
      build:
          context: .
          dockerfile: ./ols-apps/ols-config-importer/Dockerfile      
      volumes:
        - ./config:/config
      network_mode: "host"
      depends_on: ["mongo"]
      restart: on-failure:2

    ols-indexer:
      build:
          context: .
          dockerfile: ./ols-apps/ols-indexer/Dockerfile
      volumes:
        - ols-neo4j-data:/mnt/neo4j
        - ols-downloads:/mnt/downloads
      network_mode: "host"
      depends_on:
        ols-config-importer:
          condition: service_completed_successfully
    ols-web:
      build:
          context: .
          dockerfile: ols-web/Dockerfile
      network_mode: "host"
      depends_on:
        ols-indexer:
          condition: service_completed_successfully
      links:
        - solr
        - mongo
      environment:
        # - spring.data.solr.host=http://solr:8983/solr
        - spring.data.solr.host=http://localhost:8983/solr
        - spring.data.mongodb.host=localhost
        - ols.customisation.logo=${LOGO}
        - ols.customisation.title=${TITLE}
        - ols.customisation.short-title=${SHORT_TITLE}
        - ols.customisation.web=${WEB}
        - ols.customisation.twitter=${TWITTER}
        - ols.customisation.org=${ORG}
        - ols.customisation.backgroundImage=${BACKGROUND_IMAGE}
        - ols.customisation.backgroundColor=${BACKGROUND_COLOR}
        - ols.customisation.issuesPage=${ISSUES_PAGE}
        - ols.customisation.supportMail=${SUPPORT_MAIL}
        - OLS_HOME=/mnt/
      volumes:
        - ols-neo4j-data:/mnt/neo4j
        - ols-downloads:/mnt/downloads
      ports:
      - 8080:8080
volumes:
    ols-solr-data:
    ols-mongo-data:
    ols-neo4j-data:
    ols-downloads:

@KonradHoeffner
Copy link
Author

The next step would be to get it to work without host mode, probably by changing the source code of the ols-config-importer.

@jamesamcl
Copy link
Member

That's exciting!

AFAIK ols-config-importer is not hardcoded to localhost. You should be able to set spring.data.mongodb.host in the environment.

@KonradHoeffner
Copy link
Author

KonradHoeffner commented Apr 20, 2022

That worked! Without network_mode: "host" below:

version: '2'
services:
    solr:
        image: ebispot/ols-solr:latest
        environment:
          - SOLR_HOME=/mnt/solr-config
        ports:
          - 8983:8983
        volumes:
          - ols-solr-data:/var/solr
          - ./ols-solr/src/main/solr-5-config:/mnt/solr-config
        command: ["-Dsolr.solr.home=/mnt/solr-config", "-Dsolr.data.dir=/var/solr", "-f"]
    mongo:
      image: mongo:3.2.9
      ports:
          - 27017:27017
      volumes:
          - ols-mongo-data:/data/db
      command:
          - mongod
    ols-config-importer:
      #image: ebispot/ols-config-importer:stable
      build:
          context: .
          dockerfile: ./ols-apps/ols-config-importer/Dockerfile
      environment:
        - spring.data.mongodb.host=mongo
      volumes:
        - ./config:/config
      depends_on: ["mongo"]
      restart: on-failure:2

    ols-indexer:
      build:
          context: .
          dockerfile: ./ols-apps/ols-indexer/Dockerfile
      environment:
        - spring.data.solr.host=http://solr:8983/solr
        - spring.data.mongodb.host=mongo
      volumes:
        - ols-neo4j-data:/mnt/neo4j
        - ols-downloads:/mnt/downloads
      depends_on:
        ols-config-importer:
          condition: service_completed_successfully
    ols-web:
      build:
          context: .
          dockerfile: ols-web/Dockerfile
      depends_on:
        ols-indexer:
          condition: service_completed_successfully
      links:
        - solr
        - mongo
      environment:
        # - spring.data.solr.host=http://solr:8983/solr
        - spring.data.solr.host=http://solr:8983/solr
        - spring.data.mongodb.host=mongo
        - ols.customisation.logo=${LOGO}
        - ols.customisation.title=${TITLE}
        - ols.customisation.short-title=${SHORT_TITLE}
        - ols.customisation.web=${WEB}
        - ols.customisation.twitter=${TWITTER}
        - ols.customisation.org=${ORG}
        - ols.customisation.backgroundImage=${BACKGROUND_IMAGE}
        - ols.customisation.backgroundColor=${BACKGROUND_COLOR}
        - ols.customisation.issuesPage=${ISSUES_PAGE}
        - ols.customisation.supportMail=${SUPPORT_MAIL}
        - OLS_HOME=/mnt/
      volumes:
        - ols-neo4j-data:/mnt/neo4j
        - ols-downloads:/mnt/downloads
      ports:
      - 8080:8080
volumes:
    ols-solr-data:
    ols-mongo-data:
    ols-neo4j-data:
    ols-downloads:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants