Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: detail docker-in-docker requirements #10

Open
alexmaras opened this issue Feb 15, 2024 · 5 comments
Open

Documentation: detail docker-in-docker requirements #10

alexmaras opened this issue Feb 15, 2024 · 5 comments

Comments

@alexmaras
Copy link

I'd recommend documenting the requirements for running this wrapper if you're already running meltano in docker. This is possible, but requires the /tmp directory to be mounted from the host to the meltano container.

This ensures that /tmp/config.json can be accessed, as the python script uses mktemp to create a directory in /tmp such as /tmp/tmp.BNlf296WXX, which can then be mapped down to the airbyte container. Due to the way docker-in-docker works, the volume mounts are mounts that exist on the host, not the meltano container, so it needs to be bind-mounted for it to show up in the airbyte image.

@sabino
Copy link
Contributor

sabino commented Mar 2, 2024

Hello!

Just want to add here that I had no issues with /tmp because looks like it's mounted automatically (works with docker in docker as well, as far as my implementation).

But I had issues with the user that runs everything, due to the UID / ownership and permissions, because apparently Airbyte images (at least for postgres) are using root user, so the container creates some files (the state file for instance) and the wrapper throws an error about permission (complains about being read only).

My workaround currently was to run meltano as root. (which is far from ideal IMO)
I'm not sure if this could also be your issue.

@alexmaras
Copy link
Author

@sabino When I run meltano in docker, which then runs the tap-airbyte-wrapper, I specify the top-level docker run command myself. The Tap is mounting /tmp, but meltano needs to first have /tmp mounted, because it puts the config files into /tmp and then runs the docker run for the tap. I've verified this in a couple of environments, and the tap has no control over this.

Are you running meltano itself as a docker container, or running it directly?

I haven't hit permissions issues yet, but that seems to be in-line with what I'd expect in that instance, and it'll help me if I come to using the airbyte taps for postgres!

@sabino
Copy link
Contributor

sabino commented Mar 2, 2024

I'm running meltano outside docker in our production environment, but we also have a devcontainer setup in VScode that installs everything and since it's a container it runs meltano insider docker. So we have both setups, and both of them works.

Here is our current (partial, because I've omitted some things) implementation using tap-postgres (airbyte variant).


meltano.yml

version: 1
send_anonymous_usage_stats: false
env:
  MELTANO_SNOWPLOW_COLLECTOR_ENDPOINTS: []
include_paths:
- ./environments/*.meltano.yml
- ./extract/*.meltano.yml
- ./mappers/*.meltano.yml
- ./load/*.meltano.yml
- ./transform/*.meltano.yml
- ./orchestrate/*.meltano.yml
- ./utilities/*.meltano.yml

default_environment: dev

extract/extractors.meltano.yml

plugins:
  extractors:

  - name: tap-postgres
    variant: airbyte
  # Pointing to my fork temporarily (until PR gets merged)
    pip_url: git+https://github.com/sabino/tap-airbyte-wrapper.git

  # This is the LOG BASED Extractor
  # It requires Docker to work because it is
  # a wrapper around the airbyte CDC implementation
  - name: tap-postgres-log
    inherit_from: tap-postgres
    variant: airbyte
    pip_url: git+https://github.com/sabino/tap-airbyte-wrapper.git
    config:
      airbyte_spec:
        image: airbyte/source-postgres
        tag: 3.3.10
      airbyte_config:
        ssl_mode.mode: require
        replication_method.method: CDC
        replication_method.plugin: pgoutput
        replication_method.publication: ext__pub
        replication_method.replication_slot: ext__slot
      flattening_max_depth: 0
      docker_mounts:
      - type: bind
        source: /var/run/docker.sock
        target: /var/run/docker.sock
    select:
    - '*.*'
    metadata:
      '*':
        replication-method: LOG_BASED
        replication-key: _ab_cdc_lsn
        
  - name: tap__rds_db
    inherit_from: tap-postgres-log

extract/databases.meltano.yml

plugins:
  extractors:

  # Database X
  - name: raw__database_x
    inherit_from: tap__rds_db
    config:
      airbyte_config:
        database: database_x

load/loaders.meltano.yml

plugins:
  loaders:
  # Used to output things locally
  - name: target-local
    inherit_from: target-duckdb
    variant: jwills
    pip_url: target-duckdb~=0.4
    config:
      filepath: ${OUTPUT_DB_PATH}
      add_metadata_columns: true

  - name: target-bigquery
    variant: z3z1ma
    pip_url: git+https://github.com/z3z1ma/target-bigquery.git
    config:
      column_name_transforms:
        add_underscore_when_invalid: true
        lower: true
        quote: false
        snake_case: true
      cluster_on_key_properties: false
      credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}
      dataset: ${MELTANO_EXTRACT__LOAD_SCHEMA}
      project: ${BIGQUERY_PROJECT_ID}
      flattening_enabled: false
      flattening_max_depth: 0
      upsert: false
      overwrite: false
      denormalized: true
      method: storage_write_api
      schema_resolver_version: 2
      options:
        storage_write_batch_mode: true

With that I just set the PORT, USER and PASSWORD using env vars and usually set to the top level from the hierarchy (which I called tap-postgres-log in my case) if they are the same downstream.

Set HOST on tap__rds_db (and all other connections we have, I've omitted but we have a lot of instances we are reading from).

And just run:
meltano install
Test with:
meltano invoke raw__database_x --test
Extract with:
meltano run raw__database_x target-bigquery

@alexmaras
Copy link
Author

alexmaras commented Mar 8, 2024

I imagine we might be seeing a difference that devcontainer is hiding - perhaps it mounts /tmp by default?

I'm running meltano in AWS Batch (a layer over AWS ECS), which spins up the meltano/meltano:v3.2.0 docker image. In that, I need to mount /tmp, otherwise it doesn't work. In order to do that, I also need to be running on EC2 instances, but that's fine - just that docker-in-docker isn't possible in fargate.

The same fix was needed for local - -v /tmp:/tmp - otherwise I get:
https://gist.github.com/alexmaras/58300ecaa3cecd83e53070ea84c3bb47

If I just add in -v /tmp:/tmp, everything works. This is because this wrapper uses /tmp to store the config and passes that to the docker container by mounting with -v /tmp:/tmp.

Because of the way docker-in-docker works, if that docker run is run within a container running meltano, the airbyte container is run on the host machine. If you didn't mount /tmp to the meltano container with the docker run meltano command, then the file won't be present on the host filesystem - it'll only be in the first containers' /tmp. As a result, the airbyte container will have a /tmp that contains everything the host's /tmp does - it mounts /tmp after all - but nothing from the meltano container, which is where config.json is.

At the end of the day, my system is working well. This issue may act as enough documentation for others to follow if they hit this issue too.

@z3z1ma
Copy link
Collaborator

z3z1ma commented Mar 30, 2024

I added this env var in main, AIRBYTE_MOUNT_DIR so you can choose the directory inside the container that is mounted into. By default it is /tmp so behavior is unchanged unless you set this var.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants