Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] duckdb connection to s3 over HTTPFS errors after indeterminate amount of time/connections #5039

Open
nyc-de opened this issue May 7, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@nyc-de
Copy link
Contributor

nyc-de commented May 7, 2024

Mage version

mage v0.9.67

Describe the bug

After a number of queries (approximately 1000) to S3 via duckdb and httpfs the connection begins to fail with error :

duckdb.duckdb.IOException: IO Error: Connection error for HTTP HEAD to 'https://my-bucket.s3.amazonaws.com/prod/bronze/salesforce/main.parquet'

To reproduce

Difficult to reproduce because it seems tied to the number of connections made to S3. It is very consistent. My pipelines start running at 12pm UTC and after ~ 8-10 hours the connections start returning the error.

Expected behavior

No response

Screenshots

No response

Operating system

  • duckdb v0.10.1
  • Mage in docker on EC2.

Additional context

How I am hacking around this?

I am managing the issue by restarting the container everyday.

What have I done to help myself?

I have found these issues on duckdb however they fail to resolve my issue:

If you get an IO Error (Connection error for HTTP HEAD), configure the endpoint explicitly via ENDPOINT 's3.⟨your-region⟩.amazonaws.com'.

Code producing error

import duckdb
from loguru import logger
import os
import DataLakeURI

if "data_loader" not in globals():
    from mage_ai.data_preparation.decorators import data_loader
if "test" not in globals():
    from mage_ai.data_preparation.decorators import test

@data_loader
def load_data(*args, **kwargs):

    env = os.getenv("ENV")
    user = os.getenv("USER")
    object=kwargs['SALESFORCE_OBJECT_NAME']
    source=kwargs['DATA_LAKE_CONTEXT']
    medallion=kwargs['DATA_LAKE_MEDALLION']

    uri = DataLakeURI(user=user, environment=env, source=source, object=object, medallion=medallion)

    logger.info(f"Fetching data from {uri.get_uri()}")

    query = f"""
    CREATE SECRET secret1 (
        TYPE S3,
        PROVIDER CREDENTIAL_CHAIN
        );

    SET s3_endpoint='s3.us-east-1.amazonaws.com';
    SET http_keep_alive=false;
    SET s3_region='us-east-1';
    load HTTPFS;
    select max(LastModifiedDate::TIMESTAMP)::TIMESTAMP as max_date from read_parquet('{uri.get_uri()}');
    """

    df = duckdb.sql(query).df()

    logger.info(f"max LastModifiedDate = {df.iat[0,0]}")

    return df

traceback

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/executors/block_executor.py", line 613, in execute

    result = __execute_with_retry()

  File "/usr/local/lib/python3.10/site-packages/mage_ai/shared/retry.py", line 54, in retry_func

    raise e

  File "/usr/local/lib/python3.10/site-packages/mage_ai/shared/retry.py", line 38, in retry_func

    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/executors/block_executor.py", line 588, in __execute_with_retry

    return self._execute(

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/executors/block_executor.py", line 1077, in _execute

    result = self.block.execute_sync(

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/models/block/__init__.py", line 1314, in execute_sync

    raise err

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/models/block/__init__.py", line 1223, in execute_sync

    output = self.execute_block(

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/models/block/__init__.py", line 1529, in execute_block

    outputs = self._execute_block(

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/models/block/__init__.py", line 1685, in _execute_block

    outputs = self.execute_block_function(

  File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/models/block/__init__.py", line 1724, in execute_block_function

    output = block_function_updated(*input_vars, **global_vars)

  File "<string>", line 42, in load_data

duckdb.duckdb.IOException: IO Error: Connection error for HTTP HEAD to 'https://my-bucket.s3.amazonaws.com/prod/bronze/salesforce/main.parquet'

stacktrace

File "/usr/local/bin/mage", line 8, in <module>
 sys.exit(app())

File "/usr/local/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
 return get_command(self)(*args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
 return self.main(*args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/typer/core.py", line 778, in main
 return _main(

File "/usr/local/lib/python3.10/site-packages/typer/core.py", line 216, in _main
 rv = self.invoke(ctx)

File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
 return _process_result(sub_ctx.command.invoke(sub_ctx))

File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
 return ctx.invoke(self.callback, **ctx.params)

File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
 return __callback(*args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
 return callback(**use_params)  # type: ignore

File "/usr/local/lib/python3.10/site-packages/mage_ai/cli/main.py", line 163, in start
 start_server(

File "/usr/local/lib/python3.10/site-packages/mage_ai/server/server.py", line 743, in start_server
 scheduler_manager.start_scheduler()

File "/usr/local/lib/python3.10/site-packages/mage_ai/server/scheduler_manager.py", line 87, in start_scheduler
 proc.start()

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
 self._popen = self._Popen(self)

File "/usr/local/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
 return _default_context.get_context().Process._Popen(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
 return Popen(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
 self._launch(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
 code = process_obj._bootstrap(parent_sentinel=child_r)

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
 self.run()

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
 self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/db/process.py", line 15, in start_session_and_run
 results = target(*args)

File "/usr/local/lib/python3.10/site-packages/mage_ai/server/scheduler_manager.py", line 50, in run_scheduler
 LoopTimeTrigger().start()

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/triggers/loop_time_trigger.py", line 14, in start
 self.run()

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/triggers/time_trigger.py", line 11, in run
 schedule_all()

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/pipeline_scheduler_original.py", line 1615, in schedule_all
 PipelineScheduler(r).start()

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/db/__init__.py", line 157, in func_with_rollback
 return func(*args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/pipeline_scheduler_original.py", line 190, in start
 self.schedule()

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/db/__init__.py", line 157, in func_with_rollback
 return func(*args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/pipeline_scheduler_original.py", line 318, in schedule
 self.__schedule_blocks(block_runs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/pipeline_scheduler_original.py", line 602, in __schedule_blocks
 job_manager.add_job(

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/job_manager.py", line 28, in add_job
 self.queue.enqueue(job_id, target, *args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/queue/process_queue.py", line 108, in enqueue
 self.start_worker_pool()

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/queue/process_queue.py", line 183, in start_worker_pool
 self.worker_pool_proc.start()

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
 self._popen = self._Popen(self)

File "/usr/local/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
 return _default_context.get_context().Process._Popen(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
 return Popen(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
 self._launch(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
 code = process_obj._bootstrap(parent_sentinel=child_r)

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
 self.run()

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 108, in run
 self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/queue/process_queue.py", line 288, in poll_job_and_execute
 worker.start()

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 121, in start
 self._popen = self._Popen(self)

File "/usr/local/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
 return _default_context.get_context().Process._Popen(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
 return Popen(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
 self._launch(process_obj)

File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 71, in _launch
 code = process_obj._bootstrap(parent_sentinel=child_r)

File "/usr/local/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
 self.run()

File "/usr/local/lib/python3.10/site-packages/newrelic/api/background_task.py", line 117, in wrapper
 return wrapped(*args, **kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/queue/process_queue.py", line 253, in run
 start_session_and_run(args[1], *args[2], **args[3])

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/db/process.py", line 15, in start_session_and_run
 results = target(*args)

File "/usr/local/lib/python3.10/site-packages/mage_ai/orchestration/pipeline_scheduler_original.py", line 1152, in run_block
 return ExecutorFactory.get_block_executor(

File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/executors/block_executor.py", line 615, in execute
 self.logger.exception(

File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/logging/logger.py", line 30, in exception
 self.__send_message('exception', message, **kwargs)

File "/usr/local/lib/python3.10/site-packages/mage_ai/data_preparation/logging/logger.py", line 65, in __send_message
 data['error_stack'] = traceback.format_stack(),

Additonal info

Slack thread

@nyc-de nyc-de added the bug Something isn't working label May 7, 2024
@chlimaferreira
Copy link

For those using MinIO in Docker, use port 9000:9000 instead of 9000:9001.

docker run -p 9000:9000 -p 9090:9090 --name minio_name -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" -v ${HOME}/minio/data:/data quay.io/minio/minio server /data --console-address ":9090"

In my case it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants