Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large stdout/stderr crashes the acknowledgement manager, results in stuck jobs #283

Open
natefoo opened this issue Aug 24, 2021 · 4 comments
Assignees

Comments

@natefoo
Copy link
Member

natefoo commented Aug 24, 2021

2021-08-24 13:52:32,217 DEBUG [pulsar.client.amqp_exchange][acknowledgement-manager] UUID b694576c-02bf-11ec-a39e-566f6d94001a has not been acknowledged, republishing original message on queue status_update
2021-08-24 13:52:32,217 DEBUG [pulsar.client.amqp_exchange][acknowledgement-manager] [publish:0e74b094-0504-11ec-9a41-566f6d94001a] Begin publishing to key pulsar_bridges__status_update
2021-08-24 13:52:32,218 DEBUG [pulsar.client.amqp_exchange][acknowledgement-manager] [publish:0e74b094-0504-11ec-9a41-566f6d94001a] Have producer for publishing to key pulsar_bridges__status_update
2021-08-24 13:52:32,300 ERROR [pulsar.client.amqp_exchange][acknowledgement-manager] Problem with acknowledgement manager, leaving ack_manager method in problematic state!
Traceback (most recent call last):
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/pulsar/client/amqp_exchange.py", line 232, in ack_manager
    self.publish(resubmit_queue, payload)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/pulsar/client/amqp_exchange.py", line 205, in publish
    producer.publish(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/kombu/messaging.py", line 175, in publish
    return _publish(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/kombu/connection.py", line 525, in _ensured
    return fun(*args, **kwargs)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/kombu/messaging.py", line 197, in _publish
    return channel.basic_publish(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/channel.py", line 1775, in _basic_publish
    self.connection.drain_events(timeout=0)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/connection.py", line 522, in drain_events
    while not self.blocking_read(timeout):
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/connection.py", line 528, in blocking_read
    return self.on_inbound_frame(frame)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/method_framing.py", line 53, in on_frame
    callback(channel, method_sig, buf, None)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/connection.py", line 534, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/abstract_channel.py", line 143, in dispatch_method
    listener(*args)
  File "/jet/home/xcgalaxy/main/pulsar/venv/lib/python3.8/site-packages/amqp/channel.py", line 277, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: Basic.publish: (406) PRECONDITION_FAILED - message size 155238738 is larger than configured max size 134217728

Not sure of the best solution here but maybe Pulsar should post the stdio streams back as job files rather than sending them in the MQ?

@natefoo natefoo added this to 2021 - T3 in Admin Working Group Aug 31, 2021
@natefoo natefoo self-assigned this Aug 31, 2021
@mvdbeek
Copy link
Member

mvdbeek commented Apr 10, 2023

We do (also) send it as a file, and pulsar has the maximum_stream_size option ... which defaults to -1, i.e. read everything. I think we can add a more sensible default here.

@natefoo
Copy link
Member Author

natefoo commented Apr 13, 2023

I have that set to 8 MB, it seems to work fine.

@cat-bro
Copy link
Contributor

cat-bro commented Apr 5, 2024

Wouldn't truncating the stdout/stderr files affect galaxy's ability to judge success/failure of the job?

@mvdbeek
Copy link
Member

mvdbeek commented Apr 5, 2024

We do (also) send it as a file,

covers guessing the job state if the exit code is not the authoritative source. that happens in the metadata script, which read in the file contents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants