Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split #86

Merged
merged 43 commits into from
May 21, 2024

Conversation

micmarty-deepsense
Copy link
Contributor

@micmarty-deepsense micmarty-deepsense commented May 14, 2024

  • the default for pdf_split_page is False, according to this comment
  • number of processes used for handling each batch of PDF pages can be configured (via form data, env variable is no longer being used)
  • introduced constants for controlling split size: MIN_PAGES_PER_SPLIT=2 and MAX_PAGES_PER_SPLIT=20
  • the page-splitting mechanism evenly divides pages among workers (processes).
  • basic tests for the logic mentioned above
  • regenerated speakeasy client

How to verify that this PR works

Unit & Integration Tests

make install && make test

Manually

make install
pip install --editable .
python -m timeit --repeat 10 --verbose "$(cat test-client.py)"

Where test-client.py has the following contents:

import os
import sys

import unstructured_client
from unstructured_client import UnstructuredClient

print(unstructured_client.__file__)
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

filename = "_sample_docs/layout-parser-paper.pdf"

with open(filename, "rb") as f:
    files = shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=True,
    split_pdf_concurrency_level=1,
)
resp = s.general.partition(req)
ids = [e.element_id for e in resp.elements]
print(ids)

overlay_client.yaml Outdated Show resolved Hide resolved
.genignore Outdated Show resolved Hide resolved
.github/workflows/ci.yaml Show resolved Hide resolved
overlay_client.yaml Outdated Show resolved Hide resolved
test_unstructured_client/integration/test_decorators.py Outdated Show resolved Hide resolved
tests/helpers.py Show resolved Hide resolved
test_unstructured_client/unit/test_split_pdf_hook.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@cragwolfe
Copy link

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

this is a big one. 1% of requests could easily be non 200

@micmarty-deepsense
Copy link
Contributor Author

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮

I'll try to reproduce this behavior somehow.

@micmarty-deepsense
Copy link
Contributor Author

LGTM! Per standup today, we've decided to hold off a bit longer setting the default split to True. We'll flip this when pay per page goes out.

alright, I'm changing the default value to False

@badGarnet
Copy link
Collaborator

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮

I'll try to reproduce this behavior somehow.

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮

I'll try to reproduce this behavior somehow.

Here is my guess: the retry logic might be causing the issue. We relaunch a process before it closes (or with thread tries to relaunch a thread before current one unlocks the GIL -> thread lock)

@micmarty-deepsense micmarty-deepsense changed the title Parametrize page splitting logic with num of threads, min and max pages per thread Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split May 17, 2024
@badGarnet
Copy link
Collaborator

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮
I'll try to reproduce this behavior somehow.

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮
I'll try to reproduce this behavior somehow.

Here is my guess: the retry logic might be causing the issue. We relaunch a process before it closes (or with thread tries to relaunch a thread before current one unlocks the GIL -> thread lock)

tested again; with the start method forced to be fork this problem goes away on macOS. Tested by mocking http responses from the api.

README.md Outdated Show resolved Hide resolved
@micmarty-deepsense micmarty-deepsense merged commit 0bd48a4 into main May 21, 2024
7 checks passed
@micmarty-deepsense micmarty-deepsense deleted the mike/page-split-adjustments branch May 21, 2024 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants