Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split #86

micmarty-deepsense · 2024-05-14T15:44:47Z

the default for pdf_split_page is False, according to this comment
number of processes used for handling each batch of PDF pages can be configured (via form data, env variable is no longer being used)
introduced constants for controlling split size: MIN_PAGES_PER_SPLIT=2 and MAX_PAGES_PER_SPLIT=20
the page-splitting mechanism evenly divides pages among workers (processes).
basic tests for the logic mentioned above
regenerated speakeasy client

How to verify that this PR works

Unit & Integration Tests

make install && make test

Manually

make install
pip install --editable .
python -m timeit --repeat 10 --verbose "$(cat test-client.py)"

Where test-client.py has the following contents:

import os
import sys

import unstructured_client
from unstructured_client import UnstructuredClient

print(unstructured_client.__file__)
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth=os.environ["UNS_API_KEY"], server_url="http://localhost:8000")

filename = "_sample_docs/layout-parser-paper.pdf"

with open(filename, "rb") as f:
    files = shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy="fast",
    languages=["eng"],
    split_pdf_page=True,
    split_pdf_concurrency_level=1,
)
resp = s.general.partition(req)
ids = [e.element_id for e in resp.elements]
print(ids)

src/unstructured_client/_hooks/custom/split_pdf_hook.py

test_unstructured_client/integration/test_decorators.py

test_unstructured_client/unit/test_split_pdf_hook.py

overlay_client.yaml

src/unstructured_client/_hooks/custom/split_pdf_hook.py

.genignore

.github/workflows/ci.yaml

overlay_client.yaml

test_unstructured_client/integration/test_decorators.py

tests/helpers.py

test_unstructured_client/unit/test_split_pdf_hook.py

README.md

cragwolfe · 2024-05-17T02:53:57Z

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

this is a big one. 1% of requests could easily be non 200

micmarty-deepsense · 2024-05-17T06:53:16Z

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮

I'll try to reproduce this behavior somehow.

micmarty-deepsense · 2024-05-17T10:44:20Z

LGTM! Per standup today, we've decided to hold off a bit longer setting the default split to True. We'll flip this when pay per page goes out.

alright, I'm changing the default value to False

src/unstructured_client/_hooks/custom/split_pdf_hook.py

README.md

badGarnet · 2024-05-17T14:30:33Z

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮

I'll try to reproduce this behavior somehow.

Here is my guess: the retry logic might be causing the issue. We relaunch a process before it closes (or with thread tries to relaunch a thread before current one unlocks the GIL -> thread lock)

README.md

badGarnet · 2024-05-17T18:40:39Z

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮
I'll try to reproduce this behavior somehow.

there is a caveat that merits revisit: when server returns non 200 sometimes it causes the client to hang (now with process pool it will log errors first so customer could have some cue; thread pool would just silently stuck in GIL hell)

ohhh😮
I'll try to reproduce this behavior somehow.

Here is my guess: the retry logic might be causing the issue. We relaunch a process before it closes (or with thread tries to relaunch a thread before current one unlocks the GIL -> thread lock)

tested again; with the start method forced to be fork this problem goes away on macOS. Tested by mocking http responses from the api.

…file

README.md

micmarty-deepsense added 11 commits May 14, 2024 16:58

extract max no. threads from form data instead of env var

be3f76c

update split_pdf_page with default value

896bb32

add split_pdf_threads with 5 as default

593f0db

fix missing testing requirement: requests-toolbelt

8fd9a86

rework splitting

ed23c10

format with black

5ec932b

add tests for split size

f0ff6e4

fix linting errors

9d28122

update speakeasy

d8bda78

move tests and enable all in ci

393e8b8

update speakeasy

3fc2021

micmarty-deepsense commented May 15, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Outdated Show resolved Hide resolved

micmarty-deepsense commented May 15, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Outdated Show resolved Hide resolved

micmarty-deepsense commented May 15, 2024

View reviewed changes

test_unstructured_client/integration/test_decorators.py Outdated Show resolved Hide resolved

make the test more compact

8ff9706

micmarty-deepsense commented May 15, 2024

View reviewed changes

test_unstructured_client/unit/test_split_pdf_hook.py Outdated Show resolved Hide resolved

run all tests in ci

15fad47

badGarnet reviewed May 15, 2024

View reviewed changes

overlay_client.yaml Outdated Show resolved Hide resolved

badGarnet reviewed May 15, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Outdated Show resolved Hide resolved

amanda103 requested review from Coniferish and awalker4 May 15, 2024 23:32

Coniferish reviewed May 16, 2024

View reviewed changes

.genignore Outdated Show resolved Hide resolved

hubert-rutkowski85 reviewed May 16, 2024

View reviewed changes

micmarty-deepsense added 2 commits May 16, 2024 14:00

revert tests location to _test_unstructred_client

df74835

replace ThreadPoolExecutor with ProcessPoolExecutor

e8dd538

mpolomdeepsense reviewed May 16, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

mpolomdeepsense reviewed May 16, 2024

View reviewed changes

README.md Show resolved Hide resolved

micmarty-deepsense added 3 commits May 16, 2024 16:28

update overlay params and their descriptions

db33388

rename threads to concurrency levels

8e79b93

convert unittests to pytest tests

5b7581b

micmarty-deepsense added 3 commits May 17, 2024 12:41

break down large split_pdf_hook.py into modules

f7e6122

generate speakeasy client

2e0a0d7

reformat README

2f02385

micmarty-deepsense added 4 commits May 17, 2024 12:48

update README

5b3c20e

improve docstrings for utils

3c32f12

force macos to fork

72fccac

disable printing during testing

30ae2c8

micmarty-deepsense commented May 17, 2024

View reviewed changes

src/unstructured_client/_hooks/custom/split_pdf_hook.py Show resolved Hide resolved

micmarty-deepsense added 3 commits May 17, 2024 14:05

refactor unittests to pytest tests

5372147

remove debug statement

e029278

add example SDK usage

d58f578

micmarty-deepsense commented May 17, 2024

View reviewed changes

README.md Show resolved Hide resolved

remove unused import

1cfd6f9

micmarty-deepsense changed the title ~~Parametrize page splitting logic with num of threads, min and max pages per thread~~ Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split May 17, 2024

micmarty-deepsense commented May 17, 2024

View reviewed changes

README.md Show resolved Hide resolved

micmarty-deepsense and others added 4 commits May 20, 2024 10:24

Merge branch 'main' into mike/page-split-adjustments

0eb456f

generate speakeasy client

aa93306

add more files for testing pdf splitting (non-supported formats)

cd478ea

improve separation of responsibilities for logging validity of a PDF …

068a4d9

…file

hubert-rutkowski85 reviewed May 20, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

hubert-rutkowski85 approved these changes May 20, 2024

View reviewed changes

micmarty-deepsense and others added 2 commits May 20, 2024 14:39

quickfix failing test

de1f90c

Update README.md

cefea72

micmarty-deepsense merged commit 0bd48a4 into main May 21, 2024
7 checks passed

micmarty-deepsense deleted the mike/page-split-adjustments branch May 21, 2024 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split #86

Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split #86

micmarty-deepsense commented May 14, 2024 •

edited

cragwolfe commented May 17, 2024

micmarty-deepsense commented May 17, 2024

micmarty-deepsense commented May 17, 2024

badGarnet commented May 17, 2024

badGarnet commented May 17, 2024

Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split #86

Parametrize page splitting logic with concurrency level, introduce constants for min and max pages per split #86

Conversation

micmarty-deepsense commented May 14, 2024 • edited

How to verify that this PR works

Unit & Integration Tests

Manually

cragwolfe commented May 17, 2024

micmarty-deepsense commented May 17, 2024

micmarty-deepsense commented May 17, 2024

badGarnet commented May 17, 2024

badGarnet commented May 17, 2024

micmarty-deepsense commented May 14, 2024 •

edited