Skip to content

Commit c1c60c0

Browse files
committed
Add *About* and *Usage* sections to README
1 parent a45a266 commit c1c60c0

File tree

1 file changed

+293
-3
lines changed

1 file changed

+293
-3
lines changed

README.md

Lines changed: 293 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,305 @@ WordMaze
77

88
# About
99

10-
- TODO: write About section
10+
WordMaze is a standardized format for text extracted from documents.
11+
12+
When designing [OCR](https://www.wikiwand.com/en/Optical_character_recognition) engines, developers have to decide how to give their clients the list of extracted textboxes, including their position in the page, the text they contain and the confidence associated with that extraction.
13+
14+
Many patterns arise in the wild, for instance:
15+
```py
16+
(x1, x2, y1, y2, text, confidence) # a flat tuple
17+
((x1, y1), (x2, y2), text, confidence) # nested tuples
18+
{'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence} # a dict
19+
{'x': x1, 'y': y1, 'w': width, 'h': height, 'text': text, 'conf': confidence} # another dict
20+
... # and many others
21+
```
22+
23+
With WordMaze, textboxes are defined using a unified interface:
24+
```py
25+
from wordmaze import TextBox
26+
27+
textbox = TextBox(
28+
x1=x1,
29+
x2=x2,
30+
y1=y1,
31+
y2=y2,
32+
text=text,
33+
confidence=confidence
34+
)
35+
# or
36+
textbox = TextBox(
37+
x1=x,
38+
width=w,
39+
y1=y,
40+
height=h,
41+
text=text,
42+
confidence=conf
43+
)
44+
```
1145

1246
# Usage
1347

14-
- TODO: write Usage section
48+
Perhaps the best example of usage is [`pdfmap.PDFMaze`](https://github.com/elint-tech/pdfmap/blob/e5b3434a63729ba5a737201d93a146f2e0e5ad7a/pdfmap/pdfmaze.py), the first application of WordMaze in a public repository.
49+
50+
The exact expected behaviour of every piece of code in WordMaze can be checked out at the [tests folder](https://github.com/elint-tech/wordmaze/tree/main/tests).
51+
52+
There are three main groups of objects defined in WordMaze:
53+
54+
## Textboxes
55+
56+
### `Box`es
57+
58+
The first and most fundamental [(data)class](https://pypi.org/project/dataclassy/) is the `Box`, which contains only positional information of a textbox inside a document's page:
59+
```py
60+
from wordmaze import Box
61+
62+
box1 = Box(x1=3, x2=14, y1=15, y2=92) # using coordinates
63+
box2 = Box(x1=3, width=11, y1=15, height=77) # using coordinates and sizes
64+
box3 = Box(x1=3, x2=14, y2=92, height=77) # mixing everything
65+
```
66+
67+
We enforce `x1<=x2` and `y1<=y2` (if `x1>x2`, for instance, their values are automatically swapped upon initialization). Whether `(y1, y2)` means `(top, bottom)` or `(bottom, top)` depends on the context.
68+
69+
`Box`es have some interesting attributes to facilitate further calculation using them:
70+
```py
71+
from wordmaze import Box
72+
73+
box = Box(x1=1, x2=3, y1=10, y2=22)
74+
# coordinates:
75+
print(box.x1) # 1
76+
print(box.x2) # 3
77+
print(box.y1) # 10
78+
print(box.y2) # 22
79+
# sizes:
80+
print(box.height) # 12
81+
print(box.width) # 2
82+
# midpoints:
83+
print(box.xmid) # 2
84+
print(box.ymid) # 16
85+
```
86+
87+
### `Textbox`es
88+
89+
To include textual information in a textbox, use a `TextBox`:
90+
```py
91+
from wordmaze import TextBox
92+
93+
textbox = TextBox(
94+
# Box arguments:
95+
x1=3,
96+
x2=14,
97+
y1=15,
98+
height=77,
99+
# textual content:
100+
text='Dr. White.',
101+
# confidence with which this text was extracted:
102+
confidence=0.85 # 85% confidence
103+
)
104+
```
105+
106+
Note that `TextBox`es inherit from `Box`es, so you can inspect `.x1`, `.width` and so on as shown previously. Moreover, you have two more properties:
107+
```py
108+
# textbox from the previous example
109+
print(textbox.text) # Dr. White.
110+
print(textbox.confidence) # 0.85
111+
```
112+
113+
### `PageTextBox`es
114+
115+
If you also wish to include the page number from which your textbox was extracted, you can use a `PageTextBox`:
116+
```py
117+
from wordmaze import PageTextBox
118+
119+
textbox = PageTextBox(
120+
# TextBox arguments:
121+
x1=2,
122+
x2=10,
123+
y1=5,
124+
height=20,
125+
text='Sichermann and Sichelero and the same person!',
126+
confidence=0.6,
127+
# page info:
128+
page=3 # this textbox was extracted from the 4th page of the document
129+
)
130+
print(textbox.page) # 3
131+
```
132+
133+
Note that page counting starts from `0` as is common in Python, so that page #3 is the 4th page of the document.
134+
135+
## Pages
136+
137+
### The basics
138+
139+
`Page`s are a representation of a document's page. They contain information regarding their size, their coordinate system's origin and their textboxes. For instance:
140+
```py
141+
from wordmaze import Page, Shape, Origin
142+
143+
page = Page(
144+
shape=Shape(height=210, width=297), # A4 page size in mm
145+
origin=Origin.TOP_LEFT
146+
)
147+
print(page.shape.height) # 210
148+
print(page.shape.width) # 297
149+
print(page.origin) # Origin.TOP_LEFT
150+
```
151+
152+
A `Page` is a [`MutableSequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.MutableSequence) of `TextBox`es:
153+
```py
154+
page = Page(
155+
shape=Shape(height=210, width=297), # A4 page size in mm
156+
origin=Origin.TOP_LEFT,
157+
entries=[ # define textboxes at initialization
158+
TextBox(...),
159+
TextBox(...),
160+
...
161+
]
162+
)
163+
164+
page.append(TextBox(...)) # list-like append
165+
166+
for textbox in page: # iteration
167+
assert isinstance(textbox, TextBox)
168+
169+
print(page[3]) # 4th textbox
170+
```
171+
172+
### Different origins
173+
174+
There are two `Origin`s your page may have:
175+
- `Origin.TOP_LEFT`: `y==0` means top, `y==page.shape.height` means bottom;
176+
- `Origin.BOTTOM_LEFT`: `y==0` means bottom, `y==page.shape.height` means top;
177+
178+
If one textbox provider returned textboxes in `Origin.BOTTOM_LEFT` coordinates, but you'd like to have them in `Origin.TOP_LEFT` coordinates, you can use `Page.rebase` as follows:
179+
```py
180+
bad_page = Page(
181+
shape=Shape(width=10, height=10),
182+
origin=Origin.BOTTOM_LEFT,
183+
entries=[
184+
TextBox(
185+
x1=2,
186+
x2=3,
187+
y1=7,
188+
y2=8,
189+
text='Lofi defi',
190+
confidence=0.99
191+
)
192+
]
193+
)
194+
195+
nice_page = bad_page.rebase(Origin.TOP_LEFT)
196+
assert nice_page.shape == bad_page.shape # rebasing preserves page shape
197+
print(nice_page[0].y1, nice_page[0].y2) # 2 3
198+
```
199+
200+
### Transforming and filtering `TextBox`es
201+
202+
You can easily modify and filter out `TextBox`es contained in a `Page` using `Page.map` and `Page.filter`, which behave like [`map`](https://docs.python.org/3/library/functions.html#map) and [`filter`](https://docs.python.org/3/library/functions.html#filter) where the iterable is fixed and equal to the page's textboxes:
203+
```py
204+
page = Page(...)
205+
206+
def pad(textbox: TextBox, horizontal, vertical) -> TextBox:
207+
return TextBox(
208+
x1=textbox.x1 - horizontal,
209+
x2=textbox.x2 + horizontal,
210+
y1=textbox.y1 - vertical,
211+
y2=textbox.y2 + vertical,
212+
text=textbox.text,
213+
confidence=textbox.confidence
214+
)
215+
216+
# get a new page with textboxes padded by 3 to the left and to the right
217+
# and by 5 to the top and to the bottom
218+
padded_page = page.map(lambda textbox: pad(textbox, horizontal=3, vertical=5))
219+
220+
# filters out textboxes with low confidence
221+
good_page = padded_page.filter(lambda textbox: textbox.confidence >= 0.25)
222+
```
223+
224+
`Page.map` and `Page.filter` also accept keywords. Each keyword accepts a function that accepts the respective property and operates on it. Better shown in code. The previous padding and filtering can be equivalently written as:
225+
```py
226+
# get a new page with textboxes padded by 3 to the left and to the right
227+
# and by 5 to the top and to the bottom
228+
padded_page = page.map(
229+
x1=lambda x1: x1-3,
230+
x2=lambda x2: x2+3,
231+
y1=lambda y1: y1-5,
232+
y2=lambda y2: y2+5,
233+
)
234+
235+
# filters out textboxes with low confidence
236+
good_page = padded_page.filter(confidence=lambda conf: conf >= 0.25)
237+
```
238+
239+
### `tuple`s and `dict`s
240+
241+
You can also convert page's textboxes to `tuple`s or `dict`s with `Page.tuples` and `Page.dicts`:
242+
```py
243+
page = Page(...)
244+
for tpl in page.tuples():
245+
# prints a tuple in the form
246+
# (x1, x2, y1, y2, text, confidence)
247+
print(tpl)
248+
249+
for dct in page.dicts():
250+
# prints a dict in the form
251+
# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence}
252+
print(dct)
253+
```
254+
255+
## `WordMaze`s
256+
257+
The top-level class from WordMaze is, of course, a `WordMaze`. `WordMaze`s are simply sequences of `Page`s:
258+
```py
259+
from wordmaze import WordMaze
260+
261+
wm = WordMaze([
262+
Page(...),
263+
Page(...),
264+
...
265+
])
266+
267+
for page in wm: # iterating
268+
print(page.shape)
269+
270+
first_page = wm[0] # indexing
271+
```
272+
273+
`WordMaze` objects also provide a `WordMaze.map` and a `WordMaze.filter` functions, which work the same thing that `Page.map` and `Page.filter` do.
274+
275+
If you wish to access `WordMaze`'s pages shapes, there is the property `WordMaze.shapes`, which is a `tuple` satisfying `wm.shapes[N] == wm[N].shape`.
276+
277+
Additionally, you can iterate over `WordMaze`'s textboxes in two ways:
278+
```py
279+
wm = WordMaze(...)
280+
281+
# 1
282+
for page in wm:
283+
for textbox in page:
284+
print(textbox)
285+
286+
# 2
287+
for textbox in wm.textboxes():
288+
print(textbox)
289+
```
290+
The main difference between #1 and #2 is that the textboxes in #1 are instances of `TextBox`, whereas the ones in #2 are `PageTextBox`es including their containing page index.
291+
292+
`WordMaze` objects also have a `WordMaze.tuples` and a `WordMaze.dicts` which behave just like their `Page` counterpart except that they also return their page's number:
293+
```py
294+
wm = WordMaze(...)
295+
for tpl in wm.tuples():
296+
# prints a tuple in the form
297+
# (x1, x2, y1, y2, text, confidence, page_number)
298+
print(tpl)
299+
300+
for dct in wm.dicts():
301+
# prints a dict in the form
302+
# {'x1': x1, 'x2': x2, 'y1': y1, 'y2': y2, 'text': text, 'confidence': confidence, 'page': page_number}
303+
print(dct)
304+
```
15305

16306
# Installing
17307

18-
Install from [PyPI](https://pypi.org/project/wordmaze/):
308+
Install WordMaze from [PyPI](https://pypi.org/project/wordmaze/):
19309
```
20310
pip install wordmaze
21311
```

0 commit comments

Comments
 (0)