You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+293-3Lines changed: 293 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -7,15 +7,305 @@ WordMaze
7
7
8
8
# About
9
9
10
-
- TODO: write About section
10
+
WordMaze is a standardized format for text extracted from documents.
11
+
12
+
When designing [OCR](https://www.wikiwand.com/en/Optical_character_recognition) engines, developers have to decide how to give their clients the list of extracted textboxes, including their position in the page, the text they contain and the confidence associated with that extraction.
With WordMaze, textboxes are defined using a unified interface:
24
+
```py
25
+
from wordmaze import TextBox
26
+
27
+
textbox = TextBox(
28
+
x1=x1,
29
+
x2=x2,
30
+
y1=y1,
31
+
y2=y2,
32
+
text=text,
33
+
confidence=confidence
34
+
)
35
+
# or
36
+
textbox = TextBox(
37
+
x1=x,
38
+
width=w,
39
+
y1=y,
40
+
height=h,
41
+
text=text,
42
+
confidence=conf
43
+
)
44
+
```
11
45
12
46
# Usage
13
47
14
-
- TODO: write Usage section
48
+
Perhaps the best example of usage is [`pdfmap.PDFMaze`](https://github.com/elint-tech/pdfmap/blob/e5b3434a63729ba5a737201d93a146f2e0e5ad7a/pdfmap/pdfmaze.py), the first application of WordMaze in a public repository.
49
+
50
+
The exact expected behaviour of every piece of code in WordMaze can be checked out at the [tests folder](https://github.com/elint-tech/wordmaze/tree/main/tests).
51
+
52
+
There are three main groups of objects defined in WordMaze:
53
+
54
+
## Textboxes
55
+
56
+
### `Box`es
57
+
58
+
The first and most fundamental [(data)class](https://pypi.org/project/dataclassy/) is the `Box`, which contains only positional information of a textbox inside a document's page:
59
+
```py
60
+
from wordmaze import Box
61
+
62
+
box1 = Box(x1=3, x2=14, y1=15, y2=92) # using coordinates
63
+
box2 = Box(x1=3, width=11, y1=15, height=77) # using coordinates and sizes
We enforce `x1<=x2` and `y1<=y2` (if `x1>x2`, for instance, their values are automatically swapped upon initialization). Whether `(y1, y2)` means `(top, bottom)` or `(bottom, top)` depends on the context.
68
+
69
+
`Box`es have some interesting attributes to facilitate further calculation using them:
70
+
```py
71
+
from wordmaze import Box
72
+
73
+
box = Box(x1=1, x2=3, y1=10, y2=22)
74
+
# coordinates:
75
+
print(box.x1) # 1
76
+
print(box.x2) # 3
77
+
print(box.y1) # 10
78
+
print(box.y2) # 22
79
+
# sizes:
80
+
print(box.height) # 12
81
+
print(box.width) # 2
82
+
# midpoints:
83
+
print(box.xmid) # 2
84
+
print(box.ymid) # 16
85
+
```
86
+
87
+
### `Textbox`es
88
+
89
+
To include textual information in a textbox, use a `TextBox`:
90
+
```py
91
+
from wordmaze import TextBox
92
+
93
+
textbox = TextBox(
94
+
# Box arguments:
95
+
x1=3,
96
+
x2=14,
97
+
y1=15,
98
+
height=77,
99
+
# textual content:
100
+
text='Dr. White.',
101
+
# confidence with which this text was extracted:
102
+
confidence=0.85# 85% confidence
103
+
)
104
+
```
105
+
106
+
Note that `TextBox`es inherit from `Box`es, so you can inspect `.x1`, `.width` and so on as shown previously. Moreover, you have two more properties:
107
+
```py
108
+
# textbox from the previous example
109
+
print(textbox.text) # Dr. White.
110
+
print(textbox.confidence) # 0.85
111
+
```
112
+
113
+
### `PageTextBox`es
114
+
115
+
If you also wish to include the page number from which your textbox was extracted, you can use a `PageTextBox`:
116
+
```py
117
+
from wordmaze import PageTextBox
118
+
119
+
textbox = PageTextBox(
120
+
# TextBox arguments:
121
+
x1=2,
122
+
x2=10,
123
+
y1=5,
124
+
height=20,
125
+
text='Sichermann and Sichelero and the same person!',
126
+
confidence=0.6,
127
+
# page info:
128
+
page=3# this textbox was extracted from the 4th page of the document
129
+
)
130
+
print(textbox.page) # 3
131
+
```
132
+
133
+
Note that page counting starts from `0` as is common in Python, so that page #3 is the 4th page of the document.
134
+
135
+
## Pages
136
+
137
+
### The basics
138
+
139
+
`Page`s are a representation of a document's page. They contain information regarding their size, their coordinate system's origin and their textboxes. For instance:
140
+
```py
141
+
from wordmaze import Page, Shape, Origin
142
+
143
+
page = Page(
144
+
shape=Shape(height=210, width=297), # A4 page size in mm
145
+
origin=Origin.TOP_LEFT
146
+
)
147
+
print(page.shape.height) # 210
148
+
print(page.shape.width) # 297
149
+
print(page.origin) # Origin.TOP_LEFT
150
+
```
151
+
152
+
A `Page` is a [`MutableSequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.MutableSequence) of `TextBox`es:
153
+
```py
154
+
page = Page(
155
+
shape=Shape(height=210, width=297), # A4 page size in mm
156
+
origin=Origin.TOP_LEFT,
157
+
entries=[ # define textboxes at initialization
158
+
TextBox(...),
159
+
TextBox(...),
160
+
...
161
+
]
162
+
)
163
+
164
+
page.append(TextBox(...)) # list-like append
165
+
166
+
for textbox in page: # iteration
167
+
assertisinstance(textbox, TextBox)
168
+
169
+
print(page[3]) # 4th textbox
170
+
```
171
+
172
+
### Different origins
173
+
174
+
There are two `Origin`s your page may have:
175
+
-`Origin.TOP_LEFT`: `y==0` means top, `y==page.shape.height` means bottom;
176
+
-`Origin.BOTTOM_LEFT`: `y==0` means bottom, `y==page.shape.height` means top;
177
+
178
+
If one textbox provider returned textboxes in `Origin.BOTTOM_LEFT` coordinates, but you'd like to have them in `Origin.TOP_LEFT` coordinates, you can use `Page.rebase` as follows:
You can easily modify and filter out `TextBox`es contained in a `Page` using `Page.map` and `Page.filter`, which behave like [`map`](https://docs.python.org/3/library/functions.html#map) and [`filter`](https://docs.python.org/3/library/functions.html#filter) where the iterable is fixed and equal to the page's textboxes:
`Page.map` and `Page.filter` also accept keywords. Each keyword accepts a function that accepts the respective property and operates on it. Better shown in code. The previous padding and filtering can be equivalently written as:
225
+
```py
226
+
# get a new page with textboxes padded by 3 to the left and to the right
The top-level class from WordMaze is, of course, a `WordMaze`. `WordMaze`s are simply sequences of `Page`s:
258
+
```py
259
+
from wordmaze import WordMaze
260
+
261
+
wm = WordMaze([
262
+
Page(...),
263
+
Page(...),
264
+
...
265
+
])
266
+
267
+
for page in wm: # iterating
268
+
print(page.shape)
269
+
270
+
first_page = wm[0] # indexing
271
+
```
272
+
273
+
`WordMaze` objects also provide a `WordMaze.map` and a `WordMaze.filter` functions, which work the same thing that `Page.map` and `Page.filter` do.
274
+
275
+
If you wish to access `WordMaze`'s pages shapes, there is the property `WordMaze.shapes`, which is a `tuple` satisfying `wm.shapes[N] == wm[N].shape`.
276
+
277
+
Additionally, you can iterate over `WordMaze`'s textboxes in two ways:
278
+
```py
279
+
wm = WordMaze(...)
280
+
281
+
# 1
282
+
for page in wm:
283
+
for textbox in page:
284
+
print(textbox)
285
+
286
+
# 2
287
+
for textbox in wm.textboxes():
288
+
print(textbox)
289
+
```
290
+
The main difference between #1 and #2 is that the textboxes in #1 are instances of `TextBox`, whereas the ones in #2 are `PageTextBox`es including their containing page index.
291
+
292
+
`WordMaze` objects also have a `WordMaze.tuples` and a `WordMaze.dicts` which behave just like their `Page` counterpart except that they also return their page's number:
0 commit comments