Request for info: support for multi-page tiffs #136

evu · 2018-09-14T19:44:14Z

Summary

Please confirm if support for multi-page TIFF files is present, perhaps using an option I cannot identify, or if this would require an enhancement.
When I extract text from a multi-page TIFF using Text() it only extracts the text from the first page of the TIFF.
When I extract text from a multi-page TIFF using the tesseract command line client with defaults it extracts all pages of text.
I looked at some of the tesseract source code for pixReadMem() and I noticed this here:
- https://tesseract-ocr.github.io/3.x/a00680_source.html#l01173
It looks like tesseract might do some additional preprocessing on the image prior to calling pixReadMem().

Reproducibility

Reproducility Frequency

100%

How to reproduce

Get a multipage tiff file.
Run tesseract 3+ on it from the command line like so:

tesseract multipage.tif multipage.tif

Examine output (multipage.tif.txt) and notice text has been extracted from all pages of tif.
Next, set up a gosseract client and set the image using either SetImage() or SetImageFromBytes() on a multi-page .tif file. Extract text using Text().

client := gosseract.NewClient()
defer client.Close()

// client.SetImageFromBytes(*imgBytes)
client.SetImage("multipage.tif")
text, _  := client.Text()
fmt.Println(text)

Examine output in text. Notice only first page's text is returned.

Environment

uname -a
Darwin <<removed>> 17.7.0 Darwin Kernel Version 17.7.0 x86_64

go env
GOARCH="amd64"
GOBIN=""
GOCACHE="<<removed>>"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="<<removed>>"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=<<removed>>"

go version
go1.10.3 darwin/amd64

tesseract --version
tesseract 3.05.02
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11

The text was updated successfully, but these errors were encountered:

otiai10 · 2018-10-23T03:04:07Z

@evu Thanks. Could you give me any multi-paged tiff file as an example for development

otiai10 · 2018-11-03T17:06:31Z

ping @evu

evu · 2018-11-05T13:08:30Z

http://www.nightprogrammer.org/wp-uploads/2013/02/multipage_tiff_example.tif

otiai10 · 2018-11-05T13:43:05Z

thx

otiai10 · 2018-11-05T16:08:21Z

filip-dahlberg · 2021-11-29T06:15:31Z

I'm having the same problem where I'm trying to extract text from a multi-page .tiff file, only first page is extracted. The same problem also exists in the case of a .png file. Would appreciate any help :)

otiai10 self-assigned this Oct 23, 2018

otiai10 added the enhancement label Oct 23, 2018

otiai10 added a commit that referenced this issue Nov 5, 2018

Add example tiff file for #136

9ff73e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for info: support for multi-page tiffs #136

Request for info: support for multi-page tiffs #136

evu commented Sep 14, 2018 •

edited

otiai10 commented Oct 23, 2018

otiai10 commented Nov 3, 2018

evu commented Nov 5, 2018

otiai10 commented Nov 5, 2018

otiai10 commented Nov 5, 2018

filip-dahlberg commented Nov 29, 2021

Request for info: support for multi-page tiffs #136

Request for info: support for multi-page tiffs #136

Comments

evu commented Sep 14, 2018 • edited

Summary

Reproducibility

Reproducility Frequency

How to reproduce

Environment

otiai10 commented Oct 23, 2018

otiai10 commented Nov 3, 2018

evu commented Nov 5, 2018

otiai10 commented Nov 5, 2018

otiai10 commented Nov 5, 2018

filip-dahlberg commented Nov 29, 2021

evu commented Sep 14, 2018 •

edited