Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for info: support for multi-page tiffs #136

Open
evu opened this issue Sep 14, 2018 · 6 comments
Open

Request for info: support for multi-page tiffs #136

evu opened this issue Sep 14, 2018 · 6 comments
Assignees

Comments

@evu
Copy link

evu commented Sep 14, 2018

Summary

  • Please confirm if support for multi-page TIFF files is present, perhaps using an option I cannot identify, or if this would require an enhancement.

  • When I extract text from a multi-page TIFF using Text() it only extracts the text from the first page of the TIFF.

  • When I extract text from a multi-page TIFF using the tesseract command line client with defaults it extracts all pages of text.

  • I looked at some of the tesseract source code for pixReadMem() and I noticed this here:

  • It looks like tesseract might do some additional preprocessing on the image prior to calling pixReadMem().

Reproducibility

Reproducility Frequency

  • 100%

How to reproduce

  1. Get a multipage tiff file.
  2. Run tesseract 3+ on it from the command line like so:
tesseract multipage.tif multipage.tif
  1. Examine output (multipage.tif.txt) and notice text has been extracted from all pages of tif.
  2. Next, set up a gosseract client and set the image using either SetImage() or SetImageFromBytes() on a multi-page .tif file. Extract text using Text().
client := gosseract.NewClient()
defer client.Close()

// client.SetImageFromBytes(*imgBytes)
client.SetImage("multipage.tif")
text, _  := client.Text()
fmt.Println(text)
  1. Examine output in text. Notice only first page's text is returned.

Environment

uname -a
Darwin <<removed>> 17.7.0 Darwin Kernel Version 17.7.0 x86_64
go env
GOARCH="amd64"
GOBIN=""
GOCACHE="<<removed>>"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="<<removed>>"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=<<removed>>"
go version
go1.10.3 darwin/amd64
tesseract --version
tesseract 3.05.02
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11
@otiai10 otiai10 self-assigned this Oct 23, 2018
@otiai10
Copy link
Owner

otiai10 commented Oct 23, 2018

@evu Thanks. Could you give me any multi-paged tiff file as an example for development

@otiai10
Copy link
Owner

otiai10 commented Nov 3, 2018

ping @evu

@evu
Copy link
Author

evu commented Nov 5, 2018

http://www.nightprogrammer.org/wp-uploads/2013/02/multipage_tiff_example.tif

@otiai10
Copy link
Owner

otiai10 commented Nov 5, 2018

thx

otiai10 added a commit that referenced this issue Nov 5, 2018
@filip-dahlberg
Copy link

I'm having the same problem where I'm trying to extract text from a multi-page .tiff file, only first page is extracted. The same problem also exists in the case of a .png file. Would appreciate any help :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants