Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tags for MS Document Text, MS Property Storage #579

Open
tballison opened this issue Jun 2, 2022 · 0 comments
Open

Add tags for MS Document Text, MS Property Storage #579

tballison opened this issue Jun 2, 2022 · 0 comments
Labels
format-tiff help wanted image-queue Actionable issue with sample image

Comments

@tballison
Copy link
Contributor

I recently ran exiftool on a bunch of tiffs that we have in our regression corpus on Apache Tika. I was interested to see that there can be text (OCR'd or original) for the underlying document stored in what exiftool calls "MS Document Text", which is currently an unknown tag with value 0x932f. There's also MS Property Set Storage (0x9330)

An example file is here: https://corpora.tika.apache.org/base/docs/commoncrawl3/RD/RDAFESH5CBBJWWQZMZR4MGJIPYYEL7DN

This is what exiftool extracts from the file:

ExifTool Version Number         : 12.42
File Name                       : RDAFESH5CBBJWWQZMZR4MGJIPYYEL7DN
Directory                       : /data1/docs/commoncrawl3/RD
File Size                       : 38 kB
File Modification Date/Time     : 2018:11:05 02:38:44+01:00
File Access Date/Time           : 2022:06:01 15:25:59+02:00
File Inode Change Date/Time     : 2020:06:10 23:11:36+02:00
File Permissions                : -rwxr-xr-x
File Type                       : TIFF
File Type Extension             : tif
MIME Type                       : image/tiff
Exif Byte Order                 : Little-endian (Intel, II)
Image Width                     : 1760
Image Height                    : 2800
Bits Per Sample                 : 1
Compression                     : T6/Group 4 Fax
Photometric Interpretation      : WhiteIsZero
Strip Offsets                   : 8
Samples Per Pixel               : 1
Rows Per Strip                  : 2800
Strip Byte Counts               : 23737
X Resolution                    : 200
Y Resolution                    : 200
Resolution Unit                 : inches
Software                        :  HATFILT Version 1.8
Subfile Type                    : Reduced-resolution image
Preview Image Start             : 27248
Preview Image Length            : 5225
JPEG Proc                       : Baseline
Jpg From Raw Start              : 27248
Jpg From Raw Length             : 5225
MS Document Text                : .d.CÂMARA. ..MUNICIPAL DE VARGEM ALTA. ..ESTADO DO ESPíRITO SANTO. .DECRETO LEGISLATIVO N° 032197. ..APROVA AS CONTAS. .MUNICIPAL DE VARGEM. .ESPíRITO SANTO,. .ExERCIdo DE 1996.. ..DA PREFEITURA. .ALTA, EST>
MS Property Set Storage         : (Binary data 5632 bytes, use -b option to extract)
MS Document Text Position       : (Binary data 2110 bytes, use -b option to extract)
Image Size                      : 1760x2800
Jpg From Raw                    : (Binary data 5225 bytes, use -b option to extract)
Megapixels                      : 4.9
Preview Image                   : (Binary data 5225 bytes, use -b option to extract)

The exiftool dumps of the tiffs are available as tiffs-*.gz here: https://corpora.tika.apache.org/base/share/

@drewnoakes drewnoakes added help wanted format-tiff image-queue Actionable issue with sample image labels Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format-tiff help wanted image-queue Actionable issue with sample image
Projects
None yet
Development

No branches or pull requests

2 participants