You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently ran exiftool on a bunch of tiffs that we have in our regression corpus on Apache Tika. I was interested to see that there can be text (OCR'd or original) for the underlying document stored in what exiftool calls "MS Document Text", which is currently an unknown tag with value 0x932f. There's also MS Property Set Storage (0x9330)
I recently ran exiftool on a bunch of tiffs that we have in our regression corpus on Apache Tika. I was interested to see that there can be text (OCR'd or original) for the underlying document stored in what exiftool calls "MS Document Text", which is currently an unknown tag with value 0x932f. There's also MS Property Set Storage (0x9330)
An example file is here: https://corpora.tika.apache.org/base/docs/commoncrawl3/RD/RDAFESH5CBBJWWQZMZR4MGJIPYYEL7DN
This is what exiftool extracts from the file:
The exiftool dumps of the tiffs are available as
tiffs-*.gz
here: https://corpora.tika.apache.org/base/share/The text was updated successfully, but these errors were encountered: