Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mimetype detection as backup when extension not present #121

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

akoumjian
Copy link
Contributor

Some background: My use case for textract is with web scraping. Often, I will download attachments that are not saved with the proper filename extension. When this happens, textract currently defaults to the txt parser and normally it fails because these are pdfs, word docs, etc.

By adding support for python-magic, which uses the unix file command under the hood, I am able to successfully guess roughly half of these files. Still not where I want it, but it works. You'll see the tests are not yet passing because of this. However, I wanted to open the PR to discuss whether this was a good addition or not.

I've looked into other file type detection possibilities, and was surprised to discover that there aren't really any good heuristic based file detection libraries out there.

If you think this is a good addition, I'd love suggestions on how to properly tests / handle the fact that only about half of the filetypes are detected correctly. I could write very explicit tests per file that simply accept what is currently working and what isn't.

@akoumjian
Copy link
Contributor Author

Also, the reason I took all the test files and put them in a data folder was to make it easy to crawl for all test files. I think it also makes things a bit neater having the tests on one level and the test data in a separate folder.

@deanmalmgren
Copy link
Owner

I took a crack at this in #89 and @frbapolkosnik attempted something similar in #99. I certainly like the idea of doing something like this a lot. When I was experimenting with #89, I was really disappointed at how poorly the mimetype guessing worked in practice (not unlike what you discovered).

In your case, you mentioned that "attachments that are not saved with the proper filename extension". That is particularly troubling because it would mean that textract would essentially not be able to trust the file extension that it has.

I wonder if it would make sense to open up a separate API endpoint like textract.process_unknown that could handle these situations that have unknown filetypes with a mime-detection approach. Would that make things cleaner and more transparent to end users? It would certainly avoid the danger of ignoring the file extension altogether...

@akoumjian
Copy link
Contributor Author

Somehow I completely missed both of those PRs, oops!

Obviously it's your call, but the thing I like about textract's design is that is solves the 80% case and deals with triaging for the user. What do you think about the method on which I settled, in which we use the extension if it exists, otherwise it uses mimetype detection? Using this method in my personal project, I was able to detect 100% of the word / pdf documents I was downloading with Scrapy that had no extensions attached. So I think overall users who have extensionless files will still see a qualitative improvement in the results that they get.

@deanmalmgren deanmalmgren modified the milestone: v1.6.0 Nov 15, 2016
@deanmalmgren
Copy link
Owner

@akoumjian I am spectacularly embarrassed by how stale I let this get. Eek...

With #138 in progress, it would be great to also get this resolved as well if we can. These two PRs certainly complement one another. #138 gives people the ability to specify a parsing method whereas this PR gives us the ability to provide educated guesses based on the mimetype as a fallback method when the extension isn't available (I like this approach a lot!).

I have a couple of big picture comments that would be great to resolve to get this merged in:

  • remove logging. i know this is super convenient for debugging purposes, but adding logging to textract is an entirely different beast (PR would be welcome on that though)

  • MIME_MAPPING makes me a bit nervous for a few reasons which makes me think that I might prefer adding a BaseParser.mimetype class attribute which is overwritten by each subclass.

    1. I'm not sure where you came up with the list (kudos on your comprehensiveness!), but this seems rather brittle to updates to python-magic. Is this dictionary already embedded in some other package that we could use?
    2. It has non-unique keys in the dictionary, which means that it will not be stable on loading. For example, 'application/x-bzip2' is in there twice with different values
  • We need to add some tests so that we can monitor anything that might break along the way. The approach in attempt at extensionless filename support #89 (specifically in commit 6b0690f) was way over the top. How might we create a few reasonable tests to make sure things are working properly in cases where the extension is not known?

  • play nicely with changes in let the user pass file's extension (opitnal parameter) #138. I'm not exactly sure what to advise on this yet... More in the next few days hopefully.

@deanmalmgren
Copy link
Owner

One more thing... based on the conflicting files, I think the merge will go a lot more smoothly if you rebase this off of master.

@akoumjian
Copy link
Contributor Author

akoumjian commented Mar 24, 2017 via email

@deanmalmgren
Copy link
Owner

hey @akoumjian, i just thought i'd let you know that #138 was merged into master in v1.6.0. excited to see this come together for the next version of textract :)

@jpweytjens
Copy link
Contributor

The following code works nicely on my end. I'll try to include this in the textract code.

from . import exceptions

# source https://github.com/samuelneff/MimeTypeMap
mime_map = {
    # audio
    "audio/aac": "aac",
    "audio/mid": "midi",
    "audio/flac": "flac",
    "audio/wav": "wav",
    # csv
    "text/csv": "csv",
    # microsoft office
    "application/msword": "doc",
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document": "docx",
    "application/vnd.ms-powerpoint": "ppt",
    "application/vnd.openxmlformats-officedocument.presentationml.presentation": "pptx",
    "application/vnd.ms-excel": "xls",
    "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet": "xlsx",
    # html
    "text/html": "html",
    # images
    "image/jpeg": "jpg",
    "image/png": "png",
    "image/bmp": "bmp",
    "image/x-png": "png",
    "image/tiff": "tiff",
    # latex
    "application/x-latex": "latex",
    "application/x-tex": "latex",
    # outlook
    "application/vnd.ms-outlook": "msg",
    # open document
    "application/vnd.oasis.opendocument.text-master": "odm",
    "application/vnd.oasis.opendocument.presentation": "odp",
    "application/vnd.oasis.opendocument.spreadsheet": "ods",
    "application/vnd.oasis.opendocument.text": "odt",
    # pdf
    "application/pdf": "pdf",
    # text
    "text/plain": "txt",
}


fileformat_synonyms = {
    # html
    "htm": "html",
    # images
    "jpeg": "jpg",
    "tff": "tiff",
    "tif": "tiff",
    # text
    "text": "txt",
}


def convert_mime_to_fileformat(
    mime, mime_map=mime_map, fileformat_synonyms=fileformat_synonyms
):
    """Return the normalized fileformat associated with the provided extension or MIME type."""

    if mime is None:
        return None

    try:
        fileformat = mime_map[mime]
        return fileformat
    except KeyError:
        raise exceptions.MIMEtypeError(
            f"The fileformat corresponding to the MIME type '{mime}' can not be determined."
        )


def normalize_fileformat(fileformat, fileformat_synonyms):
    """Normalize a given fileformat, i.e. replace many possible synonyms with the normalized version."""

    if fileformat is None:
        return None

    # remove leading . if fileformat is an extension
    fileformat = fileformat.strip(".")

    try:
        normal_fileformat = fileformat_synonyms[fileformat]
    except KeyError:
        normal_fileformat = fileformat

    return normal_fileformat


def determine_fileformat(filename, extension, fileformat, mime_fileformat):
    """Determine the fileformat based on the extension, the manually provided fileformat and the detected MIME type."""

    extension = normalize_fileformat(extension)
    fileformat = normalize_fileformat(fileformat)
    mime_fileformat = convert_mime_to_fileformat(mime_fileformat)

    # case 1: only a filename with extension is provided
    if fileformat is None and mime_fileformat is None and extension is not None:
        return extension

    # case 2: only a filename without extension is provided
    if fileformat is None and mime_fileformat is None and extension is None:
        raise exceptions.UnknownFileformatError(
            f"The fileformat of '{filename.name}' can not be determined. You should explicitly pass a fileformat or enable MIME type detection."
        )

    # case 3: a fileformat is provided, overriding any extensions of the file
    if fileformat is not None and mime_fileformat is None:
        return fileformat

    # case 4: no information is provided and the fileformat associated with the detected MIME type is used
    if fileformat is None and mime_fileformat is not None:
        return mime_fileformat

    # case 5: both fileformat and MIME type detection is used and we require these to give the same result
    if fileformat is not None and mime_fileformat is not None:
        if fileformat == mime_fileformat:
            return fileformat
        else:
            raise exceptions.FileformatError(
                f"The specified fileformat '{fileformat}' does not correspond with the detected fileformat {mime_fileformat}. Automatic detection of fileformats can be wrong however. If you think this is the case, you can disable it with mime_detection=False."
            )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants