Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add callback to optionally "repair" fields #24

Open
mjl opened this issue Aug 20, 2020 · 5 comments
Open

Add callback to optionally "repair" fields #24

mjl opened this issue Aug 20, 2020 · 5 comments
Labels
CHECKER MRZ.CHECKER Issues enhancement New feature or request

Comments

@mjl
Copy link

mjl commented Aug 20, 2020

I'm not really sure this functionality belongs here, but as the knowledge of the MRZ internal structure is only present in this module, why not... let me know what you think!

I work with scanned MRZ, and as comes with the process, the OCR sometimes mis-reads similar characters. For example, I have seen countries read as "R0U" or a name "SZ0BO5ZLAI". And the MRZ checker correctly warns that the nationality or the identifier is not valid. However, if you could add a method repair() to the checkers

def __init__(self, mrz_code: str, check_expiry=False, compute_warnings=False, precheck=True):
        precheck and check.precheck("TD1", mrz_code, 92)
        lines = mrz_code.splitlines()
        self._document_type = self.repair('document type', lines[0][0: 2])
        self._country = self.repair('country', lines[0][2: 5])
        [...]

def repair(self, field_name: str, field_content: str):
        return field_content

that would allow me to do things like:

class MyChecker(TD1CodeChecker):
    def repair(self, name, content):
        if name in ('country', 'identifier', ...):
            # I know those can only contain alphas
            return self.replace_often_mistaken_numbers_by_alphas(content)

        if name in ('expiry date', 'birth date'):
            return self.replace_often_mistaken_alphas_by_numbers(content)

    def replace_often_mistaken_numbers_by_alphas(self, s):
        return s.replace('5', 'S').replace('1', 'I').replace('0', 'O')

This would make the checker more useful when presented with badly scanned data.

The alternative would be that I somehow preprocess the MRZ, but then I would have to re-implement the MRZ structure definition in my code too. As said above, I'm not a big fan of shoehorning that functionality into this module, but I don't see any other place that has enough knowledge of the MRZ structure.

@Arg0s1080
Copy link
Owner

Arg0s1080 commented Aug 21, 2020

Hi!

Yeah... that functionality should be out of the scope of the project, but heck! Why not? In fact, almost everything in mrz.checker is already off target xDD

Because almost all the project (especially checker) has been done based on requests from others and some ideas of mine (some very bad) now I realize that I should have planned many things differently. Actually i'm trying to fix some of those bad ideas a bit now. Specifically the horrible _Report class

Please give me a few days to finish what I'm doing with checker and we'll see what we can do.

I don't know what you will think, but an option could be add the option to transliterate desired chars with a dict in the same way as in mrz.generator with surnames and given names.

Something like this:

def __init__(self, mrz_code: str, check_expiry=False, compute_warnings=False, ocr_transliteration=None):
    """"
    Params:
        mrz_string           (str):  MRZ string of TD1's. Must be 90 uppercase characters long
        check_expiry        (bool):  If it's set to True, it is verified and reported as warning that the
                                     document is not expired and that expiry_date is not greater than 10 years
        compute_warnings    (bool):  If it's set True, warnings compute as False
        ocr_transliteration (dict):  Transliteration dictionary for OCR purposes. None by default
    """
    [...]

I have some doubts:

  • Should some specific fields be repaired or could it be applied to all mrz code?

    • oops sorry. I didn't think about it too much. Obviously the repairs must be done depending on the type of field. Purely numeric fields such as dates must convert letters into numbers and fields such as identifier must convert detected numbers to letters.
  • Usually corrections are always the same for everyone or each person have their specific corrections? I ask this to add a dictionary to the project (or several if there are not many) But there would always be the possibility of using your own external dictionary

EDIT:
Oops! SORRY!. I didn't think about it too much. Obviously repairs must be done depending of the field type. Pure-numeric fields such as dates must convert letters into numbers and fields such as identifier, document_type or country must convert detected numbers to letters. What I don't know is what kind of solution you use to repair alphanumeric fields.

It's too late here. Please let me think it a little more calmly. IIf I can't think of anything better, yours might be a good solution.

By the way.. One of the rules for using classes that inherit from TD1CodeChecker, TD2Codechecker, TD1CodeGenerator and and all others is that the class name must start with the document type. For example: TD1MyCodeChecker, TD2OCRChecker, or something like that. Only the following strings: "TD1", "TD2", "TD3", "Passport", "MRVA", "MRVB" are allowed, otherwise document_type will be False (Another thing that I don't like and I must change)

@Arg0s1080 Arg0s1080 added CHECKER MRZ.CHECKER Issues enhancement New feature or request labels Aug 21, 2020
@mjl
Copy link
Author

mjl commented Aug 21, 2020 via email

@TanjaBayer
Copy link

This sounds really great, right now for solving that issue is:

  • use TD1CodeChecker to get the fields
  • apply some specific functions (a bit more sophisiticated than just replacing values, because often there more than one replacement character)
  • use the updated fields dict as kwars for the TD1CodeGenerator to generate the mrz again (Problem hier is the outputfields have different names than the input fields e.g. given_names vs names, country vs country_code, which is not that nice, would you accept a merge request for that?)
  • use the TD1CodeChecker again to now run on the updated mrz

For sure this also applies to TD2 and TD3 and the others.

But still I am wondering if there are still some plans to work on that?

@Arg0s1080
Copy link
Owner

I made a commitment to add this feature a long time ago and have not kept my word. I'm not normally like that, but my current circumstances stole me of most of my time.

When @mjl created this issue i thought about giving "a twist to his idea" but the truth is that I do not have the time and the experience in CV to do it.

YES OF COURSE, YOUR PR WILL BE WELCOME and you will have my eternal gratitude 🥇 . Ideally, it could work for all documents. If you propose a PR we could look at it (if possible and @mjl is not very angry, he could also get involved or at least give his opinion)

Thank you very much in advance

@mjl
Copy link
Author

mjl commented Feb 15, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CHECKER MRZ.CHECKER Issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants