Add callback to optionally "repair" fields #24

mjl · 2020-08-20T21:12:46Z

I'm not really sure this functionality belongs here, but as the knowledge of the MRZ internal structure is only present in this module, why not... let me know what you think!

I work with scanned MRZ, and as comes with the process, the OCR sometimes mis-reads similar characters. For example, I have seen countries read as "R0U" or a name "SZ0BO5ZLAI". And the MRZ checker correctly warns that the nationality or the identifier is not valid. However, if you could add a method repair() to the checkers

def __init__(self, mrz_code: str, check_expiry=False, compute_warnings=False, precheck=True):
        precheck and check.precheck("TD1", mrz_code, 92)
        lines = mrz_code.splitlines()
        self._document_type = self.repair('document type', lines[0][0: 2])
        self._country = self.repair('country', lines[0][2: 5])
        [...]

def repair(self, field_name: str, field_content: str):
        return field_content

that would allow me to do things like:

class MyChecker(TD1CodeChecker):
    def repair(self, name, content):
        if name in ('country', 'identifier', ...):
            # I know those can only contain alphas
            return self.replace_often_mistaken_numbers_by_alphas(content)

        if name in ('expiry date', 'birth date'):
            return self.replace_often_mistaken_alphas_by_numbers(content)

    def replace_often_mistaken_numbers_by_alphas(self, s):
        return s.replace('5', 'S').replace('1', 'I').replace('0', 'O')

This would make the checker more useful when presented with badly scanned data.

The alternative would be that I somehow preprocess the MRZ, but then I would have to re-implement the MRZ structure definition in my code too. As said above, I'm not a big fan of shoehorning that functionality into this module, but I don't see any other place that has enough knowledge of the MRZ structure.

The text was updated successfully, but these errors were encountered:

Arg0s1080 · 2020-08-21T00:12:40Z

Hi!

Yeah... that functionality should be out of the scope of the project, but heck! Why not? In fact, almost everything in mrz.checker is already off target xDD

Because almost all the project (especially checker) has been done based on requests from others and some ideas of mine (some very bad) now I realize that I should have planned many things differently. Actually i'm trying to fix some of those bad ideas a bit now. Specifically the horrible _Report class

Please give me a few days to finish what I'm doing with checker and we'll see what we can do.

I don't know what you will think, but an option could be add the option to transliterate desired chars with a dict in the same way as in mrz.generator with surnames and given names.

Something like this:

def __init__(self, mrz_code: str, check_expiry=False, compute_warnings=False, ocr_transliteration=None):
    """"
    Params:
        mrz_string           (str):  MRZ string of TD1's. Must be 90 uppercase characters long
        check_expiry        (bool):  If it's set to True, it is verified and reported as warning that the
                                     document is not expired and that expiry_date is not greater than 10 years
        compute_warnings    (bool):  If it's set True, warnings compute as False
        ocr_transliteration (dict):  Transliteration dictionary for OCR purposes. None by default
    """
    [...]

I have some doubts:

~~Should some specific fields be repaired or could it be applied to all mrz code?~~
- oops sorry. I didn't think about it too much. Obviously the repairs must be done depending on the type of field. Purely numeric fields such as dates must convert letters into numbers and fields such as identifier must convert detected numbers to letters.
Usually corrections are always the same for everyone or each person have their specific corrections? I ask this to add a dictionary to the project (or several if there are not many) But there would always be the possibility of using your own external dictionary

EDIT:
Oops! SORRY!. I didn't think about it too much. Obviously repairs must be done depending of the field type. Pure-numeric fields such as dates must convert letters into numbers and fields such as identifier, document_type or country must convert detected numbers to letters. What I don't know is what kind of solution you use to repair alphanumeric fields.

It's too late here. Please let me think it a little more calmly. IIf I can't think of anything better, yours might be a good solution.

By the way.. One of the rules for using classes that inherit from TD1CodeChecker, TD2Codechecker, TD1CodeGenerator and and all others is that the class name must start with the document type. For example: TD1MyCodeChecker, TD2OCRChecker, or something like that. Only the following strings: "TD1", "TD2", "TD3", "Passport", "MRVA", "MRVB" are allowed, otherwise document_type will be False (Another thing that I don't like and I must change)

mjl · 2020-08-21T11:52:41Z

- Should some specific fields be repaired or could it be applied to all mrz code?

I guess it makes sense to apply it to all the fields that have constraints on them as to what data they can contain. It probably is not useful to have a callback for "this field may contain anything", but if one knows it is characters only, or digits only, or a date...

- Usually corrections are always the same for everyone or each person have their specific corrections? I ask this to add a dictionary to the project (or several if there are not many)

It would probably be the same for everybody, if the MRZ source is the same (ie. if I scan 1000 ID cards, then they probably will all have the same classes of errors). Your ocr_transliteration dict could be something along the lines: ``` { 'alpha': callback_for_replacing_numbers, 'digit': callback_for_replacing_chars, } ``` Perhaps having specialisations for 'date' might make sense, and fall back to 'digit' if not present? I'm partial to having callbacks instead just a static mapping dictionary (1->L, 5->S), but I can live with the static mapping too. The transliteration should run before the hash checks and the other sanity checks.

TanjaBayer · 2021-01-15T12:48:53Z

This sounds really great, right now for solving that issue is:

use TD1CodeChecker to get the fields
apply some specific functions (a bit more sophisiticated than just replacing values, because often there more than one replacement character)
use the updated fields dict as kwars for the TD1CodeGenerator to generate the mrz again (Problem hier is the outputfields have different names than the input fields e.g. given_names vs names, country vs country_code, which is not that nice, would you accept a merge request for that?)
use the TD1CodeChecker again to now run on the updated mrz

For sure this also applies to TD2 and TD3 and the others.

But still I am wondering if there are still some plans to work on that?

Arg0s1080 · 2021-02-05T21:52:55Z

I made a commitment to add this feature a long time ago and have not kept my word. I'm not normally like that, but my current circumstances stole me of most of my time.

When @mjl created this issue i thought about giving "a twist to his idea" but the truth is that I do not have the time and the experience in CV to do it.

YES OF COURSE, YOUR PR WILL BE WELCOME and you will have my eternal gratitude 🥇 . Ideally, it could work for all documents. If you propose a PR we could look at it (if possible and @mjl is not very angry, he could also get involved or at least give his opinion)

Thank you very much in advance

mjl · 2021-02-15T13:27:08Z

@arg0s Don't worry, we all fall off the train sometimes when life happens. The feature is on my back burner too at this moment in time, but if anyone has ideas/comments/code, feel free to discuss here!

Arg0s1080 added CHECKER MRZ.CHECKER Issues enhancement New feature or request labels Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add callback to optionally "repair" fields #24

Add callback to optionally "repair" fields #24

mjl commented Aug 20, 2020

Arg0s1080 commented Aug 21, 2020 •

edited

mjl commented Aug 21, 2020 via email

TanjaBayer commented Jan 15, 2021

Arg0s1080 commented Feb 5, 2021

mjl commented Feb 15, 2021 via email

Add callback to optionally "repair" fields #24

Add callback to optionally "repair" fields #24

Comments

mjl commented Aug 20, 2020

Arg0s1080 commented Aug 21, 2020 • edited

mjl commented Aug 21, 2020 via email

TanjaBayer commented Jan 15, 2021

Arg0s1080 commented Feb 5, 2021

mjl commented Feb 15, 2021 via email

Arg0s1080 commented Aug 21, 2020 •

edited