Correlation-of-Names-and-Numbers

This project focuses on the extraction of Names, PAN ID and Aadhaar ID from a scanned document like pdf, images, etc using OCR (Optical Character Recognition) for extraction of text from images, pre-trained Hugging Face NER pipeline model for name extraction and regex module for pattern matching of the PAN cards and Aadhaar Card. The output is a dictionary containing Names and ID's associated with the given name.

Project File Structure:

    code_files
    | names_ner.py
    | ocr.py
    | output_phrasing.py
    | re_pan_aadhaar_match.py
    | result.py
    Dataset
    | (Images for testing)
    README.md

ocr.py

It contains two function:

1. text_image:

To read and extract the text from images using the Pytesseract OCR and Cv2.

def text_image(path:str)->str:
    
    # Reading images using cv2 module and converting the 
    # extracted image to grayscale for uniformity for 
    # improved text extraction
    
    image = cv2.imread(path)
    image = cv2.cvtColor(image,cv2.COLOR_RGB2GRAY)
    
    # Using Pytesseract for converting image to string
    
    ocr_text = pytesseract.image_to_string(image,output_type=Output.STRING)
    
    # Text processing
    
    result = text_process(ocr_text)
    return result

2. text_process:

To process the extracted text to remove the extra spaces and indentations.

def text_process(text:str)->str:
    
    # Removing the extra spaces and new line from the 
    # extracted text and converting into upper case

    ocr_text = text.lstrip().rstrip().upper()
    result = re.sub("\n+"," ",ocr_text)
    result = re.sub("\s+"," ",result)
    
    return result

re_pan_aadhaar_match.py

It contains two function:

1. extract_pan:

To extract the matching pattern of the PAN card from image text extracted using text_image function using the regex module.

def extract_PAN(text:str)->list:

    # Defining pattern for PAN Card
    pattern = "[A-Z]{3}[ABCFGHLJPTF][A-Z][0-9]{4}[A-Z]"
    
    # Converting the text to upper case
    text = text.upper()
    
    # Finding the PAN Card ID using above pattern
    result = re.findall(pattern,text)
    
    # If the pattern is not found an empty list is returned
    if len(result) == 0:
        return []

    return result

2. extract_aadhaar:

To extract the matching pattern of the extract_aadhaar card from image text extracted using text_image function using the regex module.

def extract_Aadhaar(text:str)->list:
    
    # Defining pattern for Aadhaar Card
    pattern = "\d{4}-\d{4}-\d{4}|\d{4} \d{4} \d{4}"
    
    # Finding the Aadhaar Card numbers using above pattern
    result = re.findall(pattern,text)
    
    if len(result) == 0:
        return []
    
    return result

names_ner.py

It contains a single function extract_names that takes string as input and returns the list of the names extracted using a pre-trained Hugging Face NER model. The model used is "dslim/bert-base-NER-uncased" which is trained on a rich dataset.

Only the names of the person represented as 'NER' is returned as per the requirement.

def extract_names(text:str)->list:
    
    # Using a pre-trained Hugging Face model for extracting the names
    # by using the NER pipeline
    
    # Loading the tokenizer
    tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER-uncased")
    
    # Loading the model
    model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER-uncased")

    # Defining the NER pipeline with entities as grouped
    nlp = pipeline("ner", model=model, tokenizer=tokenizer,grouped_entities=True)

    # Extracting the names
    ner_results = nlp(text)
    
    names_list = []
    
    # Extracting only the Person name defined as 'PER' by the model
    for obs in ner_results:
        if obs['entity_group'] == 'PER' and len(obs['word']) > 5:
            names_list.append(obs['word'])
    
    return names_list

output_phrasing.py

This file contains function final_output. This function converts the output into the dictionary for the output to the result.py file.

def final_output(names:list,pan_id:list,aadhar_id:list)->list:
    sol = []
    
    # If the Aadhaar ID and PAN ID are empty then returning the list
    # with NO ID's
    if len(pan_id) == 0 and len(aadhar_id) == 0:
        sol.append("No ID found")
        
    # If the Aadhaar ID is empty then returning the list
    # with Names and PAN ID
    elif len(aadhar_id) == 0:
        for name,pan in zip(names,pan_id):
            zipped = {}
            zipped["Name"] = name
            zipped["PAN"] = pan
            zipped["Aadhar"] = ""
            
            sol.append(zipped)
    
    # If the PAN ID is empty then returning the list
    # with Names and Aadhaar ID
    elif len(pan_id) == 0:
        for name,adh in zip(names,aadhar_id):
            zipped = {}
            zipped["Name"] = name
            zipped["PAN"] = ""
            zipped["Aadhar"] = adh
            
            sol.append(zipped)
    
    # If the Aadhaar ID and PAN ID are not empty then returning the list
    # with Names, PAN and Aadhaar ID's
    else:
        for name,pan in zip(names,pan_id):
            zipped = {}
            zipped["Name"] = name
            zipped["PAN"] = pan
            zipped["Aadhar"] = ""
            
            sol.append(zipped)
            
    return sol

result.py

This file compiles functions of all the above files and prints the output in the dictionary format.

def extract(path:str):

    # Calling the text_image function in ocr file for text extraction
    text = text_image(path)
    
    # Calling the extract_names, extract_PAN and extract_Aadhaar functions 
    # in names_ner and re_pan_aadhar_match files respectively for 
    # data extraction
    names = extract_names(text)
    pan_id = extract_PAN(text)
    aadhaar_id = extract_Aadhaar(text)

    # Calling the final_output function in output_phrasing file for 
    # phrasing output text extracted above
    sol = final_output(names,pan_id,aadhaar_id)
    
    # Printing the the output 
    print(sol)

if __name__ == '__main__':
    path = input("Enter the path of the image date to be extracted: ")
    extract(path)

Tech Stack

Server: Python, Visual Studio Code, Pytesseract, Hugging Face Transformers, Regex

🚀 About Me

I'm an ambitious and hardworking pre-final year student, with skills in Machine Learning, NLP, Data Science, Analytics, and Development.

🛠 Skills

Python, C++, Deep Learning, Image Processing, Flask, Heroku, NLP (Natural Language Processing), Machine Learning, Hugging Face, Image Processing, OCR(Optical Character Recognition), Regex

🔗 Links

Authors

@Rachit R Jindal

Feedback

If you have any feedback, please reach out at rachitjindal56@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset

Dataset

code_files

code_files

README.md

README.md

Repository files navigation

Correlation-of-Names-and-Numbers

Project File Structure:

ocr.py

1. text_image:

2. text_process:

re_pan_aadhaar_match.py

1. extract_pan:

2. extract_aadhaar:

names_ner.py

output_phrasing.py

result.py

Tech Stack

🚀 About Me

🛠 Skills

🔗 Links

Authors

Feedback

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Dataset		Dataset
code_files		code_files
README.md		README.md

rachitjindal56/Correlation-of-Names-and-Numbers

Folders and files

Latest commit

History

Repository files navigation

Correlation-of-Names-and-Numbers

Project File Structure:

ocr.py

1. text_image:

2. text_process:

re_pan_aadhaar_match.py

1. extract_pan:

2. extract_aadhaar:

names_ner.py

output_phrasing.py

result.py

Tech Stack

🚀 About Me

🛠 Skills

🔗 Links

Authors

Feedback

About

Topics

Resources

Stars

Watchers

Forks

Languages