Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic text reversed with connected letters not reshaped correctly #69

Open
AnasAG opened this issue Jun 16, 2021 · 2 comments
Open

Arabic text reversed with connected letters not reshaped correctly #69

AnasAG opened this issue Jun 16, 2021 · 2 comments

Comments

@AnasAG
Copy link

AnasAG commented Jun 16, 2021

I have a script for extracting Arabic text from PDF. pdfminer lib is used for pdf parsing. When extracting the Arabic text, sentences were reversed but the letters in each work were connected.

Original text in PDF: "وضح المقصود بكل من المصطلحات التالية"
Extracted text from PDF: "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

When using arabic_reshaper I noticed a situation where the Arabic text is not formatted correctly.

Sample Code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result: ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: ﻮﻀﺣ ﻼﻤﻘﺻﻭﺩ ﺐﻜﻟ ﻢﻧ ﻼﻤﺼﻄﻠﺣﺎﺗ ﻼﺗﻼﻳﺓ

But, when using an Arabic text similar to the previous example (reversed) but the letters are isolated (not connected), arabic_reshaper did work properly.

Original text in PDF: "على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة"
Extracted text from PDF: "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

Sample code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result:  ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة

I couldn't find out why it behaves this way. Also tried using the ArabicReshaper class with configuration and changing args such as use_unshaped_instead_of_isolated and support_ligatures, but the behavior was the same.
The pdf font affects the extracted text output, it might be also why the text sometimes is extracted with connected or isolated letters/alphabets. Though in general, I'm not sure if it's a bug or related to ligatures or other causes.

@AnasAG AnasAG changed the title Arabic revered but connected letters not reshaped correctly Arabic text reversed with connected letters not reshaped correctly Jun 16, 2021
@naourass
Copy link

I'm running into this same issue. All my target text is in join format. Is it possible to isolate the letters when they're joined?

@abdelmalek13
Copy link

I have the same problem during extracted data from pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants