Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature: apply OCR to uploaded pdfs #64

Merged
merged 1 commit into from
Jun 7, 2024

Conversation

pricua
Copy link
Contributor

@pricua pricua commented Dec 7, 2023

Hi,

We have developed a functionality to apply OCR (image to text conversion) to PDFs that are uploaded to Dedalo.
When uploading a pdf, the option to apply OCR or not and a drop-down menu to select the OCR language is shown.

To do this, we use the ocrmypdf tool, which must be installed on the same server as Dedalo, since it is invoked from Dedalo via the shell_exec function.

We would like you to review these changes so that they can be incorporated into the official Dedalo code and that they can be used in any Dedalo installation, since we consider that it is a functionality that can be quite interesting for any area.

Above all, we are interested in the revision of the tool_upload file class.tool_upload.php. Both the way of obtaining the path of the file to be uploaded and the conversion between the Dedalo languages and the OCR languages, which we do manually, we think can be greatly improved. As well as anything else that you consider can be improved or optimized.

We can comment on what you need about it.

We hope this is of interest to the community and can be incorporated into the official Dedalo distribution.

Best

@AranzazuBM
Copy link

AranzazuBM commented Dec 7, 2023 via email

@renderpci
Copy link
Owner

renderpci commented Dec 12, 2023

Hi @pricua

Thanks a lot about this new feature. It will be useful to all projects and I thinks that is possible integrate it into the main code.

We are reviewing your code, and some comments arise.

  • The labels has to be translatable, Dédalo is used in different countries and languages, so, every label has to be translatable in this way:

From:

const combobox_label = ui.create_dom_element({
 element_type : 'label',
 class_name   : 'label',
 inner_html   : '<label>Lenguaje</label>',
 parent       : form
})

to:

const combobox_label = ui.create_dom_element({
 element_type : 'label',
 class_name.  : 'label',
 inner_html   : get_label.language || 'Language',
 parent       : form
})

The main fallback for labels are English, and if you need add some labels that are not into the ontology, please tell me, I will open it.

Take account that, If the label is inside the tool, you will need to call with tool method as:

self.get_tool_label('language ') || 'Language'

and please, don't add more html tags than necessary:

<label>Lenguaje</label>

is not necessary, the ui.create_dom_element() will create the label node, so adding this tag the result will be:

<label><label>Lenguaje</label></label>

Try to keep simple.

Finally. We need time to test it, thanks again and I will back with more.

@renderpci
Copy link
Owner

Hi @pricua

Well, full integration has been done!

Just want to point a few things about the final integration:

  1. Never use a var to define a global variables in Dédalo. You can create an object in the instance and change it /recover it .... at any time, so is more easy to maintain and move between instances.
  2. Don't include specific processes in the general classes, the OCR process applies only to PDFs, so use the component_pdf class to make this process specific. If you include the exec() in the tool_upload.php all uploaded files will check if they has the property... and when we want to find the process it will not be obvious that a specific process was defined in a general class... using the specific component will be clearer and more obvious and easier to find, besides the scope of the process is clear, all the things about PDF in the component_pdf.

Please review the actual code and compare it with your commit.

The code was integrated into the pricua-v6_developer branch.

Feel free to comment or suggest something else. We will merge into the master branch at the end of this week (Friday 7 June 2024)

And thanks for improve Dédalo features... :)

Best

@renderpci renderpci merged commit ec17842 into renderpci:v6_developer Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants