new feature: apply OCR to uploaded pdfs #64

pricua · 2023-12-07T11:53:44Z

Hi,

We have developed a functionality to apply OCR (image to text conversion) to PDFs that are uploaded to Dedalo.
When uploading a pdf, the option to apply OCR or not and a drop-down menu to select the OCR language is shown.

To do this, we use the ocrmypdf tool, which must be installed on the same server as Dedalo, since it is invoked from Dedalo via the shell_exec function.

We would like you to review these changes so that they can be incorporated into the official Dedalo code and that they can be used in any Dedalo installation, since we consider that it is a functionality that can be quite interesting for any area.

Above all, we are interested in the revision of the tool_upload file class.tool_upload.php. Both the way of obtaining the path of the file to be uploaded and the conversion between the Dedalo languages and the OCR languages, which we do manually, we think can be greatly improved. As well as anything else that you consider can be improved or optimized.

We can comment on what you need about it.

We hope this is of interest to the community and can be incorporated into the official Dedalo distribution.

Best

AranzazuBM · 2023-12-07T12:05:25Z

Fantastic tool! I’ll try it. Many thanks.

…

On Dec 7, 2023, at 6:53 AM, pricua ***@***.***> wrote: Hi, We have developed a functionality to apply OCR (image to text conversion) to PDFs that are uploaded to Dedalo. When uploading a pdf, the option to apply OCR or not and a drop-down menu to select the OCR language is shown. To do this, we use the ocrmypdf tool, which must be installed on the same server as Dedalo, since it is invoked from Dedalo via the shell_exec function. We would like you to review these changes so that they can be incorporated into the official Dedalo code and that they can be used in any Dedalo installation, since we consider that it is a functionality that can be quite interesting for any area. Above all, we are interested in the revision of the tool_upload file class.tool_upload.php. Both the way of obtaining the path of the file to be uploaded and the conversion between the Dedalo languages and the OCR languages, which we do manually, we think can be greatly improved. As well as anything else that you consider can be improved or optimized. We can comment on what you need about it. We hope this is of interest to the community and can be incorporated into the official Dedalo distribution. Best You can view, comment on, or merge this pull request online at: #64 Commit Summary 3a88328 <3a88328> new feature: apply OCR to uploaded pdfs File Changes (5 files <https://github.com/renderpci/dedalo/pull/64/files>) M core/services/service_upload/js/render_edit_service_upload.js <https://github.com/renderpci/dedalo/pull/64/files#diff-1186898516378b074a473ae66b2abb3039ed00ce6c1507731ce2e11f36eb8eb5> (72) M core/services/service_upload/js/service_upload.js <https://github.com/renderpci/dedalo/pull/64/files#diff-845c6621f548c389d9edf0295c5eb174188b594f9de2895d5398160d5271e34e> (951) M tools/tool_upload/class.tool_upload.php <https://github.com/renderpci/dedalo/pull/64/files#diff-6a05c6f18f203a7afc7d4d635e261b47dd81f61347a1589f9d093ff9ab98b309> (35) M tools/tool_upload/js/render_tool_upload.js <https://github.com/renderpci/dedalo/pull/64/files#diff-0b905fa028898f3fd0c1a3c27c13d37914ea4fdeb0080a8ce6f17d30820fcf98> (776) M tools/tool_upload/js/tool_upload.js <https://github.com/renderpci/dedalo/pull/64/files#diff-bb207a54de409fa5399693ddd06cab1addfd7ba43e5a60a2aa8a376a7c9f8357> (186) Patch Links: https://github.com/renderpci/dedalo/pull/64.patch https://github.com/renderpci/dedalo/pull/64.diff — Reply to this email directly, view it on GitHub <#64>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATRPMLXHPBTHC26XAU5EXKTYIGU5HAVCNFSM6AAAAABAK5W6NGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTANRQGE3TCNY>. You are receiving this because you are subscribed to this thread.

renderpci · 2023-12-12T11:26:55Z

Hi @pricua

Thanks a lot about this new feature. It will be useful to all projects and I thinks that is possible integrate it into the main code.

We are reviewing your code, and some comments arise.

The labels has to be translatable, Dédalo is used in different countries and languages, so, every label has to be translatable in this way:

From:

const combobox_label = ui.create_dom_element({
 element_type : 'label',
 class_name   : 'label',
 inner_html   : '<label>Lenguaje</label>',
 parent       : form
})

to:

const combobox_label = ui.create_dom_element({
 element_type : 'label',
 class_name.  : 'label',
 inner_html   : get_label.language || 'Language',
 parent       : form
})

The main fallback for labels are English, and if you need add some labels that are not into the ontology, please tell me, I will open it.

Take account that, If the label is inside the tool, you will need to call with tool method as:

self.get_tool_label('language ') || 'Language'

and please, don't add more html tags than necessary:

<label>Lenguaje</label>

is not necessary, the ui.create_dom_element() will create the label node, so adding this tag the result will be:

<label><label>Lenguaje</label></label>

Try to keep simple.

Finally. We need time to test it, thanks again and I will back with more.

renderpci · 2024-06-04T20:52:02Z

Hi @pricua

Well, full integration has been done!

Just want to point a few things about the final integration:

Never use a var to define a global variables in Dédalo. You can create an object in the instance and change it /recover it .... at any time, so is more easy to maintain and move between instances.
Don't include specific processes in the general classes, the OCR process applies only to PDFs, so use the component_pdf class to make this process specific. If you include the exec() in the tool_upload.php all uploaded files will check if they has the property... and when we want to find the process it will not be obvious that a specific process was defined in a general class... using the specific component will be clearer and more obvious and easier to find, besides the scope of the process is clear, all the things about PDF in the component_pdf.

Please review the actual code and compare it with your commit.

The code was integrated into the pricua-v6_developer branch.

Feel free to comment or suggest something else. We will merge into the master branch at the end of this week (Friday 7 June 2024)

And thanks for improve Dédalo features... :)

Best

new feature: apply OCR to uploaded pdfs

3a88328

renderpci merged commit ec17842 into renderpci:v6_developer Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new feature: apply OCR to uploaded pdfs #64

new feature: apply OCR to uploaded pdfs #64

pricua commented Dec 7, 2023

AranzazuBM commented Dec 7, 2023 via email

renderpci commented Dec 12, 2023 •

edited

renderpci commented Jun 4, 2024

new feature: apply OCR to uploaded pdfs #64

new feature: apply OCR to uploaded pdfs #64

Conversation

pricua commented Dec 7, 2023

AranzazuBM commented Dec 7, 2023 via email

renderpci commented Dec 12, 2023 • edited

renderpci commented Jun 4, 2024

renderpci commented Dec 12, 2023 •

edited