Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't upload a file with line break inside text #330

Closed
ayrtondenner opened this issue Aug 12, 2019 · 3 comments
Closed

Can't upload a file with line break inside text #330

ayrtondenner opened this issue Aug 12, 2019 · 3 comments
Labels
question Further information is requested

Comments

@ayrtondenner
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Windows 10
  • Python version:
    3.6.4

Describe the problem

I'm trying to upload a file where the texts are not a single line, but they can have line breaks inside of them. Even when using JSON file format to separate every text as property instead of a line, when upload a file, it seems Doccano still separate via line break.

Source code / logs

For instance, this is a JSON file I'm trying to upload with a single text inside of it:

[{"text": "Processo 0000637-15.2012.8.12.0003 (003.12.000637-8) - Procedimento Comum - Inadimplemento Reqte: Fabiano Neves Gon\u00e7alves ADV: PAULO DE TARSO AZEVEDO PEGOLO (OAB 10789/MS) ADV: HENRIQUE LIMA (OAB 9979/MS) ADV: GUILHERME FERREIRA DE BRITO (OAB 9982/MS) ADV: RODRIGO LOUREIRO (OAB 13583/MS) ADV: FRANCIELLI SANCHEZ SALAZAR (OAB 15140/MS) ADV: JAC\u00d3 CARLOS SILVA COELHO (OAB 15155A/MS) ADV: IVONE CONCEI\u00c7\u00c3O SILVA (OAB 13609/MS) 1.\nCom o tr\u00e2nsito em julgado da senten\u00e7a de fl. 393 e satisfa\u00e7\u00e3o integral do cr\u00e9dito, o of\u00edcio jurisdicional acha-se cumprido e acabado, raz\u00e3o por que indefiro o pedido de digitaliza\u00e7\u00e3o do feito (fl. 416).\nAdemais, tramitam nessa unidade judici\u00e1ria milhares de processos e se for admitida a digitaliza\u00e7\u00e3o de todos os feitos finalizados, haver\u00e1 atraso injustificado nas atividades do cart\u00f3rio, pois \u00e9 necess\u00e1rio grande lapso temporal do servidor para este fim.\n2.\nDever\u00e1 o cart\u00f3rio promover a retifica\u00e7\u00e3o do advogado da Mafre Vida S/A no sistema SAJ, para futuras publica\u00e7\u00f5es e intima\u00e7\u00f5es, conforme declinado \u00e0 fl. 416.\nIntimem-se.\nAp\u00f3s, arquive-se."}]

The idea was to visualize the text with line breaks when showing it during the annotation process, but instead what we got was that Doccano was transforming every phrase in a text by itself. For comparison, this same text was uploaded as this:

image

As the image shows, the text was broke in every line break, and every substring was dealed as a document alone.

@icoxfog417 icoxfog417 added the question Further information is requested label Aug 13, 2019
@icoxfog417
Copy link
Contributor

We don't support the text includes line breaks. Please refer the discussion at #34.

@ayrtondenner
Copy link
Author

I managed to solve it via .jsonl file. The data that I previously showed was saved as following:

{"text": "Processo 0000637-15.2012.8.12.0003 (003.12.000637-8) - Procedimento Comum - Inadimplemento Reqte: Fabiano Neves Gonçalves ADV: PAULO DE TARSO AZEVEDO PEGOLO (OAB 10789/MS) ADV: HENRIQUE LIMA (OAB 9979/MS) ADV: GUILHERME FERREIRA DE BRITO (OAB 9982/MS) ADV: RODRIGO LOUREIRO (OAB 13583/MS) ADV: FRANCIELLI SANCHEZ SALAZAR (OAB 15140/MS) ADV: JACÓ CARLOS SILVA COELHO (OAB 15155A/MS) ADV: IVONE CONCEIÇÃO SILVA (OAB 13609/MS) 1.\n\nCom o trânsito em julgado da sentença de fl. 393 e satisfação integral do crédito, o ofício jurisdicional acha-se cumprido e acabado, razão por que indefiro o pedido de digitalização do feito (fl. 416).\n\nAdemais, tramitam nessa unidade judiciária milhares de processos e se for admitida a digitalização de todos os feitos finalizados, haverá atraso injustificado nas atividades do cartório, pois é necessário grande lapso temporal do servidor para este fim.\n\n2.\n\nDeverá o cartório promover a retificação do advogado da Mafre Vida S/A no sistema SAJ, para futuras publicações e intimações, conforme declinado à fl. 416.\n\nIntimem-se.\n\nApós, arquive-se.", "labels": []}

Saved every document in a single line, with a "\n" for every line break. It doesn't appear in the "Dataset" section:

image

When annotating the examples, the line breaks are rendered successfully:

image

The same approach works when dealing with csv format, but unfortunately it requires at least one label, not allowing using it as an empty array. Because I don't want to send any label value, I had to use jsonl format, as it seems to be the only one allowing an empty label array. The txt/plain text format expects one example per line, not being able to support line breaks at all.

@GeorgeXiaojie
Copy link

GeorgeXiaojie commented Sep 11, 2019

I copy your example, and save into a.txt, but it does not work , it still can not render line breaks.

My project type is sequence labeling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants