New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to import corpus in relANNIS format #55
Comments
Thank you for your very detailed report on that. Looking into the exception stack lets me guess, it could be an encoding problem. The following line looks a bit like that: Caused by: org.xml.sax.SAXParseExceptionpublicId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; systemId: file:/home/ldd/src/corpus-tools/annis/annis-kickstarter-3.3.6/GUM_relANNIS/corpus.tab; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog. But nevertheless you got us. The RelANNISImporter was set as deprecated and probably does not work with the relANNIS version of the GUM corpus. We are sorry about that. In the next versions it will be removed entirely. I would propose to use the PAULAImporter, but you found that solution on your own ;-). saltnpepper@lists.hu-berlin.de Thank you and best regards |
Let me add that the Atomic 1.0 release candidate will include only those Pepper modules that are also available in standalone Pepper, which means that deprecated modules won't be included. Also, it will include Pepper 3.x and Salt 3.x (currently Atomic uses 1.8) as well as provide bugfixes for showstoppers like #54. |
On Wed, 2015-12-09 at 11:28 -0800, Florian Zipser wrote:
If you mean encoding like character encoding, I'm not seeing it:
Those files that contain UTF-8 do not seem to start with anything
The sequence 30 09 is the character 0 followed by a tab. If the files
Good to know.
Indeed!
I'll pass on CoNLL because I don't actually plan to use this format, so If my plans change, I'll sure send you the data. Now, the format I am really interested in, beyond all those I've Thanks, |
Versions
Atomic 0.2.1 (this is the latest stable release at time of writing)
This is running on Debian testing.
Steps to reproduce
./atomic
File > Import Corpus
.Corpus Import
that appears underAtomic
. ClickNext
.RelANNISImporter
among the choices of Pepper modules.relANNIS
version 3.1 among the formats, which is the only choice.Target path is a
toDirectory
, and select the directory for the GUM corpus you've unzipped somewhere in the first step. (That's the directory that contains the.tab
files.) ClickNext
.Next
again without setting a property.test1
. ClickFinish
.Expected results
Ideally, I'd expect the corpus to be imported.
If somehow I'm doing something I should not, then I'd expect Atomic to guide me towards using it properly, and I'd expect some informative error message to be provided by the GUI.
Actual results
There is a "Progress Information" dialog that comes up but never reaches completion. The corpus is not imported. The GUI does not help towards a resolution.
The console shows:
Observations
I have no experience with ANNIS or Atomic, or with the relANNIS format. I did read the documentation here. Importing the GUM corpus into ANNIS worked correctly on the first try. I pretty much used the method I described above, but with the modifications appropriate for ANNIS: I unzipped GUM somewhere, then I pointed ANNIS to the directory that contains all the
.tab
files, clicked the button to import and it worked! Then I tried the equivalent in Atomic and got nowhere fast. I've tried a bunch of variations, just in case:corpus.tab
instead of the directory. (As I said, I'm not familiar with the format, so I don't know if this makes sense at all. I've worked with other corpus formats that split a corpus among many files but have a file that serves as the "main" file of the corpus. A file namedcorpus.tab
seemed a good candidate for this, so I tried.)Nothing above worked and the error messages left on the console were generally not helpful.
Surely there are things I tried that would appear obviously wrong from the point of view of someone who already knows how to load a relANNIS corpus into Atomic. I'd expect though that anything obviously wrong would be prevented by the GUI. For instance, if it is obviously wrong to try to load a relANNIS corpus as a single file, the option to load as a single file should just not be present on the screen. Or if only
.tab
files can be loaded, then I should not be able to select a.zip
file.Finally, I tried loading GUM in the PAULA format, and this worked, but it only adds to my puzzlement because I unzipped the zip file and pointed Atomic to the top directory that was created from unzipping the zip, which is essentially what I did with relANNIS. So I don't know why this worked.
The text was updated successfully, but these errors were encountered: