New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rdfile reader #942
base: main
Are you sure you want to change the base?
Rdfile reader #942
Conversation
…h test cases; extended tests to also test the resultant IChemObject; simplified RdfRecord
…MoleculeHashGenerator
Interesting. This was topic of my project at EMBL-EBI/CamUni in 2003, where I looked at MaCIE database content validation with the CDK/JChemPaint. |
Here is my first draft of an RDfile reader. As a piece on its own, it works fine and I tested it quite extensively. The processing happens in two stages: (1) reading the lines and doing some validation and (2) handing over the molfile/rxnfile part of the record to the parser already present in CDK. The validation in (1) might be a bit extensive given that the parser is doing some of this as well, but it allows the RdfileReader to be in control, provide good error messages, and meaningful and early skipping of individual broken records. The tricky part here seems to be the integration into the CDK IO object hierarchy. I also have a few other questions/suggestions and would very much appreciate any input.
|
Hi Uli, I'll find time to look closer but some easy fixes:
|
Thanks, sorry about the failed build. Was out with Covid last week, still recovering. Should have just stopped and continued tomorrow instead of pushing this through.... Will have a look tomorrow. |
added test cases to BasicMoleculeHashGeneratorTest; added javadoc to …
…h test cases; extended tests to also test the resultant IChemObject; simplified RdfRecord
That was me trying to re-base on main which ended up being a real mess commit-wise. I am not a complete git novice but sometimes it still surprises me 🤷🏼 I don't know what the best way to proceed is, so I'd very much appreciate any git recovery strategies. Well, at least I found the reason the tests crashed: classical typo, one of the test input filenames started with an upper case letter, but the file started with a lower case letter, windows doesn't mind, but everyone else does! |
TODO
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left suggestions.
BTW, I do wonder what happened with my RD reader...
private final BufferedReader bufferedReader; | ||
private final IChemObjectBuilder chemObjectBuilder; | ||
|
||
public RdfileReader(InputStream in, IChemObjectBuilder chemObjectBuilder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add JavaDoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do (see also above).
} | ||
|
||
public RdfileReader(InputStream in, IChemObjectBuilder chemObjectBuilder, boolean skipRecordsOnError) { | ||
this(new InputStreamReader(in, StandardCharsets.UTF_8), chemObjectBuilder, skipRecordsOnError); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could check here first is in
is not already a InputStreamReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure if I understand correctly.
I am wrapping the InputStream
in
in an InputStreamReader
.
InputStreamReader
is not an InputStream
(i.e., there is no relationship in their object inheritance), so in cannot be an InputStreamReader
.
Please let me know if I am missing something here.
storage/ctab/src/main/java/org/openscience/cdk/io/RdfileReader.java
Outdated
Show resolved
Hide resolved
try { | ||
return doReadNext(); | ||
} catch (IOException exception) { | ||
LOGGER.error("I/O error when reading RDfile: " + exception.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you can do here, is have a STRICT
mode. When set true, you can throw an exception. This gives the user to ability to take action.
Second, you can use the IO listener functionality to report the error without failing, so that the user can record where in the file there are errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) I do like the idea of the STRICT mode and might look into that.
(2) Can you elaborate on the IO listener functionality you are referring to? Is this an interface/base class in CDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IO listener stuff is how the options are handled. It's useful for UIs (e.g. Bioclipse) since you can list the options of a defaults automatically. In makes programmer/API usage very clunky though:
For example, this is how you are "meant" to set the option "ForceReadAs3Dcoords" in V2000 reader. Note it's a String not a bool!:
MDLV2000Reader reader = new MDLV2000Reader(ins);
Properties prop = new Properties();
prop.setProperty("ForceReadAs3DCoordinates", "true");
propertiesListener listener = new PropertiesListener(prop);
reader.addChemObjectIOListener(listener);
reader.customizeJob();
Personally I would have a gone a different way, but it is what we have. In some places I have added the directory access methods and utilities to make life a little smoother:
SDFWriter sdfWriter = new SDFWriter(writer);
sdfWriter.getSetting(SDFWriter.OptAlwaysV3000).setSetting("true");
// or in this case I deem the option so important it has it's own method :-)
sdfWriter.setAlwaysV3000(true);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I miss-read Egon's input on the IO listener... personally I would prefer a return status on the record. So then the you force the user to decide what to do with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very good explanation re the IO listener stuff. I saw that in the IO classes before but haven't paid too much attention.
However, if I understand correctly Egon was actually referring to something else? Are we talking about the IChemObject Interface and its handleError methods...? 🤷🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it’s the error handler stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the hierarchy of the IO objects in CDK correctly that could then lead to the following class declaration
public class RdfileReader extends DefaultIteratingChemObjectReader<IChemObject>
if we want to stick with IChemObject
as a return object for each record.
- I would pick up the error handling as implemented by
DefaultIteratingChemObjectReader
. - The functionality of the boolean
skipRecordsOnError
would be substituted byIChemObjectReader.Mode.RELAXED
andIChemObjectReader.Mode.STRICT
RdfileRecord
would just be an inner class ofRdfileReader
CharIter
might end up us an inner class ofRdfileReader
, too; I only use a very limited set of methods there and would remove the unused methods- and the external/internal registry number would be returned as a property of the molecule/reaction
Main drawback that I see: given the discussion with Daniel recently we cannot make this implementation thread-safe due to the setReader(Reader)
and setReader(InputStream)
methods that are part of the interface IChemObjectReader
.
Does that make sense? Are we all happy with this?
Removes the Xerces and Xalan dependencies
Good timing then! I thoroughly enjoy the process of sharing, so it makes me happy to hear that this might be put to good use. btw CDK doesn't have a RXN V3000 writer at the moment, that would be something I'd like to add, too.
Unfortunately, the Biovia 'specification' doesn't read like a specification at all and leaves open a lot of questions.
I have seen these crudpaths in RDfiles, but have never seen a specification or description. I'd certainly be interested in that myself! |
Could you rebase on master @uli-f, I think have configure SonarCloud so we can run it now. |
…h test cases; extended tests to also test the resultant IChemObject; simplified RdfRecord
in the constructor of RdfileReader the provided Reader is a BufferedReader prior to wrapping it up in a BufferedReader
# Conflicts: # storage/ctab/src/main/java/org/openscience/cdk/io/RdfileReader.java
Done. |
Oh grrr... not finished yet but can see it still didn't pickup the SonarCloud token. This is infuriating, we could just expose the token publicly but I thought how I had set this up was a good option. i.e. me/egon need to click run on the PR |
I guess you are aware of github secrets...? It sounds to me like you should be able to work a token into the CI github actions script by using github secrets. Let me know if I can be of any help. I set up the CI pipeline based on GitHub actions here at Pending AI. |
Yes it uses secrets, however since the branch is in your fork it will not see the secret. So o thought the work around was you define a secret in an environment then it can be seen if the workflow gets approved to run - but that didn’t work |
I did something similar for test results that are uploaded as a comment for PRs opened by dependabot. I am not sure how easy it would be to adopt whatever is described there to your needs though. |
No description provided.