Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rdfile reader #942

Open
wants to merge 48 commits into
base: main
Choose a base branch
from
Open

Rdfile reader #942

wants to merge 48 commits into from

Conversation

uli-f
Copy link
Member

@uli-f uli-f commented Nov 30, 2022

No description provided.

@egonw
Copy link
Member

egonw commented Nov 30, 2022

Interesting. This was topic of my project at EMBL-EBI/CamUni in 2003, where I looked at MaCIE database content validation with the CDK/JChemPaint.

@uli-f
Copy link
Member Author

uli-f commented Nov 30, 2022

Here is my first draft of an RDfile reader.

As a piece on its own, it works fine and I tested it quite extensively.

The processing happens in two stages: (1) reading the lines and doing some validation and (2) handing over the molfile/rxnfile part of the record to the parser already present in CDK. The validation in (1) might be a bit extensive given that the parser is doing some of this as well, but it allows the RdfileReader to be in control, provide good error messages, and meaningful and early skipping of individual broken records.

The tricky part here seems to be the integration into the CDK IO object hierarchy. I also have a few other questions/suggestions and would very much appreciate any input.

  • As an RDfile holds one or more records implementing this is an Iterator seemed like an obvious choise to me.
  • A record in an RDfile can hold either a molfile which ends up as an IAtomContainer in CDK or a RXNfile which ends up as an IReaction. This makes the type of Iterator less obvious. I settled on an Iterator<RdfileRecord> at the moment.
  • While the record is read the data class RdfileRecord is storing intermediate data in a bunch of private fields and is immutable outside the package. It also serves as a return object. It has a RdfileRecord::getAtomContainer and a RdfileRecord::getReaction that returns either null or an IAtomContainer/IReaction depending on the actual nature of the IChemObject.
  • An alternative to the current Iterator<RdfileRecord> implementation might be Iterator<IChemObject> which covers both IAtomContainer and IReaction. This then puts the task of figuring out what the record actually contains for each call to RdfileReader::next on the user of the RdfileReader.
  • The fields internalRegistryNumber and externalRegistryNumber of RdfileRecord might be best returned in the form of a property of the IAtomContainer/IReaction object of a record. Would it make sense to define a particular property key/name for the two in CDKConstants (e.g., RDFILE_INTERNAL_RN = "cdk:RdfileInternalRN" and RDFILE_EXTERNAL_RN = "cdk:RdfileExternalRN")?
  • If the fields internalRegistryNumber and externalRegistryNumber are indirectly returned as properties of the IChemObject, the class RdfileRecord doesn't need to be exposed at all anymore, but might just live as an inner class of RdfileReader.

@johnmay
Copy link
Member

johnmay commented Nov 30, 2022

Hi Uli,

I'll find time to look closer but some easy fixes:

  • the build if failing due a missing file it looks likey just forgot to check something in?
  • Missing copyright header/license on the java files

@uli-f
Copy link
Member Author

uli-f commented Nov 30, 2022

Thanks, sorry about the failed build.

Was out with Covid last week, still recovering. Should have just stopped and continued tomorrow instead of pushing this through.... Will have a look tomorrow.

@uli-f
Copy link
Member Author

uli-f commented Dec 1, 2022

That was me trying to re-base on main which ended up being a real mess commit-wise. I am not a complete git novice but sometimes it still surprises me 🤷🏼

I don't know what the best way to proceed is, so I'd very much appreciate any git recovery strategies.

Well, at least I found the reason the tests crashed: classical typo, one of the test input filenames started with an upper case letter, but the file started with a lower case letter, windows doesn't mind, but everyone else does!

@uli-f
Copy link
Member Author

uli-f commented Dec 1, 2022

TODO

  • add copyright/license headers
  • add javadocs
  • cite documentation for RDfile and elaborate a bit on the spec

Copy link
Member

@egonw egonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left suggestions.

BTW, I do wonder what happened with my RD reader...

private final BufferedReader bufferedReader;
private final IChemObjectBuilder chemObjectBuilder;

public RdfileReader(InputStream in, IChemObjectBuilder chemObjectBuilder) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add JavaDoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do (see also above).

}

public RdfileReader(InputStream in, IChemObjectBuilder chemObjectBuilder, boolean skipRecordsOnError) {
this(new InputStreamReader(in, StandardCharsets.UTF_8), chemObjectBuilder, skipRecordsOnError);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could check here first is in is not already a InputStreamReader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure if I understand correctly.
I am wrapping the InputStream in in an InputStreamReader.
InputStreamReader is not an InputStream (i.e., there is no relationship in their object inheritance), so in cannot be an InputStreamReader.
Please let me know if I am missing something here.

try {
return doReadNext();
} catch (IOException exception) {
LOGGER.error("I/O error when reading RDfile: " + exception.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you can do here, is have a STRICT mode. When set true, you can throw an exception. This gives the user to ability to take action.

Second, you can use the IO listener functionality to report the error without failing, so that the user can record where in the file there are errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) I do like the idea of the STRICT mode and might look into that.

(2) Can you elaborate on the IO listener functionality you are referring to? Is this an interface/base class in CDK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IO listener stuff is how the options are handled. It's useful for UIs (e.g. Bioclipse) since you can list the options of a defaults automatically. In makes programmer/API usage very clunky though:

For example, this is how you are "meant" to set the option "ForceReadAs3Dcoords" in V2000 reader. Note it's a String not a bool!:

MDLV2000Reader reader = new MDLV2000Reader(ins);
Properties prop = new Properties();
prop.setProperty("ForceReadAs3DCoordinates", "true");
propertiesListener listener = new PropertiesListener(prop);
reader.addChemObjectIOListener(listener);
reader.customizeJob();

Personally I would have a gone a different way, but it is what we have. In some places I have added the directory access methods and utilities to make life a little smoother:

SDFWriter sdfWriter = new SDFWriter(writer);
sdfWriter.getSetting(SDFWriter.OptAlwaysV3000).setSetting("true");
// or in this case I deem the option so important it has it's own method :-)
sdfWriter.setAlwaysV3000(true);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I miss-read Egon's input on the IO listener... personally I would prefer a return status on the record. So then the you force the user to decide what to do with them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very good explanation re the IO listener stuff. I saw that in the IO classes before but haven't paid too much attention.

However, if I understand correctly Egon was actually referring to something else? Are we talking about the IChemObject Interface and its handleError methods...? 🤷🏼

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it’s the error handler stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the hierarchy of the IO objects in CDK correctly that could then lead to the following class declaration

public class RdfileReader extends DefaultIteratingChemObjectReader<IChemObject>

if we want to stick with IChemObject as a return object for each record.

  • I would pick up the error handling as implemented by DefaultIteratingChemObjectReader.
  • The functionality of the boolean skipRecordsOnError would be substituted by IChemObjectReader.Mode.RELAXED and IChemObjectReader.Mode.STRICT
  • RdfileRecord would just be an inner class of RdfileReader
  • CharIter might end up us an inner class of RdfileReader, too; I only use a very limited set of methods there and would remove the unused methods
  • and the external/internal registry number would be returned as a property of the molecule/reaction

Main drawback that I see: given the discussion with Daniel recently we cannot make this implementation thread-safe due to the setReader(Reader) and setReader(InputStream) methods that are part of the interface IChemObjectReader.

Does that make sense? Are we all happy with this?

Removes the Xerces and Xalan dependencies
@uli-f
Copy link
Member Author

uli-f commented Dec 5, 2022

This is interesting because I was considering support for RDfile myself.

Good timing then! I thoroughly enjoy the process of sharing, so it makes me happy to hear that this might be put to good use.

btw CDK doesn't have a RXN V3000 writer at the moment, that would be something I'd like to add, too.

One thing I didn't see was a controlled vocabulary for DTYPE. The spec (p 79) says only that the format is:

$DTYPE field name

But the example looks more structured:

$DTYPE run:VARIATION(1):rxnTEXT(1)

The idea appears to be that a reaction is a complex object with left/right sides, 0 or many structures on each side, and stuff above/below the arrow. So that colon-separated thing is a kind of coordinate into the reaction.

I did find CambridgeSoft's ancient document RDFiles in ChemFinder that sheds light on these "crudpaths," but its far from clear whether this is a vendor-specific notation or not. It's in the BIOVIA spec as an example but not explained AFAICT.

Unfortunately, the Biovia 'specification' doesn't read like a specification at all and leaves open a lot of questions.

Are there any other sources about these crudpaths?

I have seen these crudpaths in RDfiles, but have never seen a specification or description. I'd certainly be interested in that myself!

@johnmay
Copy link
Member

johnmay commented Dec 5, 2022

Could you rebase on master @uli-f, I think have configure SonarCloud so we can run it now.

…h test cases; extended tests to also test the resultant IChemObject; simplified RdfRecord
in the constructor of RdfileReader the provided Reader is a BufferedReader prior to wrapping it up in a BufferedReader
# Conflicts:
#	storage/ctab/src/main/java/org/openscience/cdk/io/RdfileReader.java
@uli-f
Copy link
Member Author

uli-f commented Dec 5, 2022

Could you rebase on master @uli-f, I think have configure SonarCloud so we can run it now.

Done.

@johnmay
Copy link
Member

johnmay commented Dec 5, 2022

Oh grrr... not finished yet but can see it still didn't pickup the SonarCloud token. This is infuriating, we could just expose the token publicly but I thought how I had set this up was a good option. i.e. me/egon need to click run on the PR

@uli-f
Copy link
Member Author

uli-f commented Dec 6, 2022

Oh grrr... not finished yet but can see it still didn't pickup the SonarCloud token. This is infuriating, we could just expose the token publicly but I thought how I had set this up was a good option. i.e. me/egon need to click run on the PR

I guess you are aware of github secrets...? It sounds to me like you should be able to work a token into the CI github actions script by using github secrets. Let me know if I can be of any help. I set up the CI pipeline based on GitHub actions here at Pending AI.

@johnmay
Copy link
Member

johnmay commented Dec 6, 2022

Yes it uses secrets, however since the branch is in your fork it will not see the secret. So o thought the work around was you define a secret in an environment then it can be seen if the workflow gets approved to run - but that didn’t work

@uli-f
Copy link
Member Author

uli-f commented Dec 6, 2022

Yes it uses secrets, however since the branch is in your fork it will not see the secret.

I did something similar for test results that are uploaded as a comment for PRs opened by dependabot.
Basically, it is a two step process. You may find a bit of explanation here:
https://github.com/EnricoMi/publish-unit-test-result-action#support-fork-repositories-and-dependabot-branches

I am not sure how easy it would be to adopt whatever is described there to your needs though.

@uli-f
Copy link
Member Author

uli-f commented Mar 10, 2024

@johnmay @egonw Looks like a PR merging spree is on! I'd be very happy to contribute in any way so that we can pick this one up and merge it in as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants