Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] use a class loader to bring in dependancy classes for a connector #6043

Open
1 of 2 tasks
davidradl opened this issue Dec 17, 2021 · 19 comments
Open
1 of 2 tasks
Assignees
Labels
enhancement New feature or request pinned Keep open (do not time out) triage New bug/issue which needs checking & assigning

Comments

@davidradl
Copy link
Member

davidradl commented Dec 17, 2021

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

We have talked of creating a fat jar for connectors. We are now not favouring this approach and want the connectors to supply minimal jar files and find the dependant jars in the environment

Expected Behavior

Have a custom class loader that loads the classes for a connector, those classes should be only used by that connector and those classes would not effect any other connector. This seems to be a pattern that has been used successfully in the past.
There still might be a need for shading if the connector classes or its dependancies clash with the Egeria classes, but there would not be a need to resolve dependancies between connectors.

Alternatives

leave asis, shading or fat jars

Any Further Information?

No response

Would you be prepared to be assigned this issue to work on?

  • I can work on this
@davidradl davidradl added enhancement New feature or request triage New bug/issue which needs checking & assigning labels Dec 17, 2021
@davidradl
Copy link
Member Author

@planetf1 @mandy-chessell fyi

@planetf1
Copy link
Member

planetf1 commented Dec 17, 2021

A 'fat jar' - (a term I've used!) can be a bit of a misnomer in that not necessarily all dependencies are included.

For example we can have the core egeria platform as 'provided' and other dependencies as compile/run. That way anything in the core platform isn't replicated, and we only get the delta - the additional classes unique to the connector in question.

To gain /ISOLATION/ - I think the key point here - we can shade these additional jars so that they don't clash - at the expense of duplication.

This is an enforced BUILD TIME choice

A custom class loader can make choices at runtime over where to source classes, what to load - so it is a lot more flexible. We would of course still need to decide on the algorithm used. We can emulate shading as we wish as part of the classloader. For example we may have a connector-local lib path that we pick in preference & always use those classes. We'd also still want to package up dependencies, which can be done with the existing maven/gradle tools.

So classloader = more flexible, but many more choices, alogorithm to be decided (what is the search path). medium-sized shaded jar = build time only, more restrictive, though simple

I do think it's an excellent approach to evaluate (and note spring uses it's own custom classloader so when we get to dynamically loadable OMASs (something I think we should ultimately do, vs connectors) there may be a difference in terms of component scan for the spring rest binding?

@cmgrote
Copy link
Member

cmgrote commented Dec 17, 2021

Maybe we first need to be clear on what jars / dependencies we are including under this proposed umbrella?

TL;DR
If we're talking about all of a connector's dependencies, then I'm very much against this approach (if I'm understanding it correctly). My rationale being: keep the "default" case simple, and if people want something beyond that default case for some specific reasons, allow them to do so, but not at the expense of adding significant complexity to the "default" case. (For me the "default" case here is someone just wants to use a connector to connect to something.)

Details...
If this proposed classloader-based approach means that each dependent jar must separately be placed somewhere on the filesystem, I think it will require those who want to use a connector a significant headache: to go out and determine all the dependent libraries (and versions) themselves, locate where to download them, etc. For the XT connector (a relatively simple example?) this would mean downloading:

  • The connector itself
  • Several XTDB pluggable-interface jar files (that previously were only available on clojars.org, not MC)
  • The dependent libraries that provide a pluggable back-end (e.g. S3, Kafka, LMDB, RocksDB, etc)
  • The dependent libraries that are used to interface to a given pluggable back-end (JNI, etc)
  • The dependent libraries that are used for things like serialization (?) (Cheshire, etc)

It would frankly be easier for someone that wants to use a connector to just build their own "fat jar", particularly as any new release of any of these components would mean that they need to go through the entire investigation loop again to figure out what libraries have changed (if any), which versions are tested to be interoperable, etc, etc.

Having the connector designer make these choices up-front (and "lock-them-in" at build time) means that a connector user will typically just need to go and download a single file -- much, much easier.

(This doesn't prevent someone who wants to go to the trouble of managing all these jar files from doing the other approach, but it makes life easy for simple adopters rather than forcing every adopter to manage the nightmare of all of this complexity themselves.)

Then again, if we're talking about this classloader approach for only the Egeria dependencies that a connector has, I don't really understand why these are compiled in to the connector in the first place (not simply marked provided)? The connectors themselves don't really "do" anything without being run through an OMAG Server Platform, which would presumably bring along any of the Egeria dependencies (?), so why build these Egeria dependencies into the jar file of the connector itself?

@davidradl
Copy link
Member Author

@cmgrote Thanks for these considerations. I will ask for more details from the people who suggested this.

@carstenmichel
Copy link

Hello @cmgrote and @davidradl
We ran into this issue when we were trying to import the client library that is provided by Apache Atlas to be used in a custom Connector that we were trying out with @davidradl.
This Client, provided by the Apache team has about 50 or so dependent jar files.
We tried to take these Atlas related jar files and add them to the connector jar file, making it a medium to fat sized jar.

The trouble starts since Apache is using dependencies that are already present in Egeria base, but in different versions.

So my idea was to have a custom classloader for each Connector in Atlas that would allow to load in external dependencies without interfering with the main Egeria product.
Loading of the non-egeria related dependencies can still happen with a fat jar to make life easy.
Does this make sense ❓

@cmgrote
Copy link
Member

cmgrote commented Dec 17, 2021

Definitely we should do something to avoid the potential conflicts -- I'm not an expert in this domain, but I would hope that in an ideal world any jar file (whether Apache Atlas or Egeria) that includes dependencies that it doesn't own itself should have some way of making those dependencies within its jar file specific to that jar file. My impression was that shading is a way to do this (@planetf1 ?)

What seems to be the problem here is that neither Apache Atlas nor Egeria has shaded its non-owned dependencies. As a result, those that overlap but at different versions are therefore causing a conflict. While we could therefore split things up so that some jars are included and others not, this doesn't sound very scalable to me: today we know what that is for Apache Atlas, but the overlap could be greater or less for any other technology / connector, and that list of overlap could change for each independently-managed lifecycle of each technology -- trying to keep a handle on all of that over time sounds like it's going to be messy...

I think if either (or in the ideal world, both) were to shade its dependencies then there would no longer be a conflict?

I'd therefore be more in favour of being a "good citizen" ourselves and shading such dependencies -- but I'm not sure what the broader implications of such an approach are...

(I'm mostly pushing in this direction thinking to the world of containers, where each container is intended to be a self-contained atomic "thing" and if you have some overall solution that needs multiple components that say each run on Python you just have Python installed inside each of those containers -- you don't have some "shared" underlying Python installation that each container makes use of, they're entirely independent. For me these jar files should be managed the same way?)

@davidradl
Copy link
Member Author

Thanks @carstenmichel @cmgrote . I am seeing 2 approaches.

  1. Each connector has a custom classloader, so its dependancies are managed in its own isolated 'sandpit'
  • in theory no dependancy clashes between connectors. For integration connectors this makes sense if we bring in simple 3rd party dependancies like Apache Atlas. I think this approach could be used effectively for 'simple' dependancies.
  • duplicate code between connectors
  • may not work for more complex dependancies.
  1. Each connector plays as a good citizen, as Chris talks of
  • allows us to handle the more complex dependancies
  • more effort & complexity to document and keep track of the supported 'working set' and having shaded jars for clashes.

I wonder for a connector if we can do approach 1) then we should, if there are more complex considerations, then approach 2) should be used. By using approach 1) as much as we can, we reduce the connectors we need to handle using approach 2.

thoughts @cmgrote @carstenmichel @planetf1 ?

@cmgrote
Copy link
Member

cmgrote commented Dec 20, 2021

I don't know enough about shading, but my hopeful assumptions would be that it is a) something decided / designed to be used entirely at build time, and b) uses some kind of "namespace"-like concept to keep the shaded dependencies "unique".

If (a) is correct it has the major advantage of not placing any complexity or additional knowledge / need to download / configure paths / etc on the runtime (operator / user).

If (b) is correct then as long as we use a namespace that matches the connector name itself (or its ConnectorProvider GUID even) then it is pretty well guaranteed to be unique (if not there would be other clashes anyway).

So while the approach places some extra effort on the developer, my sense is that's worth it by removing any extra effort on the operator / user -- again, if my assumptions are correct...

@planetf1
Copy link
Member

My understanding of shading is the same @cmgrote

The maven docs for the shade plugin at https://maven.apache.org/plugins/maven-shade-plugin/ gives a little info, referring to renaming of some, or all (so it can be partial)

https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html specifically refers to renaming, which 'moves' the packages over to a new, configurable, packagename

We can use shading in part or in full, and alongside other techniques like use of 'provided' for the common egeria dependencies (or anything else critical) to reduce the size of shaded jars. We would need to use shading in our own jars also (such as chassis)

The key issue with shading is it is build-time, not runtime. This IS the biggie. We save on runtime and code complexity, but we will have less ultimate flexibility. Simplicity is appealing though

@planetf1
Copy link
Member

planetf1 commented Dec 20, 2021

With a custom class loader we can do something similar to shading, by allowing a dependency to pickup favoured classes from a configurable location - ie further up the search path, or via a new name.

This means:

  • more complexity in the run-time code
  • more management/deployment of dependencies (where are they put, what are they called)
  • search algorithm/classpath search list to be configured (or statically defined)
  • more code to test/support (classloader)
  • probably more confusing to debug (mitigated by good logging)
  • more predictible runtime environment (for the code developer)

My current feeling is that the additional complexity of the dynamic classloader approach is not yet argued enough to improve on the simplicity of the shading approach. The effort to enable shading+provided is pretty low.

Note also we didn't initially do shading as we were considering shading all classes. By use of 'provided' we can reduce this to a much smaller scope, meaning duplications are less a concern

@planetf1
Copy link
Member

Some interesting articles

  • dynamic class loading using reflection can be broken
  • debugging can still be tricky with shading
  • contains a gradle example :-)

@davidradl
Copy link
Member Author

I assume we are talking about the connector classloader mentioned here https://docs.oracle.com/cd/E19830-01/819-4721/beade/index.html

@davidradl
Copy link
Member Author

Hi @carstenmichel ,
I think your idea is interesting especially for integration connectors. I wonder if you go into more detail on how this would work in gradle / code, if we had a lib folder under an Egeria integration connector containing the 3rd party dependancies (or a jar with them all in) so these dependancies would be loaded only for this connector.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Mar 27, 2022
@planetf1
Copy link
Member

planetf1 commented Mar 7, 2023

I'm going to reopen this.

In recent months I've hit, and seen others hit, multiple conflicts with connectors. In particular the HMS connector with it's complex and large dependency chain (50k classes) is a major offender.

  • currently a bad connector can prevent the server from starting, or cause it to malfunction
  • using compileOnly where possible removes unnecessary classes/duplicates from the runtime classpath
  • keeping egeria & related libraries (implementation platform) at consistent versions helps - at least duplicates are then the same code
  • Deployment choices, like placing each connector technology in it's own container helps prevent conflicts between two very different environments
  • assembly of a chassis from the base jars (not uber) as part of a pre-deployment step can reduce/address some conflicts early on (re: cloud native discussions)

but none of these are complete solutions.

Spring classloading is more complex than base jvm, but does offer customization
(Interestingly an area where spring->microprofile may make a difference)

We need to have improved documentation - for developers, operators, but also I think we need to look again at classpath isolation.

opening and assigning to myself for now

@planetf1 planetf1 reopened this Mar 7, 2023
@github-actions github-actions bot removed the no-issue-activity Issues automatically marked as stale because they have not had recent activity. label Mar 8, 2023
@planetf1
Copy link
Member

planetf1 commented Mar 8, 2023

The issue above seems common in application servers, or where plugin technologies are used. This is similar to egeria's use of connectors.

Firstly we need to place plugins out of the normal spring/java class loading path - this way egeria/spring code using the normal classloading approach won't find them.

We need a 'parent-last' classloader. This will load the class required by looking (first) in a custom location (configured by property, ie lib or extralib). If it does not find it, it will delegate up to the parent. Note parent-last is at odds with the java spec, where delegation starts from the root ie ultimate parent first, but it is how containers like tomcat work

When we load a connector we use this new classloader. We also configure the thread classloader to use it. Attempts to resolve a class within the connector resolve 'locally' - ie within the connector's jar, and only for classes that can not be found we defer to the parent & hence default behaviour

Of course if we then have many connectors & place them all in the same location, we'll still could get conflicts between the connectors, but we control that search in our code - so we'll look in an uber jar first. then - a connector specific libpath (by name) before finally delegating to parent.

It doesn't appear that spring will stop us doing this - but the approach needs validation and implementation. It could be local within the connector framework

@planetf1
Copy link
Member

This may be an opportunity to also add some additional audit logging at the time, depending on what we can see
ref: odpi/egeria-connector-xtdb#472 (comment)

@planetf1 planetf1 self-assigned this May 15, 2023
@planetf1
Copy link
Member

I likely will not be able to make this change now, but please do get in touch if you wish to discuss any ideas. I do think it would be valuable as more varied connectors are developed.
Unassigning for now.

@mandy-chessell mandy-chessell self-assigned this Jun 13, 2023
@mandy-chessell
Copy link
Contributor

I agree on its value and have added it to my list ...

@mandy-chessell mandy-chessell added the pinned Keep open (do not time out) label Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned Keep open (do not time out) triage New bug/issue which needs checking & assigning
Projects
None yet
Development

No branches or pull requests

5 participants