New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Enforce module encapsulation via a linter #41862

Closed

crisptrutski wants to merge 9 commits into master from lint-modules

Contributor

crisptrutski commented Apr 25, 2024 •

edited

Description

This introduces a linter against importing a module's internals from outside that module.

Uses metabase.upload and metabase.db as test cases, which motivate the need to supporting nested and internal modules.

metabase-bot bot assigned crisptrutski

metabase-bot bot added the .Team/BackendComponents label

crisptrutski commented

View reviewed changes

Contributor Author

crisptrutski left a comment

I wish that this were simpler, and perhaps it may be so. Ideas welcome!

.clj-kondo/hooks/clojure/core.clj Outdated

Comment on lines 184 to 185

		;; TODO -- this hard coding is not ideal, it would be better to support metadata on the namespaces themselves.
		(def ^:private global-modules

Contributor Author

crisptrutski Apr 25, 2024

Besides the convenience of leaving this detail for another PR, it's quite nice seeing all the modules in one place for now, particularly if we want to plug some of the leaks in metabase.db.

.clj-kondo/hooks/clojure/core.clj

@@ @@ -179,6 +179,134 @@ @@
                   (non-thread-safe-form-should-end-with-exclamation* node))
                 {:node node})
+              ;; ----------------- Modules -----------------

Contributor Author

crisptrutski Apr 25, 2024

It would be nice to hide all these helper function in a (testable) namespace, but I didn't figure out how to do that.

Contributor Author

crisptrutski Apr 26, 2024

Speaking of tests, I didn't find an easy way to jack into this kondo namespace with Cursive, so I didn't bother with "rich comment" tests either. I tested manually by inserting actual requires in the app and running the linter but don't have a great way to share that here. I guess I could show it via screenshots or a loom, but this seems a bit OTT.

.clj-kondo/hooks/clojure/core.clj Outdated

+                 'metabase.upload
+                 ;; We anticipate using this CLJC ns for type inference on the frontend, so it's exposed outside the CLJ module.
+                 'metabase.upload.types))

Contributor Author

crisptrutski Apr 25, 2024

Leaving aside plugging the leaks in metabase.db, which is included as a stress-test we could defer, one might as if it's worth exposing supporting nested modules if we're not yet using this namespace from the frontend. One might also argue that it's better to just move this code out of the module, despite the awkward naming challenge that presents.

For all the complexity of this linter, only a handful comes from this feature, so I think it's worth it.

.clj-kondo/hooks/clojure/core.clj Outdated

+                 'metabase.upload.types))
+              ;; TODO -- this hard coding is not ideal, it would be better to support metadata on the namespaces themselves.
+              (def ^:private internal-modules

Contributor Author

crisptrutski Apr 25, 2024

We don't need these for uploads, but I think they're pretty natural, and only a little bit more complicated than global nested modules to support. A drop in the bucket compared to the vanilla machinery.

.clj-kondo/hooks/clojure/core.clj

+                    (when (str/starts-with? (name ns-sym) prefix)
+                      module-ns-sym))))
+              (def ^:private module-detectors

Contributor Author

crisptrutski Apr 25, 2024

While this is tempting to inline, its more efficient to only calculate it once.

crisptrutski added the no-backport label

crisptrutski commented

View reviewed changes

.clj-kondo/hooks/clojure/core.clj

+                (into #{} (remove nil?)
+                      [;; Modules can depend on their own internals.
+                       (modules ns-sym)
+                       ;; We can depend on our siblings.

Contributor Author

crisptrutski Apr 25, 2024

It feels a bit weird for modules to be able to depend on sibling modules. We may want to tighten this up a bit.

replay-io bot commented Apr 25, 2024 •

edited

Status	Complete ↗︎
Commit	`3bfc433`
Results	⚠️ 9 Flaky ✅ 2427 Passed

crisptrutski requested a review from a team

April 26, 2024 07:59

calherries reviewed

View reviewed changes

.clj-kondo/hooks/clojure/core.clj Outdated

+                (sorted-set
+                 'metabase.db
+                 ;; We may aspire to encapsulating more of these - either through refactoring or ^:clj-kondo/ignore.
+                 'metabase.db.connection

Contributor

calherries Apr 26, 2024

It surprises me that this allows metabase.db.connection to be a top-level module, while metabase.db is also a top-level module. I thought we wanted a rule that if metabase.db was a module then everything matching metabase.db.* except metabase.db.core would be internal. But the way this works is if metabase.db is a module, then everything matching metabase.db.* is internal, except for namespaces that are also listed as global. Is there a motivation behind this?

Contributor Author

crisptrutski Apr 26, 2024 •

edited

I think the notion of nested module is what I call internal-modules in the next var, with metabase.db.liquibase being the one example.

The main motivation I have for nested modules that are also globally visible is metabase.upload.types, where a big reason for how and why it was extracted was making it available to the frontend. I think this example here with metabase.db also shows how it's a useful mechanism for "latching" a large module that isn't quite fully encapsulated yet, but is getting there.

An (potentially dangerous to this audience) analogy can also be made to thinking of module namespaces like service objects, and these global submodules as their public properties. I think this will give a good and natural way to have some interface segregation for large modules.

We could YAGNI it for now, but I felt it falls out quite naturally, and it's good to know we're not designing ourselves into a corner that wouldn't support it.

Contributor

calherries Apr 26, 2024 •

edited

I asked this question because I think this code can be simpler if we enforce the rules that we discussed at the offsite, rather than extending it to allow for nested-by-name-but-still-top-level modules like you're doing here.

I think we should keep the logic for determining what's internals or not rule-based because it has benefits for us humans. For example, if I know metabase.db is a module, I know that metabase.db.liquibase is guaranteed to be internal to it. I don't have to go to this namespace and check whether metabase.db.liquibase is top-level or not, unlike this implementation would make me do.

Here's an outline of the solution I had in mind, and it only works if we don't allow for "nested modules that are globally visible" like we do here.

A namespace named N is internal to module named M if N matches M.+ but not M.core.
A namespace named N is an interface to a module named M if N is M.core or M (in practice M and M.core shouldn't coexist)

The rule we enforce is this: a namespace N can't use an internal namespace of any module (nested or top-level) unless N is itself internal to that module, or N is an interface namespace for that module.

This works for nested modules too. Consider the case that A.B and A.B.C are nested modules inside module A. Some test cases:

A.D can use A.B because even though A.B is internal to A, A.D is too.
A.D can not use A.B.C because A.B.C is internal to A.B but A.D is not.
A.B can use A.B.C because it is an interface namespace for A.B.

Contributor

calherries Apr 26, 2024

As for metabase.upload.types being used on the FE, I do think we can YAGNI on that for now, and maybe it would be better to expose module system for clojurescript code by a separate mechanism.

Contributor Author

crisptrutski Apr 26, 2024 •

edited

Thanks for the detailed explanation, I think we're on the same page with the semantics. The only difference I can see is this code allows nested modules to be marked as "public" by declaring them global-modules¹. If we ignore that detail, I'm curious whether there is a semantic difference I missed, or you just feel the implementation could be simpler. The majority of the complexity is in the token-walking and node tagging I think, not in support of this feature.

To streamline this discussion and perhaps the PR flow too, I've removed this "public" motion, and made all nested modules implicitly private. The practical difference besides the deleted config and sanity checks is just one line of logic being removed². With only this notion of nested we can no longer modularize metabase.db, so the scope is reduced to just metabase.upload, which is all we really want for now anyway.

Let's discuss this implementation for now, and once that's settled we can shift back to the "default private" versus "cannot be made public" debate.

I've motivated for global modules a bit already in the code and discussion, and feel they have precedent in how other languages handle visibility, but let's put them aside for now. ↩
(remove (comp global-modules :ns-sym)) ↩

Contributor Author

crisptrutski Apr 26, 2024

As for metabase.upload.types being used on the FE, I do think we can YAGNI on that for now, and maybe it would be better to expose module system for clojurescript code by a separate mechanism.

Agreed, the main reason I included it was to check the viability of my preferred DX and start the conversion. It's gone and left for another day.

I'd prefer fewer moving parts and special cases, or diverging structure in Clojurescript, so I'm a bit skeptical of this "separate mechanism", but who knows elegant solution could occur to us when we need to use it in anger.

Contributor

calherries Apr 26, 2024 •

edited

Cool, thanks for simplifying this discussion about interface vs implementation. You're right that the debate about "cannot be made public" can be parked so let's simplify the implementation first. My intuition that this can be made a lot simpler is based on the fact that the code doesn't map well to the English description of how it should work. Here's a demonstration of what I mean, putting the English desciption next to the the clojure code:

;; 1. A namespace named N is an interface to a module named M if N is M.core or M
(defn interface-for? [n m] (contains? #{m (str m ".core")} n))

;; 2. A namespace named N is internal to module named M if N matches M.+ and is not an interface for M
(defn internal-to? [n m] (and (str/starts-with? n m) (not (interface-for? n m))))

;; A namespace N can't use an internal namespace of any module (nested or top-level)
;; unless N is itself internal to that module, or N is an interface namespace for that module.
(defn illegal-module [modules using used]
  (first (filter (fn [module]
                   (and (internal-to? used module)
                        (not (or (internal-to? using module)
                                 (interface-for? using module)))))
                 modules)))

illegal-module could be used by module-internals-should-be-encapsulated, replacing the use of allowed-parents and parent-module.

On the topic of testing, I think something like this would suffice:

(deftest illegal-module-test
  (let [modules #{"A" "A.B" "A.B.C"}]
    (is (nil?      (illegal-module modules "A.D" "A.B")))         ; A.B is internal to A
    (is (= "A.B"   (illegal-module modules "A.D" "A.B.C")))       ; A.B.C is internal to A.B but A.D is not and A.D is not the interface for A.B
    (is (nil?      (illegal-module modules "A.B" "A.B.C")))       ; A.B.C is an interface namespace for A.B.C
    (is (= "A.B.C" (illegal-module modules "A.B" "A.B.C.D")))     ; A.B.C.D is internal to A.B.C but A.B is not and A.B is not the interface for A.B.C
    (is (nil?      (illegal-module modules "A.B" "A.B.C.core")))) ; A.B.C.core is an interface namespace for A.B.C
)

Contributor Author

crisptrutski Apr 26, 2024 •

edited

Hmm, it's interesting how subjective simplicity and intuitiveness can be. Late post incoming with a lot of recency bias towards the first algorithm I thought of.

If we peel away all the logic about walking the tokens, and reporting the findings on the appropriate nodes, dealing with symbols versus strings, handlings tests, etc the PR's code boils down to the following:

;; 1. The most directly enclosing module of the given namespace, if there is one.
(defn- parent-module [modules ns] (first (filter #(str/starts-with? ns (str % ".")) (reverse modules))))

;; 2. The modules containing the direct children and siblings of the given namespace.
(defn- allowed-modules [modules this-ns]
  (into #{} (filter modules) [(str/replace this-ns #"\.core$" "") (parent-module modules this-ns)]))

;; Check that the [used] can be imported by [using]
(defn illegal-module
  [modules using used]
  (when-let [used-parent-module (parent-module modules used)]
    ;; Is the give dependency inside a module that is opaque to us?
    (when (not (contains? (allowed-modules modules using) used-parent-module))
      used-parent-module)))

It's a little bit chunkier than your code, but I'd argue its considerably more declarative and maps more naturally to human language. Seeing it this way I'm actually feeling a lot happier with it than when it hid in the kondo murk. Maybe there's a way to factor it out a bit better.

Perhaps it was end-of-week fatigue, but convincing myself your code was right felt like math exercise needing to work through complex boolean logic, compared to single branch in the reference code. I even got a bit mixed up and thought I saw a bug (that A.B wouldn't load A.E), but I was being silly.

Getting to less subjective stuff - let's talk efficiency and debug-ability. I think it's useful to look at your implementation being expanded:

(defn illegal-module [modules using used]
  (first (filter (fn [module]
                   (and (and (str/starts-with? used (str module "."))
                             (not (contains? #{module (str module ".core")} used)))
                        (not (or (and (str/starts-with? using (str module "."))
                                      (not (contains? #{module (str module ".core")} using)))
                                 (contains? #{module (str module ".core")} using)))))
                 modules)))

While it's very compact, I see a few problems with the computational structure:

Some allocating forms get executed multiple times. These can't be de-duplicated without changing the call structure.
Redundant logic -(not (contains? .. using)) could be skipped since its negative is checked in the surrounding or.
All logic occurs within a loop, and is quite low level. Stepping through or tracing this would be really tough.
This loop is less likely to short-circuit than the separate "allowed" versus "found" loops.
Since we will loop over multiple multiple used variables for the same values of using and modules, these problems get exacerbated.

Given that we run this linter across every namespace on every push, efficiency matters, and even though this is an unfair shoot out with non-optimized code it does seem like there is less low-hanging fruit. The ability to pre-build the detectors and then run all the calculations for the given using only once across all imports are both huge boosts that fall out quite naturally.

Contributor

calherries Apr 29, 2024 •

edited

Nice! This is more like what I was looking for. Let's shift focus to correctness because now I see a couple of issues with this

Your illegal-module doesn't pass my tests above, there's a small bug in parent-module.

(deftest illegal-module-test
  (let [modules (sorted-set "A.B.C" "A.B" "A")]
    (is (nil?      (illegal-module modules "A.D" "A.B")))         ; A.B is internal to A
    (is (= "A.B"   (illegal-module modules "A.D" "A.B.C")))       ; A.B.C is internal to A.B but A.D is not and A.D is not the interface for A.B
    (is (nil?      (illegal-module modules "A.B" "A.B.C")))       ; A.B.C is an interface namespace for A.B.C
    (is (= "A.B.C" (illegal-module modules "A.B" "A.B.C.D")))     ; A.B.C.D is internal to A.B.C but A.B is not and A.B is not the interface for A.B.C
    (is (nil?      (illegal-module modules "A.B" "A.B.C.core")))  ; A.B.C.core is an interface namespace for A.B.C
  ))

(illegal-module-test)
; FAIL in (illegal-module-test)
; expected: (nil? (illegal-module modules "A.B" "A.B.C.core"))
;   actual: (not (nil? "A.B.C"))
nil

The problem is (parent-module modules "A.B.C.core") returns "A.B.C", but it should return "A.B". The problem is fixed by using my internal-to? function above:

(defn interface-for? [n m] (contains? #{m (str m ".core")} n))

(defn internal-to? [n m] (and (str/starts-with? n m) (not (interface-for? n m))))

(defn- parent-module [modules n] (first (filter #(internal-to? n %) (reverse modules))))

(illegal-module-test)
nil

I initially thought your use of parent-module was just an optimisation trick, but now I see that it actually produces different semantics to how we want modules to work and your PR description. There's an important test case I missed, that your code fails:

(deftest illegal-module-test
  (let [modules (sorted-set "A.B" "A")]
    (is (nil? (illegal-module modules "A.B.C" "A.util")))))

(illegal-module-test)
; FAIL in (illegal-module-test)
; expected: (nil? (illegal-module modules "A.B.C" "A.util"))
;   actual: (not (nil? "A"))

Your code goes one step further than "a module prevents any code outside that module from using its internals", adding an additional restriction: "a module prevents its internals from using any code outside, except top-level modules or namespaces". That effectively rules out modules sharing any code except through top-level modules, and will force us to move a lot more code to the top-level.

For next steps, I think we need a set of passing tests everyone can agree on.

Contributor Author

crisptrutski May 9, 2024 •

edited

Think I ended up replying to this only in Slack, so putting a shortened reply here for posterity.

Ah thanks for spotting. I wasn't thinking of .core namespaces being first class, so only supported the situations I actually saw in the code already. Easy to fix with this PR's approach too.
I don't see why that should pass - why should you be able to read the internals of your parent module? I don't agree with that working unless A.util is also made a public module, which then works with this code.

I'm surprised these submodule semantics were so controversial. Since we don't have a way to write tests for linters as documentation, that I know of, I guess it's just as well we're deferring them.

Contributor

darksciencebase commented Apr 26, 2024

@crisptrutski no there is no issue for this and i think we can live without one

crisptrutski added 8 commits

April 26, 2024 15:42


          Enforce module encapsulation via a linter

af6a6a2


          Consolidate terminology

21e488d


          Tweaks

49466f3


          Dont need to import tests from src

bc50db8

46315e6


          More clarity

e5b8660


          Stray period

cef7bfb


          Forget about nesting for now

5f42769

crisptrutski force-pushed the lint-modules branch from d6e7c46 to 5f42769 Compare

April 26, 2024 13:47


          Keep support for private submodules

3bfc433

calherries mentioned this pull request

Cam's module linter 5000 #42250

Merged

Contributor Author

crisptrutski commented May 9, 2024

Superseded by the CML5000

crisptrutski closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment