Skip to content

io7m/lanark

Repository files navigation

lanark

Maven Central Maven Central (snapshot) Codecov

com.io7m.lanark

JVM Platform Status
OpenJDK (Temurin) Current Linux Build (OpenJDK (Temurin) Current, Linux)
OpenJDK (Temurin) LTS Linux Build (OpenJDK (Temurin) LTS, Linux)
OpenJDK (Temurin) Current Windows Build (OpenJDK (Temurin) Current, Windows)
OpenJDK (Temurin) LTS Windows Build (OpenJDK (Temurin) LTS, Windows)

Lanark

A specification for restricted dotted names.

Motivation

Applications such as compilers and package systems often use so-called reverse dns notation to identify packages and code artifacts. Unfortunately, due to reverse DNS notation being underspecified, each implementation has its own idea of which names should be permitted and which should be rejected.

This specification attempts to define a restricted form of the notation with the following properties:

  • Names can be validated with a simple regular expression that is defined in such a way as to avoid resource exhaustion attacks.
  • Names are defined using a strict subset of ASCII in order to avoid Unicode-based spoofing and phishing attacks.
  • Names are defined such that the maximum length of a name is bounded in order to provide for predictable storage use in database applications.

Definitions

Regular Expression

A dotted name is a string matching the following regular expression:

([a-z][a-z0-9_-]{0,63})(\.[a-z][a-z0-9_-]{0,62}){0,15}

Theorem SPLIT

A dotted name can always be split into between 1 and 16 segments by splitting the name into separate parts at each dot, such that each part is a valid dotted name .

Proof

Let x be a dotted name. By the definition of the regular expression that defines a dotted name, x must be one of:

  • A single character in the range [a-z].
  • A primary segment consisting of a single character in the range [a-z] followed by up to 63 characters from the set [a-z0-9_-].
  • A primary segment followed by between 1 and 15 secondary segments that each consist of a dot ., followed by a single character in the range [a-z], followed by up to 62 characters from the set [a-z0-9_-].

If x is single character in the range [a-z], then there is no splitting to be performed and x is already trivially matched by the regular expression and is therefore a valid dotted name.

If x consists of a single primary segment, then there is no splitting to be performed and x is already trivially matched by the regular expression and is therefore a valid dotted name.

If x consists of a primary segment followed by between 1 and 15 secondary segments, then for each segment s it is necessary to show that s matches the regular expression when the preceding dot (if s is a secondary segment) is removed.

  • If s is a primary segment, then it already matches the regular expression.

  • If s is a secondary segment, then it will effectively become the primary segment of the new dotted name. Because s is a secondary segment, it must match the subexpression [a-z][a-z0-9_-]{0,62}. By the semantics of length ranges in regular expressions, any string matched by an expression e{0,n} will also be matched by an expression e{0,n+1}. As the subexpression for primary segments is [a-z][a-z0-9_-]{0,63}, s will match and is therefore a valid primary segment.

Theorem SIZE

The number of characters in any dotted name is <= 1024.

Proof

By SPLIT, we know that a dotted name x consists of a primary segment followed by up to 15 secondary segments.

The regular (sub)expression that matches a primary segment is defined as [a-z][a-z0-9_-]{0,63}. The longest size of a primary segment is therefore 1 + 63 = 64.

The regular (sub)expression that matches a secondary segment is defined as \.[a-z][a-z0-9_-]{0,62}. The longest size of a secondary segment is therefore 1 + 1 + 62 = 64.

The regular (sub)expression that defines how many secondary segments may appear in a dotted name is defined as e{0,15}, so the maximum number of secondary segments is 15 and therefore the maximum number of characters that can be used for secondary segments is 15 * 64 = 960.

We can therefore conclude that a string consisting of a maximum length primary segment and the maximum number of maximum length secondary segments is 64 + (15 * 64) = 1024.

Coq

Machine-checked proofs of the above propositions are provided in the Lanark.v file.

Rationale

Why are names defined in terms of a regular expression rather than as a BNF grammar?

This specification is being written to support the development of various io7m software packages, and validation of dotted names is expected to occur in a wide range of different contexts such as XML schemas, at runtime in Java code, in definitions of SQL tables, and etc. These environments all feature regular expression validation, and not all of them support writing parsers for more advanced grammars. With the specification itself containing the canonical regular expression, this expression can literally be pasted into various locations without needing any changes.

Why are names restricted to a subset of ASCII?

One of the uses for dotted names is in the naming of software packages published onto the web. In systems that allow for the full use of Unicode to name packages, it's possible for malicious parties to spoof the appearance of packages by using carefully crafted names. For example:

com.io7m.example com.iọ7m.example

The second package is a malicious package. It would be fairly trivial for someone to sneak in a reference to this package as a dependency in an open source project and have it go unnoticed. For those unable to tell the difference: The o in io7m in the second package is actually U+1ECC ("Latin Capital Letter O with Dot Below"). This is almost indistinguishable from the first package, but could easily be used to fool people into thinking they're installing a package written by someone controlling the com.io7m namespace.

Won't ASCII cause problems for non-English developers?

Currently, Maven Central is the largest collection of open-source software on the planet. Artifacts published to Maven Central have a group name and an artifact name. It is conventional for group names to be in reverse DNS notation, and it is not uncommon for artifact names to also be in this same notation. By analyzing the largest collection of of open-source software on the planet, we can probably get some idea as to how developers all over the world are naming their artifacts.

An index is published weekly consisting of a list of every single artifact published into the repository. By analyzing the names of artifacts and groups and checking to see if those names could be expressed using the restricted dotted name specification here, we observed the following:

  • There are 69604 unique group names on Maven Central. Of these, 68690 have names that are expressible using the syntax defined here. This leaves 914 inexpressible groups, for a coverage of 98.69%.

  • There are 431423 unique artifact names on Maven Central. Of these, 343678 have names that are expressible using the syntax defined here. This leaves 87745 inexpressible names, for a coverage of 79.66%.

However, we also analyzed the reasons that names failed to match the syntax defined here and determined:

  • 618 and 24810 group and artifact names, respectively, failed to match because they contained uppercase characters. If all names are converted to lowercase, this removes a significant chunk of "bad" names.

  • 62696 artifact names failed to match because they contained characters other than [a-z] after a dot. These were frequently artifacts that, for whatever reason, decided to encode version numbers within the name itself. A random sample of failing names is as follows:

      com.github.javawithmarcus.wicket-cdi-1.1
      org.floggy.3rd.org.eclipse.core
      com.github.1137095129
      io.github.2gis
      com.github.9215095360
      com.9isuper.eve
      org.99soft
      com.moz.kiji.delegation.kiji-delegation.3.0.0.com.moz.kiji.schema
      org.floggy.3rd.org.eclipse.ui
      io.7mind.izumi.sbt
      opentelemetry-armeria-1.0
      common-util_2.13
      mongoauth_3.1_2.12
      content-api-client_2.12
      utils-test_2.12
      kafkakit_2.13
      ciris-refined_2.11
      dynamo-test_2.13
      case-service_2.12
      jimcy-java-api_2.11
    
  • Only one single artifact name used a non-ASCII character on Maven Central: com.github.marcioos:bgg-clienẗ

  • Only 69 artifacts contained name segments that were too long to be supported by the syntax defined here. A random sample of failing names is as follows:

    rapidpm-proxybuilder-modules-dynamicobjectadapter-generator-processors
    rapidpm-proxybuilder-modules-objectadapter-generator-usages-usinggenerated
    spring-cloud-starter-stream-processor-tasklaunchrequest-transform
    stormpath-sdk-examples-spring-security-spring-boot-webmvc-bare-bones
    camel-quarkus-integration-tests-support-custom-type-converter-deployment
    wildfly-microprofile-reactive-streams-operators-cdi-provider-legacy-namespace
    camel-quarkus-integration-tests-support-custom-type-converter-parent
    stormpath-sdk-tutorials-spring-boot-default-spring-security-refined
    nav-virksomhet-tiltakOgAktiviteterForBrukere-v1-meldingsdefinisjon
    camel-quarkus-integration-test-support-core-main-collector-ext-deployment
    

    In the author's opinion, these names are somewhat excessive and could be supported with dotted notation instead of relentless hyphenation.

  • Less than 20 artifacts had any other characters that do not appear in the regular expression defined here. A random sample is as follows:

    # These contain ':'
    libaums:storageprovider
    com.foilen:database-tools
    libaums:http
    reactivex:rxjs
    app.ubie:brave-kt
    
    # These contain '+'
    bctsp-jdk15+
    bcpg-jdk15+
    sugar-tms_2.12at13+
    mvp+android
    bcprov-jdk15+
    amiitool+android
    bcmail-jdk15+
    
    # These contain whitespace
    com.inkapplications.spondee.math-macosx64.0.0.3.com 2.inkapplications.spondee
    com.inkapplications.spondee.math-macosx64.0.0.3.com 3.inkapplications.spondee
    com.inkapplications.spondee.math-macosx64.0.0.3.com 4.inkapplications.spondee
    utilex
    
    # These contain quote characters
    "palsolayouts"
    "android-sdk"
    "rxbluetooth"
    

    Many of these look like publication mistakes.

It is therefore the position of the author that if people are publishing packages with non-English names, they appear to be doing it using the ASCII character set.

Why are the length of names bounded?

For two reasons:

  • Regular expressions can be subject to denial of service attacks, particularly when they contain unbounded quantifiers such as * and +.

  • Adding an upper bound on length means more predictable storage use when names are used in relational databases.

The regular expression as it is defined here is expected to be somewhat less vulnerable to denial of service attacks in naive regex engine implementations than an unbounded version would be.