Skip to content

DataONEorg/mnlite

Repository files navigation

mnlite

Light weight read-only DataONE member node in Python Flask

Development Notes

Creating a MN with node identifier urn:node:mn_1:

workon mnlite
export FLASK_APP=mnlite
mkdir -p instance/nodes/mn_1
flask m_node new_node mn_1

Add a subject to the MN:

opersist -f instance/nodes/mn_1 sub -o create -n "Dave" -s 'https://orcid.org/0000-0002-6513-4996'

Adjust the node configuration to specify default_submitter, default_owner, base_url, and contact_subject:

{
  "node": {
    "node_id": "urn:node:mn_1",
    "state": "up",
    "name": "Unnamed member node: mn_1",
    "description": "No description available for this node.",
    "base_url": "http://localhost:5000/mn_1/",
    "schedule": {
      "hour": "*",
      "day": "*",
      "min": "0,10,20,30,40,50",
      "mon": "*",
      "sec": "5",
      "wday": "*",
      "year": "*"
    },
    "subject": "http://localhost:5000/mn_1",
    "contact_subject": "https://orcid.org/0000-0002-6513-4996"
  },
  "content_database": "sqlite:///content.db",
  "log_database": "sqlite:///eventlog.db",
  "data_folder": "data",
  "created": "2021-02-19T15:17:09+0000",
  "default_submitter": "https://orcid.org/0000-0002-6513-4996",
  "default_owner": "https://orcid.org/0000-0002-6513-4996"
  "spider":{
    "sitemap_urls":[
      "https://datadryad.org/sitemap.xml"
    ]
  }
}

The mnlite service:

workon mnlite
export FLASK_APP=mnlite
export FLASK_ENV=development
flask run

The soscan service:

workon mnlite

Collecting content from a source.

Implemented as a scrapy based crawler. Given a sitemap, crawls and adds discovered SO:Dataset entries to the persistence store.

Settings are in settings.py

workon mnlite
scrapy crawl JsonldSpider  -s STORE_PATH=instance/nodes/mn_3

To count sitemap loc entries only:

scrapy crawl JsonldSpider -s STORE_PATH=instance/nodes/mnTestDRYAD -L INFO -a count_only=1

Model

Thing

A digital entity. The persistent identifier is the Sha256 hash of the bytes. A Thing may have more than one identifier. An instance of Thing may be any digital object such as metadata, data, and so forth.

A Thing may have multiple Identifiers.

A Thing has only one unique Sha256 hash. It has single Sha1 and MD5 hashes that are unique within the constraints of those hashing algorithms.

Identifier

Captures metadata associated with a minted identifier. Note that this is about the identifier, it’s creation, and other management aspects. The content referenced by an identifier is described by the Thing.

An Identifier may be associated with more than one Thing. There are situations where this can happen:

  1. Different representations of the same thing. For example a digital entity may be conceptually the same though serialized differently.

  2. Different aspects of the same thing. For example, a DOI may resolve to a landing page that describes a Thing.

  3. Erroneous duplication, the Identifier is used to reference two or more distinct, unrelated entities.

  4. The identifier refers to a conceptualization of a thing. For example a series identifer in the DataONE system refers to the most recent version of some thing.

In practice, while an identifier may be intended to be a globally unique reference to a specific digital thing, the most reliable mechanism to achieve this is using an identifier derived from the content of the thing.

Identifiers generated from a hash of the content of a Thing will be unique, subject to the contraints of the hashing algorithm. It is assumed here for all practical purposes that a Sha256 identifier will always refer to exactly one digital entity.

Identifiers may refer to a physical thing. Physical things do not exist digitally. Hence, any digital entity can only be associated with a physical thing, it can not be that thing. As such, where an identifier is used to refer to a physical thing and a digital thing, the digital thing must result from some observation of the physical thing. The outcomes of such observation may be manifest in many different forms such as metadata, data records, images, and other digital entities.

Relation

Documents a relationship between two identifiers.

Where identifiers refer to specific digital entities, the relation is unambiguous.

Ambiguous relationships arise where identifiers may refer to more than one Thing. The degree of ambiguity varies with the precision of the identifiers.

A relationship between physical things implies a physical association. E.g. a sub-sample, a sibling sample from a batch, geographcally co-located, and so forth.

Relations exist within a Context. As of this writing, Contexts are defined by a label.

AccessRule

Defines how Subjects may interact with a Thing.

A Thing may have multiple AccessRules.

An AccessRule may have multiple Subjects.

Subject

Identifies an actor that may interact with a Thing.

Request

Holds metadata associated with a request such as a HTTP request resolving an Identifier or retrieving a Thing.