Distributed Atom Space (DAS)

Description:

This repo aims to develop a new design to store all the MeTTa expressions in a database to be accessed through an API. Our first approach is using MongoDB (expressions) + Couchbase (indexes).

Examples:

As a simple example, we have the following expression:

(: Evaluation Type)
(: Predicate Type)
(: Reactome Type)
(: Concept Type)
(: "Predicate:has_name" Predicate)
(: "Reactome:R-HSA-164843" Reactome)
(: "Concept:2-LTR circle formation" Concept)
(
	Evaluation 
	"Predicate:has_name" 
	(
	    Evaluation 
	    "Predicate:has_name" 
	    {"Reactome:R-HSA-164843" "Concept:2-LTR circle formation"}
	)
)

MongoDB:

The _id must be built by hashing (sha256) the documents' fields to avoid duplication. For simplicity, we'll be using integers on this example.

NodeTypes: [
    { _id: 1, type: null, name: "Unknown" },
    { _id: 2, type: null, name: "Type" },
    { _id: 3, type: 2, name: "Evaluation" },
    { _id: 4, type: 2, name: "Predicate" },
    { _id: 5, type: 2, name: "Reactome" },
    { _id: 6, type: 2, name: "Concept" },
]

Nodes: [
    { _id: 7, type: 4, name: "Predicate:has_name" },
    { _id: 8, type: 5, name: "Reactome:R-HSA-164843" },
    { _id: 9, type: 6, name: "Concept:2-LTR circle formation" },
]

Links_1: [{}]

Links_2: [
    {
	    _id: 10,
	    set_from: 1,
	    is_root: false,
	    type: [Reactome, Concept],
	    key1: 8,
	    key2: 9,
    },
]

Links_3: [
    {
	    _id: 11,
	    set_from: null,
	    is_root: false,
	    type: [Type, Predicate, [Reactome, Concept]],
	    key1: 3,
	    key2: 7,
	    key3: 10,
    },
    {
	    _id: 12,
	    set_from: null,
	    is_root: true,
	    type: [Type, Predicate, [Type, Predicate, {Reactome, Concept}]],
	    key1: 3,
	    key2: 7,
	    key3: 11,
    },
]

As an example of how sha256 will be used here:

    _id: XX ->  sha256(sha256(type), sha256(key1), sha256(key2), ...)
    _id: 10 ->  sha256(sha256(set_salt, 5, 6), 8, 9)
    _id: 11 ->  sha256(sha256(2, 4, sha256(set_salt, 5, 6)), 3, 7, 10)
    _id: 12 ->  sha256(sha256(2, 4, sha256(2, 4, sha256(set_salt, 5, 6))), 3, 7, 11)

Notes:

The field named is_root is NOT used on hashing.
Each document that represents an expression has the field named set_from. This field represents:
- when equal to null that the keys in document wasn't ordered in anyway;
- when equal to 1 that the keys in document was ordered alphabetically since their first key;
- when equal to 2 that the keys in document was ordered alphabetically since their second key;
The set_from field will be different of null when:
- their expression represents a set ({ ... }). So set_from receives 1.
- the first key in expression points to a Similarity node type. So set_from receives 2.
The set_from field is used on hashing.

Couchbase:

IncomingSet:
{
    8: [10],
    9: [10],
    3: 2,
    3_0: [11],
    3_1: [12],
    7: 2,
    7_0: [11],
    7_1: [12],
    10: [11],
    11: [12]
}

RecursiveIncomingSet:
{
     8: [10, 11, 12],
     9: [10, 11, 12],
     3: [11, 12],
     7: [11, 12],
    10: [11, 12],
    11: [12]
}

OutgoingSet:
{
    10: [8, 9],
    11: [3, 7, 10],
    12: [3, 7, 11]
}

RecursiveOutgoingSet:
{
    10: [8, 9],
    11: [3, 7, 10, 8, 9],
    12: [3, 7, 11, 10, 8, 9]
}

At this point, we found a size limitation for values in Couchbase collections. Not rarely some keys in IncomingSet collection will have more than the limit of 20 MB defined by Couchbase under their values. On intend to bypass this limitation was implemented a rule to split the values into sub-keys. For simplicity, the example uses a limit of one value for each key (the real implementation has 500,000 as max number of values). The rule defines that once time a main key have more values than the max limit defined that key will be splitted into two other sub-keys and at the time the last one created sub-key achieve the max limit for their values a new sub-key will be created. The integer number storaged at main key represents the amount of the sub-keys existents under key itself. By their turn the sub-keys has the indentifier composed by the main key plus a counter starts at zero and ends at the integer storaged under main key minus one and both are separeted by underscore (_).

IncomingSet:
{
    8: [10],
    9: [10],
    3: 2,
    3_0: [11],
    3_1: [12],
    7: 2,
    7_0: [11],
    7_1: [12],
    10: [11],
    11: [12]
}

Here is another simple example to show how we create a graph from an expression:

(
    Evaluation
        "Predicate:P1"
        (
            (Evaluation "Predicate:P2" {"Gene:G1" "Gene:G2"})
            ("Concept:CN1" "Concept:CN2")
        )
)

Datasets:

You can find all the Atomese (.scm) files from gene-level-dataset_2020-10-20 already translated to MeTTa (.metta) in the data/bio_atomsapace directory.

The translation script used is in scripts/atomese2metta.

Get it started:

Go to das/ directory to get info about how to set up the necessary environment.

Name		Name	Last commit message	Last commit date
Latest commit History 525 Commits
assets		assets
das		das
data		data
flybase2metta		flybase2metta
notebooks		notebooks
scripts		scripts
service		service
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.service		Dockerfile.service
README.md		README.md
canonical_load		canonical_load
docker-compose-service.yml		docker-compose-service.yml
environment		environment
mongodump		mongodump
redis.conf		redis.conf
requirements.txt		requirements.txt

singnet/das-poc

Folders and files

Latest commit

History

Repository files navigation

Distributed Atom Space (DAS)

Description:

Examples:

MongoDB:

Couchbase:

Datasets:

Get it started:

About

Resources

Stars

Watchers

Forks

Languages