Merge pull request #55 from octue/release/0.1.7

Release/0.1.7
octue · Jan 5, 2021 · 5bcf2b7 · 5bcf2b7
2 parents 0dd42ec + c2f6ff7
commit 5bcf2b7
Show file tree

Hide file tree

Showing 33 changed files with 1,483 additions and 347 deletions.
diff --git a/.github/workflows/check-version-consistency.yml b/.github/workflows/check-version-consistency.yml
diff --git a/.github/workflows/python-ci.yml b/.github/workflows/python-ci.yml
@@ -9,6 +9,14 @@ name: python-ci
 on: [push]
 
 jobs:
+
+  check-version-consistency:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions/setup-python@v2
+      - run: python .github/workflows/scripts/check-version-consistency.py
+
   tests:
     runs-on: ubuntu-latest
     env:

diff --git a/docs/source/analysis_objects.rst b/docs/source/analysis_objects.rst
@@ -0,0 +1,44 @@
+.. _analysis_objects:
+
+================
+Analysis objects
+================
+
+An ``Analysis`` object is the sole argument to the ``app`` function in your ``app.py`` module. Its attributes include
+every strand that can be possibly added to a ``Twine``, although only the strands specified in your ``twine.py`` file
+will not be ``None``. The attributes are:
+
+-   ``input_values``
+-   ``input_manifest``
+-   ``configuration_values``
+-   ``configuration_manifest``
+-   ``output_values``
+-   ``output_manifest``
+-   ``credentials``
+-   ``children``
+-   ``monitors``
+
+Additionally, all input and configuration attributes are hashed using a
+`BLAKE3 hash <https://github.com/BLAKE3-team/BLAKE3>`_ so the inputs and configuration that produced a given output in
+your app can always be verified. These hashes exist on the following attributes:
+
+-   ``input_values_hash``
+-   ``input_manifest_hash``
+-   ``configuration_values_hash``
+-   ``configuration_manifest_hash``
+
+If an input or configuration attribute is ``None``, so will its hash attribute be. For ``Manifests``, some metadata
+about the ``Datafiles`` and ``Datasets`` within them, and about the ``Manifest`` itself, is included when calculating
+the hash:
+
+- For a ``Datafile``, the content of its on-disk file is hashed, along with the following metadata:
+
+    - ``name``
+    - ``cluster``
+    - ``sequence``
+    - ``posix_timestamp``
+    - ``tags``
+
+- For a ``Dataset``, the hashes of its ``Datafiles`` are included, along with its ``tags``.
+
+- For a ``Manifest``, the hashes of its ``Datasets`` are included, along with its ``keys``.
diff --git a/docs/source/datafile.rst b/docs/source/datafile.rst
@@ -0,0 +1,14 @@
+.. _datafile:
+
+========
+Datafile
+========
+
+A ``Datafile`` is an Octue type that corresponds to a file, which may exist on your computer or in a cloud store. It has
+the following main attributes:
+
+- ``path`` - the path of this file, which may include folders or subfolders, within the dataset.
+- ``cluster`` - the integer cluster of files, within a dataset, to which this belongs (default 0)
+- ``sequence`` - a sequence number of this file within its cluster (if sequences are appropriate)
+- ``tags`` - a space-separated string or iterable of tags relevant to this file
+- ``posix_timestamp`` - a posix timestamp associated with the file, in seconds since epoch, typically when it was created but could relate to a relevant time point for the data
diff --git a/docs/source/dataset.rst b/docs/source/dataset.rst
@@ -0,0 +1,44 @@
+.. _dataset:
+
+=======
+Dataset
+=======
+
+A ``Dataset`` contains any number of ``Datafiles`` along with the following metadata:
+
+- ``name``
+- ``tags``
+
+The files are stored in a ``FilterSet``, meaning they can be easily filtered according to any attribute of the
+`Datafile <datafile.rst>`_ instances it contains.
+
+
+--------------------------------
+Filtering files in a ``Dataset``
+--------------------------------
+
+You can filter a ``Dataset``'s files as follows:
+
+.. code-block:: python
+    dataset = Dataset(
+        files=[
+            Datafile(path="path-within-dataset/my_file.csv", tags="one a:2 b:3 all"),
+            Datafile(path="path-within-dataset/your_file.txt", tags="two a:2 b:3 all"),
+            Datafile(path="path-within-dataset/another_file.csv", tags="three all"),
+        ]
+    )
+
+    dataset.files.filter(filter_name="name__ends_with", filter_value=".csv")
+    >>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('another_file.csv')>})>
+
+    dataset.files.filter("tags__contains", filter_value="a:2")
+    >>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('your_file.txt')>})>
+
+You can also chain filters indefinitely:
+
+.. code-block:: python
+    dataset.files.filter(filter_name="name__ends_with", filter_value=".csv").filter("tags__contains", filter_value="a:2")
+    >>> <FilterSet({<Datafile('my_file.csv')>})>
+
+Find out more about ``FilterSets`` `here <filterset.rst>`_, including all the possible filters available for each type of object stored on
+an attribute of a ``FilterSet`` member, and how to convert them to primitive types such as ``set`` or ``list``.
diff --git a/docs/source/filter_containers.rst b/docs/source/filter_containers.rst
@@ -0,0 +1,127 @@
+.. _filter_containers:
+
+=================
+Filter containers
+=================
+
+A filter container is just a regular python container that has some extra methods for filtering or ordering its
+elements. It has the same interface (i.e. attributes and methods) as the primitive python type it inherits from, with
+these extra methods:
+
+- ``filter``
+- ``order_by``
+
+There are two types of filter containers currently implemented:
+
+- ``FilterSet``
+- ``FilterList``
+
+``FilterSets`` are currently used in:
+
+- ``Dataset.files`` to store ``Datafiles``
+- ``TagSet.tags`` to store ``Tags``
+
+You can see filtering in action on the files of a ``Dataset`` `here <dataset.rst>`_.
+
+
+---------
+Filtering
+---------
+
+Filters are named as ``"<name_of_attribute_to_check>__<filter_action>"``, and any attribute of a member of the
+``FilterSet`` whose type or interface is supported can be filtered.
+.. code-block:: python
+    filter_set = FilterSet(
+        {Datafile(path="my_file.csv"), Datafile(path="your_file.txt"), Datafile(path="another_file.csv")}
+    )
+
+    filter_set.filter(filter_name="name__ends_with", filter_value=".csv")
+    >>> <FilterSet({<Datafile('my_file.csv')>, <Datafile('another_file.csv')>})>
+
+The following filters are implemented for the following types:
+
+- ``bool``:
+
+    * ``is``
+    * ``is_not``
+
+- ``str``:
+
+    * ``is``
+    * ``is_not``
+    * ``equals``
+    * ``not_equals``
+    * ``iequals``
+    * ``not_iequals``
+    * ``lt`` (less than)
+    * ``lte`` (less than or equal)
+    * ``gt`` (greater than)
+    * ``gte`` (greater than or equal)
+    * ``contains``
+    * ``not_contains``
+    * ``icontains`` (case-insensitive contains)
+    * ``not_icontains``
+    * ``starts_with``
+    * ``not_starts_with``
+    * ``ends_with``
+    * ``not_ends_with``
+
+- ``NoneType``:
+
+    * ``is``
+    * ``is_not``
+
+- ``TagSet``:
+
+    * ``is``
+    * ``is_not``
+    * ``equals``
+    * ``not_equals``
+    * ``any_tag_contains``
+    * ``not_any_tag_contains``
+    * ``any_tag_starts_with``
+    * ``not_any_tag_starts_with``
+    * ``any_tag_ends_with``
+    * ``not_any_tag_ends_with``
+
+
+
+Additionally, these filters are defined for the following *interfaces* (duck-types). :
+
+- Numbers:
+
+    * ``is``
+    * ``is_not``
+    * ``equals``
+    * ``not_equals``
+    * ``lt``
+    * ``lte``
+    * ``gt``
+    * ``gte``
+
+- Iterables:
+
+    * ``is``
+    * ``is_not``
+    * ``equals``
+    * ``not_equals``
+    * ``contains``
+    * ``not_contains``
+    * ``icontains``
+    * ``not_icontains``
+
+The interface filters are only used if the type of the attribute of the element being filtered is not found in the first
+list of filters.
+
+--------
+Ordering
+--------
+As sets are inherently orderless, ordering a ``FilterSet`` results in a new ``FilterList``, which has the same extra
+methods and behaviour as a ``FilterSet``, but is based on the ``list`` type instead - meaning it can be ordered and
+indexed etc. A ``FilterSet`` or ``FilterList`` can be ordered by any of the attributes of its members:
+.. code-block:: python
+    filter_set.order_by("name")
+    >>> <FilterList([<Datafile('another_file.csv')>, <Datafile('my_file.csv')>, <Datafile(path="your_file.txt")>])>
+
+The ordering can also be carried out in reverse (i.e. descending order) by passing ``reverse=True`` as a second argument
+to the ``order_by`` method.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -13,6 +13,10 @@ Not all of Octue's API functionality is implemented in the SDK yet, we're active
    :hidden:
 
    installation
+   datafile
+   dataset
+   filter_containers
+   analysis_objects
    license
    version_history
    bibliography

diff --git a/octue/mixins/__init__.py b/octue/mixins/__init__.py
@@ -1,9 +1,11 @@
 from .base import MixinBase
+from .filterable import Filterable
+from .hashable import Hashable
 from .identifiable import Identifiable
 from .loggable import Loggable
 from .pathable import Pathable
 from .serialisable import Serialisable
 from .taggable import Taggable
 
 
-__all__ = "Identifiable", "Loggable", "MixinBase", "Pathable", "Serialisable", "Taggable"
+__all__ = ("Filterable", "Hashable", "Identifiable", "Loggable", "MixinBase", "Pathable", "Serialisable", "Taggable")