Skip to content

Releases: intel/dffml

0.4.0 Alpha Release

18 Feb 16:07
Compare
Choose a tag to compare

Friends, Romans, countrymen, lend me pip install -U dffml commands.

It’s been a long 10 months since our last release. A lot has changed in the world of DFFML. Looking back it’s clear how much progress we’ve made. The work that was done to get us to 0.4.0 has really polished the project. There are no doubt still kinks to work out, but we’ve come a long way.

If you’re new to DFFML. See the Installation and Quickstart documents to get started, then come back here to try out the new features.

Highlights

We have a ton of cool new features in this release. You can see the CHANGELOG for the full details. We’ll cover some highlights here.

Custom Neural Networks

Saksham, one of our GSoC 2020 students, implemented PyTorch based models. The generic model allows users to use JSON or YAML files to define the layers they want in a neural network.

Image classification

Saksham also exposed some PyTorch pre-trained models which are very useful for working with images. The full list can be found under dffml-model-pytorch on the model plugins page.

Natural Language Processing (NLP) Models and Operations

Himanshu, one of our GSoC 2020 students, implemented Natural Language Processing (NLP) models and operations. The Spacy based models can be found under dffml-model-spacy on the model plugins page. The operations can be found under dffml-operations-nlp on the operations plugins page. We had a slight hangup with the release of the Transformers based models, we hope to send all those out you all as soon as possible.

Continuous Deployment Examples

Aghin, one of our GSoC 2020 students, wrote operations and tutorials which allow users to receive web hooks from GitHub and re-deploy their containerized models and operations whenever their code is updated.

New Models

In addition to the above mentioned models, we have many more that were added. We now have over 115 models! Check out the model plugins page to see them all.

Documentation Testing with Sphinx consoletest extension

We developed a Sphinx extension which has allowed us to test the code-block:: console directives and others in our documentation. This serves as integration testing / documentation validation. The Documenting a Model tutorial was written to explain how this can be used to write documentation for your models in the models’ docstrings.

Road to Beta

Things are looking up with regards to the path to our beta release.

We’re going to start deciding what APIs are stable come the 0.5.0 beta release.

Up until then, including now, things have been subject to change at any time as we have learned more about our problem space and how best to architecturally approach it.

We have several major things on deck that we want to get before we declare beta.

AutoML

We now have a lot of models to choose from and are at the stage where we need models to help us choose our models! We’re going to have AutoML in the Beta release. This will pick the best model with the best hyper parameters for the job.

Accuracy Scorers

Sudhanshu has been working on this since June 2020. He’s done a major refactor of the codebase to take the accuracy methods out of all the models and move them into .score() methods in a new AccuracyScorer method. This will allow users to more easily compare accuracy of models against each other.

Machine Learning support for videos

We still need to decide how we’re going to support videos. DFFML’s asynchronous approach will hopefully make it convenient to use with live video streams.

Model directories auto stored into archives or remotely

We’re going to implement automatic packing and unpacking of directories which models get saved and loaded from into/out of archives, such as Zip, Tar, etc. We’ll also implement plugins to be able to push and pull them from remote storage. This will make models convenient to train in one location and deploy another.

Remote execution

The HTTP service already allows users to access all the DFFML command line and Python APIs over HTTP. We are going to integrate the High_Level API with the HTTP service. A remote execution plugin type will allow users to install only the base package, and whatever remote execution plugin they want on a machine. Users will then be able to run the HTTP service on a machine with all needed ML packages installed, and their Python API calls will run on the HTTP service. In cases where you have multiple architectures, one of which doesn’t have ML packages compiled for it, this will be helpful (Edge).

Config files in place of command line parameters

To stop users from having to copy paste so many command line parameters across command invocations, we’ll be implementing support for config files. YAML, JSON, etc. will all be able to be used to store what could also be command line arguments.

Command line to config file to Python API to HTTP API auto translation

Since DFFML offers consistent APIs across it’s various interfaces, we will be able to implement an auto translator to convert one API to another. This means that if you have a DFFML command line invocation that you want to make into a Python API call, the translator will take your CLI command and output the DFFML Python API calls in Python.

DataFlows with operation implementations in multiple languages

Our first target is to integrate wasmer to help us run web assembly binaries. We’ll later expand this out to having multiple Operation Implementation networks that will allow users to create DataFlows that run code written in multiple languages. For example, Python, Rust, and Golang. This will allow users to leverage their favorite libraries to get the job done without worrying about them being in different languages.

Premade data cleanup DataFlows

We’ll have a set of out of the box data cleanup DataFlows that users can use before passing data to models. These will do common data cleanup tasks such as removing horrendous outliers.

Continuous deployment tutorials

We will expand the tutorials released with 0.4.0 to include deployment behind reverse proxies for multiple projects, including how to setup encryption and authentication in a painless and maintainable way.

Pandas DataFrame source

This is a small convenience that will probably improve usability. This change will allow us to pass DataFrame objects to the train/accuracy/predict functions.

Collaborations

We’re exploring participation with the OpenSSF Identifying Security Threats working group. Their effort is similar to shouldi and we might be able to contribute some of what we’ve done there.

We’re exploring another use of DFFML internally at Intel. This time leveraging DataFlows more so than Machine Learning.

Thanks

Since 0.3.7 we’ve seen 35203/10423 insertions(+)/deletions(-) lines changed, added, or removed, across 757 files.

You all have done amazing stuff!! Great job and keep up the good work!

Aadarsh Singh

Aghin Shah Alin

Aitik Gupta

Geetansh Saxena

Hashim

Himanshu Tripathi

Ichisada Shioko

Jan Keromnes

Justin Moore

Naeem Khoshnevis

Nitesh Yadav

Oliver O’Brien

Saksham Arora

Shaurya Puri

Shivam Singh

Sudeep Sidhu

Sudhanshu kumar

Sudharsana K J L

Yash Lamba

Yash Varshney

0.3.7 Alpha Release

14 Apr 23:04
Compare
Choose a tag to compare
0.3.7 Alpha Release Pre-release
Pre-release

[0.3.7] - 2020-04-14

Added

  • IO operations demo and literal_eval operation.
  • Python prompts >>> can now be enabled or disabled for easy copying of code into interactive sessions.
  • Whitespace check now checks .rst and .md files too.
  • GetMulti operation which gets all Inputs of a given definition
  • Python usage example for LogisticRegression and its related tests.
  • Support for async generator operations
  • Example CLI commands and Python code for SLRModel
  • save function in high level API to quickly save all given records to a
    source
  • Ability to configure sources and models for HTTP API from command line when
    starting server
  • Documentation page for command line usage of HTTP API
  • Usage of HTTP API to the quickstart to use trained model

Changed

  • Renamed "arg" to "plugin".
  • CSV source sorts feature names within headers when saving
  • Moved HTTP service testing code to HTTP service util.testing

Fixed

  • Exporting plugins
  • Issue parsing string values when using the dataflow run command and
    specifying extra inputs.

Removed

  • Unused imports

0.3.6 Alpha Release

04 Apr 21:41
Compare
Choose a tag to compare
0.3.6 Alpha Release Pre-release
Pre-release

[0.3.6] - 2020-04-04

Added

  • Operations for taking input from the user AcceptUserInput and for printing the output print_output
  • Hugging Face Transformers tensorflow based NER models.
  • PNG ConfigLoader for reading images as arrays to predict using MNIST trained models
  • Docstrings and doctestable examples to record.py.
  • Inputs can be validated using operations
    • validate parameter in Input takes Operation.instance_name
  • New db source can utilize any database that inherits from BaseDatabase
  • Logistic Regression with SAG optimizer
  • shouldi got an operation to run cargo-audit on rust code.
  • Moved all the downloads to tests/downloads to speed the CI test.
  • Test tensorflow DNNEstimator documentation exaples in CI
  • Add python code for tensorflow DNNEstimator
  • Ability to run a subflow as if it were an operation using the
    dffml.dataflow.run operation.
  • Support for operations without inputs.
  • Partial doctestable examples to features.py
  • Doctestable examples for BaseSource
  • Instructions for setting up debugging environment in VSCode

Fixed

  • New model tutorial mentions file paths that should be edited.
  • DataFlow is no longer a dataclass to prevent it from being exported
    incorrectly.
  • operations_parameter_set_pairs moved to MemoryOrchestratorContext
  • Ignore generated files in docs/plugins/
  • Treat "~" as the the home directory rather than a literal
  • Windows support by selecting asyncio.ProactorEventLoop and not using
    asyncio.FastChildWatcher.
  • Moved SLR into the main dffml package and removed scratch:slr.

Changed

  • Refactor model/tensroflow

0.3.5 Alpha Release

10 Mar 23:01
Compare
Choose a tag to compare
Pre-release

[0.3.5] - 2020-03-10

Added

  • Parent flows can now forward inputs to active contexts of subflows.
    • forward parameter in DataFlow
    • subflow in OperationImplementationContext
  • Documentation on writing examples and running doctests
  • Doctestable Examples to high-level API.
  • Shouldi got an operation to run npm-audit on JavaScript code
  • Docstrings and doctestable examples for record.py (features and evaluated)
  • Simplified model API with SimpleModel
  • Documentation on how DataFlows work conceptually.
  • Style guide now contains information on class, variable, and function naming.

Changed

  • Restructured contributing documentation
  • Use randomly generated data for scikit tests
  • Change Core to Official to clarify who maintains each plugin
  • Name of output of unsupervised model from "Prediction" to "cluster"
  • Test scikit LR documentation examples in CI
  • Create a fresh archive of the git repo for release instead of cleaning
    existing repo with git clean for development service release command.
  • Simplified SLR tests for scratch model
  • Test tensorflow DNNClassifier documentation exaples in CI
  • config directories and files associated with ConfigLoaders have been renamed
    to configloader.
  • Model config directory parameters are now pathlib.Path objects
  • New model tutorial and skel/model use simplifeid model API.

0.3.4 Alpha Release

29 Feb 01:07
Compare
Choose a tag to compare
0.3.4 Alpha Release Pre-release
Pre-release

[0.3.4] - 2020-02-28

Added

  • Tensorflow hub NLP models.
  • Notes on development dependencies in setup.py files to codebase notes.
  • Test for cached_download
  • dffml.util.net.cached_download_unpack_archive to run a cached download and
    unpack the archive, very useful for testing. Documented on the Networking
    Helpers API docs page.
  • Directions on how to read the CI under the Git and GitHub page of the
    contributing documentation.
  • HTTP API
    • Static file serving from a dirctory with -static
    • api.js file serving with the -js flag
    • Docs page for JavaScript example
  • shouldi got an operation to run golangci-lint on Golang code

Fixed

  • Port assignment for the HTTP API via the -port flag

Changed

  • repo/Repo to record/Record
  • Definitions with a spec can use the subspec parameter to declare that they
    are a list or a dict where the values are of the spec type. Rather than the
    list or dict itself being of the spec type.
  • Fixed the URL mentioned in example to configure a model.
  • Sphinx doctests are now run in the CI in the DOCS task.
  • Lint JavaScript files with js-beautify and enforce with CI

Removed

  • Unused imports

0.3.3 Alpha Release

11 Feb 07:48
Compare
Choose a tag to compare
0.3.3 Alpha Release Pre-release
Pre-release

[0.3.3] - 2020-02-10

Added

  • Moved from TensorFlow 1 to TensorFlow 2.
  • IDX Sources to read binary data files and train models on MNIST Dataset
  • scikit models
    • Clusterers
      • KMeans
      • Birch
      • MiniBatchKMeans
      • AffinityPropagation
      • MeanShift
      • SpectralClustering
      • AgglomerativeClustering
      • OPTICS
  • allowempty added to source config parameters.
  • Quickstart document to show how to use models from Python.
  • The latest release of the documentation now includes a link to the
    documentation for the master branch (on GitHub pages).
  • Virtual environment, GitPod, and Docker development environment setup notes to
    the CONTRIBUTING.md file.
  • Changelog now included in documenation website.
  • Database abstraction dffml.db
    • SQLite connector
    • MySQL connector
  • Documented style for imports.
  • Documented use of numpy docstrings.
  • Inputs can now be sanitized using function passed in validate parameter
  • Helper utilities to take callables with numpy style docstrings and
    create config classes out of them using make_config.
  • File listing endpoint to HTTP service.
  • When an operation throws an exception the name of the instance and the
    parameters it was executed with will be thrown via an OperationException.
  • Network utilities to preformed cached downloads with hash validation.
  • Development service got a new command, which can retrieve an argument passed
    to setuptools setup function within a setup.py file.

Changed

  • All instances of src_url changed to key.
  • readonly parameter in source config is now changed to readwrite.
  • predict parameter of all model config classes has been changed from str to Feature.
  • Defining features on the command line no longer requires that defined features
    be prefixed with def:
  • The model predict operation will now raise an exception if the model it is
    passed via it's config is a class rather than an instance.
  • entry_point and friends have been renamed to entrypoint.
  • Use FastChildWatcher when run via the CLI to prevent BlockingIOErrors.
  • TensorFlow based neural network classifier had the classification parameter
    in it's config changed to predict.
  • SciKit models use make_config_numpy.
  • Predictions in repos are now dictionary.
  • All instances of label changed to tag
  • Subclasses of BaseConfigurable will now auto instantiate their respective
    config classes using kwargs if the config argument isn't given and keyword
    arguments are.
  • The quickstart documentation was improved as well as the structure of docs.

Fixed

  • CONTRIBUTING.md has -e in the wrong place in the getting setup section.
  • Since moving to auto args() and config(), BaseConfigurable no longer
    produces odd typenames in conjunction with docs.py.
  • Autoconvert Definitions with spec into their spec

Removed

  • The model predict operation erroneously had a msg parameter in it's config.
  • Unused imports identified by deepsource.io
  • Evaluation code from feature.py file as well as tests for those evaluations.

0.3.2 Alpha Release

03 Jan 22:19
Compare
Choose a tag to compare
0.3.2 Alpha Release Pre-release
Pre-release

[0.3.2] - 2020-01-03

Added

  • scikit models
    • Classifiers
      • LogisticRegression
      • GradientBoostingClassifier
      • BernoulliNB
      • ExtraTreesClassifier
      • BaggingClassifier
      • LinearDiscriminantAnalysis
      • MultinomialNB
    • Regressors
      • ElasticNet
      • BayesianRidge
      • Lasso
      • ARDRegression
      • RANSACRegressor
      • DecisionTreeRegressor
      • GaussianProcessRegressor
      • OrthogonalMatchingPursuit
      • Lars
      • Ridge
  • AsyncExitStackTestCase which instantiates and enters async and non-async
    contextlib exit stacks. Provides temporary file creation.
  • Automatic releases to PyPi via GitHub Actions
  • Automatic documentation deployment to GitHub Pages
  • Function to create a config class dynamically, analogous to make_dataclass

Changed

  • CLI tests and integration tests derive from AsyncExitStackTestCase
  • SciKit models now use the auto args and config methods.

Fixed

  • Correctly identify when functions decorated with op use self to reference
    the OperationImplementationContext.
  • Negative values are correctly parsed when input via the command line.
  • Do not lowercase development mode install location when reporting version.

0.3.1 Alpha Release

12 Dec 08:47
Compare
Choose a tag to compare
0.3.1 Alpha Release Pre-release
Pre-release

[0.3.1] - 2019-12-12

Added

  • Integration tests using the command line interface.

Changed

  • Features were moved from ModelContext to ModelConfig
  • CI is now run via GitHub Actions
  • CI testing script is now verbose
  • args and config methods of all classes no longer require implementation.
    BaseConfigurable handles exporting of arguments and creation of config objects
    for each class based off of the CONFIG property of that class. The CONFIG
    property is a class which has been decorated with dffml.base.config to make it
    a dataclass.
  • Speed up development service install of all plugins in development mode
  • Speed up named plugin load times

Fixed

  • DataFlows with multiple possibilities for a source for an input, now correctly
    look through all possible sources instead of just the first one.
  • DataFlow MemoryRedundancyCheckerContext was using all inputs in an input set
    and all their ancestors to check redundancy (a hold over from pre uid days).
    It now correctly only uses the inputs in the parameter set. This fixes a major
    performance issue.
  • MySQL packaging issue.
  • Develop service running one off operations correctly json-loads dict types.
  • Operations with configs can be run via the development service
  • JSON dumping numpy int* and float* caused crash on dump.
  • CSV source always loads src_urls as strings.

Removed

  • CLI command operations removed in favor of dataflow run
  • Duplicate dataflow diagram code from development service

0.3.0 Alpha Release

26 Oct 20:52
Compare
Choose a tag to compare
0.3.0 Alpha Release Pre-release
Pre-release

[0.3.0] - 2019-10-26

Added

  • Real DataFlows, see operations tutorial and usage examples
  • Async helper concurrently nocancel optional keyword argument which, if set is
    a set of tasks not to cancel when the concurrently execution loop completes.
  • FileSourceTest has a test_label method which checks that a FileSource knows
    how to properly load and save repos under a given label.
  • Test case for Merge CLI command
  • Repo.feature method to select a single piece of feature data within a repo.
  • Dev service to help with hacking on DFFML and to create models from templates
    in the skel/ directory.
  • Classification type parameter to DNNClassifierModelConfig to specifiy data
    type of given classification options.
  • util.cli CMD classes have their argparse description set to their docstring.
  • util.cli CMD classes can specify the formatter class used in
    argparse.ArgumentParser via the CLI_FORMATTER_CLASS property.
  • Skeleton for service creation was added
  • Simple Linear Regression model from scratch
  • Scikit Linear Regression model
  • Community link in CONTRIBUTING.md.
  • Explained three main parts of DFFML on docs homepage
  • Documentation on how to use ML models on docs Models plugin page.
  • Mailing list info
  • Issue template for questions
  • Multiple Scikit Models with dynamic config
  • Entrypoint listing command to development service to aid in debugging issues
    with entrypoints.
  • HTTP API service to enable interacting with DFFML over HTTP. Currently
    includes APIs for configuring and using Sources and Models.
  • MySQL protocol source to work with data from a MySQL protocol compatible db
  • shouldi example got a bandit operation which tells users not to install if
    there are more than 5 issues of high severity and confidence.
  • dev service got the ability to run a single operation in a standalone fashion.
  • About page to docs.
  • Tensorflow DNNEstimator based regression model.

Changed

  • feature/codesec became it's own branch, binsec
  • BaseOrchestratorContext run_operations strict is default to true. With
    strict as true errors will be raised and not just logged.
  • MemoryInputNetworkContext got an sadd method which is shorthand for creating
    a MemoryInputSet with a StringInputSetContext.
  • MemoryOrchestrator basic_config method takes list of operations and optional
    config for them.
  • shouldi example uses updated MemoryOrchestrator.basic_config method and
    includes more explanation in comments.
  • CSVSource allows for setting the Repo's src_url from a csv column
  • util Entrypoint defines a new class for each loaded class and sets the
    ENTRY_POINT_LABEL parameter within the newly defined class.
  • Tensorflow model removed usages of repo.classifications methods.
  • Entrypoint prints traceback of loaded classes to standard error if they fail
    to load.
  • Updated Tensorflow model README.md to match functionality of
    DNNClassifierModel.
  • DNNClassifierModel no longer splits data for the user.
  • Update pip in Dockerfile.
  • Restructured documentation
  • Ran black on whole codebase, including all submodules
  • CI style check now checks whole codebase
  • Merged HACKING.md into CONTRIBUTING.md
  • shouldi example runs bandit now in addition to safety
  • The way safety gets called
  • Switched documentation to Read The Docs theme
  • Models yield only a repo object instead of the value and confidence of the
    prediction as well. Models are not responsible for calling the predicted
    method on the repo. This will ease the process of making predict feature
    specific.
  • Updated Tensorflow model README.md to include usage of regression model

Fixed

  • Docs get version from dffml.version.VERSION.
  • FileSource zipfiles are wrapped with TextIOWrapper because CSVSource expects
    the underlying file object to return str instances rather than bytes.
  • FileSourceTest inherits from SourceTest and is used to test json and csv
    sources.
  • A temporary directory is used to replicate mktemp -u functionality so as to
    provide tests using a FileSource with a valid tempfile name.
  • Labels for JSON sources
  • Labels for CSV sources
  • util.cli CMD's correcly set the description of subparsers instead of their
    help, they also accept the CLI_FORMATTER_CLASS property.
  • CSV source now has entry_point decoration
  • JSON source now has entry_point decoration
  • Strict flag in df.memory is now on by default
  • Dynamically created scikit models get config args correctly
  • Renamed DNNClassifierModelContext first init arg from config to features
  • BaseSource now has base_entry_point decoration

Removed

  • Repo objects are no longer classification specific. Their classify,
    classified, and classification methods were removed.

0.2.1 Alpha Release

07 Jun 22:17
Compare
Choose a tag to compare
0.2.1 Alpha Release Pre-release
Pre-release

[0.2.1] - 2019-06-07

Added

  • Definition spec field to specify a class representative of key value pairs for
    definitions with primitives which are dictionaries
  • Auto generation of documentation for operation implementations, models, and
    sources. Generated docs include information on configuration options and
    inputs and outputs for operation implementations.
  • Async helpers got an aenter_stack method which creates and returns and
    contextlib.AsyncExitStack after entering all the context's passed to it.
  • Example of how to use Data Flow Facilitator / Orchestrator / Operations by
    writing a Python meta static analysis tool,
    shouldi

Changed

  • OperationImplementation add_label and add_orig_label methods now use op.name
    instead of ENTRY_POINT_ORIG_LABEL and ENTRY_POINT_NAME.
  • Make output specs and remap arguments optional for Operations CLI commands.
  • Feature skeleton project is now operations skeleton project

Fixed

  • MemoryOperationImplementationNetwork instantiates OperationImplementations
    using their withconfig() method.
  • MemorySource now decorated with entry_point
  • MemorySource takes arguments correctly via config_set and config_get
  • skel modules have long_description_content_type set to "text/markdown"
  • Base Orchestrator __aenter__ and __aexit__ methods were moved to the
    Memory Orchestrator because they are specific to that config.
  • Async helper aenter_stack uses inspect.isfunction so it will bind lambdas