Skip to content

Latest commit

 

History

History
122 lines (94 loc) · 21.2 KB

papers.md

File metadata and controls

122 lines (94 loc) · 21.2 KB

MLOps Papers

This section lists scientific and industrial papers about ML operalization since 2015.

2023

  1. Marius Schlegel, Kai-Uwe Sattler. "MLflow2PROV: Extracting Provenance from Machine Learning Experiments", 7th Workshop on Data Management for End-to-End Machine Learning (DEEM@SIGMOD '23), 2023. (GitHub: MLflow2PROV)
  2. Socio-Technical Anti-Patterns in Building ML-Enabled Software: Insights from Leaders on the Forefront
  3. Data Models for Dataset Drift Controls in Machine Learning With Images: Paper | Code

2022

  1. Marius Schlegel, Kai-Uwe Sattler. "Management of Machine Learning Lifecycle Artifacts: A Survey", ACM SIGMOD Record Volume 51, Issue 4, 2022.
  2. Tiny-MLOps: a framework for orchestrating ML applications at the far edge of IoT systems
  3. Machine Learning Operations (MLOps): Overview, Definition, and Architecture

2021

  1. A software engineering perspective on engineering machine learning systems: State of the art and challenges
    A systematic analysis and summary of the current state of software engineering research for engineering ML systems.
  2. Asset management in machine learning: a survey
    This paper presents a feature-based survey of 17 tools with ML asset management support identified in a systematic search. It overviews these tools’ features for managing the different types of assets used for engineering ML-based systems and performing experiments.
  3. Ease.ML: a lifecycle management system for MLDev and MLOps
    This paper presents a system for managing and automating the entire lifecycle of machine learning application development.
  4. Challenges in deploying machine learning: a survey of case studies
    This survey reviews published reports of deploying machine learning solutions in a variety of use cases, industries and applications and extracts practical considerations corresponding to stages of the machine learning deployment workflow.
  5. Fischer, Lukas, Lisa Ehrlinger, Verena Geist, Rudolf Ramler, Florian Sobiezky, Werner Zellinger, David Brunner, Mohit Kumar, and Bernhard Moser. "AI System Engineering—Key Challenges and Lessons Learned."
  6. A Data Quality-Driven View of MLOps
  7. Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure
  8. Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities
  9. Muralidhar, Nikhil, et al. "Using AntiPatterns to avoid MLOps Mistakes." arXiv preprint arXiv:2107.00079 (2021).
  10. ModelCI-e: Enabling Continual Learning in Deep Learning Serving Systems
    This paper implements a lightweight MLOps plugin, termed ModelCI-e (continuous integration and evolution). It embraces continual learning (CL) and ML deployment techniques, providing end-to-end supports for model updating and validation without serving engine customization.
  11. Hopkins, Aspen, and Serena Booth. "Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development." (2021).

2020

  1. Adoption and effects of software engineering best practices in machine learning
    This paper aims to empirically determine the state of the art in how teams develop, deploy and maintain software with ML components.
  2. A viz recommendation system: ML lifecycle at Tableau
    This paper cover Tableau's research and development effort for the ML models behind the recommendation especially in the area of model life-cycle management, deployment, and monitoring.
  3. Building continuous integration services for machine learning
    This paper presents a CI system for ML that integrates seamlessly with existing ML development tools.
  4. CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking
    This paper present CodeReef, an open source platform to share all the components necessary to enable cross-platform (MLSysOps), i.e., automating the deployment of ML models across diverse system in the most efficient way.
  5. Common problems with creating machine learning pipelines from existing code This workshop paper shares common problems observed in industry on developing machine learning pipelines.
  6. Data engineering for data analytics: a classification of the issues and case studies
    This paper provides a description and classification of data engineering tasks (such as acquiring, understanding, cleaning, and preparing the data) into high-levels groups, namely data organization, data quality, and feature engineering.
  7. DevOps for AI - challenges in development of AI-enabled applications
    This paper points out the challenges in development of complex systems that include ML components, and discuss possible solutions driven by the combination of DevOps and ML workflow processes. Industrial cases are presented to illustrate these challenges and the possible solutions.
  8. Developments in MLflow: a system to accelerate the machine learning lifecycle
    This paper discusses user feedback collected since MLflow was launched in 2018, as well as three major features introduced in response to this feedback.
  9. Engineering AI systems: a research agenda
    This paper presents a research agenda for AI engineering that provides an overview of the key engineering challenges surrounding ML solutions and an overview of open items that need to be addressed by the research community at large.
  10. Explainable machine learning in deployment
    This study explores how organizations view and use explainability for stakeholder consumption.
  11. From what to how: an initial review of publicly available AI ethics tools, methods and research to translate principles into practices
    This papers aims at contributing to closing the gap between principles and practices in Machine Learning by constructing a typology that may help practically-minded developers apply ethics at each stage of the Machine Learning development pipeline, and to signal to researchers where further work is needed.
  12. Implicit provenance for machine learning artifacts
    This paper presents an approach, called implicit provenance, where a distributed file system and APIs are instrumented to capture changes to ML artifacts, that, along with file naming conventions, mean that full lineage can be tracked for TensorFlow/Keras/Pytorch programs without requiring code changes.
  13. Machine learning testing: survey, landscapes and horizons
    This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research.
  14. MLModelCI: an automatic cloud platform for efficient MLaaS
    This paper presents MLModelCI, a one-step platform for efficient machine learning (ML) services that leverages DevOps techniques to optimize, test, and manage models. It also containerizes and deploys these optimized and validated models as cloud services.
  15. Monitoring and explainability of models in production
    This paper discusses the challenges to successful implementation of solutions in key areas (such as model performance and data monitoring, detecting outliers and data drift using statistical techniques) with some recent examples of production ready solutions using open source tools.
  16. Principles and practice of explainable machine learning
    This paper focuses on data-driven methods - machine learning and pattern recognition models in particular - so as to survey and distill the results and observations from the literature about the following challenges: how do we understand the decisions suggested by these systems in order that we can trust them?
  17. sensAI: fast ConvNets serving on live data via class parallelism
    This paper presents sensAI, a novel and generic approach to achieve faster inference on single data item, that distributes a single CNN into disconnected subnets, and achieve decent serving accuracy with negligible communication overhead (1 float value).
  18. Software engineering for artificial intelligence and machine learning software: a systematic literature review
    This study aims to investigate how software engineering (SE) has been applied in the development of AI/ML systems and identify challenges and practices that are applicable and determine whether they meet the needs of professionals.
  19. Software engineering patterns for machine learning applications (SEP4MLA)
    From 33 ML patterns, this paper describes three major ML architecture patterns and one ML design pattern in the standard pattern format so that practitioners can (re)use them in their contexts. Go to part 1 or part 2
  20. Simulating performance of ML systems with offline profiling
    This paper advocates that simulation based on offline profiling is a promising approach to better understand and improve the complex ML systems, and proposes and approach that uses operation-level profiling and dataflow based simulation to ensure a unified and automated solution for all frameworks and ML models.
  21. Towards automating the AI operations lifecycle
    This paper presents a set of enabling technologies that can be used to increase the level of automation in AI operations, thus lowering the human effort required.
  22. Towards CRISP-ML(Q): a machine learning process model with quality assurance methodology
    This paper proposes a process model for the development of machine learning applications that guides machine learning practitioners and project organizations from industry and academia with a checklist of tasks that spans the complete project life-cycle.
  23. Towards distribution transparency for supervised ML with oblivious training functions
    This paper introduces the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyper�parameter search and distributed training on clusters.
  24. Towards ML engineering: a brief history of TensorFlow Extended (TFX)
    This paper gives a whirlwind tour of Sibyl and TensorFlow Extended (TFX), two successive end-to-end ML platforms at Alphabet. It also shares the lessons learned from over a decade of applied ML built on these platforms, and explains both their similarities and their differences.
  25. Siebert, Julien, et al. "Towards guidelines for assessing qualities of machine learning systems." International Conference on the Quality of Information and Communications Technology. Springer, Cham, 2020.
  26. Karlaš, Bojan, Matteo Interlandi, Cedric Renggli, Wentao Wu, Ce Zhang, Deepak Mukunthu Iyappan Babu, Jordan Edwards, Chris Lauren, Andy Xu, and Markus Weimer. "Building continuous integration services for machine learning." In Proceedings of the 26th ACM SIGKDD 2020.

2019

  1. Assuring the machine learning lifecycle: desiderata, methods, and challenges
    This paper provides a comprehensive survey of the state-of-the-art in the assurance of ML, i.e., in the generation of evidence that ML is sufficiently safe for its intended use.
  2. Continuous integration of machine learning models with ease.ml/ci: towards a rigorous yet practical treatment
    This paper presents ease.ml/ci, a continuous integration system for machine learning to provide rigorous guarantees with a practical amount of labeling effort.
  3. Challenges in the deployment and operation of machine learning in practice
    In this work, the authors target to systematically elicit the challenges in deployment and operation to enable broader practical dissemination of machine learning applications.
  4. Overton: a data system for monitoring and improving machine-learned products
    This paper describes a system called Overton, whose main design goal is to support engineers in building, monitoring, and improving production machine learning systems.
  5. Studying software engineering patterns for designing machine learning systems
    This paper collects good/bad software engineering design patterns for ML techniques to provide developers with a comprehensive classification of such patterns.
  6. Towards automated ML model monitoring: measure, improve and quantify data quality
    This paper focuses on the arising challenge of automating the operation of deployed ML applications, especially with respect to monitoring the quality of their input data.

2018

  1. A systems perspective to reproducibility in production machine learning domain This paper presents a system that enables ML experts to track and reproduce ML models and pipelines in production.
  2. Building a reproducible machine learning pipeline This paper discusses some problems encountered while building a variety of machine learning models, and subsequently describes a framework to tackle the problem of model reproducibility.
  3. On challenges in machine learning model management
    This paper discusses a selection of ML use cases, develops an overview over conceptual, engineering, and data-processing related challenges arising in the management of the corresponding ML models, and points out future research directions.
  4. Ease.ml in action: towards multi-tenant declarative learning services
    This demo paper presents the design principles of ease.ml, highlights the implementation of its key components, and showcases how ease.ml can help ease machine learning tasks that often perplex even experienced users.

2017

  1. Clipper: a low-latency online prediction serving system
    This paper introduces Clipper, a general-purpose low-latency prediction serving system that aims to simplify model deployment across frameworks and applications, reduce prediction latency, and improve prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.
  2. Ease.ml: towards multi-tenant resource sharing for machine learning workloads
    This paper presents ease.ml, a declarative machine learning service platform.
  3. Data management challenges in production machine learning
    This paper discusses data-management issues that arise in the context of machine learning pipelines deployed in production.
  4. TFX: A TensorFlow-based production-scale machine learning platform
    This paper presents TensorFlow Extended (TFX), a TensorFlow-based general-purpose machine learning platform implemented at Google to reduce the time to production from the order of months to weeks, while providing platform stability that minimizes disruptions.

2016

  1. ModelDB: a system for machine learning model management
    This paper describes ModelDB, a novel end-to-end system for the management of machine learning models.
  2. Scaling Machine Learning as a Service
    This paper presents the scalable MLaaS built for Uber that operates globally. It focus on several challenges, among which: (i) how to scale feature computation for many machine learning use cases; (ii) how to build accurate models using global data; (iii) how to enable scalable model deployment and real-time serving for many models across multiple data centers.
  3. What’s your ML test score? A rubric for ML production systems
    This paper presents an ML Test Score rubric based on a set of actionable tests to help quantify a host of issues not found in small toy examples or even large offline research experiments.

2015

  1. Hidden technical debt in machine learning systems
    This paper explores several ML-specific risk factors to account for in system design.

Additional Resources

  1. Adversarial machine learning reading list
  2. Workshop at ICML 2020: "Challenges in Deploying and Monitoring Machine Learning Systems" (Accepted Papers)
  3. Workshop on MLOps Systems (MLSys)
  4. A survey on concept drift adaptation
  5. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
  6. Conversational Applications and Natural Language Understanding Services at Scale. Minh Tue Vo Thanh and Vijay Ramakrishnan.
  7. Efficient Scheduling of DNN Training on Multitenant Clusters. Deepak Narayanan, Keshav Santhanam, Amar Phanishayee and Matei Zaharia.
  8. MLBox: Towards Reproducible ML. Victor Bittorf, Xinyuan Huang, Peter Mattson, Debojyoti Dutta, David Aronchick, Emad Barsoum, Sarah Bird, Sergey Serebryakov, Natalia Vassilieva, Tom St. John, Grigori Fursin, Srini Bala, Sivanagaraju Yarramaneni, Alka Roy, David Kanter and Elvira Dzhuraeva.
  9. MLPM: Machine Learning Package Manager. Xiaozhe Yao.
  10. Tools for machine learning experiment management. Vlad Velici and Adam Prügel-Bennett.
  11. Towards split learning at scale: System design. Iker Rodríguez, Eduardo Muñagorri, Alberto Roman, Abhishek Singh, Praneeth Vepakomma and Ramesh Raskar.
  12. (2020) Towards complaint-driven ML workflow debugging.
  13. (NA) PerfGuard: Deploying ML-for-Systems without Performance Regressions.
  14. Addressing the Memory Bottleneck in AI Model-Training
  15. Reliance on Metrics is a Fundamental Challenge for AI
  16. Teaching Software Engineering for AI-Enabled Systems