Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): add models and class for pycytominer core #383

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kenibrewer
Copy link
Member

@kenibrewer kenibrewer commented Mar 25, 2024

Description

This represents initial work toward an experimental core module that is intended to the basis of refactoring pycytominer towards a more modular and object oriented design.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

📚 Documentation preview 📚: https://pycytominer--383.org.readthedocs.build/en/383/

@kenibrewer
Copy link
Member Author

kenibrewer commented Mar 25, 2024

@d33bs @shntnu @gwaybio @bethac07 I wanted to bring this WIP PR to your attention. It represents the approach I'd like to take for the planned refactor/modularization of pycytominer.

In essence I'd like to create an experimental core module that will serve as the home for the classes and data models that will eventually be used to move the rest of pycytominer functions to being more object-oriented rather than script-oriented. I'd like to do incremental merges into pycytominer main to add additional functionality to these core classes and models. This would allow them to be tested by others in the community and allow us to collect feedback. The caveat we would include is that the core module may be subject to breaking changes while it is still "experimental".

Two python packages that will be at the "core" of core are pydantic and pandera. They provide the ability to do robust type checking of our core, validation of inputs, and will substantially reduce the amount of boilerplate code we will need to write in this refactor.

One additional submodule I'm planning to add to core before beginning to write tests is features. I'm thinking this would include a Features class and a CellprofilerFeatures subclass that would house the default feature names and parsing logic for working with Cellprofiler features.

Perhaps it would be worthwhile to get together sometime and whiteboard a UML class diagram and/or discuss this planned approach more generally.

@gwaybio
Copy link
Member

gwaybio commented Mar 26, 2024

Thanks for your initiative on this @kenibrewer! It's great to see this evolution.

Roadmap question

Can you point me to where this work fits in the CZI EOSS pycytominer roadmap? Would it be in "(4) Add data class for CytoTable input to enable flexible metadata and feature management"? Also, how would this work fit in "(2) Implement Python Fire for CLI optionality"? But maybe this is also separate pre-work that is required prior to hitting the deliverables?

Development question

I'm also a bit weary of the development/release schedule. Given the answer to the previous questions, would the expectation of regular breaking changes negatively impact development for other pycytominer deliverables?

I believe we discussed development in other branches, but that you (or maybe it was @d33bs ) were generally against developing on off-main branches?

My goal is to identify a development/release solution that minimally impacts our users (maybe this is ok since we have conda/pypi) while enabling us to hit other pycytominer milestones without extensive blocking.

Pycytominer chat

I'm in favor of diagramming and more in depth discussion. Should anyone other than the folks you listed be on the call? Maybe @ErinWeisbart and @axiomcura ?

I can facilitate this, and welcome any other ideas.

Thanks again Ken!

@d33bs
Copy link
Member

d33bs commented Mar 26, 2024

Thanks @kenibrewer for opening up the discussion here with some example work to match! Generally I feel that Pandera can be a powerful tool and might serve pycytominer well. At the same time, I wonder a bit about how data vagueness and inference are helpful to the work that takes place in pycytominer.

I'd be 👍 for a meet to discuss this a bit further with others! Diagramming could be really helpful. Maybe a shared HackMD or Excalidraw board could assist (and also be included within the repo where helpful)?

I'll try to be specific about items that I feel might be important to this discussion:

  • How could we quantify user value when it comes to this change? For example, as a user story, this might be written: "As a pycytominer developer, I need Pandera models or schema(s) and related validation steps to avoid or resolve <repeated issues> and as a result <reach higher goal>.". Generally here I also wonder about how we might define scope to make sure we're on target for the EOSS roadmap and future considerations with related topics.
  • How would data within files be considered / included for this effort? My understanding is that Pandera covers in-memory data only, so this work may in-part depend on Pandas-focused data type inferences.
  • Defining data schema for data we do not directly govern could be challenging to update. For example, CellProfiler maintainers may make choices that effect how the schema are designed in pycytominer. CellProfiler output effects pycytominer output anywho, but data schema validation with precision could accelerate differences in negative ways (such as continual need for updates / coupling).
    • As a sidenote: is there an opportunity or existing resource for single-cell profiling feature engineering data schema definitions? Perhaps this could be served through an adjacent repository or organization? I wonder about the inclusion or creation of ontologies here too to help with data definition and understanding (in addition to testing).

@kenibrewer kenibrewer linked an issue Mar 27, 2024 that may be closed by this pull request
2 tasks
@kenibrewer
Copy link
Member Author

kenibrewer commented Mar 27, 2024

I'll respond in more detail later, but here are some brief responses.

Stories

@d33bs Thanks for the suggestion on the stories. I'll add those to the three linked issues on this PR.

Roadmap

This new core module would be in support of nearly all planned milestones:

(2) Release v1.1: Add Fire, refactor institution code. Q1 2025.

Has no impact on ability to deliver CLI interfaces. Addresses a substantial portion of the technical debt planned to be addressed in cyto_utils.

(3) Release v1.2: Add decorators on transformation functions. Q2 2025.

Core represents an alternative (and likely cleaner) approach to attaching a long list of decorators to the main functions (annotate, normalize, etc)

(4) Release v1.3: Overhaul test suite. Q4 2025.

Core would greatly simplify testing

(5) Release v1.4: CytoTable input data type class. Q1 2026.

Core would enable this milestone too.

Breaking Changes

An important clarification on breaking changes. The plan for this approach would enable us to have no breaking changes anywhere in the existing codebase. Initially core would be a standalone set of functionality, but after they are ready, we could migrate existing functions annotate, normalize, etc to use core under the hood. We would keep our existing test suite for those functions to make sure that downstream users don't need to modify their code.

Meeting

Let's schedule a meeting 2-3 weeks from now so I can prepare a more detailed plan and overview of my ideas. Great idea on including @ErinWeisbart and @axiomcura.

@gwaybio
Copy link
Member

gwaybio commented Mar 27, 2024

Thanks @kenibrewer - I'll work on scheduling. I think ~1 month is good lead time.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 0% with 31 lines in your changes are missing coverage. Please review.

Project coverage is 94.03%. Comparing base (af5a4ef) to head (6ca5421).
Report is 2 commits behind head on main.

Files Patch % Lines
pycytominer/core/profiles.py 0.00% 16 Missing ⚠️
pycytominer/core/operation.py 0.00% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #383      +/-   ##
==========================================
- Coverage   94.96%   94.03%   -0.93%     
==========================================
  Files          56       58       +2     
  Lines        3137     3168      +31     
==========================================
  Hits         2979     2979              
- Misses        158      189      +31     
Flag Coverage Δ
unittests 94.03% <0.00%> (-0.93%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants