Proposed repository structure #6

SkalskiP · 2023-11-29T21:01:57Z

Proposed Code Structure

Every prompting pipeline comes with prompt_creator and result_processor. You can manually instantiate instances of those classes or call pipeline function providing name argument.

from abc import ABC, abstractmethod
from typing import Tuple, List, Dict
import numpy as np
import supervision as sv


class BasePromptCreator(ABC):
    @abstractmethod
    def create(self, image: np.ndarray, *args, **kwargs) -> Tuple[np.ndarray, sv.Detections]:
        """
        Create a prompt from an image and additional arguments.

        Args:
            image (np.ndarray): The input image.
            *args, **kwargs: Additional arguments.

        Returns:
            Tuple[np.ndarray, sv.Detections]: A tuple containing a processed image and detections.
        """
        pass


class BaseResultProcessor(ABC):
    @abstractmethod
    def process(self, text: str, marks: sv.Detections, *args, **kwargs) -> Dict[str, str]:
        """
        Process the results with given text and detections.

        Args:
            text (str): The input text.
            marks (sv.Detections): Detections to be used in processing.
            *args, **kwargs: Additional arguments.

        Returns:
            Dict[str, str]: Processed results.
        """
        pass


    @abstractmethod
    def visualize(self, text: str, image: np.ndarray, marks: sv.Detections, *args, **kwargs) -> np.ndarray:
        """
        Visualize the results on an image.

        Args:
            text (str): The input text.
            image (np.ndarray): The input image.
            marks (sv.Detections): Detections to be visualized.
            *args, **kwargs: Additional arguments.

        Returns:
            np.ndarray: The image with visualizations.
        """
        pass


class SamPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, mask: Optional[np.ndarray] = none) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class SamResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> List[str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


class GroundingDinoPromptCreator(BasePromptCreator):
    def __init__(self, device: str):
        self.device = device

    def create(image: np.ndarray, categories: List[str]) -> Tuple[image: np.ndarray, sv.Detections]:
        pass


class GroundingDinoResultProcessor(BaseResultProcessor):
    
    def process(text: str, marks: sv.Detections) -> Dict[str, str]:
        pass

    def visualize(text: str, image: np.ndarray, marks: sv.Detections) -> np.ndarray:
        pass


PIPELINES = {
    'sam': (SamPromptCreator, SamResultProcessor),
    'grounding-dino': (GroundingDinoPromptCreator, GroundingDinoResultProcessor)
}


def pipeline(name: str, **kwargs) -> Tuple[BasePromptCreator, BaseResultProcessor]:
    """Retrieves the prompt creator and result processor for the specified pipeline.

    Args:
        name (str): The name of the pipeline.
        **kwargs: Additional keyword arguments for initializing the classes.

    Returns:
        Tuple[BasePromptCreator, BaseResultProcessor]: Instances of the prompt creator and result processor.

    Raises:
        ValueError: If the pipeline name is not in the PIPELINES dictionary.
    """
    pipeline_classes = PIPELINES.get(name)

    if pipeline_classes is None:
        raise ValueError(f"Pipeline '{name}' not found. Please choose from {list(PIPELINES.keys())}.")

    PromptCreatorClass, ResultProcessorClass = pipeline_classes

    prompt_creator = PromptCreatorClass(**kwargs)
    result_processor = ResultProcessorClass(**kwargs)

    return prompt_creator, result_processor

Example Usage

LMM inference gets sandwiched between prompt_creator and result_processor calls.

import cv2
from maestro import pipeline, prompt_gpt4_vision

prompt_creator, result_processor = pipeline('sam', device='cuda')

image_prompt, marks = prompt_creator(image=image)
text_prompt = 'Find dog.'
api_key = '...'

response = prompt_gpt4_vision(
    text_prompt=text_prompt, 
    image_prompt=image_prompt, 
    api_key=api_key)

visualization = result_processor.visualize(
    text=response, 
    image=image, 
    marks=marks)

The text was updated successfully, but these errors were encountered:

PawelPeczek-Roboflow · 2023-11-30T11:17:12Z

Looks good as a baseline, I am just wondering change in this theme would be more verbose:

maestro = build_maestro('sam', device='cuda').with("gpt-4")
result = maestro.prompt("Find a dog").with_image(image).visualize()

Naming conventions to be agreed - I just would like to point out that usage of prompt_creator and result_processor with custom things (that cannot be fully custom) in between - may bring confusion for less advanced users - especially that result_processor probably assumes some structure of response that may not be guaranteed given that client uses their own logic instead of prompt_gpt4_vision()

for more advanced use cases, however - I would let .with("gpt-4") to be replaced with .with(my_callable) where my_callable takes agreed parameters and clients can inject implementation.

yeldarby · 2023-11-30T16:28:42Z

This makes sense to me for set of marks style prompts where you're annotating an image.

I think we may want to have some aspirational things that we may implement some day that we're keeping in mind as we design the API structure. Some thoughts on potential future directions of exploration:

Chaining - taking the output of one response, doing another transformation, and passing it back (eg "find the dog" -> it finds it -> we crop the photo to isolate the object of interest -> "describe this dog")
Few-shot - pulling similar images (and captions/annotations) from a vector DB & passing them along with your prompt to show by example what you want (or "spot the difference" style prompting against a reference image)
RAG - pulling relevant images from a vector DB to add additional context
Temporal / Video - to help with eg the sports broadcasting example
Tool use - using another model like a fine-tuned CNN to be able to add additional context
Integration with existing tools like LangChain (so you can eg us these prompting techniques as part of agent flows)

SkalskiP · 2023-12-01T08:04:06Z

Cool! I'll keep that in mind. We had a call with @PawelPeczek-Roboflow. We agreed on PromptCreator and ResultProcessor structure. Those can encapsulate a lot of the logic you just described. We just need to make sure the top layer allows to freely pass versions arguments. But because we are still not sure what we want to support we'll add high level API at the very end.

SkalskiP added the enhancement New feature or request label Nov 29, 2023

SkalskiP assigned yeldarby, PawelPeczek-Roboflow and SkalskiP Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed repository structure #6

Proposed repository structure #6

SkalskiP commented Nov 29, 2023 •

edited

PawelPeczek-Roboflow commented Nov 30, 2023 •

edited

yeldarby commented Nov 30, 2023

SkalskiP commented Dec 1, 2023

Proposed repository structure #6

Proposed repository structure #6

Comments

SkalskiP commented Nov 29, 2023 • edited

Proposed Code Structure

Example Usage

PawelPeczek-Roboflow commented Nov 30, 2023 • edited

yeldarby commented Nov 30, 2023

SkalskiP commented Dec 1, 2023

SkalskiP commented Nov 29, 2023 •

edited

PawelPeczek-Roboflow commented Nov 30, 2023 •

edited