Skip to content

Releases: bentoml/BentoML

BentoML - v1.1.7

12 Oct 18:24
1e8902a
Compare
Choose a tag to compare

What's Changed

Update OTEL deps to 0.41b0 to address CVE for 0.39b0

General documentation client updates.

New Contributors

Full Changelog: v1.1.6...v1.1.7

BentoML - v1.1.6

08 Sep 05:23
c1504bd
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.1.5...v1.1.6

BentoML - v1.1.5

08 Sep 05:15
ca6eca5
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.1.4...v1.1.5

BentoML - v1.1.4

30 Aug 01:17
7a83d99
Compare
Choose a tag to compare

🍱 To better support LLM serving through response streaming, we are proud to introduce an experimental support of server-sent events (SSE) streaming support in this release of BentoML v1.14 and OpenLLM v0.2.27. See an example service definition for SSE streaming with Llama2.

  • Added response streaming through SSE to the bentoml.io.Text IO Descriptor type.
  • Added async generator support to both API Server and Runner to yield incremental text responses.
  • Added supported to ☁️ BentoCloud to natively support SSE streaming.

🦾 OpenLLM added token streaming capabilities to support streaming responses from LLMs.

  • Added /v1/generate_stream endpoint for streaming responses from LLMs.

    curl -N -X 'POST' 'http://0.0.0.0:3000/v1/generate_stream' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
      "prompt": "### Instruction:\n What is the definition of time (200 words essay)?\n\n### Response:",
      "llm_config": {
        "use_llama2_prompt": false,
        "max_new_tokens": 4096,
        "early_stopping": false,
        "num_beams": 1,
        "num_beam_groups": 1,
        "use_cache": true,
        "temperature": 0.89,
        "top_k": 50,
        "top_p": 0.76,
        "typical_p": 1,
        "epsilon_cutoff": 0,
        "eta_cutoff": 0,
        "diversity_penalty": 0,
        "repetition_penalty": 1,
        "encoder_repetition_penalty": 1,
        "length_penalty": 1,
        "no_repeat_ngram_size": 0,
        "renormalize_logits": false,
        "remove_invalid_values": false,
        "num_return_sequences": 1,
        "output_attentions": false,
        "output_hidden_states": false,
        "output_scores": false,
        "encoder_no_repeat_ngram_size": 0,
        "n": 1,
        "best_of": 1,
        "presence_penalty": 0.5,
        "frequency_penalty": 0,
        "use_beam_search": false,
        "ignore_eos": false
      },
      "adapter_name": null
    }'

What's Changed

New Contributors

Full Changelog: v1.1.3...v1.1.4

BentoML - v1.1.2

22 Aug 02:46
a2ead21
Compare
Choose a tag to compare

Patch releases

BentoML now provides a new diffusers integration, bentoml.diffusers_simple.

This introduces two integration for stable_diffusion and stable_diffusion_xl model.

import bentoml

# Create a Runner for a Stable Diffusion model
runner = bentoml.diffusers_simple.stable_diffusion.create_runner("CompVis/stable-diffusion-v1-4")

# Create a Runner for a Stable Diffusion XL model
runner_xl = bentoml.diffusers_simple.stable_diffusion_xl.create_runner("stabilityai/stable-diffusion-xl-base-1.0")

General bug fixes and documentation improvement

What's Changed

New Contributors

  • @EgShes made their first contribution in #4102
  • @zhangwm404 made their first contribution in #4108

Full Changelog: v1.1.1...v1.1.2

BentoML - v1.1.1

01 Aug 21:11
ea4aafc
Compare
Choose a tag to compare

🍱 Patched release 1.1.1

  • Added more extensive cloud config option for bentoml deployment CLI, Thanks @Haivilo.
    Note that bentoml deployment update now takes the name as a optional positional argument instead of the previous behaviour --name:
     bentoml deployment update DEPLOYMENT_NAME
    See #4087
  • Added documentation about bento release GitHub action, Thanks @frostming. See #4071

Full Changelog: v1.1.0...v1.1.1

BentoML - v1.1.0

24 Jul 20:34
2ab6de7
Compare
Choose a tag to compare

🍱 We're thrilled to announce the release of BentoML v1.1.0, our first minor version update since the milestone v1.0.

  • Backward Compatibility: Rest assured that this release maintains full API backward compatibility with v1.0.
  • Official gRPC Support: We've transitioned gRPC support in BentoML from experimental to official status, expanding your toolkit for high-performance, low-latency services.
  • Ray Integration: Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
  • Enhanced Hugging Face Transformers and Diffusers Support: All Hugging Face Diffuser models and pipelines can be seamlessly imported and integrated into BentoML applications through the Transformers and Diffusers framework libraries.
  • Enhanced Model Version Management: Enjoy greater flexibility with the improved model version management, enabling flexible configuration and synchronization of model versions with your remote model store.

🦾 We are also excited to announce the launch of OpenLLM v0.2.0 featuring the support of Llama 2 models.

image

  • GPU and CPU Support: Running Llama is support on both GPU and CPU.

  • Model variations and parameter sizes: Support all model weights and parameter sizes on Hugging Face.

    meta-llama/llama-2-70b-chat-hf
    meta-llama/llama-2-13b-chat-hf
    meta-llama/llama-2-7b-chat-hf
    meta-llama/llama-2-70b-hf
    meta-llama/llama-2-13b-hf
    meta-llama/llama-2-7b-hf
    openlm-research/open_llama_7b_v2
    openlm-research/open_llama_3b_v2
    openlm-research/open_llama_13b
    huggyllama/llama-65b
    huggyllama/llama-30b
    huggyllama/llama-13b
    huggyllama/llama-7b

    Users can use any weights on HuggingFace (e.g. TheBloke/Llama-2-13B-chat-GPTQ), custom weights from local path (e.g. /path/to/llama-1), or fine-tuned weights as long as it adheres to LlamaModelForCausalLM.

  • Stay tuned for Fine-tuning capabilities in OpenLLM: Fine-tuning various Llama 2 models will be added in a future release. Try the experimental script for fine-tuning Llama-2 with QLoRA under OpenLLM playground.

    python -m openllm.playground.llama2_qlora --help
    

BentoML - v1.0.22

12 Jun 20:44
89e5fda
Compare
Choose a tag to compare

🍱 BentoML v1.0.22 release has brought a list of well-anticipated updates.

  • Added support for Pydantic 2 for better validate performance.

  • Added support for CUDA 12 versions in builds and containerization.

  • Introduced service lifecycle events allowing adding custom logic on_deployment, on_startup, and on_shutdown. States can be managed using the context ctx variable during the on_startup and on_shutdown events and during request serving in the API.

    @svc.on_deployment
    def on_deployment():
      pass
    
    @svc.on_startup
    def on_startup(ctx: bentoml.Context):
      ctx.state["object_key"] = create_object()
    
    @svc.on_shutdown
    def on_shutdown(ctx: bentoml.Context):
      cleanup_state(ctx.state["object_key"])
    
    @svc.api
    def predict(input_data, ctx):
      object = ctx.state["object_key"]
      pass
  • Added support for traffic control for both API Server and Runners. Timeout and maximum concurrency can now be configured through configuration.

    api_server:
      traffic:
        timeout: 10 # API Server request timeout in seconds
        max_concurrency: 32 # Maximum concurrency requests in the API Server
    
    runners:
      iris:
        traffic:
          timeout: 10 # Runner request timeout in seconds
          max_concurrency: 32 # Maximum concurrency requests in the Runner
  • Improved performance of bentoml push performance for large Bentos.

🚀 One more thing, the team is delighted to unveil our latest endeavor, OpenLLM. This innovative project allows you to effortless build with the state-of-the-art open source or fine-tuned Large Language Models.

  • Supports all variants of Flan-T5, Dolly V2, StarCoder, Falcon, StableLM, and ChatGLM out-of-box. Fully customizable with model specific arguments.

    openllm start [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]
  • Exposes the familiar BentoML APIs and transforms LLMs seamlessly into Runners.

    llm_runner = openllm.Runner("dolly-v2")
  • Builds LLM application into the Bento format that can be deployed to BentoCloud or containerized into OCI images.

    openllm build [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]

Our dedicated team is working hard to pioneering more integrations of advanced models for our upcoming releases of OpenLLM. Stay tuned for the unfolding developments.

BentoML - v1.0.20

10 May 01:14
7f7be71
Compare
Choose a tag to compare

🍱 BentoML v1.0.20 is released with improved usability and compatibility features.

  • Production Mode by Default: bentoml serve command will now run with the --production option by default. The change is made the simulate the production behavior during development. The --reload option will continue to with as expected. To achieve the serving behavior previously, use --development instead.

  • Optional Dependency for OpenTelemetry Exporter: The opentelemetry-exporter-otlp-proto-http dependency has been moved from a required dependency to an optional one to address a protobuf dependency incompatibility issue. ⚠️ If you are currently using the Model Monitoring and Inference Data Collection feature, you must install the package with the monitor-otlp ****option from this release onwards to include the necessary dependency.

    pip install "bentoml[monitor-otlp]"
  • OpenTelemetry Trace ID Configuration Option: A new configuration option has been added to return the OpenTelemetry Trace ID in the response. This feature is particularly helpful when tracing has not been initialized in the upstream caller, but the caller still wishes to log the Trace ID in case of an error.

    api_server:
      http:
        response:
          trace_id: True
  • Start from a Service: Added the ability to start a server from a bentoml.Service object. This is helpful for troubleshooting a project in a development environment where no Bentos has been built yet.

    import bentoml
    
    # import the Service defined in `/clip_api_service/service.py` file
    from clip_api_service.service import svc 
    
    if __name__ == "__main__":
      # start a server:
      server = bentoml.HTTPServer(svc)
      server.start(blocking=False)
      client = server.get_client()
      client.predict(..)

What's Changed

New Contributors

Full Changelog: v1.0.19...v1.0.20

BentoML - v1.0.19

26 Apr 23:52
afe9660
Compare
Choose a tag to compare

🍱 BentoML v1.0.19 is released with enhanced GPU utilization and expanded ML framework support.

  • Optimized GPU resource utilization: Enabled scheduling of multiple instances of the same runner using the workers_per_resource scheduling strategy configuration. The following configuration allows scheduling 2 instances of the “iris” runner per GPU instance. workers_per_resource is 1 by default.

    runners:
      iris:
        resources:
          nvidia.com/gpu: 1
        workers_per_resource: 2
  • New ML framework support: We've added support for EasyOCR and Detectron2 to our growing list of supported ML frameworks.

  • Enhanced runner communication: Implemented PEP 574 out-of-band pickling to improve runner communication by eliminating memory copying, resulting in better performance and efficiency.

  • Backward compatibility for Hugging Face Transformers: Resolved compatibility issues with Hugging Face Transformers versions prior to v4.18, ensuring a seamless experience for users with older versions.

⚙️ With the release of Kubeflow 1.7, BentoML now has native integration with Kubeflow, allowing developers to leverage BentoML's cloud-native components. Prior, developers were limited to exporting and deploying Bento
as a single container. With this integration, models trained in Kubeflow can easily be packaged, containerized, and deployed to a Kubernetes cluster as microservices. This architecture enables the individual models to run in their own pods, utilizing the most optimal hardware for their respective tasks and enabling independent scaling.

💡 With each release, we consistently update our blog, documentation and examples to empower the community in harnessing the full potential of BentoML.

What's Changed

New Contributors

Full Changelog: v1.0.18...v1.0.19