Releases: bentoml/BentoML
BentoML - v1.1.7
What's Changed
Update OTEL deps to 0.41b0 to address CVE for 0.39b0
General documentation client updates.
- docs: Add the SDXL deployment quickstart by @Sherlock113 in #4175
- Update pytorch.rst by @piercus in #4176
- chore(deps): bump actions/checkout from 3 to 4 by @dependabot in #4177
- fix: parse tag from multiline output by @frostming in #4178
- docs: Update the user management docs by @Sherlock113 in #4186
- fix(config): set default runner timeout to 15min by @sauyon in #4184
- docs: Add observability to the BentoCloud overview docs by @Sherlock113 in #4187
- fix(framework): add args and kwargs to sklearn and xgboost methods by @jianshen92 in #4189
- docs: fix typo in bento.rst and model.rst by @seedspirit in #4192
- fix: Rename ASGIHTTPSender to BufferedASGISender for Ray compatibility. by @HamzaFarhan in #4191
- fix(client): make get_client raise instead of logging by @sauyon in #4181
- fix(cloud-client): delete unused field of schema by @Haivilo in #4196
- chore(deps): bump docker/setup-buildx-action from 2 to 3 by @dependabot in #4195
- chore(deps): bump docker/setup-qemu-action from 2 to 3 by @dependabot in #4194
- chore: client_request_hook type fix by @sauyon in #4199
- docs: Add docs for the new bentoml.Server API by @Sherlock113 in #4198
- docs: Add the OneDiffusion Google Colab task by @Sherlock113 in #4202
- docs: Add best practices doc for cost optimization by @Sherlock113 in #4200
- docs: Update the Manage Models and Bentos docs by @Sherlock113 in #4203
- fix: do not use UDS on WSL by @frostming in #4204
- docs: fix typos in help messages by @smidm in #4206
- fix: subprocess not using same python as main process causing
bentoml.bentos.build
to crash by @nickolasrm in #4209 - fix: allow WSL in the condition by @frostming in #4210
- docs: Update manage access token docs by @Sherlock113 in #4215
- ci: pre-commit autoupdate [skip ci] by @pre-commit-ci in #4216
- fix: EasyOCR integration docs mistake by @jianshen92 in #4214
- fix: include mounted FastAPI app's OpenAPI components by @RobbieFernandez in #4212
- UPDATE: model.py -> fix Model class Exepction message. by @JminJ in #4219
- docs: Remove private access mention by @Sherlock113 in #4221
- docs: Change to sentence case by @Sherlock113 in #4222
- docs: Fix dead link by @Sherlock113 in #4225
- feat: support ipv6 addresses for serve by @sauyon in #3914
- docs: Fix all dead links in BentoML docs by @Sherlock113 in #4229
- docs: Add the BYOC doc by @Sherlock113 in #4223
- docs: Update the Services doc by @Sherlock113 in #4231
- fix(client): type fixes by @sauyon in #4182
- fix: correct the bento size to include the size of models by @frostming in #4226
- fix: use httpx for usage tracking by @sauyon in #4228
- fix(deps): bump otel for CVE by @aarnphm in #4233
- feat: separate and optimize async and sync clients by @judahrand in #4116
New Contributors
- @piercus made their first contribution in #4176
- @seedspirit made their first contribution in #4192
- @HamzaFarhan made their first contribution in #4191
- @nickolasrm made their first contribution in #4209
- @JminJ made their first contribution in #4219
Full Changelog: v1.1.6...v1.1.7
BentoML - v1.1.6
What's Changed
- fix(exception): catch exception for users' runners code by @aarnphm in #4150
- docs: Add the streaming docs by @Sherlock113 in #4164
- ci: pre-commit autoupdate [skip ci] by @pre-commit-ci in #4167
- fix(httpclient): take into account trailing slash in from_url by @sauyon in #4169
- docs: fix typo by @Sherlock113 in #4173
- fix: apply env map for distributed runner workers by @bojiang in #4174
New Contributors
- @pre-commit-ci made their first contribution in #4167
Full Changelog: v1.1.5...v1.1.6
BentoML - v1.1.5
What's Changed
- fix(type): explicit init for attrs Runner by @aarnphm in #4140
- fix: typo in ALLOWED_CUDA_VERSION_ARGS by @thomasjo in #4156
- chore(deps): open Starlette version, to allow latest by @alexeyshockov in #4100
- chore: lower bound for cloudpickle by @aarnphm in #4098
- docs: Add embedded runners docs by @Sherlock113 in #4157
- fix cloud client types by @sauyon in #4160
- fix: use closer-integrated callbackwrapper by @sauyon in #4161
- chore(annotations): cleanup compat and fix ModelSignatureDict type by @aarnphm in #4162
- fix(pull): correct use
cloud_context
for models pull by @aarnphm in #4163
New Contributors
- @thomasjo made their first contribution in #4156
- @alexeyshockov made their first contribution in #4100
Full Changelog: v1.1.4...v1.1.5
BentoML - v1.1.4
🍱 To better support LLM serving through response streaming, we are proud to introduce an experimental support of server-sent events (SSE) streaming support in this release of BentoML v1.14
and OpenLLM v0.2.27
. See an example service definition for SSE streaming with Llama2.
- Added response streaming through SSE to the
bentoml.io.Text
IO Descriptor type. - Added async generator support to both API Server and Runner to
yield
incremental text responses. - Added supported to ☁️ BentoCloud to natively support SSE streaming.
🦾 OpenLLM added token streaming capabilities to support streaming responses from LLMs.
-
Added
/v1/generate_stream
endpoint for streaming responses from LLMs.curl -N -X 'POST' 'http://0.0.0.0:3000/v1/generate_stream' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "prompt": "### Instruction:\n What is the definition of time (200 words essay)?\n\n### Response:", "llm_config": { "use_llama2_prompt": false, "max_new_tokens": 4096, "early_stopping": false, "num_beams": 1, "num_beam_groups": 1, "use_cache": true, "temperature": 0.89, "top_k": 50, "top_p": 0.76, "typical_p": 1, "epsilon_cutoff": 0, "eta_cutoff": 0, "diversity_penalty": 0, "repetition_penalty": 1, "encoder_repetition_penalty": 1, "length_penalty": 1, "no_repeat_ngram_size": 0, "renormalize_logits": false, "remove_invalid_values": false, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "encoder_no_repeat_ngram_size": 0, "n": 1, "best_of": 1, "presence_penalty": 0.5, "frequency_penalty": 0, "use_beam_search": false, "ignore_eos": false }, "adapter_name": null }'
What's Changed
- docs: Update the models doc by @Sherlock113 in #4145
- docs: Add more workflows to the GitHub Actions doc by @Sherlock113 in #4146
- docs: Add text embedding example to readme by @Sherlock113 in #4151
- fix: bento build cache miss by @xianml in #4153
- fix(buildx): parsing attestation on docker desktop by @aarnphm in #4155
New Contributors
Full Changelog: v1.1.3...v1.1.4
BentoML - v1.1.2
Patch releases
BentoML now provides a new diffusers integration, bentoml.diffusers_simple
.
This introduces two integration for stable_diffusion
and stable_diffusion_xl
model.
import bentoml
# Create a Runner for a Stable Diffusion model
runner = bentoml.diffusers_simple.stable_diffusion.create_runner("CompVis/stable-diffusion-v1-4")
# Create a Runner for a Stable Diffusion XL model
runner_xl = bentoml.diffusers_simple.stable_diffusion_xl.create_runner("stabilityai/stable-diffusion-xl-base-1.0")
General bug fixes and documentation improvement
What's Changed
- docs: Add the Overview and Quickstarts sections by @Sherlock113 in #4088
- chore(type): makes ModelInfo mypy-compatible by @aarnphm in #4094
- feat(store): update annotations by @aarnphm in #4092
- docs: Fix some relative links by @Sherlock113 in #4097
- docs: Add the Iris quickstart doc by @Sherlock113 in #4096
- docs: Add the yolo quickstart by @Sherlock113 in #4099
- docs: Code format fix by @Sherlock113 in #4101
- fix: respect environment during
bentoml.bentos.build
by @aarnphm in #4081 - docs: replaced deprecated save to save_model in pytorch.rst by @EgShes in #4102
- fix: Make the install command shorter by @frostming in #4103
- docs: Update the BentoCloud Build doc by @Sherlock113 in #4104
- docs: Add quickstart repo link and move torch import in Yolo by @Sherlock113 in #4106
- docs: fix typo by @zhangwm404 in #4108
- docs: fix typo by @zhangwm404 in #4109
- fix: calculate Pandas DataFrame batch size correctly by @judahrand in #4110
- fix(cli): fix CLI output to BentoCloud by @Haivilo in #4114
- Fix sklearn example docs by @jianshen92 in #4121
- docs: Add the BentoCloud Deployment creation and update page property explanations by @Sherlock113 in #4105
- fix: disable pyright for being too strict by @frostming in #4113
- refactor(cli): change prompt of cloud cli to unify Yatai and BentoCloud by @Haivilo in #4124
- fix(cli): change model to lower case by @Haivilo in #4126
- chore(ci): remove codestyle jobs by @aarnphm in #4125
- fix: don't pass column names twice by @judahrand in #4120
- feat: SSE (Experimental) by @jianshen92 in #4083
- docs: Restructure the get started section in BentoCloud docs by @Sherlock113 in #4129
- docs: change monitoring image by @Haivilo in #4133
- feat: Rust gRPC client by @aarnphm in #3368
- feature(framework): diffusers lora and textual inversion support by @larme in #4086
- feat(buildx): support for attestation and sbom with buildx by @aarnphm in #4132
New Contributors
Full Changelog: v1.1.1...v1.1.2
BentoML - v1.1.1
🍱 Patched release 1.1.1
- Added more extensive cloud config option for
bentoml deployment
CLI, Thanks @Haivilo.
Note thatbentoml deployment update
now takes the name as a optional positional argument instead of the previous behaviour--name
:See #4087bentoml deployment update DEPLOYMENT_NAME
- Added documentation about bento release GitHub action, Thanks @frostming. See #4071
Full Changelog: v1.1.0...v1.1.1
BentoML - v1.1.0
🍱 We're thrilled to announce the release of BentoML v1.1.0, our first minor version update since the milestone v1.0.
- Backward Compatibility: Rest assured that this release maintains full API backward compatibility with v1.0.
- Official gRPC Support: We've transitioned gRPC support in BentoML from experimental to official status, expanding your toolkit for high-performance, low-latency services.
- Ray Integration: Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
- Enhanced Hugging Face Transformers and Diffusers Support: All Hugging Face Diffuser models and pipelines can be seamlessly imported and integrated into BentoML applications through the Transformers and Diffusers framework libraries.
- Enhanced Model Version Management: Enjoy greater flexibility with the improved model version management, enabling flexible configuration and synchronization of model versions with your remote model store.
🦾 We are also excited to announce the launch of OpenLLM v0.2.0 featuring the support of Llama 2 models.
-
GPU and CPU Support: Running Llama is support on both GPU and CPU.
-
Model variations and parameter sizes: Support all model weights and parameter sizes on Hugging Face.
meta-llama/llama-2-70b-chat-hf meta-llama/llama-2-13b-chat-hf meta-llama/llama-2-7b-chat-hf meta-llama/llama-2-70b-hf meta-llama/llama-2-13b-hf meta-llama/llama-2-7b-hf openlm-research/open_llama_7b_v2 openlm-research/open_llama_3b_v2 openlm-research/open_llama_13b huggyllama/llama-65b huggyllama/llama-30b huggyllama/llama-13b huggyllama/llama-7b
Users can use any weights on HuggingFace (e.g.
TheBloke/Llama-2-13B-chat-GPTQ
), custom weights from local path (e.g./path/to/llama-1
), or fine-tuned weights as long as it adheres to LlamaModelForCausalLM. -
Stay tuned for Fine-tuning capabilities in OpenLLM: Fine-tuning various Llama 2 models will be added in a future release. Try the experimental script for fine-tuning Llama-2 with QLoRA under OpenLLM playground.
python -m openllm.playground.llama2_qlora --help
BentoML - v1.0.22
🍱 BentoML v1.0.22
release has brought a list of well-anticipated updates.
-
Added support for Pydantic 2 for better validate performance.
-
Added support for CUDA 12 versions in builds and containerization.
-
Introduced service lifecycle events allowing adding custom logic
on_deployment
,on_startup
, andon_shutdown
. States can be managed using the contextctx
variable during theon_startup
andon_shutdown
events and during request serving in the API.@svc.on_deployment def on_deployment(): pass @svc.on_startup def on_startup(ctx: bentoml.Context): ctx.state["object_key"] = create_object() @svc.on_shutdown def on_shutdown(ctx: bentoml.Context): cleanup_state(ctx.state["object_key"]) @svc.api def predict(input_data, ctx): object = ctx.state["object_key"] pass
-
Added support for traffic control for both API Server and Runners. Timeout and maximum concurrency can now be configured through configuration.
api_server: traffic: timeout: 10 # API Server request timeout in seconds max_concurrency: 32 # Maximum concurrency requests in the API Server runners: iris: traffic: timeout: 10 # Runner request timeout in seconds max_concurrency: 32 # Maximum concurrency requests in the Runner
-
Improved performance of
bentoml push
performance for large Bentos.
🚀 One more thing, the team is delighted to unveil our latest endeavor, OpenLLM. This innovative project allows you to effortless build with the state-of-the-art open source or fine-tuned Large Language Models.
-
Supports all variants of Flan-T5, Dolly V2, StarCoder, Falcon, StableLM, and ChatGLM out-of-box. Fully customizable with model specific arguments.
openllm start [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]
-
Exposes the familiar BentoML APIs and transforms LLMs seamlessly into Runners.
llm_runner = openllm.Runner("dolly-v2")
-
Builds LLM application into the Bento format that can be deployed to BentoCloud or containerized into OCI images.
openllm build [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]
Our dedicated team is working hard to pioneering more integrations of advanced models for our upcoming releases of OpenLLM. Stay tuned for the unfolding developments.
BentoML - v1.0.20
🍱 BentoML v1.0.20
is released with improved usability and compatibility features.
-
Production Mode by Default:
bentoml serve
command will now run with the--production
option by default. The change is made the simulate the production behavior during development. The--reload
option will continue to with as expected. To achieve the serving behavior previously, use--development
instead. -
Optional Dependency for OpenTelemetry Exporter: The
opentelemetry-exporter-otlp-proto-http
dependency has been moved from a required dependency to an optional one to address aprotobuf
dependency incompatibility issue.⚠️ If you are currently using the Model Monitoring and Inference Data Collection feature, you must install the package with themonitor-otlp
****option from this release onwards to include the necessary dependency.pip install "bentoml[monitor-otlp]"
-
OpenTelemetry Trace ID Configuration Option: A new configuration option has been added to return the OpenTelemetry Trace ID in the response. This feature is particularly helpful when tracing has not been initialized in the upstream caller, but the caller still wishes to log the Trace ID in case of an error.
api_server: http: response: trace_id: True
-
Start from a Service: Added the ability to start a server from a
bentoml.Service
object. This is helpful for troubleshooting a project in a development environment where no Bentos has been built yet.import bentoml # import the Service defined in `/clip_api_service/service.py` file from clip_api_service.service import svc if __name__ == "__main__": # start a server: server = bentoml.HTTPServer(svc) server.start(blocking=False) client = server.get_client() client.predict(..)
What's Changed
- fix(dispatcher): handling empty o_stat in
trigger_refresh
by @larme in #3796 - fix(framework): adjust diffusers device_map default behavior by @larme in #3779
- chore(dispatcher): cancel jobs with a for loop by @sauyon in #3788
- fix: correctly reraise
CancelledError
by @sauyon in #3801 - use path as resource for non-OS paths by @sauyon in #3800
- chore(deps): bump coverage[toml] from 7.2.3 to 7.2.4 by @dependabot in #3803
- feat: embedded runner by @larme in #3735
- feat(tensorflow): support list types inputs by @enmanuelmag in #3807
- chore(deps): bump ruff from 0.0.263 to 0.0.264 by @dependabot in #3817
- feat: subprocess build by @aarnphm in #3814
- docs: update community slack links by @parano in #3824
- chore(deps): bump pyarrow from 11.0.0 to 12.0.0 by @dependabot in #3820
- chore(deps): remove imageio by @aarnphm in #3812
- chore(deps): bump tritonclient[all] from 2.32.0 to 2.33.0 by @dependabot in #3795
- ci: add Pillow to tests dependencies by @aarnphm in #3830
- feat(observability): support
service.name
by @aarnphm in #3825 - feat: optional returning trace_id in response by @aarnphm in #3827
- chore: 3.11 support by @PeterJCLaw in #3792
- fix: Eliminate the exception during shutdown by @frostming in #3826
- chore: expose scheduling_strategy in to_runner by @bojiang in #3831
- feat: allow starting server with bentoml.Service instance by @parano in #3829
- chore(deps): bump bufbuild/buf-setup-action from 1.17.0 to 1.18.0 by @dependabot in #3838
- fix: make sure to set content-type for file type by @aarnphm in #3837
- docs: update default docs to use env as key:value instead of list type by @aarnphm in #3841
- deps: move exporter-proto to optional by @aarnphm in #3840
- feat(server): improve server APIs by @aarnphm in #3834
New Contributors
- @enmanuelmag made their first contribution in #3807
- @PeterJCLaw made their first contribution in #3792
Full Changelog: v1.0.19...v1.0.20
BentoML - v1.0.19
🍱 BentoML v1.0.19
is released with enhanced GPU utilization and expanded ML framework support.
-
Optimized GPU resource utilization: Enabled scheduling of multiple instances of the same runner using the
workers_per_resource
scheduling strategy configuration. The following configuration allows scheduling 2 instances of the “iris” runner per GPU instance.workers_per_resource
is 1 by default.runners: iris: resources: nvidia.com/gpu: 1 workers_per_resource: 2
-
New ML framework support: We've added support for EasyOCR and Detectron2 to our growing list of supported ML frameworks.
-
Enhanced runner communication: Implemented PEP 574 out-of-band pickling to improve runner communication by eliminating memory copying, resulting in better performance and efficiency.
-
Backward compatibility for Hugging Face Transformers: Resolved compatibility issues with Hugging Face Transformers versions prior to
v4.18
, ensuring a seamless experience for users with older versions.
⚙️ With the release of Kubeflow 1.7, BentoML now has native integration with Kubeflow, allowing developers to leverage BentoML's cloud-native components. Prior, developers were limited to exporting and deploying Bento
as a single container. With this integration, models trained in Kubeflow can easily be packaged, containerized, and deployed to a Kubernetes cluster as microservices. This architecture enables the individual models to run in their own pods, utilizing the most optimal hardware for their respective tasks and enabling independent scaling.
💡 With each release, we consistently update our blog, documentation and examples to empower the community in harnessing the full potential of BentoML.
- Learn more scheduling strategy to get better resource utilization.
- Learn more about model monitoring and drift detection in BentoML and integration with various monitoring framework.
- Learn more about using Nvidia Triton Inference Server as a runner to improve your application’s performance and throughput.
What's Changed
- fix(env): using
python -m
to run pip commands by @frostming in #3762 - chore(deps): bump pytest from 7.3.0 to 7.3.1 by @dependabot in #3766
- feat: lazy load
bentoml.server
by @aarnphm in #3763 - fix(client): service route prefix by @aarnphm in #3765
- chore: add test with many requests by @sauyon in #3768
- fix: using http config for grpc server by @aarnphm in #3771
- feat: apply pep574 out-of-band pickling to DefaultContainer by @larme in #3736
- fix: passing serve_cmd and passthrough kwargs by @aarnphm in #3764
- feat: Detectron by @aarnphm in #3711
- chore(dispatcher): (re-)factor out training code by @sauyon in #3767
- feat: EasyOCR by @aarnphm in #3712
- feat(build): support 3.11 by @aarnphm in #3774
- patch: backports module availability for transformers<4.18 by @aarnphm in #3775
- fix(dispatcher): set wait to 0 while training by @sauyon in #3664
- chore(deps): bump ruff from 0.0.261 to 0.0.262 by @dependabot in #3778
- feat: add
model#load_model
method by @parano in #3780 - feat: Allow spawning more than 1 worker on each resource by @frostming in #3776
- docs: Fix TensorFlow
save_model
parameter order by @ssheng in #3781 - chore(deps): bump yamllint from 1.30.0 to 1.31.0 by @dependabot in #3782
- chore(deps): bump imageio from 2.27.0 to 2.28.0 by @dependabot in #3783
- chore(deps): bump ruff from 0.0.262 to 0.0.263 by @dependabot in #3790
- fix: allow import service defined under a Python package by @parano in #3794
New Contributors
- @frostming made their first contribution in #3762
Full Changelog: v1.0.18...v1.0.19