Releases: mlcommons/modelgauge
Releases · mlcommons/modelgauge
v0.5.1
What's Changed
- Updated docs
- SafeTest compatible with python 3.11+
- Add new Llama Guard 2 to
LlamaGuardAnnotator
- Can configure
LlamaGuardAnnotator
with optionalllama_guard_version
parameter. Defaults to Llama Guard 2 - Minor changes to prompt/category formatting for Llama Guard 1. This may affect results.
- Can configure
- SafeTest can also be configured to use Llama Guard 1 or 2 as it's annotator. Defaults to version 2.
Full Changelog: v0.5.0...v0.5.1
v0.5.0
What's Changed
- Renamed to ModelGauge and started pushing to PyPI!
- A whole bunch of cleanups and preparation for the more public release.
- Caching now supports dicts.
- Unit tests to ensure you can install from PyPI and run in a notebook.
- Expand range of supported python versions to 3.10 and up.
- Remove benign hazard from SafeTest.
- Start setting up ReadTheDocs.
Full Changelog: v0.3.3...v0.5.0
v0.3.3
What's Changed
- Change SafeTest to data_april04 release.
- More prompts
- Removed safe-ben
Full Changelog: v0.3.2...v0.3.3
v0.3.2
What's Changed
max_test_items
returns a relatively stable set of prompts- Loading bar for plugins
- Have
list
command report prettier values for secrets - Time out requests stuck on TogetherAI
- Updated docs
- Move
simple_test_runner
out of plugins and into core library
Full Changelog: v0.3.1...v0.3.2
v0.3.1
What's Changed
- Fix bad version specification for
together
dependency, which was causing 0.3.0 to not actually install. - Add Deepseek model that is now available on Together.
- Stabilize the order of TestItems in SafeTest to better utilize caching.
Full Changelog: v0.3.0...v0.3.1
v0.3.0
What's Changed
- Reorganized the
run_data
folder and made several improvements to caching. This breaks backward comparability. Old files should just be ignored, but if you run into issues, probably best to just delete yourrun_data
folder. - Updated SafeTest to 02apr2024.
- We now have all SUTs in the requested set, minus Deepseek.
- Simplified the command line to be
newhelm
once installed orpoetry run newhelm
when using the local repo. - Annotations are now recorded per completion instead of per TestItem.
- HuggingFace sets pad token to default, which should remove warning messages.
- Added some enforcement of SUTCapabilities to help them be accurate.
- Remove all "Base" prefixes except BaseTest.
Full Changelog: v0.2.6...v0.3.0
v0.2.6
v0.2.5
What's Changed
- Tests no longer have a
get_metadata()
method. Dependency helper uses a Test's class name instead. - Introduced the concept of SUT capabilities (
ProducesPerTokenLogProbabilities
,AcceptsChatPrompt
,AcceptsTextPrompt
). SUTs and Tests must specify their capabilities/requirements in the@newhelm_sut
and@newhelm_test
decorators. - SUTs can now return per-token log probabilities in a
SUTCompletion
. OpenAIChat is updated with this capability. - SafeTest updates:
- Re-structured to have one test per hazard, grouping all applicable persona types (typical, malicious, or vulnerable).
- Results are reported as mapping from persona type to PersonaResult, which consists of
num_items
in addition tofrac_safe
. - Added tests for new hazards
- Added new test DiscrimEval
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
- Tests and SUTs now have a member variable UID, which gets passed into their constructor.
- Introduced
@newhelm_test
and@newhelm_sut
decorators to give us better hooks into user code. - New command
list-suts
to tell you what secrets each SUT uses. - Bug fixes for SafeTest, max_test_items, our integration with Together
New Contributors
Full Changelog: v0.2.3...v0.2.4
v0.2.3
What's Changed
- The results from a test in
TestRecord
switched fromList[Result]
to a test specificTypedData
. This allows Tests to report their results in a more natural structured form, as well as provide documentation on what that form is. - More SAFE tests, including
benign
tests.
Full Changelog: v0.2.2...v0.2.3