Skip to content

Releases: mlcommons/modelgauge

v0.5.1

26 Apr 21:10
79283fd
Compare
Choose a tag to compare
v0.5.1 Pre-release
Pre-release

What's Changed

  • Updated docs
  • SafeTest compatible with python 3.11+
  • Add new Llama Guard 2 to LlamaGuardAnnotator
    • Can configure LlamaGuardAnnotator with optional llama_guard_version parameter. Defaults to Llama Guard 2
    • Minor changes to prompt/category formatting for Llama Guard 1. This may affect results.
  • SafeTest can also be configured to use Llama Guard 1 or 2 as it's annotator. Defaults to version 2.

Full Changelog: v0.5.0...v0.5.1

v0.5.0

15 Apr 22:35
2e81a6c
Compare
Choose a tag to compare
v0.5.0 Pre-release
Pre-release

What's Changed

  • Renamed to ModelGauge and started pushing to PyPI!
  • A whole bunch of cleanups and preparation for the more public release.
  • Caching now supports dicts.
  • Unit tests to ensure you can install from PyPI and run in a notebook.
  • Expand range of supported python versions to 3.10 and up.
  • Remove benign hazard from SafeTest.
  • Start setting up ReadTheDocs.

Full Changelog: v0.3.3...v0.5.0

v0.3.3

09 Apr 23:00
4088c92
Compare
Choose a tag to compare
v0.3.3 Pre-release
Pre-release

What's Changed

  • Change SafeTest to data_april04 release.
    • More prompts
    • Removed safe-ben

Full Changelog: v0.3.2...v0.3.3

v0.3.2

09 Apr 21:50
Compare
Choose a tag to compare
v0.3.2 Pre-release
Pre-release

What's Changed

  • max_test_items returns a relatively stable set of prompts
  • Loading bar for plugins
  • Have list command report prettier values for secrets
  • Time out requests stuck on TogetherAI
  • Updated docs
  • Move simple_test_runner out of plugins and into core library

Full Changelog: v0.3.1...v0.3.2

v0.3.1

03 Apr 17:13
daf4e5c
Compare
Choose a tag to compare
v0.3.1 Pre-release
Pre-release

What's Changed

  • Fix bad version specification for together dependency, which was causing 0.3.0 to not actually install.
  • Add Deepseek model that is now available on Together.
  • Stabilize the order of TestItems in SafeTest to better utilize caching.

Full Changelog: v0.3.0...v0.3.1

v0.3.0

02 Apr 22:03
089b5d4
Compare
Choose a tag to compare
v0.3.0 Pre-release
Pre-release

What's Changed

  • Reorganized the run_data folder and made several improvements to caching. This breaks backward comparability. Old files should just be ignored, but if you run into issues, probably best to just delete your run_data folder.
  • Updated SafeTest to 02apr2024.
  • We now have all SUTs in the requested set, minus Deepseek.
  • Simplified the command line to be newhelm once installed or poetry run newhelm when using the local repo.
  • Annotations are now recorded per completion instead of per TestItem.
  • HuggingFace sets pad token to default, which should remove warning messages.
  • Added some enforcement of SUTCapabilities to help them be accurate.
  • Remove all "Base" prefixes except BaseTest.

Full Changelog: v0.2.6...v0.3.0

v0.2.6

28 Mar 19:57
e77d8c0
Compare
Choose a tag to compare
v0.2.6 Pre-release
Pre-release

What's Changed

  • Bug fix for SafeTest

Full Changelog: v0.2.5...v0.2.6

v0.2.5

27 Mar 23:31
03c8fae
Compare
Choose a tag to compare
v0.2.5 Pre-release
Pre-release

What's Changed

  • Tests no longer have a get_metadata() method. Dependency helper uses a Test's class name instead.
  • Introduced the concept of SUT capabilities (ProducesPerTokenLogProbabilities, AcceptsChatPrompt, AcceptsTextPrompt). SUTs and Tests must specify their capabilities/requirements in the @newhelm_sut and @newhelm_test decorators.
  • SUTs can now return per-token log probabilities in a SUTCompletion. OpenAIChat is updated with this capability.
  • SafeTest updates:
    • Re-structured to have one test per hazard, grouping all applicable persona types (typical, malicious, or vulnerable).
    • Results are reported as mapping from persona type to PersonaResult, which consists of num_items in addition to frac_safe.
    • Added tests for new hazards
  • Added new test DiscrimEval

Full Changelog: v0.2.4...v0.2.5

v0.2.4

21 Mar 17:07
2521208
Compare
Choose a tag to compare
v0.2.4 Pre-release
Pre-release

What's Changed

  • Tests and SUTs now have a member variable UID, which gets passed into their constructor.
  • Introduced @newhelm_test and @newhelm_sut decorators to give us better hooks into user code.
  • New command list-suts to tell you what secrets each SUT uses.
  • Bug fixes for SafeTest, max_test_items, our integration with Together

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3

13 Mar 20:47
77be2b2
Compare
Choose a tag to compare
v0.2.3 Pre-release
Pre-release

What's Changed

  • The results from a test in TestRecord switched from List[Result] to a test specific TypedData. This allows Tests to report their results in a more natural structured form, as well as provide documentation on what that form is.
  • More SAFE tests, including benign tests.

Full Changelog: v0.2.2...v0.2.3