LangSmith Client SDK Implementations
-
Updated
May 15, 2024 - Python
LangSmith Client SDK Implementations
UNCode is an online platform for frequent practice and automatic evaluation of computer programming, Jupyter Notebooks and hardware description language (VHDL/Verilog) assignments. Also provides a pluggable interface with your existing LMS.
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
INGInious is a secure and automated exercises assessment platform using your own tests, also providing a pluggable interface with your existing LMS.
Chess engine
Toolkit for evaluating and monitoring AI models in clinical settings
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks
UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Latxa: An Open Language Model and Evaluation Suite for Basque
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
An open-source visual programming environment for battle-testing prompts to LLMs.
A list of works on evaluation of visual generation models, including evaluation metrics, models, and systems
Pip compatible CodeBLEU metric implementation available for linux/macos/win
FuzzBench - Fuzzer benchmarking as a service.
Data release for the ImageInWords (IIW) paper.
Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs.
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."