🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
-
Updated
May 10, 2024 - TypeScript
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
🐢 Open-Source Evaluation & Testing framework for LLMs and ML models
Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.
The LLM Evaluation Framework
The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.
Open-Source Evaluation for GenAI Application Pipelines
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向大型语言模型评测(例如ChatGPT、LLaMA、GLM、Baichuan等).
Awesome papers involving LLMs in Social Science.
Python SDK for running evaluations on LLM generated responses
Superpipe - optimized LLM pipelines for structured data
Evaluating LLMs with CommonGen-Lite
Framework for LLM evaluation, guardrails and security
A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.
A list of LLMs Tools & Projects
Initiative to evaluate and rank the most popular LLMs across common task types based on their propensity to hallucinate.
Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
A collection of hand on notebook for LLMs practitioner
The implementation for EMNLP 2023 paper ”Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators“
A framework to build scenario simulation projects where human and LLM based agents can participant in, with a user-friendly web UI to visualize simulation, support automatically evaluation on agent action level.
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models
Add a description, image, and links to the llm-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the llm-evaluation topic, visit your repo's landing page and select "manage topics."