llm-evaluation

Here are 51 public repositories matching this topic...

confident-ai / deepeval

The LLM Evaluation Framework

evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated May 24, 2024
Python

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 24, 2024
TypeScript

Agenta-AI / agenta

Star

The all-in-one LLM developer platform: prompt management, evaluation, human feedback, and deployment all in one place.

prompt-toolkit rag human-annotation large-language-models llm prompt-engineering llms langchain llmops llama-index prompt-management llm-tools llm-framework llm-evaluation rag-evaluation

Updated May 24, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 23, 2024
TypeScript

evaluation-tools / nutcracker

Star

Large Model Evaluation Experiments

large-language-models llm llmops llm-evaluation

Updated May 23, 2024
Python

prompt-foundry / typescript-sdk

Star

The Typescript SDK for the prompt engineering, prompt management, and prompt testing tool Prompt Foundry

typescript open-ai prompt-engineering prompt-testing prompt-manager prompt-management llm-eval llm-test llm-evaluation prompt-evaluation

Updated May 23, 2024
TypeScript

parea-ai / parea-sdk-ts

Star

TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

llm prompt-engineering llms llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated May 23, 2024
TypeScript

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated May 23, 2024
Python

intuit-ai-research / DCR-consistency

Star

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

consistency summarization blackbox divide-and-conquer-approach hallucinations large-language-models llm llm-evaluation

Updated May 23, 2024
Python

Giskard-AI / giskard

Sponsor

Star

🐢 Open-Source Evaluation & Testing for LLMs and ML models

Updated May 23, 2024
Python

onejune2018 / Awesome-LLM-Eval

Star

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

nlp benchmark machine-learning leaderboard evaluation dataset openai llama bert rag awsome-list gpt3 llm awsome-lists chatgpt large-language-model chatglm qwen llm-evaluation

Updated May 23, 2024

relari-ai / continuous-eval

Star

Open-Source Evaluation for GenAI Application Pipelines

information-retrieval evaluation-metrics evaluation-framework rag llmops retrieval-augmented-generation llm-evaluation

Updated May 23, 2024
Python

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated May 24, 2024
Python

raga-ai-hub / raga-llm-hub

Star

Framework for LLM evaluation, guardrails and security

guardrails llmops llm-security llm-evaluation

Updated May 21, 2024
Python

villagecomputing / superpipe

Star

Superpipe - optimized LLM pipelines for structured data

classification data-extraction structured-data data-labeling llm llm-evaluation llm-optimization

Updated May 20, 2024
Python

Chainlit / literal-cookbook

Star

Cookbooks and tutorials on Literal AI

rag llm prompt-engineering llm-evaluation

Updated May 23, 2024
Jupyter Notebook

deshwalmahesh / PHUDGE

Star

Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute, relative and much more. It contains a list of all the available tool, methods, repo, code etc to detect hallucination, LLM evaluation, grading and much more.