evaluation

Here are 1,115 public repositories matching this topic...

llyx97 / TempCompass

[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou

evaluation temporal-perception video-llms

Updated Jun 11, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 11, 2024
TypeScript

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated Jun 11, 2024
Go

This is the official PyTorch implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends.

Updated Jun 11, 2024
Python

HarryBleckert / moodle-mod_evaluation

Star

Moodle plugin for evaluations with Moodle. This is the evaluation activity plugin.

evaluations evaluation moodle moodle-activity moodle-plugin evaluation-kit lehrveranstaltungsevaluationen evaluations-with-moodle

Updated Jun 11, 2024
PHP

hitz-zentroa / latxa

Star

Latxa: An Open Language Model and Evaluation Suite for Basque

evaluation language-model basque huggingface gpt-neox llm lm-evaluation latxa

Updated Jun 11, 2024
Shell

prometheus-eval / prometheus-eval

Star

Evaluate your LLM's response with Prometheus and GPT4 💯

python evaluation gpt4 llm llmops vllm litellm llm-as-a-judge llm-as-evaluator

Updated Jun 11, 2024
Python

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated Jun 11, 2024
Python

promptfoo / promptfoo

Star

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 11, 2024
TypeScript

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jun 11, 2024
Python

lunary-ai / lunary

Star

The production toolkit for LLMs. Observability, prompt management and evaluations.

testing ai monitoring evaluation logs self-hosted openai hacktoberfest observability prompts llm langchain

Updated Jun 11, 2024
TypeScript

ncalc / ncalc

Star

Mathematical Expressions Evaluator for .NET

parser csharp math runtime async dotnet evaluation antlr antlr4 expressions ncalc

Updated Jun 11, 2024
C#

r-lib / evaluate

Star

A version of eval for R that returns more information about what happened

r parsing repl evaluation r-package

Updated Jun 10, 2024
R

VectorInstitute / cyclops

Star

Toolkit for evaluating and monitoring AI models in clinical settings

machine-learning deep-learning evaluation healthcare physionet mimic-iii electronic-health-record clinical-research eicu-crd clinical-data clinical-decision-support drift-detection model-monitoring data-drift omop-cdm mimic-iv electronic-medical-record

Updated Jun 11, 2024
Python

langwatch / langwatch

Star

🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.

ai analytics evaluation openai gpt datasets observability llm prompt-engineering