evaluation

Star

Here are 1,079 public repositories matching this topic...

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated May 15, 2024
Python

EXP-Tools / steam-discount

Star

steam 特惠游戏榜单（自动刷新）

steam crawler evaluation rank discount zero playing

Updated May 15, 2024
Python

UNCode is an online platform for frequent practice and automatic evaluation of computer programming, Jupyter Notebooks and hardware description language (VHDL/Verilog) assignments. Also provides a pluggable interface with your existing LMS.

Updated May 15, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated May 15, 2024
TypeScript

UCL-INGI / INGInious

Star

INGInious is a secure and automated exercises assessment platform using your own tests, also providing a pluggable interface with your existing LMS.

training learning education mooc autograding evaluation grading exercise interview assessment lti learn-to-code programming-exercise e-assessment technical-coding-interview coding-interviews inginious

Updated May 15, 2024
Python

mcthouacbb / Sirius

Star

Chess engine

chess ai extensions engine evaluation bitboard pruning alpha-beta-pruning negamax reductions

Updated May 15, 2024
C++

VectorInstitute / cyclops

Star

Toolkit for evaluating and monitoring AI models in clinical settings

machine-learning deep-learning evaluation healthcare physionet mimic-iii electronic-health-record clinical-research eicu-crd clinical-data clinical-decision-support drift-detection model-monitoring data-drift omop-cdm mimic-iv electronic-medical-record

Updated May 15, 2024
Python

open-compass / opencompass

Star

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated May 15, 2024
Python

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated May 15, 2024
Python

symflower / eval-dev-quality

Star

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

evaluation software-development software-quality evaluation-framework llms

Updated May 15, 2024
Go

open-compass / VLMEvalKit

Star

Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 40+ HF models, 20+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated May 15, 2024
Python

uptrain-ai / uptrain

Star

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated May 15, 2024
Python

hitz-zentroa / latxa

Star

Latxa: An Open Language Model and Evaluation Suite for Basque

evaluation language-model basque huggingface gpt-neox llm lm-evaluation latxa

Updated May 15, 2024
Shell

promptfoo / promptfoo

Star

Test your prompts, models, and RAGs. Catch regressions and improve prompt quality. LLM evals for OpenAI, Azure, Anthropic, Gemini, Mistral, Llama, Bedrock, Ollama, and other local & private models with CI/CD integration.