[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou
-
Updated
Jun 11, 2024 - Python
[ACL 2024 Findings] "TempCompass: Do Video LLMs Really Understand Videos?", Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, Lu Hou
🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
This is the official PyTorch implementation of "LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models", and also an efficient LLM compression tool with various advanced compression methods, supporting multiple inference backends.
Moodle plugin for evaluations with Moodle. This is the evaluation activity plugin.
Latxa: An Open Language Model and Evaluation Suite for Basque
Evaluate your LLM's response with Prometheus and GPT4 💯
LangSmith Client SDK Implementations
Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
The official evaluation suite and dynamic data release for MixEval.
The production toolkit for LLMs. Observability, prompt management and evaluations.
A version of eval for R that returns more information about what happened
Toolkit for evaluating and monitoring AI models in clinical settings
🤖 Build AI applications with confidence ✅ DSPy Visualizer ✅ Understand how your users are using your LLM-app ✅ Get a full picture of the quality performance of your LLM-app ✅ Collaborate with your stakeholders in ONE platform ✅ Iterate towards the most valuable & reliable LLM-app.
Documentation for langsmith
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
Some of the topics, algorithms and projects in Machine Learning & Deep Learning that I have worked on and become familiar with.
Evaluation tools for time series machine learning algorithms.
Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.
To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."