LOOM-Scope:LOng-cOntext Model evaluation framework
Comprehensive and Efficient Long-context Model Evaluation Framework
LOOM-Scope Demonstration
Low-code interactive with LOOM-Scope
Comprehensive Evaluation Framework
Comprehensive Benchmarks
22 benchmarks, 149 tasks, covering general, faithfulness, retrieval, reasoning, generation, and specialized
Inference Augmentation
11 accelerations, covering token eviction, quantization, token merge and sparse attention.
Low Code
Interactive
Diverse deployment server, Local WebUI, Low-code command line.
Custom
Evaluation
LOOMBench, user-defined chat template and new benchmarks.
Benchmark & Deployment Method
Comprehensive benchmark suite and implementation methods for long-context model evaluation
LOOMBench Leaderboard
Model Performance Overview
Overall model performance across all capabilities and benchmark tests.
Qwen Series
Llama Series
Phi Series
GLM Series
Command Series
Mistral Series
Interactive Model Comparison
Dynamic radar chart analysis for side-by-side model performance comparison across all capability dimensions
Validation With Official Results
Capability | Benchmark | Model | Our Evaluation | Official Score | Δ Difference |
---|---|---|---|---|---|
Reasoning | babilong | Llama-3.1-8B-Instruct | 57.0 | 59.0 | -2.0 |
Counting_Stars | Llama-3.1-8B-Instruct | 38.2 | 43.5 | -5.3 | |
LongBench_v2 | Llama-3.1-8B-Instruct | 30.4 | 30.0 | +0.4 | |
LongIns | ChatGLM2-6B | 11.5 | 12.0 | -0.5 | |
LVEval | Llama2-7B-32k-Instruct | 7.3 | 7.4 | -0.1 | |
Ada-LEval | ChatGLM3-6B-32k | 6.9 | 6.8 | 0.1 | |
General | LEval | Llama-3.1-8B-Instruct | 60.5 | 58.7 | +1.8 |
LongBench | ChatGLM2-6B | 24.9 | 25.7 | -0.8 | |
LooGLE | ChatGLM2-6b-32k | 19.6 | 15.1 | +4.5 | |
RULER | Llama-3.1-8B-Instruct | 90.7 | 88.3 | +2.4 | |
BAMBOO | ChatGLM2-32k | 21.6 | 19.3 | +2.3 | |
Retrieval | InfiniteBench | ChatGLM3-6b-128k | 24.5 | 19.5 | +5.0 |
NIAH | Llama-3.1-8B-Instruct | 97.6 | - | N/A | |
NThread | Llama-3.1-8B-Instruct | 34.3 | 41.4 | -7.1 | |
NoLiMa | Llama-3-8B | 36.1 | 38.8 | -2.7 | |
Generation | LongWriter | Llama-3.1-8B-Instruct | 58.5 | 60.3 | -1.8 |
Specialization | LIBRA | Llama-3-8B-Instruct | 56.8 | 57.4 | -0.6 |
LongHealth | longchat-7b-v1.5-32k | 11.8 | 9.6 | +2.6 | |
CLongEval | chatglm3-6b | 25.2 | 25.0 | +0.2 | |
LongSafety | Llama-3.1-8B-Instruct | 13.4 | 11.1 | +2.3 | |
Faithfulness | L_CiteEval | Llama-3.1-8B-Instruct | 27.7 | 25.5 | +2.2 |
LongCite | Llama-3.1-8B-Instruct | 52.10 | 45.79 | +4.19 |
Llama Series
GLM Series
Citation
Reference this work in your research and publications
@misc{loom_scope_2025, title={LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework}, author={Zecheng Tang and Haitian Wang and Quantong Qiu and Baibei Ji and Ruoxi Sun and Keyan Zhou and Juntao Li and Min Zhang}, year={2025}, institution={Soochow University, China}, note={Key Laboratory of Data Intelligence and Advanced Computing, Soochow University}, url={https://github.com/LCM-Lab/LOOM-Scope}}