LOOM-Scope:LOng-cOntext Model evaluation framework

Comprehensive and Efficient Long-context Model Evaluation Framework

LOOM-Scope Demonstration

Low-code interactive with LOOM-Scope

Comprehensive Evaluation Framework

Comprehensive Benchmarks

22 benchmarks, 149 tasks, covering general, faithfulness, retrieval, reasoning, generation, and specialized

Inference Augmentation

11 accelerations, covering token eviction, quantization, token merge and sparse attention.

Low Code
Interactive

Diverse deployment server, Local WebUI, Low-code command line.

Custom
Evaluation

LOOMBench, user-defined chat template and new benchmarks.

Benchmark & Deployment Method

Comprehensive benchmark suite and implementation methods for long-context model evaluation

LOOMBench Leaderboard

Model Performance Overview

Overall model performance across all capabilities and benchmark tests.

Qwen Series
Llama Series
Phi Series
GLM Series
Command Series
Mistral Series

Interactive Model Comparison

Dynamic radar chart analysis for side-by-side model performance comparison across all capability dimensions

Validation With Official Results

Capability Benchmark Model Our Evaluation Official Score Δ Difference
Reasoning babilong Llama-3.1-8B-Instruct 57.0 59.0 -2.0
Counting_Stars Llama-3.1-8B-Instruct 38.2 43.5 -5.3
LongBench_v2 Llama-3.1-8B-Instruct 30.4 30.0 +0.4
LongIns ChatGLM2-6B 11.5 12.0 -0.5
LVEval Llama2-7B-32k-Instruct 7.3 7.4 -0.1
Ada-LEval ChatGLM3-6B-32k 6.9 6.8 0.1
General LEval Llama-3.1-8B-Instruct 60.5 58.7 +1.8
LongBench ChatGLM2-6B 24.9 25.7 -0.8
LooGLE ChatGLM2-6b-32k 19.6 15.1 +4.5
RULER Llama-3.1-8B-Instruct 90.7 88.3 +2.4
BAMBOO ChatGLM2-32k 21.6 19.3 +2.3
Retrieval InfiniteBench ChatGLM3-6b-128k 24.5 19.5 +5.0
NIAH Llama-3.1-8B-Instruct 97.6 - N/A
NThread Llama-3.1-8B-Instruct 34.3 41.4 -7.1
NoLiMa Llama-3-8B 36.1 38.8 -2.7
Generation LongWriter Llama-3.1-8B-Instruct 58.5 60.3 -1.8
Specialization LIBRA Llama-3-8B-Instruct 56.8 57.4 -0.6
LongHealth longchat-7b-v1.5-32k 11.8 9.6 +2.6
CLongEval chatglm3-6b 25.2 25.0 +0.2
LongSafety Llama-3.1-8B-Instruct 13.4 11.1 +2.3
Faithfulness L_CiteEval Llama-3.1-8B-Instruct 27.7 25.5 +2.2
LongCite Llama-3.1-8B-Instruct 52.10 45.79 +4.19
Llama Series
GLM Series

Citation

Reference this work in your research and publications

                    @misc{loom_scope_2025,
                    title={LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework},
                    author={Zecheng Tang and Haitian Wang and Quantong Qiu and Baibei Ji and Ruoxi Sun and Keyan Zhou and Juntao Li and Min Zhang},
                    year={2025},
                    institution={Soochow University, China},
                    note={Key Laboratory of Data Intelligence and Advanced Computing, Soochow University},
                    url={https://github.com/LCM-Lab/LOOM-Scope}}