LOOM-Scope：LOng-cOntext Model evaluation framework

Comprehensive and Efficient Long-context Model Evaluation Framework

22 Benchmarks 11 Accelerations Low-Code LOOMBench

Research Paper Source Code Demo Video

LOOM-Scope Demonstration

Low-code interactive with LOOM-Scope

Comprehensive Evaluation Framework

Comprehensive Benchmarks

22 benchmarks, 149 tasks, covering general, faithfulness, retrieval, reasoning, generation, and specialized

Inference Augmentation

11 accelerations, covering token eviction, quantization, token merge and sparse attention.

Low Code
Interactive

Diverse deployment server, Local WebUI, Low-code command line.

Custom
Evaluation

LOOMBench, user-defined chat template and new benchmarks.

Benchmark & Deployment Method

Comprehensive benchmark suite and implementation methods for long-context model evaluation

LOOMBench Leaderboard

Model Performance Overview

Overall model performance across all capabilities and benchmark tests.

Qwen Series

Llama Series

Phi Series

GLM Series

Command Series

Mistral Series

Interactive Model Comparison

Dynamic radar chart analysis for side-by-side model performance comparison across all capability dimensions

Validation With Official Results

Capability	Benchmark	Model	Our Evaluation	Official Score	Δ Difference
Reasoning	babilong	Llama-3.1-8B-Instruct	57.0	59.0	-2.0
	Counting_Stars	Llama-3.1-8B-Instruct	38.2	43.5	-5.3
	LongBench_v2	Llama-3.1-8B-Instruct	30.4	30.0	+0.4
	LongIns	ChatGLM2-6B	11.5	12.0	-0.5
	LVEval	Llama2-7B-32k-Instruct	7.3	7.4	-0.1
	Ada-LEval	ChatGLM3-6B-32k	6.9	6.8	0.1
General	LEval	Llama-3.1-8B-Instruct	60.5	58.7	+1.8
	LongBench	ChatGLM2-6B	24.9	25.7	-0.8
	LooGLE	ChatGLM2-6b-32k	19.6	15.1	+4.5
	RULER	Llama-3.1-8B-Instruct	90.7	88.3	+2.4
	BAMBOO	ChatGLM2-32k	21.6	19.3	+2.3
Retrieval	InfiniteBench	ChatGLM3-6b-128k	24.5	19.5	+5.0
	NIAH	Llama-3.1-8B-Instruct	97.6	-	N/A
	NThread	Llama-3.1-8B-Instruct	34.3	41.4	-7.1
	NoLiMa	Llama-3-8B	36.1	38.8	-2.7
Generation	LongWriter	Llama-3.1-8B-Instruct	58.5	60.3	-1.8
Specialization	LIBRA	Llama-3-8B-Instruct	56.8	57.4	-0.6
	LongHealth	longchat-7b-v1.5-32k	11.8	9.6	+2.6
	CLongEval	chatglm3-6b	25.2	25.0	+0.2
	LongSafety	Llama-3.1-8B-Instruct	13.4	11.1	+2.3
Faithfulness	L_CiteEval	Llama-3.1-8B-Instruct	27.7	25.5	+2.2
Faithfulness	LongCite	Llama-3.1-8B-Instruct	52.10	45.79	+4.19