Little Brains, Big Feats: Exploring Compact Language Models

Authors: Dari Baturova, Elena Bruches, Ivan Chernov, Roman Derunets, Arsenii Fomin, Andrey Kostin

arXiv ID: 2606.30062

Problem: Can small language models (SLMs) serve as viable, GPU-free replacements for large language models in the generation stage of Retrieval-Augmented Generation (RAG) systems?

Key Methodology:

Benchmarked 16 quantized SLMs (1.5B–8B parameters) on Russian-language RAG generation using a curated dataset spanning diverse subjects, evaluated via LLM-as-Judge on Correctness, Answer Relevance, and Faithfulness.
Compared context mode (query + retrieved documents) vs. no-context mode to isolate the impact of retrieved context on answer quality.
All models were run exclusively on CPU hardware to simulate real-world on-device deployment without GPU acceleration.

Key Results:

Top-performing SLM (Qwen3-4B-Instruct-2507-Q5KM) scored Correctness 0.71, Answer Relevance 0.89, Faithfulness 0.80 with 70.9s average latency on CPU - approaching proprietary GPT-5-mini (0.73 / 0.88 / 0.89).
Qwen3-8B-Q4KM achieved the best quality among SLMs (0.72 / 0.87 / 0.83) but at 339.3s latency, while Meno-tiny-1.5B ran in 27.8s with reasonable scores (0.41 / 0.57 / 0.60).
Context was critical: GPT-5-mini's Correctness dropped from 0.73 to 0.47 without retrieved context.

Applied Context: Builders can deploy production-quality RAG pipelines entirely on CPU (laptops, edge devices) using quantized 4B-parameter SLMs - no GPU required - enabling private, offline, cost-effective question answering at ~1–2 minutes per answer on consumer hardware.