Advanced Retrieval-Augmented Generation with Structure-Aware Chunking and Ensemble Retrieval for University Regulation Question-Answering

Các tác giả

  • Hải Đỗ Tấn Đại học Công nghiệp Thành phố Hồ Chí Minh
  • Đặng Thị Phúc

Từ khóa:

Structure-Aware Chunking, Ensemble Retrieval, LLM-as-a-Judge, Retrieval-Augmented Generation, Large Language Models

Tóm tắt

This paper presents an advanced Retrieval-Augmented Generation (RAG) architecture tailored for answering complex questions regarding policies and regulations in the field of higher education. Traditional RAG systems often suffer from context fragmentation, hallucination, and the lexical gap when dealing with rigid legal documents. To address these issues, we propose a novel framework comprising Structure-Aware Chunking with Metadata Inheritance, indexing-time Document Expansion, and Ensemble Retrieval. We conduct a comprehensive ablation study across four distinct pipelines to analyze latency-accuracy trade-offs: Baseline (Vanilla RAG), Lite (Hybrid Retrieval + 7B LLM), SOTA (Document Expansion + 5-Ensemble + 8B LLM), and SOTA Pruned (Optimized 3-Ensemble + Augmented Text Reranker + 4B LLM). Evaluated on a rigorously curated 1000-sample benchmark dataset at the Industrial University of Ho Chi Minh City, the results unveil crucial performance trade-offs. All advanced pipelines drastically improve Context Recall (e.g., 0.604 in SOTA Pruned vs. 0.288 in Baseline). While the Lite pipeline offers the best latency and high semantic quality (BERTScore 0.723), the SOTA pipeline maximizes factual grounding (Faithfulness 7.00/10). Notably, the SOTA Pruned pipeline achieves the highest Helpfulness (7.44/10) and Correctness (7.25/10) despite operating with half the parameters. Employing the G-EVAL (LLM-as-a-Judge) framework, our study exposes the inadequacy of traditional n-gram metrics (BLEU, ROUGE) for modern LLM evaluation and establishes a theoretical ceiling for RAG performance when confronted with implicit institutional knowledge.

Đã Xuất bản

22-05-2026

Số

Chuyên mục

Khoa học máy tính và Khoa học dữ liệu (Computer & Data Science)