Image Captioning for Visually Impaired Users Using the VizWiz Dataset

Các tác giả

  • Gia Lâm Nguyễn Khoa Học Dữ Liệu/ Công Nghệ Thông Tin
  • Trương Công Đạt
  • Hoa Xuân Hoàn

Tóm tắt

Historically, images have posed significant barriers for visually impaired individuals in accessing information and understanding their surrounding environment. In this study, we develop an image captioning system that automatically generates English textual descriptions from input images, with the goal of supporting visually impaired users in perceiving visual content through language. The proposed approach is evaluated on the VizWiz dataset and investigates several deep learning architectures, including a MobileNetV3-based encoder, a Vision Transformer-based encoder, and the pre-trained vision-language model BLIP for image caption generation. Experimental results are assessed using standard captioning metrics such as BLEU-1, BLEU-4, METEOR, ROUGE-L, and CIDEr. The results show that BLIP achieves the best overall performance, while ViT slightly outperforms MobileNetV3 on most semantic metrics. Beyond its technical contribution, this research also has practical and humanitarian value by helping visually impaired people access visual information more effectively in everyday digital contexts

Đã Xuất bản

22-05-2026

Số

Chuyên mục

Khoa học máy tính và Khoa học dữ liệu (Computer & Data Science)