Image Captioning for Visually Impaired Users Using the VizWiz Dataset

Gia Lâm Nguyễn; Công Đạt Trương; Xuân Hoàn Hoa

Các tác giả

Gia Lâm Nguyễn Khoa Học Dữ Liệu/ Công Nghệ Thông Tin
Trương Công Đạt
Hoa Xuân Hoàn

Tóm tắt

Historically, images have posed significant barriers for visually impaired individuals in accessing information and understanding their surrounding environment. In this study, we develop an image captioning system that automatically generates English textual descriptions from input images, with the goal of supporting visually impaired users in perceiving visual content through language. The proposed approach is evaluated on the VizWiz dataset and investigates several deep learning architectures, including a MobileNetV3-based encoder, a Vision Transformer-based encoder, and the pre-trained vision-language model BLIP for image caption generation. Experimental results are assessed using standard captioning metrics such as BLEU-1, BLEU-4, METEOR, ROUGE-L, and CIDEr. The results show that BLIP achieves the best overall performance, while ViT slightly outperforms MobileNetV3 on most semantic metrics. Beyond its technical contribution, this research also has practical and humanitarian value by helping visually impaired people access visual information more effectively in everyday digital contexts

Image Captioning for Visually Impaired Users Using the VizWiz Dataset

Các tác giả

Tóm tắt

Đã Xuất bản

Số

Chuyên mục