Advancing Multilingual Caption Generation with Multi-View Encoders and Triple-Stage Transformer Decoding
DOI:
https://doi.org/10.22399/ijcesen.4864Keywords:
Multilingual Image Captioning, Multi-View Visual Encoding, Attention Mechanism, Transformer-based DecoderAbstract
This work introduces a multilingual image captioning framework that leverages complementary visual repre- sentations through a multi-view encoder and a triple-stage transformer-based decoder. The encoder integrates hierarchical visual features by combining ConvNeXt, which provides strong semantic and contextual rep- resentations, with Swin Transformer, which captures fine-grained local details. A Gated Attention Fusion module unifies these views into comprehensive visual embeddings. The decoder operates in three stages: initial coarse caption generation, syntactic refinement, and final multilingual translation using a pre-trained mBART (Multilingual BART) model. This modular design enables effective multilingual captioning without requiring parallel datasets. Experiments on the MS-COCO dataset, demonstrate that the proposed system outperforms existing baselines. It achieves BLEU-4 scores of 0.53 (Hindi) and 0.52 (English), CIDEr scores of 0.94 and 0.91, and F1 scores of 0.88 and 0.95, respectively. Furthermore, the system attains Word Error Rates (WER) of 0.10 in English and 0.25 in Hindi, indicating strong fluency and semantic coherence. These results highlight the scalability and effectiveness of the approach for real-world multilingual captioning tasks.
References
[1] Megahed FM, Chen YJ, Colosimo BM, et al. Adapting OpenAI’s CLIP Model for Few-Shot Image Inspection in Manufacturing Quality Control: An Expository Case Study with Multiple Application Examples. arXiv preprint arXiv:2501.12596. 2025.
[2] Li H, Wang H, Zhang Y, Li L, Ren P. Underwater image captioning: Challenges, models, and datasets. ISPRS Journal of Photogrammetry and Remote Sensing. 2025;220:440–453. DOI: https://doi.org/10.1016/j.isprsjprs.2024.12.002
[3] Wei H, Li Z, Huang F, Zhang C, Ma H, Shi Z. Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). 2021;17(2):1–22. DOI: https://doi.org/10.1145/3439734
[4] Khan R, Islam MS, Kanwal K, Iqbal M, Hossain MI, Ye Z. Attention based sequence-to-sequence framework for auto image caption generation.
[5] Journal of Intelligent & Fuzzy Systems. 2022;43(1):159–170. DOI: https://doi.org/10.3233/JIFS-211907
[6] Lv G, Sun Y, Nian F, Zhu M, Tang W, Hu Z. COME: Clip-OCR and Master ObjEct for text image captioning. Image and Vision Computing.
[7] 2023;136:104751.
[8] Theckedath D, Sedamkar R. Detecting affect states using VGG16, ResNet50 and SE-ResNet50 networks. SN Computer Science. 2020;1(2):79. DOI: https://doi.org/10.1007/s42979-020-0114-9
[9] Yu Y, Si X, Hu C, Zhang J. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation. 2019;31(7):1235– 1270. DOI: https://doi.org/10.1162/neco_a_01199
[10] Xu L, Tang Q, Lv J, Zheng B, Zeng X, Li W. Deep image captioning: A review of methods, trends and future challenges. Neurocomputing.
[11] 2023;546:126287.
[12] Safiya K, Pandian R. A real-time image captioning framework using computer vision to help the visually impaired. Multimedia Tools and Applications. 2024;83(20):59413–59438. DOI: https://doi.org/10.1007/s11042-023-17849-7
[13] Chipman HA, George EI, McCulloch RE, Shively TS. mBART: multidimensional monotone BART. Bayesian Analysis. 2022;17(2):515–544. DOI: https://doi.org/10.1214/21-BA1259
[14] Suresh KR, Jarapala A, Sudeep P. Image captioning encoder–decoder models using CNN-RNN architectures: A comparative study. Circuits, Systems, and Signal Processing. 2022;41(10):5719–5742. DOI: https://doi.org/10.1007/s00034-022-02050-2
[15] Pan Y, Li Y, Yao T, Mei T. Bottom-up and top-down object inference networks for image captioning. ACM Transactions on Multimedia Computing, Communications and Applications. 2023;19(5):1–18. DOI: https://doi.org/10.1145/3580366
[16] Liu Y, Gu J, Goyal N, et al. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics. 2020;8:726–742. DOI: https://doi.org/10.1162/tacl_a_00343
[17] Osman AA, Shalaby MAW, Soliman MM, Elsayed KM. A survey on attention-based models for image captioning. International Journal of Advanced Computer Science and Applications. 2023;14(2). DOI: https://doi.org/10.14569/IJACSA.2023.0140249
[18] Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: 2015:3156–3164. DOI: https://doi.org/10.1109/CVPR.2015.7298935
[19] Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description. In: 2015:2625–2634. DOI: https://doi.org/10.1109/CVPR.2015.7298878
[20] Jia X, Gavves E, Fernando B, Tuytelaars T. Guiding the long-short term memory model for image caption generation. In: 2015:2407–2415. DOI: https://doi.org/10.1109/ICCV.2015.277
[21] Liu H, Brailsford T. Reproducing “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”. In: . 2589. IOP Publishing. 2023:012012. DOI: https://doi.org/10.1088/1742-6596/2589/1/012012
[22] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. In: 2018:6077–6086. DOI: https://doi.org/10.1109/CVPR.2018.00636
[23] Pan Y, Yao T, Li Y, Mei T. X-linear attention networks for image captioning. In: 2020:10971–10980. DOI: https://doi.org/10.1109/CVPR42600.2020.01098
[24] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
[25] Ahsan H, Bhalla N, Bhatt D, Shah K. Multi-modal image captioning for the visually impaired. arXiv preprint arXiv:2105.08106. 2021. DOI: https://doi.org/10.18653/v1/2021.naacl-srw.8
[26] Nguyen N, Bi J, Vosoughi A, Tian Y, Fazli P, Xu C. Oscar: Object state captioning and state change representation. arXiv preprint arXiv:2402.17128.
[27] 2024.
[28] Zhang P, Li X, Hu X, et al. VinVL: making visual representations matter in vision-language models. CoRR abs/2101.00529 (2021). arXiv preprint arXiv:2101.00529. 2021. DOI: https://doi.org/10.1109/CVPR46437.2021.00553
[29] Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904. 2021.
[30] Wang P, Yang A, Men R, et al. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: PMLR. 2022:23318–23340.
[31] Zhou M, Zhou L, Wang S, et al. Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In: 2021:4155–4165. DOI: https://doi.org/10.1109/CVPR46437.2021.00414
[32] Rusli A, Shishido M. On the Applicability of Zero-Shot Cross-Lingual Transfer Learning for Sentiment Classification in Distant Language Pairs.
[33] arXiv preprint arXiv:2412.18188. 2024.
[34] Pfeiffer J, Goyal N, Lin XV, et al. Lifting the curse of multilinguality by pre-training modular transformers. arXiv preprint arXiv:2205.06266. 2022. DOI: https://doi.org/10.18653/v1/2022.naacl-main.255
[35] Lu J, Batra D, Parikh D, Lee S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems. 2019;32.
[36] Tan H, Bansal M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. 2019. DOI: https://doi.org/10.18653/v1/D19-1514
[37] Zhang L, Wu H, Chen Q, et al. VLDeformer: Vision–language decomposed transformer for fast cross-modal retrieval. Knowledge-Based Systems.
[38] 2022;252:109316.
[39] Tan Y, Wang B, Yan Z, Liu H, Zhang H. RST-Net: a spatio-temporal residual network based on Region-reConStruction algorithm for shared bike prediction. Complex & Intelligent Systems. 2023;9(1):81–97. DOI: https://doi.org/10.1007/s40747-022-00781-y
[40] Yang X, Tang K, Zhang H, Cai J. Auto-encoding scene graphs for image captioning. In: 2019:10685–10694. DOI: https://doi.org/10.1109/CVPR.2019.01094
[41] Cornia M, Stefanini M, Baraldi L, Cucchiara R. Meshed-memory transformer for image captioning. In: 2020:10578–10587. DOI: https://doi.org/10.1109/CVPR42600.2020.01059
[42] Yang B, Liu F, Zou Y, Wu X, Wang Y, Clifton DA. Zeronlg: Aligning and autoencoding domains for zero-shot multimodal and multilingual natural language generation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2024. DOI: https://doi.org/10.1109/TPAMI.2024.3371376
[43] Mokady R, Hertz A, Bermano AH. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734. 2021.
[44] Zha ZJ, Liu D, Zhang H, Zhang Y, Wu F. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence. 2019;44(2):710–722. DOI: https://doi.org/10.1109/TPAMI.2019.2909864
[45] Chen X, Fang H, Lin TY, et al. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. 2015.
[46] Ghassemiazghandi M. An Evaluation of ChatGPT’s Translation Accuracy Using BLEU Score. Theory and Practice in Language Studies.
[47] 2024;14(4):985–994.
[48] Lavie A, Denkowski MJ. The METEOR metric for automatic evaluation of machine translation. Machine translation. 2009;23:105–115. DOI: https://doi.org/10.1007/s10590-009-9059-4
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.