BanglaVerse

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

Nurul Labib Sayeedi1, Md. Faiyaz Abdullah Sayeedi1,2, Shubhashis Roy Dipta3,
Rubaya Tabassum1, Ariful Ekraj Hridoy1, Mehraj Mahmood1,
Mahbub E Sobhani1, Md. Tarek Hasan1, Swakkhar Shatabda2
1United International University, Bangladesh      2BRAC University, Bangladesh
3University of Maryland, Baltimore County, USA

Abstract

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision–language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding âˆŧ32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

Method

Fig: Overview of the BanglaVerse dataset and experimental setup. The figure shows the two task types, example annotations for each task, artifacts generation and evaluation pipeline with multiple metrics.

Data Statistics

Category Type Count Category Type Count Category Type Count
Cult. Img. 114 M&M Img. 150 Pers. Img. 150
Cap. 1,026 Cap. 1,350 Cap. 1,350
VQA 2,052 VQA 2,592 VQA 2,700
Total 3,192 Total 4,092 Total 4,200
Food Img. 150 Nat. Achv. Img. 75 Pol. Img. 150
Cap. 1,350 Cap. 675 Cap. 1,350
VQA 2,709 VQA 1,332 VQA 2,754
Total 4,209 Total 2,082 Total 4,254
Hist. Img. 150 Nature Img. 150 Sports Img. 63
Cap. 1,350 Cap. 1,350 Cap. 576
VQA 2,772 VQA 2,691 VQA 1,125
Total 4,272 Total 4,191 Total 1,764
Total Images 1,152
Total Captions Across 9 Varieties 10,377
Total VQA Across 9 Varieties 20,727
Grand Total Artifacts 32,256
Table: Overall corpus statistics of the benchmark across 4 languages and 5 Bangla dialects. Abbreviations: Cult. = Culture, Hist. = History, M&M = Media & Movies, Nat. Achv. = National Achievements, Pers. = Personalities, Pol. = Politics, Img. = Images, Cap. = Captions, and VQA = Visual Question Answering.

Sample Data

Culture sample
Dialect/Language Caption VQA
English The farmer in the picture is plowing the land with an ox-drawn plow, a tradition of rural agriculture. Question: What work is being done using the cow in the picture?
Options: A) Paddy threshing B) Land cultivation C) Grain reserve D) Cattle feed preparation
Answer: Land cultivation
Bangla āĻ›āĻŦāĻŋāϤ⧇ āĻ•ā§ƒāώāĻ• āĻ—āϰ⧁āϰ āĻšāĻžāϞ āĻĻāĻŋāϝāĻŧ⧇ āϜāĻŽāĻŋ āϚāĻžāώ āĻ•āϰāϛ⧇āύ, āϝāĻž āĻ—ā§āϰāĻžāĻŽā§€āĻŖ āĻ•ā§ƒāώāĻŋāϰ āϐāϤāĻŋāĻšā§āϝāĨ¤ Question: āĻ›āĻŦāĻŋāϤ⧇ āĻ—āϰ⧁ āĻŦā§āϝāĻŦāĻšāĻžāϰ āĻ•āϰ⧇ āϕ⧀ āĻ•āĻžāϜ āĻ•āϰāĻž āĻšāĻšā§āϛ⧇?
Options: A) āϧāĻžāύ āĻŽāĻžāĻĄāĻŧāĻžāχ B) āϜāĻŽāĻŋ āϚāĻžāώ C) āĻļāĻ¸ā§āϝ āĻŽāϜ⧁āĻĻ D) āĻ—ā§‹-āĻ–āĻžāĻĻā§āϝ āĻĒā§āϰāĻ¸ā§āϤ⧁āϤ
Answer: āϜāĻŽāĻŋ āϚāĻžāώ
Hindi ⤚ā¤ŋ⤤āĨā¤° ā¤ŽāĨ‡ā¤‚ ⤕ā¤ŋā¤¸ā¤žā¤¨ ā¤ŦāĨˆā¤˛ ā¤šā¤˛ ⤏āĨ‡ ⤜ā¤ŧā¤ŽāĨ€ā¤¨ ⤜āĨ‹ā¤¤ ā¤°ā¤šāĨ‡ ā¤šāĨˆā¤‚, ⤜āĨ‹ ⤗āĨā¤°ā¤žā¤ŽāĨ€ā¤Ŗ ⤕āĨƒā¤ˇā¤ŋ ⤕āĨ€ ā¤Ē⤰⤂ā¤Ēā¤°ā¤ž ā¤šāĨˆāĨ¤ Question: ⤚ā¤ŋ⤤āĨā¤° ā¤ŽāĨ‡ā¤‚ ā¤ŦāĨˆā¤˛ ā¤•ā¤ž ⤉ā¤Ē⤝āĨ‹ā¤— ⤕⤰⤕āĨ‡ ⤕āĨŒā¤¨ ā¤¸ā¤ž ā¤•ā¤žā¤°āĨā¤¯ ⤕ā¤ŋā¤¯ā¤ž ā¤œā¤ž ā¤°ā¤šā¤ž ā¤šāĨˆ?
Options: A) ā¤§ā¤žā¤¨ ⤕āĨ€ ā¤Žā¤Ąā¤ŧā¤žā¤ˆ B) ⤖āĨ‡ā¤¤ ⤜āĨ‹ā¤¤ā¤¨ā¤ž C) ā¤…ā¤¨ā¤žā¤œ ā¤­ā¤‚ā¤Ąā¤žā¤°ā¤Ŗ D) ⤗āĨ‹-ā¤†ā¤šā¤žā¤° ⤤āĨˆā¤¯ā¤žā¤° ā¤•ā¤°ā¤¨ā¤ž
Answer: ⤖āĨ‡ā¤¤ ⤜āĨ‹ā¤¤ā¤¨ā¤ž
Urdu ØĒØĩŲˆÛŒØą Ų…ÛŒÚē ÚŠØŗØ§Ų† Ø¨ÛŒŲ„ŲˆÚē ØŗÛ’ ÛŲ„ Ú†Ų„Ø§ ÚŠØą Ø˛Ų…ÛŒŲ† ØĒÛŒØ§Øą ÚŠØą ØąÛØ§ ہے، ØŦ؈ دیہی Ø˛ØąØ§ØšØĒ ڊا ایڊ ØąŲˆØ§ÛŒØĒی ØˇØąÛŒŲ‚Û ہے۔ Question: ØĒØĩŲˆÛŒØą Ų…ÛŒÚē گاØĻے ڊا ڊیا Ø§ØŗØĒØšŲ…Ø§Ų„ ÛŲˆ ØąÛØ§ ہے؟
Options: A) Ø¯ÚžØ§Ų† ÚŠÛŒ گہاØĻی B) ڊاشØĒؐ Ø˛Ų…ÛŒŲ† C) ØēŲ„Û ڊا Ø°ØŽÛŒØąÛ D) Ú¯ÚžØ§Øŗ ÚŠÛŒ ØĒÛŒØ§ØąÛŒ
Answer: ڊاشØĒؐ Ø˛Ų…ÛŒŲ†
Barishal āĻ›āĻŦāĻŋāĻĄāĻžā§Ÿ āĻāωāĻ•ā§āĻ•āĻž āĻ•āĻŋāώāĻžāύ āĻ—āϰ⧁āϰ āĻšāĻžāϞ āĻĻāĻŋ⧟āĻž āϭ⧁āρāχ āϚāώāϤ⧇ āφāϛ⧇, āϝ⧇āχāĻĄāĻž āϗ⧇āϰāĻžāĻŽā§‡āϰ āĻ•ā§ƒāώāĻŋāϰ āĻĒ⧁āϰāĻžāύ āύāĻŋ⧟āĻŽāĨ¤ Question: āĻ›āĻŦāĻŋāĻĄāĻžā§Ÿ āĻ—āϰ⧁ āĻĻāĻŋ⧟āĻž āϕ⧀ āĻ•āĻžāĻŽ āĻ•āϰāĻž āĻ…āχāϤ⧇āϛ⧇?
Options: A) āϧāĻžāύ āĻŽāĻžā§œāĻžāχ B) āϭ⧁āρāχ āϚāώāĻž C) āĻļāĻ¸ā§āϝ āĻĨā§‹āĻ“ā§ŸāĻž D) āĻ—āϰ⧁āϰ āĻ–āĻžāĻ“āύ āĻŦāĻžāύāĻžāύ⧋
Answer: B) āϭ⧁āρāχ āϚāώāĻž
Chittagong āĻ›āĻŦāĻŋāϤ āĻāĻ•ā§āĻ•ā§ā§ŸāĻž āĻ•āĻŋāώāĻžāύ āĻ—āϰ⧁āϰ āĻšāĻžāϞ āĻĻāĻŋā§Ÿā§‡āύ⧇ āϜāχāĻŽ āϚāώ⧇āϰ, āϝāĻŋ⧟āĻžāύ āĻ—āĻžāρ⧟āϰ āĻ•ā§ƒāώāĻŋāϰ āĻĒ⧁āϰāĻžāύāĻž āύāĻŋ⧟āĻŽāĨ¤ Question: āĻ›āĻŦāĻŋāϤ āĻ—āϰ⧁ āĻĻāĻŋā§Ÿā§‡āύ⧇ āϕ⧀ āĻ•āĻžāĻŽ āĻ—āϰ⧇āϰ?
Options: A) āϧāĻžāύ āĻŽāĻžā§œāĻžāχ B) āϜāχāĻŽ āϚāώāĻž C) āĻļāĻ¸ā§āϝ āϤāĻšāύ D) āĻ—āϰ⧁āϰ āĻšāĻžāύāĻž āĻŦāĻžāύāĻžāύ
Answer: B) āϜāχāĻŽ āϚāώāĻž
Noakhali āĻ›āĻŦāĻŋāĻĄāĻžāϤ āĻāĻ•āϜāύ āĻ•āĻŋāώāĻžāύ āĻ—āϰ⧁āϰ āĻšāĻžāϞ āĻĻāĻŋ āϜāĻŽāĻŋ āϚāĻžāώ āĻ•āϰ⧇āϰ, āĻšā§‡āχāĻĄāĻž āϗ⧇āϰāĻžāĻŽā§‡āϰ āĻ•ā§ƒāώāĻŋāϰ āĻĒ⧁āϰāĻžāύāĻž āύāĻŋ⧟āĻŽāĨ¤ Question: āĻ›āĻŦāĻŋāĻĄāĻžāϤ āĻ—āϰ⧁ āĻĻāĻŋ āϕ⧀ āĻ•āĻžāĻŽ āĻ•āϰāĻž āĻšāĻ¨ā§āύ⧇āϰ?
Options: A) āϧāĻžāύ āĻŽāĻžā§œāĻžāχ B) āϜāĻŽāĻŋ āϚāĻžāώ C) āĻļāĻ¸ā§āϝ āϰāĻžāĻšāύ D) āĻ—āϰ⧁āϰ āĻ–āĻžāύāĻž āĻŦāĻžāύāĻžāύ
Answer: B) āϜāĻŽāĻŋ āϚāĻžāώ
Rangpur āĻ›āĻŦāĻŋāϤ āĻāĻ•āύāĻž āĻ•āĻŋāώāĻžāύ āĻ—āϰ⧁āϰ āĻšāĻžāϞ āĻĻāĻŋ⧟āĻž āϜāĻŽāĻŋāϤ āĻšāĻžāϞ āĻŦāĻžāĻ“āĻŦāĻžāϰ āύāĻžāĻ—āϛ⧇, āϝ⧇āχāϟāĻž āĻ—āĻžāĻ“ā§Ÿā§‡āϰ āĻ•ā§ƒāώāĻŋāϰ āĻĒ⧁āϰāĻžāύāĻž āύāĻŋ⧟āĻŽāĨ¤ Question: āĻ›āĻŦāĻŋāϤ āĻ—āϰ⧁ āĻĻāĻŋ⧟āĻž āϕ⧀ āĻ•āĻžāĻŽ āĻ•āϰāĻž āĻšāĻžāχāĻŦāĻžāϰ āύāĻžāĻ—āϛ⧇?
Options: A) āϧāĻžāύ āĻŽāĻžā§œāĻžāχ B) āϜāĻŽāĻŋāϤ āĻšāĻžāϞ āĻŦāĻžāĻ“ā§ŸāĻž C) āĻļāĻ¸ā§āϝ āĻĨā§‹āĻ“ā§ŸāĻž D) āĻ—āϰ⧁āϰ āĻ–āĻžāĻŦāĻžāϰ āĻŦāĻžāύāĻžāύ
Answer: B) āϜāĻŽāĻŋāϤ āĻšāĻžāϞ āĻŦāĻžāĻ“ā§ŸāĻž
Sylhet āĻ›āĻŦāĻŋāϤ āĻāĻ•āϜāύ āĻ•āĻŋāώāĻžāύ āĻ—āϰ⧁āϰ āĻšāĻžāϞ āĻĻāĻŋ⧟āĻž āϖ⧇āϤ āϚāĻžāώ āĻ•āϰāϰāĻž, āϝ⧇āϤāĻž āĻ—āĻžāĻ“āϰ āĻ•ā§ƒāώāĻŋāϰ āĻĒ⧁āϰāĻžāύāĻž āύāĻŋ⧟āĻŽāĨ¤ Question: āĻ›āĻŦāĻŋāϤ āĻ—āϰ⧁ āĻĻāĻŋ⧟āĻž āĻ•āĻŋāϤāĻž āĻ•āĻžāĻŽ āĻ•āϰāĻž āĻ…āϰ?
Options: A) āϧāĻžāύ āĻŽāĻžā§œāĻžāχ B) āϖ⧇āϤ āϚāĻžāώ C) āĻļāĻ¸ā§āϝ āϤāĻ“ā§ŸāĻž D) āĻ—āϰ⧁āϰ āĻ–āĻžāύāĻŋ āĻŦāĻžāύāĻžāύ
Answer: B) āϖ⧇āϤ āϚāĻžāώ

Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4B Cult. 0.69 0.53 0.63 0.69 0.52 0.59 0.52
Food 0.70 0.52 0.74 0.70 0.53 0.63 0.78
Hist. 0.69 0.54 0.47 0.68 0.47 0.37 0.44
M&M 0.66 0.50 0.30 0.84 0.52 0.32 0.38
Nat. Achv. 0.70 0.56 0.31 0.67 0.50 0.29 0.36
Nature 0.70 0.58 0.66 0.69 0.54 0.74 0.59
Pers. 0.65 0.45 0.61 0.69 0.49 0.62 0.49
Pol. 0.67 0.50 0.39 0.69 0.48 0.40 0.39
Sports 0.68 0.52 0.44 0.68 0.51 0.26 0.48
Gemma3-12B Cult. 0.70 0.55 0.68 0.69 0.52 0.70 0.72
Food 0.71 0.53 0.83 0.71 0.52 0.82 0.89
Hist. 0.69 0.54 0.62 0.68 0.48 0.63 0.75
M&M 0.66 0.51 0.40 0.78 0.52 0.32 0.33
Nat. Achv. 0.69 0.56 0.49 0.66 0.49 0.50 0.59
Nature 0.70 0.58 0.63 0.69 0.56 0.59 0.66
Pers. 0.64 0.49 0.67 0.68 0.48 0.70 0.74
Pol. 0.66 0.51 0.55 0.68 0.48 0.56 0.61
Sports 0.68 0.57 0.53 0.67 0.48 0.47 0.46
Gemma3-27B Cult. 0.67 0.44 0.72 0.70 0.53 0.73 0.73
Food 0.71 0.58 0.87 0.72 0.55 0.89 0.89
Hist. 0.70 0.56 0.69 0.69 0.51 0.69 0.74
M&M 0.67 0.47 0.36 0.82 0.53 0.32 0.42
Nat. Achv. 0.71 0.61 0.51 0.67 0.50 0.51 0.78
Nature 0.70 0.57 0.66 0.70 0.59 0.72 0.78
Pers. 0.65 0.51 0.78 0.70 0.52 0.83 0.76
Pol. 0.67 0.52 0.56 0.69 0.49 0.50 0.62
Sports 0.70 0.58 0.69 0.70 0.52 0.60 0.74
Qwen2.5-VL-7B Cult. 0.69 0.46 0.55 0.66 0.48 0.47 0.59
Food 0.70 0.42 0.68 0.64 0.55 0.54 0.68
Hist. 0.70 0.51 0.60 0.67 0.57 0.52 0.52
M&M 0.70 0.44 0.22 0.76 0.50 0.27 0.17
Nat. Achv. 0.69 0.46 0.69 0.69 0.51 0.71 0.54
Nature 0.70 0.49 0.42 0.55 0.38 0.64 0.39
Pers. 0.65 0.44 0.47 0.69 0.48 0.46 0.56
Pol. 0.66 0.46 0.39 0.69 0.48 0.32 0.41
Sports 0.68 0.47 0.48 0.72 0.52 0.48 0.40
Qwen3-VL-8B Cult. 0.67 0.44 0.68 0.67 0.42 0.68 0.62
Food 0.68 0.42 0.70 0.69 0.47 0.66 0.66
Hist. 0.67 0.44 0.69 0.67 0.39 0.69 0.78
M&M 0.67 0.36 0.40 0.67 0.37 0.28 0.38
Nat. Achv. 0.66 0.40 0.78 0.65 0.37 0.63 0.70
Nature 0.68 0.43 0.46 0.64 0.54 0.47 0.52
Pers. 0.63 0.41 0.51 0.65 0.45 0.52 0.57
Pol. 0.64 0.40 0.45 0.68 0.43 0.48 0.45
Sports 0.66 0.41 0.49 0.64 0.34 0.60 0.45
GPT-4.1-mini Cult. 0.71 0.57 0.65 0.71 0.57 0.56 0.77
Food 0.72 0.56 0.88 0.73 0.54 0.81 0.94
Hist. 0.71 0.58 0.66 0.72 0.57 0.60 0.72
M&M 0.70 0.54 0.49 0.89 0.61 0.42 0.46
Nat. Achv. 0.71 0.62 0.58 0.71 0.60 0.57 0.78
Nature 0.71 0.60 0.57 0.72 0.61 0.56 0.65
Pers. 0.66 0.51 0.65 0.71 0.53 0.60 0.75
Pol. 0.66 0.50 0.50 0.69 0.52 0.50 0.61
Sports 0.69 0.58 0.53 0.71 0.55 0.48 0.69

Full results by model and domain for Bangla Language.

English Language Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4B Cult. 0.87 0.65 0.72 0.86 0.56 0.58 0.50
Food 0.85 0.61 0.75 0.86 0.54 0.70 0.68
Hist. 0.86 0.61 0.43 0.84 0.44 0.40 0.44
M&M 0.85 0.55 0.26 0.92 0.55 0.34 0.42
Nat. Achv. 0.87 0.63 0.41 0.87 0.54 0.34 0.33
Nature 0.86 0.66 0.55 0.86 0.59 0.55 0.39
Pers. 0.83 0.55 0.57 0.84 0.51 0.52 0.56
Pol. 0.84 0.56 0.41 0.83 0.49 0.43 0.44
Sports 0.86 0.60 0.54 0.84 0.43 0.42 0.55
Gemma3-12B Cult. 0.87 0.62 0.69 0.86 0.55 0.72 0.71
Food 0.85 0.59 0.73 0.86 0.54 0.77 0.73
Hist. 0.85 0.60 0.62 0.85 0.51 0.60 0.74
M&M 0.85 0.53 0.44 0.90 0.59 0.30 0.40
Nat. Achv. 0.87 0.63 0.57 0.86 0.48 0.50 0.60
Nature 0.86 0.64 0.53 0.85 0.56 0.58 0.64
Pers. 0.83 0.53 0.62 0.85 0.50 0.71 0.65
Pol. 0.83 0.54 0.52 0.85 0.50 0.52 0.63
Sports 0.85 0.58 0.61 0.85 0.50 0.56 0.62
Gemma3-27B Cult. 0.87 0.66 0.77 0.86 0.57 0.75 0.80
Food 0.85 0.62 0.88 0.86 0.54 0.84 0.82
Hist. 0.85 0.65 0.62 0.87 0.57 0.62 0.73
M&M 0.86 0.53 0.41 0.92 0.54 0.36 0.45
Nat. Achv. 0.87 0.62 0.65 0.86 0.55 0.50 0.86
Nature 0.86 0.67 0.60 0.85 0.55 0.61 0.68
Pers. 0.83 0.55 0.81 0.85 0.50 0.79 0.77
Pol. 0.83 0.55 0.51 0.84 0.51 0.47 0.49
Sports 0.87 0.66 0.70 0.85 0.50 0.70 0.61
Qwen2.5-VL-7B Cult. 0.87 0.69 0.62 0.84 0.59 0.65 0.58
Food 0.85 0.60 0.77 0.81 0.62 0.72 0.75
Hist. 0.86 0.58 0.62 0.83 0.52 0.54 0.60
M&M 0.86 0.53 0.32 0.90 0.54 0.30 0.37
Nat. Achv. 0.88 0.64 0.60 0.87 0.62 0.51 0.61
Nature 0.87 0.68 0.49 0.77 0.56 0.40 0.48
Pers. 0.83 0.54 0.51 0.84 0.51 0.52 0.60
Pol. 0.83 0.55 0.49 0.84 0.54 0.42 0.56
Sports 0.85 0.56 0.59 0.86 0.60 0.49 0.59
Qwen3-VL-8B Cult. 0.87 0.63 0.71 0.86 0.50 0.66 0.64
Food 0.85 0.62 0.78 0.83 0.58 0.73 0.71
Hist. 0.83 0.62 0.64 0.86 0.49 0.61 0.63
M&M 0.87 0.56 0.37 0.91 0.48 0.31 0.45
Nat. Achv. 0.87 0.73 0.65 0.88 0.60 0.56 0.69
Nature 0.85 0.61 0.46 0.81 0.43 0.38 0.47
Pers. 0.83 0.54 0.64 0.85 0.49 0.64 0.63
Pol. 0.82 0.52 0.45 0.85 0.48 0.45 0.42
Sports 0.85 0.59 0.60 0.86 0.58 0.55 0.65
GPT-4.1-mini Cult. 0.87 0.64 0.72 0.87 0.66 0.68 0.76
Food 0.86 0.64 0.87 0.87 0.67 0.78 0.92
Hist. 0.84 0.61 0.64 0.87 0.54 0.66 0.70
M&M 0.85 0.43 0.51 0.95 0.66 0.36 0.51
Nat. Achv. 0.88 0.66 0.68 0.88 0.69 0.55 0.77
Nature 0.86 0.66 0.65 0.86 0.63 0.58 0.70
Pers. 0.83 0.53 0.83 0.86 0.58 0.76 0.82
Pol. 0.83 0.54 0.40 0.85 0.50 0.43 0.58
Sports 0.86 0.55 0.56 0.87 0.61 0.46 0.75

Full results by model and domain for English Language.

Hindi Language Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4B Cult. 0.71 0.57 0.72 0.69 0.52 0.53 0.58
Food 0.71 0.53 0.70 0.71 0.52 0.57 0.65
Hist. 0.70 0.55 0.52 0.69 0.47 0.32 0.52
M&M 0.67 0.49 0.26 0.84 0.52 0.25 0.32
Nat. Achv. 0.70 0.56 0.29 0.69 0.53 0.30 0.38
Nature 0.70 0.58 0.66 0.71 0.56 0.61 0.34
Pers. 0.65 0.48 0.57 0.68 0.47 0.45 0.45
Pol. 0.67 0.50 0.36 0.69 0.45 0.41 0.40
Sports 0.68 0.54 0.47 0.69 0.46 0.37 0.44
Gemma3-12B Cult. 0.71 0.57 0.61 0.70 0.51 0.62 0.68
Food 0.70 0.53 0.73 0.71 0.52 0.67 0.77
Hist. 0.70 0.54 0.62 0.69 0.48 0.59 0.71
M&M 0.70 0.53 0.40 0.81 0.51 0.34 0.35
Nat. Achv. 0.71 0.58 0.52 0.69 0.50 0.37 0.68
Nature 0.70 0.57 0.46 0.71 0.56 0.57 0.56
Pers. 0.65 0.49 0.54 0.69 0.50 0.57 0.62
Pol. 0.67 0.51 0.43 0.69 0.45 0.49 0.47
Sports 0.68 0.54 0.47 0.69 0.49 0.45 0.52
Gemma3-27B Cult. 0.72 0.58 0.71 0.71 0.53 0.73 0.73
Food 0.71 0.56 0.84 0.71 0.54 0.81 0.87
Hist. 0.70 0.54 0.67 0.70 0.50 0.63 0.72
M&M 0.69 0.52 0.38 0.86 0.53 0.33 0.42
Nat. Achv. 0.73 0.61 0.54 0.70 0.53 0.55 0.82
Nature 0.71 0.59 0.61 0.70 0.57 0.57 0.62
Pers. 0.66 0.50 0.66 0.70 0.50 0.68 0.68
Pol. 0.68 0.52 0.49 0.69 0.49 0.49 0.45
Sports 0.71 0.58 0.56 0.69 0.51 0.53 0.53
Qwen2.5-VL-7B Cult. 0.70 0.47 0.57 0.64 0.43 0.56 0.60
Food 0.71 0.41 0.64 0.65 0.36 0.67 0.63
Hist. 0.70 0.50 0.58 0.68 0.38 0.48 0.55
M&M 0.70 0.39 0.29 0.80 0.43 0.22 0.36
Nat. Achv. 0.70 0.48 0.75 0.69 0.47 0.58 0.47
Nature 0.71 0.49 0.41 0.65 0.48 0.50 0.45
Pers. 0.65 0.43 0.33 0.69 0.48 0.43 0.45
Pol. 0.67 0.46 0.40 0.69 0.47 0.34 0.41
Sports 0.69 0.48 0.48 0.72 0.54 0.42 0.46
Qwen3-VL-8B Cult. 0.71 0.58 0.60 0.68 0.40 0.52 0.64
Food 0.71 0.53 0.72 0.67 0.34 0.65 0.75
Hist. 0.71 0.50 0.66 0.68 0.43 0.71 0.70
M&M 0.67 0.49 0.37 0.78 0.41 0.29 0.37
Nat. Achv. 0.71 0.61 0.66 0.69 0.47 0.70 0.76
Nature 0.70 0.60 0.36 0.60 0.38 0.40 0.47
Pers. 0.65 0.49 0.48 0.69 0.50 0.48 0.49
Pol. 0.66 0.51 0.43 0.69 0.47 0.45 0.43
Sports 0.69 0.55 0.53 0.72 0.50 0.50 0.47
GPT-4.1-mini Cult. 0.72 0.59 0.64 0.72 0.61 0.61 0.71
Food 0.72 0.57 0.85 0.73 0.59 0.70 0.83
Hist. 0.71 0.55 0.70 0.72 0.55 0.59 0.68
M&M 0.69 0.52 0.38 0.87 0.58 0.37 0.42
Nat. Achv. 0.72 0.61 0.59 0.72 0.56 0.64 0.73
Nature 0.71 0.62 0.49 0.71 0.59 0.38 0.53
Pers. 0.65 0.50 0.60 0.71 0.53 0.50 0.72
Pol. 0.68 0.53 0.46 0.69 0.52 0.53 0.49
Sports 0.69 0.56 0.44 0.74 0.54 0.44 0.53

Full results by model and domain for Hindi Language.

Urdu Language Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4B Cult.0.700.480.570.690.480.340.48
Food0.700.460.630.700.470.520.56
Hist.0.690.490.510.700.440.370.45
M&M0.690.450.190.850.520.240.28
Nat. Achv.0.690.490.310.690.460.260.38
Nature0.700.490.670.690.510.570.41
Pers.0.650.450.520.680.480.430.47
Pol.0.670.470.370.700.470.370.41
Sports0.680.470.390.700.480.340.52
Gemma3-12B Cult.0.710.540.560.690.500.640.60
Food0.690.510.670.700.510.650.68
Hist.0.700.530.590.700.480.590.63
M&M0.720.520.260.830.530.280.32
Nat. Achv.0.700.550.550.700.500.430.64
Nature0.700.560.360.700.560.470.46
Pers.0.660.490.470.690.500.610.55
Pol.0.680.530.410.700.470.500.50
Sports0.680.540.420.700.500.520.50
Gemma3-27B Cult.0.710.580.610.700.520.680.68
Food0.710.560.720.710.530.770.78
Hist.0.710.580.650.700.490.640.70
M&M0.730.510.320.870.520.320.35
Nat. Achv.0.710.610.580.700.540.570.83
Nature0.700.590.410.700.560.570.51
Pers.0.660.490.560.690.510.610.59
Pol.0.680.540.450.710.490.460.48
Sports0.710.600.610.730.490.600.61
Qwen2.5-VL-7B Cult.0.690.430.580.650.420.540.55
Food0.690.340.660.640.320.560.60
Hist.0.700.460.580.690.410.540.54
M&M0.660.350.240.800.410.290.35
Nat. Achv.0.680.400.670.710.460.650.55
Nature0.690.430.490.560.320.240.46
Pers.0.650.380.430.690.450.510.44
Pol.0.680.450.420.700.470.350.36
Sports0.670.460.440.720.440.290.50
Qwen3-VL-8B Cult.0.700.530.530.670.400.590.57
Food0.690.490.660.660.370.600.64
Hist.0.710.540.660.670.380.600.68
M&M0.680.370.250.820.470.390.26
Nat. Achv.0.710.570.690.690.500.500.70
Nature0.690.520.370.590.350.460.38
Pers.0.640.430.420.690.510.450.44
Pol.0.670.470.390.690.470.410.44
Sports0.690.580.510.720.470.460.49
GPT-4.1-mini Cult.0.720.570.640.720.570.600.69
Food0.710.540.690.720.560.610.78
Hist.0.720.550.600.710.560.560.63
M&M0.700.500.430.870.550.350.43
Nat. Achv.0.710.600.580.710.580.560.73
Nature0.710.620.410.710.590.350.56
Pers.0.660.490.500.690.500.450.73
Pol.0.680.520.390.700.510.460.56
Sports0.700.570.470.740.550.450.56

Full results by model and domain for Urdu Language.

Barishal Dialect Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4B Cult.0.680.490.610.670.480.380.48
Food0.690.450.740.690.530.530.75
Hist.0.650.470.480.660.430.330.48
M&M0.640.430.290.750.470.290.29
Nat. Achv.0.650.450.260.650.490.250.36
Nature0.680.520.670.680.560.600.45
Pers.0.630.440.560.640.450.460.52
Pol.0.630.450.410.660.450.370.40
Sports0.640.480.420.640.470.440.63
Gemma3-12B Cult.0.690.470.640.680.460.630.70
Food0.700.440.860.700.490.760.95
Hist.0.670.420.630.670.410.550.75
M&M0.660.390.320.800.490.370.35
Nat. Achv.0.660.440.490.670.460.500.42
Nature0.690.460.460.670.470.510.52
Pers.0.640.420.670.660.410.650.67
Pol.0.640.420.510.670.420.550.50
Sports0.650.420.480.660.420.520.47
Gemma3-27B Cult.0.690.470.700.690.500.720.69
Food0.700.510.880.720.540.870.90
Hist.0.660.410.670.700.470.640.71
M&M0.640.420.360.780.480.310.39
Nat. Achv.0.660.440.510.680.480.530.75
Nature0.680.450.610.690.510.630.75
Pers.0.630.420.820.670.450.800.75
Pol.0.630.390.530.680.450.510.58
Sports0.660.440.650.660.440.650.80
Qwen2.5-VL-7B Cult.0.680.380.480.660.340.450.51
Food0.690.340.610.640.330.510.67
Hist.0.690.440.570.650.340.440.61
M&M0.680.340.210.780.410.260.31
Nat. Achv.0.670.300.660.690.440.680.55
Nature0.680.360.400.550.310.660.35
Pers.0.640.360.360.680.450.420.52
Pol.0.650.380.380.690.450.320.51
Sports0.680.350.450.700.440.390.53
Qwen3-VL-8B Cult.0.670.430.590.660.340.590.70
Food0.680.400.720.690.420.730.74
Hist.0.670.460.710.670.380.710.65
M&M0.670.350.360.690.370.370.28
Nat. Achv.0.660.410.670.650.350.590.57
Nature0.680.430.530.650.340.310.48
Pers.0.630.420.520.650.430.540.52
Pol.0.640.450.450.660.420.470.46
Sports0.650.470.480.640.380.530.43
GPT-4.1-mini Cult.0.700.460.640.700.500.590.75
Food0.720.450.840.730.540.730.92
Hist.0.690.440.560.700.500.540.68
M&M0.680.370.450.900.560.410.45
Nat. Achv.0.690.420.610.690.520.510.85
Nature0.690.410.480.710.510.510.62
Pers.0.650.360.590.690.520.530.73
Pol.0.650.370.490.690.450.430.56
Sports0.660.430.520.710.450.530.71

Full results by model and domain for Barishal Dialect.

Chittagong Dialect Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4BCult.0.670.450.650.680.450.420.62
Food0.690.480.750.690.510.620.77
Hist.0.650.400.470.680.450.330.41
M&M0.650.420.250.780.510.250.30
Nat. Achv.0.650.440.280.670.430.270.31
Nature0.690.510.630.690.530.570.49
Pers.0.630.400.520.660.420.440.51
Pol.0.630.410.380.670.410.350.53
Sports0.630.450.450.650.450.270.61
Gemma3-12BCult.0.680.450.650.680.450.610.69
Food0.700.480.850.710.510.710.83
Hist.0.670.400.620.680.400.590.74
M&M0.660.420.320.830.510.370.28
Nat. Achv.0.660.470.540.670.450.490.62
Nature0.680.430.510.680.460.450.59
Pers.0.640.410.690.660.410.640.83
Pol.0.640.410.470.680.400.490.54
Sports0.650.450.520.660.450.420.50
Gemma3-27BCult.0.680.400.680.690.470.730.77
Food0.700.440.900.710.500.880.85
Hist.0.660.380.670.690.460.670.77
M&M0.660.350.340.790.500.300.34
Nat. Achv.0.660.450.520.680.430.460.69
Nature0.690.430.590.700.500.570.70
Pers.0.630.350.780.670.460.820.80
Pol.0.640.360.550.680.460.480.48
Sports0.660.380.660.670.410.660.71
Qwen2.5-VL-7BCult.0.670.360.550.630.340.490.60
Food0.690.310.680.650.290.490.66
Hist.0.690.420.620.670.370.570.50
M&M0.670.350.180.790.450.190.18
Nat. Achv.0.670.340.660.700.390.600.43
Nature0.680.370.510.540.280.380.37
Pers.0.630.340.360.670.410.440.54
Pol.0.650.390.360.690.440.310.29
Sports0.670.370.520.690.450.430.58
Qwen3-VL-8BCult.0.670.450.600.670.420.570.67
Food0.680.400.690.690.430.670.74
Hist.0.670.440.680.670.410.680.69
M&M0.660.360.290.680.340.250.28
Nat. Achv.0.650.410.590.660.420.530.59
Nature0.670.440.570.640.330.480.45
Pers.0.630.430.570.640.410.460.50
Pol.0.640.440.480.660.410.410.38
Sports0.640.420.540.660.390.440.56
GPT-4.1-miniCult.0.690.410.700.700.450.580.73
Food0.720.460.840.740.520.750.91
Hist.0.690.410.640.700.460.640.66
M&M0.680.420.480.890.610.420.43
Nat. Achv.0.690.460.590.700.540.570.80
Nature0.700.390.580.710.460.440.57
Pers.0.650.340.670.690.520.540.78
Pol.0.660.360.500.680.450.460.59
Sports0.670.410.500.700.480.440.66

Full results by model and domain for Chittagong Dialect.

Noakhali Dialect Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4BCult.0.690.480.650.680.490.490.51
Food0.690.450.740.690.500.610.72
Hist.0.660.400.460.680.420.370.51
M&M0.650.410.230.790.480.280.31
Nat. Achv.0.660.490.310.680.450.200.43
Nature0.700.560.630.690.560.690.49
Pers.0.640.450.510.680.470.500.53
Pol.0.650.450.400.680.430.410.46
Sports0.640.420.500.660.460.450.45
Gemma3-12BCult.0.690.420.670.680.440.650.70
Food0.710.440.860.710.490.770.88
Hist.0.670.370.630.680.420.600.62
M&M0.670.370.350.750.540.340.36
Nat. Achv.0.660.420.550.680.450.440.44
Nature0.690.430.480.680.450.460.60
Pers.0.640.410.680.650.410.670.69
Pol.0.650.400.540.670.410.560.60
Sports0.650.400.530.650.420.520.82
Gemma3-27BCult.0.700.470.730.690.480.690.79
Food0.710.470.900.720.520.910.90
Hist.0.680.440.660.690.450.690.70
M&M0.680.410.350.800.510.320.46
Nat. Achv.0.670.470.480.670.430.530.85
Nature0.690.470.620.700.510.680.67
Pers.0.650.390.800.680.460.850.83
Pol.0.650.390.510.690.460.470.50
Sports0.670.430.650.660.420.580.67
Qwen2.5-VL-7BCult.0.680.370.510.690.360.490.53
Food0.690.330.660.710.410.540.68
Hist.0.690.420.600.700.440.550.60
M&M0.680.350.210.830.530.290.23
Nat. Achv.0.670.410.690.700.420.480.56
Nature0.680.410.370.710.440.580.34
Pers.0.640.360.460.670.410.440.43
Pol.0.650.390.390.680.440.360.30
Sports0.670.400.440.720.490.420.41
Qwen3-VL-8BCult.0.670.450.650.670.350.620.64
Food0.690.400.730.680.400.690.74
Hist.0.680.430.690.660.350.630.72
M&M0.670.360.380.690.370.300.35
Nat. Achv.0.670.460.630.660.420.570.81
Nature0.680.430.380.640.300.350.43
Pers.0.640.420.500.650.420.440.46
Pol.0.640.420.430.660.390.440.34
Sports0.660.470.500.660.380.470.58
GPT-4.1-miniCult.0.700.440.650.710.520.580.72
Food0.710.410.850.730.510.730.88
Hist.0.690.440.610.710.500.570.68
M&M0.680.360.480.900.640.380.49
Nat. Achv.0.690.430.620.710.530.620.82
Nature0.700.390.490.710.500.510.59
Pers.0.650.370.570.700.510.570.74
Pol.0.670.380.420.690.490.510.57
Sports0.680.420.530.690.480.500.61

Full results by model and domain for Noakhali Dialect.

Rangpur Dialect Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4BCult.0.680.490.630.680.490.500.59
Food0.700.500.730.700.520.560.72
Hist.0.670.480.440.680.440.420.40
M&M0.660.440.270.800.520.290.30
Nat. Achv.0.670.520.350.660.440.270.45
Nature0.690.550.670.690.570.690.41
Pers.0.630.420.590.670.460.550.53
Pol.0.640.410.420.690.450.380.41
Sports0.650.480.480.660.470.370.46
Gemma3-12BCult.0.690.450.620.680.440.620.82
Food0.700.470.850.710.510.780.82
Hist.0.670.430.640.680.440.600.76
M&M0.650.410.370.820.510.320.44
Nat. Achv.0.650.450.520.680.460.520.71
Nature0.680.430.440.690.460.540.58
Pers.0.630.420.660.650.400.710.76
Pol.0.650.410.550.670.400.540.62
Sports0.640.440.500.660.440.520.54
Gemma3-27BCult.0.700.520.690.680.490.720.64
Food0.710.500.880.720.520.900.87
Hist.0.680.470.690.690.470.660.75
M&M0.670.430.330.840.520.340.43
Nat. Achv.0.680.480.520.690.430.550.73
Nature0.690.490.590.700.540.600.67
Pers.0.640.420.780.690.480.780.79
Pol.0.650.430.570.680.460.510.51
Sports0.670.480.660.680.450.530.84
Qwen2.5-VL-7BCult.0.690.490.550.690.470.400.59
Food0.690.430.680.720.460.570.68
Hist.0.700.510.600.710.520.510.50
M&M0.690.440.220.810.570.260.20
Nat. Achv.0.680.490.690.690.500.710.51
Nature0.700.510.420.710.500.700.40
Pers.0.650.450.470.680.470.390.53
Pol.0.660.470.390.670.490.360.33
Sports0.690.490.480.720.540.440.44
Qwen3-VL-8BCult.0.660.430.570.660.350.650.66
Food0.680.410.750.680.410.700.73
Hist.0.670.440.700.670.380.720.65
M&M0.660.340.320.670.380.270.38
Nat. Achv.0.650.410.660.640.420.550.65
Nature0.670.420.570.650.340.500.49
Pers.0.620.410.570.640.440.530.54
Pol.0.640.420.450.670.450.470.44
Sports0.640.440.480.650.360.420.61
GPT-4.1-miniCult.0.690.420.660.700.500.580.78
Food0.720.440.860.730.540.680.89
Hist.0.690.450.630.690.480.620.72
M&M0.690.390.530.910.630.390.43
Nat. Achv.0.690.450.570.700.520.580.74
Nature0.690.410.550.710.530.490.69
Pers.0.640.390.590.700.510.460.72
Pol.0.660.350.460.690.460.460.51
Sports0.670.430.530.710.530.500.74

Full results by model and domain for Rangpur Dialect.

Sylhet Dialect Results

Models Domain Zero-Shot Few-Shot CoT VQA
Caption VQA Caption VQA
B-F1 LLM Accuracy (%) B-F1 LLM Accuracy (%)
Gemma3-4BCult.0.670.470.600.680.450.450.58
Food0.700.460.720.700.500.620.64
Hist.0.660.420.440.680.430.400.47
M&M0.650.420.270.800.500.350.36
Nat. Achv.0.650.470.310.680.480.290.45
Nature0.690.580.590.690.560.630.52
Pers.0.630.450.580.670.450.540.51
Pol.0.640.450.400.680.430.420.40
Sports0.640.470.470.670.470.420.44
Gemma3-12BCult.0.680.400.620.680.430.650.78
Food0.700.400.820.710.480.670.91
Hist.0.660.370.640.690.390.640.64
M&M0.660.390.350.800.490.330.45
Nat. Achv.0.650.380.460.670.430.380.58
Nature0.680.410.470.680.450.540.42
Pers.0.630.310.680.650.340.650.81
Pol.0.630.330.540.670.390.520.59
Sports0.650.390.450.660.400.440.50
Gemma3-27BCult.0.690.480.700.690.480.680.78
Food0.710.450.840.720.530.820.86
Hist.0.690.440.660.690.460.630.75
M&M0.670.420.340.840.510.330.45
Nat. Achv.0.680.430.510.680.450.490.69
Nature0.690.460.590.700.500.600.63
Pers.0.650.360.790.690.480.820.78
Pol.0.660.380.510.680.460.470.55
Sports0.680.420.630.670.410.560.68
Qwen2.5-VL-7BCult.0.670.350.520.690.320.420.53
Food0.690.320.550.710.350.450.59
Hist.0.680.420.580.700.430.550.52
M&M0.670.350.210.830.530.220.22
Nat. Achv.0.670.360.660.700.400.710.52
Nature0.690.390.430.700.440.600.38
Pers.0.640.360.490.670.410.480.51
Pol.0.660.380.390.690.440.340.36
Sports0.670.410.450.690.440.370.48
Qwen3-VL-8BCult.0.660.450.660.650.350.590.68
Food0.680.440.710.690.450.680.73
Hist.0.670.440.720.660.350.670.72
M&M0.660.340.420.670.370.440.37
Nat. Achv.0.670.450.660.650.340.690.72
Nature0.680.470.480.640.320.420.44
Pers.0.630.410.490.640.450.560.56
Pol.0.640.440.420.660.420.410.42
Sports0.660.440.530.640.390.510.59
GPT-4.1-miniCult.0.690.430.630.700.510.590.71
Food0.720.460.840.720.500.660.86
Hist.0.690.460.580.700.490.580.68
M&M0.690.400.470.900.590.430.50
Nat. Achv.0.690.480.640.700.550.620.76
Nature0.700.450.530.720.540.470.67
Pers.0.650.360.630.700.510.510.71
Pol.0.660.400.480.680.470.440.63
Sports0.680.460.470.700.500.470.73

Full results by model and domain for Sylhet Dialect.

BibTeX


@article{sayeedi2026many,
  title={Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects},
  author={Sayeedi, Nurul Labib and Sayeedi, Md. Faiyaz Abdullah and Dipta, Shubhashis Roy and Tabassum, Rubaya and Hridoy, Ariful Ekraj and Mahmood, Mehraj and Sobhani, Mahbub E and Hasan, Md. Tarek and Shatabda, Swakkhar},
  journal={arXiv preprint arXiv:2603.21165},
  year={2026}
}
    
View My Stats