Benchmarking Annotation Framework

Overview

This page presents the performance benchmarking of Small Language Models (SLMs) and Large Language Models (LLMs) within our annotation framework. The objective is to evaluate how these models perform in tasks involving text and multimodal data (text + image). For this benchmarking, SLMs are defined as models with fewer parameters, typically below 15 million, such as BERT and GPT-2. In contrast, LLMs—including models like Llama3, Mistral, Gemma, and Phi—possess hundreds of millions to billions of parameters. This scale difference highlights the trade-off between efficiency and complexity when handling various tasks and datasets.

Benchmarking Results: Text-Based Models

Model	Configuration	Precision	Recall	F1	Test Accuracy
Small Language Models
BERT-base-uncased	FT	0.8887	0.8870	0.8878	0.8870
DistilBERT	FT	0.8665	0.8554	0.8609	0.8710
RoBERTa-base	FT	0.8940	0.8940	0.8940	0.8940
GPT2	FT	0.8762	0.8751	0.8756	0.8751
BART	FT	0.8762	0.8760	0.8761	0.8760
Large Language Models
Llama 3.1-8B-instruct	0-shot	0.8280	0.6890	0.7521	0.7200
	5-shot	0.8400	0.7700	0.8035	0.7905
	IFT	0.8019	0.8019	0.8019	0.8180
Llama 3.1-8B (base)	FT	0.8800	0.8600	0.8699	0.8320
Llama 3.2-3B-instruct	0-shot	0.7386	0.7550	0.7467	0.6897
	5-shot	0.7989	0.6840	0.7370	0.6133
	IFT	0.8390	0.7984	0.8182	0.8084
Llama 3.2-3B (base)	FT	0.8400	0.8500	0.8450	0.8200
Mistral-v0.3 7B-instruct	0-shot	0.8153	0.5250	0.6387	0.6990
	5-shot	0.8319	0.8134	0.8225	0.7830
	IFT	0.8890	0.9240	0.9062	0.7980
Mistral-v0.3 7B (base)	FT	0.8200	0.7400	0.7779	0.8014
Qwen2.5-7B	0-shot	0.8576	0.8576	0.8576	0.8576
	5-shot	0.8660	0.8790	0.8724	0.8900
	IFT	0.8357	0.8474	0.8415	0.8474

Table 1: Performance metrics for various language models and configurations. Configuration types: 0-shot = No prior examples used for inference, 5-shot = Five examples provided for context before inference, FT = Fine-tuned on task-specific data, IFT = Instruction fine-tuned with targeted training.

Model	Config. (Text-Image)	Precision	Recall	F1	Test Accuracy
Small Language Models
SpotFake (XLNET + VGG-19)	FT	0.7415	0.6790	0.7089	0.8151
BERT + ResNet-34	FT	0.8311	0.6277	0.7152	0.8248
FND-CLIP (BERT and CLIP)	FT	0.6935	0.7151	0.7041	0.8971
Distill-RoBERTa and CLIP	FT	0.7000	0.6600	0.6794	0.8600
Large Vision Language Models
Phi-3-vision-128k-instruct	0-shot	0.7400	0.6700	0.7033	0.7103
Phi-3-vision-128k-instruct	5-shot	0.7600	0.7200	0.7395	0.7024
Phi-3-vision-128k-instruct	IFT	0.7800	0.8000	0.7899	0.7200
LLaVA-1.6	0-shot	0.7531	0.6466	0.6958	0.6500
LLaVA-1.6	5-shot	0.7102	0.6893	0.6996	0.6338
Llama-3.2-11B-Vision-Instruct	0-shot	0.6668	0.7233	0.6939	0.7060
Llama-3.2-11B-Vision-Instruct	5-shot	0.7570	0.7630	0.7600	0.7299
Llama-3.2-11B-Vision-Instruct	IFT	0.7893	0.8838	0.8060	0.9040

Table 2: Performance metrics for various small and large language models in text-image configurations. Configuration types: 0-shot = No prior examples used for inference, 5-shot = Five examples provided for context before inference, FT = Fine-tuning, IFT = Instruction Fine-tuning.

This benchmarking offers an insightful overview of how various models, ranging from smaller to large-scale, perform in distinct environments and tasks. The text-based and multimodal benchmarks reflect the strength of these models in handling the complexities of both textual data and combined text-image inputs, providing a useful reference for selecting the appropriate model based on the task requirements.