Skip to content

Benchmarking Annotation Framework

Overview

This page presents the performance benchmarking of Small Language Models (SLMs) and Large Language Models (LLMs) within our annotation framework. The objective is to evaluate how these models perform in tasks involving text and multimodal data (text + image). For this benchmarking, SLMs are defined as models with fewer parameters, typically below 15 million, such as BERT and GPT-2. In contrast, LLMs—including models like Llama3, Mistral, Gemma, and Phi—possess hundreds of millions to billions of parameters. This scale difference highlights the trade-off between efficiency and complexity when handling various tasks and datasets.

Benchmarking Results: Text-Based Models

Model Configuration Precision Recall F1 Test Accuracy
Small Language Models
BERT-base-uncased FT 0.8887 0.8870 0.8878 0.8870
DistilBERT FT 0.8665 0.8554 0.8609 0.8710
RoBERTa-base FT 0.8940 0.8940 0.8940 0.8940
GPT2 FT 0.8762 0.8751 0.8756 0.8751
BART FT 0.8762 0.8760 0.8761 0.8760
Large Language Models
Llama 3.1-8B-instruct 0-shot 0.8280 0.6890 0.7521 0.7200
5-shot 0.8400 0.7700 0.8035 0.7905
IFT 0.8019 0.8019 0.8019 0.8180
Llama 3.1-8B (base) FT 0.8800 0.8600 0.8699 0.8320
Llama 3.2-3B-instruct 0-shot 0.7386 0.7550 0.7467 0.6897
5-shot 0.7989 0.6840 0.7370 0.6133
IFT 0.8390 0.7984 0.8182 0.8084
Llama 3.2-3B (base) FT 0.8400 0.8500 0.8450 0.8200
Mistral-v0.3 7B-instruct 0-shot 0.8153 0.5250 0.6387 0.6990
5-shot 0.8319 0.8134 0.8225 0.7830
IFT 0.8890 0.9240 0.9062 0.7980
Mistral-v0.3 7B (base) FT 0.8200 0.7400 0.7779 0.8014
Qwen2.5-7B 0-shot 0.8576 0.8576 0.8576 0.8576
5-shot 0.8660 0.8790 0.8724 0.8900
IFT 0.8357 0.8474 0.8415 0.8474

Table 1: Performance metrics for various language models and configurations. Configuration types: 0-shot = No prior examples used for inference, 5-shot = Five examples provided for context before inference, FT = Fine-tuned on task-specific data, IFT = Instruction fine-tuned with targeted training.

Model Config. (Text-Image) Precision Recall F1 Test Accuracy
Small Language Models
SpotFake (XLNET + VGG-19) FT 0.7415 0.6790 0.7089 0.8151
BERT + ResNet-34 FT 0.8311 0.6277 0.7152 0.8248
FND-CLIP (BERT and CLIP) FT 0.6935 0.7151 0.7041 0.8971
Distill-RoBERTa and CLIP FT 0.7000 0.6600 0.6794 0.8600
Large Vision Language Models
Phi-3-vision-128k-instruct 0-shot 0.7400 0.6700 0.7033 0.7103
Phi-3-vision-128k-instruct 5-shot 0.7600 0.7200 0.7395 0.7024
Phi-3-vision-128k-instruct IFT 0.7800 0.8000 0.7899 0.7200
LLaVA-1.6 0-shot 0.7531 0.6466 0.6958 0.6500
LLaVA-1.6 5-shot 0.7102 0.6893 0.6996 0.6338
Llama-3.2-11B-Vision-Instruct 0-shot 0.6668 0.7233 0.6939 0.7060
Llama-3.2-11B-Vision-Instruct 5-shot 0.7570 0.7630 0.7600 0.7299
Llama-3.2-11B-Vision-Instruct IFT 0.7893 0.8838 0.8060 0.9040

Table 2: Performance metrics for various small and large language models in text-image configurations. Configuration types: 0-shot = No prior examples used for inference, 5-shot = Five examples provided for context before inference, FT = Fine-tuning, IFT = Instruction Fine-tuning.

This benchmarking offers an insightful overview of how various models, ranging from smaller to large-scale, perform in distinct environments and tasks. The text-based and multimodal benchmarks reflect the strength of these models in handling the complexities of both textual data and combined text-image inputs, providing a useful reference for selecting the appropriate model based on the task requirements.