Arabic Dialect Identification (ADI)

Overview

This repository contains code and scripts to train Arabic Dialect Identification (ADI) using Whisper and ECAPA-TDNN.

The project involves:

Processing and preparing the ADI-17 and ADI-20 datasets and their subsets.
Training and evaluating models for ADI.
Experiment tracking and result reporting.

Repository structure:

├── dataset             # Guides on how to acquire datasets
├── data_preparation    # Data preparation scripts for reprocucibility
│   └── manifests       # Datset manifests (CSV files)
└── recipes             # Modeling directory: Contains training recipes
    └── hparams         # Hyper-parameters for recipes (YAML files)

Download Trained Models

We share our best model overall: Whisper-large-v3 Encoder with Attention Pooling (Frozen lower layers + data augmentation) trained on ADI-20-53h on HuggingFace here.
We also make available our best perforing model trained on ADI-17-full on HuggingFace here.

Installation

Requirements

Make sure your setup meets the following requirements.

Python 3.10 environment
SpeechBrain Intalled in your environment (Preferably in editable mode).
FFMPEG or SoX (for audio processing)

To install dependencies:

pip install -r requirements.txt

Dataset Preparation

The project uses the ADI-17 and ADI-20 datasets, which consist of Arabic dialect speech segments. Ensure that you replace the placeholder in the manifest CSVs with the location of the datasets you have downloaded.

Pre-made manifests are available to downoad for your convenience at this link. All utterance segmentation is already prepared.

For licencing reasons, we do not share the audio files of the dataset. Instead, you can use the IDs of the YouTube videos to download them yourself. Only the files are needed as the segmentation and labelling of the audios are already done in the CSV manifests.
Make sure to resample all your files to mono 16khz wav format.

Running Experiments

Preprocessing Audio Data

Run the preprocessing script to segment and prepare the audio files:

python data_preparation/preprocess.py --dataset_path /path/to/data

Training a Model

To train a dialect classifier:

python recipes/train_<model>.py --config recipes/hparams/<experimpent_name>.yaml

Evaluating a Model

To evaluate on a test set:

cd model
python train_<model_category>.py --test_only \
    --test_csv=<test_csv> \
    --eval_batch_size=8 \ 
    --fewer_eval_classes=True

Note that this will also generate classification reports and confusions matrices. You will find them in the save_folder of the model you want to evaluate. These files are also generated at the end of each epoch after validation.

Experiment Tracking

The repository includes a script to track experiment results (validation scores only) and summarize metrics:

./results.sh            # Show results in terminal.
./export_results.sh     # Export results to CSV.

This generates a table with details on best epochs, F1 scores, and precision-recall metrics.

NADI 2025

We have also used this model for dialect identification task in the NADI 2025 challenge and ranked first:

For more information on how we used the model, you can refer to:

Our system paper: arXiv, ACL Anthology
NADI findings paper: arXiv, ACL Anthology

Citations

If using this work, please cite:

@inproceedings{elleuch25_interspeech,
  title     = {{ADI-20: Arabic Dialect Identification dataset and models}},
  author    = {Haroun Elleuch and Salima Mdhaffar and Yannick Estève and Fethi Bougares},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {2775--2779},
  doi       = {10.21437/Interspeech.2025-884},
  issn      = {2958-1796},
}

@inproceedings{elleuch-etal-2025-elyadata,
    title = "{ELYADATA} {\&} {LIA} at {NADI} 2025: {ASR} and {ADI} Subtasks",
    author = "Elleuch, Haroun  and
      Saidi, Youssef  and
      Mdhaffar, Salima  and
      Est{\`e}ve, Yannick  and
      Bougares, Fethi",
    booktitle = "Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.arabicnlp-sharedtasks.105/",
    doi = "10.18653/v1/2025.arabicnlp-sharedtasks.105",
    pages = "762--766",
    ISBN = "979-8-89176-356-2",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data_preparation		data_preparation
dataset		dataset
inference		inference
recipes		recipes
.gitignore		.gitignore
README.md		README.md
export_results.sh		export_results.sh
nadi_leaderboard.png		nadi_leaderboard.png
requirements.txt		requirements.txt
results.sh		results.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Dialect Identification (ADI)

Overview

Download Trained Models

Installation

Requirements

Dataset Preparation

Running Experiments

Preprocessing Audio Data

Training a Model

Evaluating a Model

Experiment Tracking

NADI 2025

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Arabic Dialect Identification (ADI)

Overview

Download Trained Models

Installation

Requirements

Dataset Preparation

Running Experiments

Preprocessing Audio Data

Training a Model

Evaluating a Model

Experiment Tracking

NADI 2025

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages