This repository contains the open-source code for the core evaluation components of our paper on reliability-aware automatic speech recognition (ASR): the definition of RAS (Reliability-Aware Score) and the code used to fit the trade-off parameter alpha from human preference annotations. The corresponding paper is available on arXiv now.
This release focuses on two pieces from the paper:
-
RAS metric computation
RAS extends edit-distance-based ASR evaluation by introducing a placeholder token (default:<ph>) for abstention. It discounts placeholder-related errors by a factoralpha, following the dynamic programming formulation described in the paper. -
Human-aligned alpha fitting
alphais not chosen heuristically. Instead, it is fit from listening-test preference data using the Bradley-Terry style objective described in the paper.
./
├── RAS.py
├── fit_alpha.py
├── requirements.txt
└── third_party/
└── normalizers/
├── __init__.py
├── basic.py
├── english.py
└── english.json
-
RAS.pyMain implementation of the RAS metric. It:- normalizes English and code-switching text,
- converts Traditional Chinese to Simplified Chinese with
OpenCC, - merges consecutive placeholder tokens,
- computes abstention-aware alignment with dynamic programming,
- returns detailed counts such as
C,S,D,I,S_ph,I_ph, and finalRAS.
-
fit_alpha.pyFits the trade-off parameteralphafrom human listening-test annotations. It:- aggregates pairwise preference counts over transcript A / transcript B / tie,
- reads precomputed transcript metrics,
- optimizes
alphawith PyTorch and Adam, - supports tie-aware fitting through the
lambda_tieterm.
-
requirements.txtPython dependencies needed for the released code. -
third_party/normalizers/__init__.pyPackage entry for the text normalization utilities. -
third_party/normalizers/basic.pyBasic text normalization helpers adapted from OpenAI Whisper, including punctuation/symbol cleanup and Unicode normalization. -
third_party/normalizers/english.pyEnglish text normalization utilities adapted from OpenAI Whisper, including number normalization and English-specific cleanup used before metric computation. -
third_party/normalizers/english.jsonLookup/configuration data used by the English normalizer.
We recommend Python 3.10+.
cd Reliable-ASR/open-source
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf opencc installation fails on your platform, install the corresponding system package for OpenCC first, then rerun pip install -r requirements.txt.
The main entry point is compute_metrics in RAS.py.
from RAS import compute_metrics
ref = "the patient has a history of diabetes"
hyp = "the patient has <ph> of diabetes"
metrics = compute_metrics(ref, hyp, alpha=0.5064)
print(metrics)Example returned fields:
{
"N": ...,
"C": ...,
"S": ...,
"D": ...,
"I": ...,
"S_ph": ...,
"I_ph": ...,
"ph_errors": ...,
"non_ph_errors": ...,
"RAS": ...
}Consistent with the paper, the implementation computes:
RAS = (C - (non_placeholder_errors + alpha * placeholder_errors)) / N
where:
Cis the number of correct matches,Nis the reference length,non_placeholder_errors = S + D + I,placeholder_errors = S_ph + I_ph.
The default placeholder token is <ph>.
Before alignment, the code:
- normalizes English text,
- tokenizes code-switching text,
- converts Traditional Chinese to Simplified Chinese,
- collapses consecutive
<ph>tokens into a single placeholder.
This is important if you want reproduction to match the paper implementation.
fit_alpha.py is a release of the fitting logic used for the human-alignment experiment. Before running it, you need to replace the placeholder paths at the top of the file:
data_dir = 'path/to/human_choices'
wer_path = 'path/to/listen_test_full.json'You also need to implement:
def get_path(str):
return 'path/to/audio.wav'This function should map a listening-test item ID to the corresponding audio path, because the script filters invalid annotations partly based on audio duration and response time.
Then run:
python fit_alpha.pyThe script will:
- read all human annotation JSON files under
data_dir, - resolve whether each preference corresponds to transcript A or B,
- filter overly fast responses,
- aggregate counts
kA,kB,kC, - optimize
alpha.
The current script assumes:
- a directory of JSON files containing listening-test annotations,
- a JSON file (
wer_path) containing per-item transcript metrics for systems A and B, - accessible audio files for duration-based filtering.
Because this release is extracted from the paper codebase, you will likely need to adapt file paths and the exact data schema to your own annotation export format.
RAS.pycurrently sets the defaultALPHA = 0.5064, which is the fitted value reported in the paper.fit_alpha.pyis the fitting script used to estimate that value from human preferences.- The implementation follows the paper’s abstention-aware dynamic programming formulation, where one placeholder may align to multiple consecutive reference tokens.
If you use this code, please cite our paper:
@misc{huang2026rasreliabilityorientedmetric,
title={RAS: a Reliability Oriented Metric for Automatic Speech Recognition},
author={Wenbin Huang and Yuhang Qiu and Bohan Li and Yiwei Guo and Jing Peng and Hankun Wang and Xie Chen and Kai Yu},
year={2026},
eprint={2604.24278},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2604.24278},
}