👋 Hi, everyone! AReno is a fast, effortless, and self-contained toolkit that scales RL post-training up locally, initiated by the inclusionAI ASystem team and maintained by the AReno community.
AReno is a local LLM post-training toolkit for RL, SFT/DPO-style training, serving, and agentic RL. It was originally developed by engineers from the ASystem Team at Ant Group.
Built on a self-contained, full-stack design, AReno is optimized to extract maximum performance from a single node, making it well-suited for fast, local post-training with no external training or inference backend in the loop.
AReno's mission is to make LLM RL accessible for a broad community of researchers and developers — so you can go from a base checkpoint to a trained, served model on a single node, without standing up a cluster or wiring together a training framework, an inference server, and a kernel library.
Small but complete, like its name — nano in footprint, full-stack in capability. We hope AReno makes scaling up your ideas locally both fast and delightful. Enjoy!
- ✨ Plug-and-play: various post-training methods are easily accessible via the
--algoflag or the sameTrainerclass from Python, no cluster or launcher to set up. - 🪶 Lightweight: single self-contained package, no external training/inference backend, just PyTorch, FlashAttention, and a handful of other libraries.
- 🧰 Agentic RL ready: run an agent function against AReno's local OpenAI-compatible proxy, return explicit trajectories, and train from tokens, logprobs, rewards, and loss masks derived by the trainer.
- 🧩 Extensible: easily register new algorithms, model adapters, reward functions, and hardware backends without changing the core.
Requirements:
- Linux with an NVIDIA GPU (CUDA compute capability 8.0+)
- CUDA toolkit, with
CUDA_HOMEset (sonvccis on the build path) - PyTorch >= 2.6, matching your installed CUDA version
Other platforms: Apple Silicon (M-series) and AMD GPUs are not supported — the engine requires NVIDIA CUDA. On Windows, install under WSL2 and follow the Linux instructions. DGX Spark and other Grace/Blackwell systems work, but install an
aarch64PyTorch build first.
Compatibility matrix:
| Environment | Status | Notes |
|---|---|---|
| Linux x86_64 + NVIDIA GPU | Supported | Primary training/serving target. Use CUDA-enabled PyTorch >= 2.6 and build areno_accel. |
| Linux aarch64 / Grace-Blackwell | Supported | Install a matching aarch64 CUDA PyTorch build first; build from source with --no-build-isolation. |
| Windows WSL2 + NVIDIA GPU | Supported | Follow the Linux install path inside WSL2. Native Windows is not supported. |
| macOS Apple Silicon | Metadata/docs only | Use ARENO_BUILD_EXT=0 for docs or packaging checks. Training/serving is not supported. |
| CPU-only environments | Metadata/docs/tests only | CPU-only PyTorch can run lightweight docs/tests, but cannot train or serve AReno models. |
To install:
pip install psutil
pip install flash-linear-attention
pip install areno --no-build-isolation--no-build-isolation is required so that pip uses your existing CUDA-enabled PyTorch instead of installing a CPU-only torch in an isolated build environment.
Because build isolation is disabled, build-time helpers are not installed automatically; psutil must already be present because PyTorch's CUDA extension builder imports it while sizing parallel compile jobs.
Install flash-attn only when using the default high-throughput --attn-backend flash path. If you run with --attn-backend native, or AReno automatically falls back to native attention on Turing GPUs like T4, flash-attn is optional and does not need to be installed.
Post-install readiness check:
areno check
areno env --json # attach this to setup/support reportsareno check fails fast with next steps for common setup problems such as missing or CPU-only PyTorch, unsupported PyTorch versions, missing CUDA_HOME/nvcc, missing build-time dependencies, unsupported platforms, or a skipped areno_accel build. Use areno env --json when opening an issue so maintainers can see the Python, CUDA, PyTorch, GPU, and extension state without guessing from low-level build errors.
From source (recommended if you want the examples or plan to contribute):
git clone https://github.com/inclusionAI/AReno.git
cd AReno
pip install psutil
pip install flash-linear-attention
pip install -e . --no-build-isolationDocker setup escape hatch (recommended when you want to verify AReno before debugging local build state):
docker build -t areno .
docker run --gpus all --rm -it areno areno checkIf you need local project files, model files, or a Hugging Face cache inside the container:
docker run --gpus all --rm -it \
-v $PWD:/workspace \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
areno \
areno checkHost checklist before blaming AReno setup:
nvidia-smi
docker run --gpus all --rm nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
docker run --gpus all --rm areno areno checkDocker gives you a known-good Python/PyTorch/CUDA user-space install path and reuses the same areno check diagnostic flow. It does not replace host requirements: the host still needs a working NVIDIA driver, NVIDIA Container Toolkit support for --gpus all, and a driver new enough for the container CUDA runtime. Docker also does not solve model downloads, Hugging Face tokens, cache paths, network access, disk space, or multi-node networking; those remain user environment concerns.
Tips:
- Install
ninja(pip install ninja) before building so CUDA kernels compile in parallel. - If installation fails with
No module named 'psutil', install it first (pip install psutil) and retry. This is required specifically for--no-build-isolationbuilds. - Install
flash-attnbefore AReno only if you plan to use--attn-backend flash, the default high-throughput backend:If buildingpip install flash-attn
flash-attnfrom source is too slow for your environment, install a pre-built wheel from the flash-attention releases that matches your Python, PyTorch, CUDA, and platform. - If you use
--attn-backend native,flash-attnis optional. AReno also automatically falls back to native attention on flash-attn-unsupported GPUs such as Tesla T4 and prints a warning that native attention is a slower compatibility path. - By default, source builds target the visible GPU architecture. To build for a specific GPU family or when building on a host where the target GPU is not visible, set
TORCH_CUDA_ARCH_LISTexplicitly. Common values are9.0for H100/H200,8.0for A100, and8.9for L40/RTX 4090:TORCH_CUDA_ARCH_LIST="9.0" MAX_JOBS=64 pip install -e . --no-build-isolation
- If your machine has many CPU cores but limited RAM, cap the parallel build jobs with
MAX_JOBS:MAX_JOBS=4 pip install -e . --no-build-isolation - For iterative CUDA development, enable
ccachebefore rebuilding:export CC="ccache gcc" export CXX="ccache g++"
- To install the Python package without building the CUDA extension (for docs/metadata or a dry run), set
ARENO_BUILD_EXT=0. The engine will not run without the extension, but the installation will succeed.
With the SDK, RL loop is a short cycle of Trainer calls. Each step below maps a concept to the SDK call that performs it:
flowchart LR
A["Trainer<br/>init()"] -->
B["rollout_batch<br/>on-policy samples"] -->
C["reward fn<br/>score"] -->
D["train<br/>optimizer step"] -->|repeat| B
- Create the trainer — construct a
Traineron the AReno backend andinit()it to load the tokenizer and start workers. - Roll out — inside
rollout_session(...),rollout_batch(...)generates on-policy completions for each prompt. - Score — reward each completion and turn rewards into advantages (your reward function, not AReno's).
- Train — pack the rollout into
TrainSequenceobjects and calltrain(batch, loss_fn)to run one optimizer step. - Repeat — new weights produce new rollouts; loop until done, then
close().
import asyncio
from functools import partial
from datasets import load_dataset
from areno.api import (
Areno,
ArenoConfig,
SamplingParams,
Trainer,
TrainSequence,
gspo_loss_fn,
)
from examples.math.math_verify_reward import reward_fn
def to_advantages(rewards):
mean = sum(rewards) / len(rewards)
var = sum((r - mean) ** 2 for r in rewards) / max(len(rewards), 1)
std = max(var**0.5, 1e-6)
return [(r - mean) / std for r in rewards]
async def main():
# 1. Create the trainer
trainer = Trainer(
world_size=1,
model_path="Qwen/Qwen3-0.6B",
backend_type=Areno,
custom_config=ArenoConfig(tp_size=1),
)
trainer.init()
try:
# 2. Roll out on-policy completions for one GSM8K prompt
row = load_dataset("gsm8k", "main", split="train[0:1]")[0]
prompt = (
"Solve the problem and put the final answer in \\boxed{}.\n\n"
f"Problem: {row['question']}\nSolution:"
)
prompt_tokens = trainer.get_tokenizer().encode(prompt)
sampling = SamplingParams(max_new_tokens=512, temperature=1.0)
async with trainer.rollout_session(sampling_params=sampling, proxy=False):
rollout = trainer.rollout_batch(
[prompt],
n_samples=8,
sampling_params=sampling,
)[0]
# 3. Score with the same reward function the CLI uses, then form advantages
completions = [trainer.get_tokenizer().decode(seq.resp_tokens) for seq in rollout.sequences]
rewards = reward_fn(row, completions)
advantages = to_advantages(rewards)
batch = []
for seq, reward, advantage in zip(rollout.sequences, rewards, advantages, strict=True):
response_len = len(seq.resp_tokens)
batch.append(
TrainSequence(
prompt_mask=[True] * len(prompt_tokens) + [False] * response_len,
tokens=prompt_tokens + seq.resp_tokens,
logprobs=[0.0] * len(prompt_tokens) + seq.resp_logprobs,
advantages=[0.0] * len(prompt_tokens) + [advantage] * response_len,
reward=reward,
eos_token_id=trainer.get_tokenizer().eos_token_id,
)
)
# 4. Train one step
stats = trainer.train(batch, partial(gspo_loss_fn, clip_eps=3.0e-4), mini_bs=4)
# 5. Repeat the loop over more prompts
finally:
trainer.close()
asyncio.run(main())See the documentation for the full Trainer API.
You can use the AReno Command Line Interface (CLI) to quickly get started with post-training without writing any Python.
Check whether the current machine is ready to run AReno:
areno checkareno check prints OK, WARN, and FAIL statuses with concrete next steps for common setup issues such as missing CUDA, CPU-only PyTorch, missing CUDA_HOME, unavailable nvcc, missing optional runtime dependencies, or a missing areno_accel extension.
For issue reports, collect a descriptive environment report:
areno env --jsonThe report includes AReno, Python, platform, PyTorch/CUDA, GPU, nvcc, dependency import status, and relevant environment variables.
Use this command when you only want to check that a machine can run one small official training task end to end:
areno train \
--ckpt Qwen/Qwen3-0.6B \
--dataset-path gsm8k:main \
--dataset-loader-fn examples/math/dataset_loader.py \
--reward-fn-path examples/math/math_verify_reward.py \
--algo gspo \
--tp-size 1 \
--world-size 1 \
--batch-size 1This is a smoke/sanity task for the CLI, dataset loader, reward function, rollout, and training-step wiring. It is not a quality benchmark. It requires a CUDA-capable NVIDIA GPU; CPU-only machines can install the package for docs and metadata checks, but cannot run the AReno training engine. A successful run should reach rollout logs and a train_stats=... line without raising an exception.
Run GSPO on a GSM8K-style dataset with a reward function:
areno train \
--ckpt Qwen/Qwen3-0.6B \
--dataset-path gsm8k:main \
--dataset-loader-fn examples/math/dataset_loader.py \
--reward-fn-path examples/math/math_verify_reward.py \
--algo gspo \
--tp-size 4--ckpt and --dataset-path accept either local paths or Hugging Face repo IDs. Switch algorithms by changing --algo (e.g. --algo grpo, --algo sft).
For Agentic RL, add --agent-fn to supply an agent function. The agent calls the local OpenAI-compatible endpoint, including tools and tool_choice when needed, and returns explicit AgentTrajectoryTurn objects. AReno converts those turns into trainable assistant outputs and masks tool results by default:
python examples/agentic/tictactoe/dataset_generator.py \
--output /tmp/areno-tictactoe.jsonl \
--count 2048 \
--seed 2026areno train \
--ckpt Qwen/Qwen3-0.6B \
--dataset-path /tmp/areno-tictactoe.jsonl \
--dataset-loader-fn examples/agentic/tictactoe/dataset_loader.py \
--reward-fn-path examples/agentic/tictactoe/reward.py \
--agent-fn examples/agentic/tictactoe/run_agent.py \
--algo gspo \
--tp-size 1 \
--world-size 1DuelGrid is a larger agentic demo with a browser game UI and multi-action turns. Before GSPO/RLVR post-training, Gemma-E2B-it performs poorly in this game and often moves back and forth without progress. After training, it learns to collect pickups, chase the user, attack when in range, and avoid trap tiles.
| Train before | Reward | Train after |
|---|---|---|
![]() |
![]() |
![]() |
See examples/agentic/duelgrid for the rule engine, fixed-path dataset loader,
reward function, OpenAI-compatible agent, and browser UI.
For the full list of training options, run areno train --help.
Serve a trained checkpoint as an OpenAI-compatible endpoint with continuous batching:
areno serve \
--model-path /path/to/model \
--tp-size 1 \
--world-size 1 \
--port 8000Point any OpenAI client at http://localhost:8000/v1/chat/completions to start generating. For the full list of serving options, run areno serve --help.
If you want to contribute to AReno or customize it for your own needs, read the contribution guide and make a development install:
git clone https://github.com/inclusionAI/AReno.git
cd AReno
pip install psutil
pip install flash-linear-attention
# Optional: install flash-attn when developing against --attn-backend flash.
pip install flash-attn
pip install -e . --no-build-isolation
# Set up pre-commit hooks (formatting, linting, commit message checks)
pip install pre-commit
pre-commit install --install-hooksNew algorithms, model adapters, kernels, reward functions, and hardware backends all have first-class extension points, so most contributions land without forking the core.
If you find the project helpful, please cite:
@misc{areno2026,
title = {AReno: A Self-Contained, Full-Stack Toolkit for Single-Node LLM RL Post-Training},
author = {Zibo He and Le Su and Zongyu Li and Xiaowei Zhu and Cheng Wang and Zhenxuan Pan},
year = {2026},
url = {https://github.com/inclusionAI/AReno},
license = {Apache-2.0}
}AReno's API design is inspired by Tinker from ThinkingMachines. We would like to express our gratitude for their pioneering work.
This repository's source code is available under the Apache 2.0 License.



