Skip to content

Draft for a DeepL Voice implementation#75

Open
DeeJayTC wants to merge 1 commit intomainfrom
tc/add-voice
Open

Draft for a DeepL Voice implementation#75
DeeJayTC wants to merge 1 commit intomainfrom
tc/add-voice

Conversation

@DeeJayTC
Copy link
Copy Markdown
Member

DeepL Voice API Integration

Real-time speech transcription and translation support for the DeepL .NET SDK, implementing the DeepL Voice API.

Overview

The Voice API uses a two-step flow:

  1. Request a session — POST to v3/voice/realtime to obtain a WebSocket URL and ephemeral token
  2. Stream via WebSocket — Send audio chunks, receive real-time transcripts and translations

This implementation exposes the flow through DeepLClient.CreateVoiceSessionAsync(), returning an IVoiceSession that wraps the WebSocket lifecycle with an event-driven API.

Usage

using var client = new DeepLClient("your-auth-key");

var session = await client.CreateVoiceSessionAsync(new VoiceSessionOptions {
    SourceMediaContentType = SourceMediaContentType.PcmS16le16000,
    SourceLanguage = "en",
    SourceLanguageMode = SourceLanguageMode.Fixed,
    TargetLanguages = new[] { "de", "fr" },
    Formality = "formal"
});

session.SourceTranscriptUpdated += (s, update) => {
    var text = string.Join("", Array.ConvertAll(update.Concluded, seg => seg.Text));
    Console.WriteLine($"[Source] {text}");
};

session.TargetTranscriptUpdated += (s, update) => {
    var text = string.Join("", Array.ConvertAll(update.Concluded, seg => seg.Text));
    Console.WriteLine($"[{update.Language}] {text}");
};

session.StreamEnded += (s, e) => Console.WriteLine("Stream complete");

// Stream audio chunks (50-250ms recommended)
await session.SendAudioAsync(audioChunkBytes);
// ... more chunks ...

// Signal end of audio to finalize transcripts
await session.EndAudioAsync();

Reconnection

// If the WebSocket disconnects unexpectedly:
await session.ReconnectAsync();
// Resume sending audio — the session continues where it left off

New Files

Enums & Constants

File Description
DeepL/SourceMediaContentType.cs String constants for all supported audio formats (audio/auto, audio/ogg;codecs=opus, PCM variants, etc.)
DeepL/VoiceMessageFormat.cs Enum: Json / MessagePack for WebSocket message encoding
DeepL/SourceLanguageMode.cs Enum: Auto / Fixed for source language handling
DeepL/TargetMediaVoice.cs Enum: Male / Female for synthesized speech voice (closed beta)

Models (DeepL/Model/)

File Description
VoiceSessionInfo.cs POST response: StreamingUrl, Token, SessionId
TranscriptSegment.cs Single text segment with Text property
TranscriptUpdate.cs Transcript update with Concluded[], Tentative[], optional Language
TargetMediaChunk.cs Translated audio chunk: ContentType, Headers, Data[], Text, Language, Duration (closed beta)
VoiceStreamError.cs WebSocket error: Code, Reason, Message

Options & Interfaces

File Description
DeepL/VoiceSessionOptions.cs Session creation options: audio format, languages, formality, glossary, closed beta TTS settings
DeepL/IVoiceSession.cs Session interface: events (SourceTranscriptUpdated, TargetTranscriptUpdated, TargetMediaChunkReceived, ErrorReceived, StreamEnded) + methods (SendAudioAsync, EndAudioAsync, ReconnectAsync)
DeepL/IVoiceManager.cs Factory interface: CreateVoiceSessionAsync(VoiceSessionOptions)

Core Implementation

File Description
DeepL/VoiceSession.cs Internal ClientWebSocket-based session with background receive loop, JSON message dispatch, and reconnection support

Tests

File Description
DeepLTests/VoiceSessionTest.cs 11 unit tests for options defaults, enum API values, model deserialization, and client input validation

Modified Files

File Change
DeepL/DeepLClient.cs Added IVoiceManager to class declaration; implemented CreateVoiceSessionAsync (POST JSON to v3/voice/realtime, then WebSocket connect)
DeepL/DeepL.csproj Added System.Net.WebSockets.Client v4.3.2 conditional reference for netstandard2.0

Architecture Decisions

  • Event-based API — Multiple concurrent streams (source transcript, target transcript, target media) arrive on the same WebSocket, mapping naturally to C# events. IAsyncEnumerable can be layered on top in a follow-up.
  • JSON only for v1 — MessagePack support deferred (requires additional NuGet dependency).
  • DeepLClient only — Voice API is v3; not added to the legacy Translator class, consistent with other v3 features (multilingual glossaries, style rules).
  • Manual ReconnectAsync() — Automatic reconnection policy deferred to a future iteration.
  • Thread safety — Events fire on a background thread. Consumers handle synchronization context marshaling (standard .NET pattern).

API Constraints

  • Maximum 5 target languages per session
  • Audio chunk size ≤ 100 KB or 1 second duration
  • Recommended chunk duration: 50–250 ms
  • Max audio stream speed: 2× real-time
  • 30-second inactivity timeout
  • 1-hour max connection duration (reconnect to continue)
  • Tokens are single-use; reuse terminates the session

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial DeepL Voice API (v3) integration to the DeepL .NET SDK, providing a session-based WebSocket streaming API for real-time transcription/translation, plus the required option types and models.

Changes:

  • Introduces IVoiceManager / IVoiceSession and a ClientWebSocket-based VoiceSession implementation (send audio, receive transcript/media/error events, manual reconnect).
  • Adds Voice API option types and model DTOs (session info, transcript updates/segments, media chunks, stream errors) plus supporting enums/constants.
  • Extends DeepLClient with CreateVoiceSessionAsync and updates the project/test suite to support and validate the new API surface.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
DeepL/DeepLClient.cs Implements IVoiceManager.CreateVoiceSessionAsync (session POST + WebSocket connect).
DeepL/DeepL.csproj Adds System.Net.WebSockets.Client reference for netstandard2.0.
DeepL/IVoiceManager.cs New interface for creating Voice API sessions.
DeepL/IVoiceSession.cs New streaming session interface (events + send/end/reconnect).
DeepL/VoiceSession.cs WebSocket session implementation with background receive loop and message dispatch.
DeepL/VoiceSessionOptions.cs Session creation options (formats, languages, glossary, formality, beta TTS knobs).
DeepL/SourceMediaContentType.cs Constants for supported source audio content types.
DeepL/VoiceMessageFormat.cs Enum + API-value mapping for JSON/MessagePack.
DeepL/SourceLanguageMode.cs Enum + API-value mapping for auto vs fixed source language.
DeepL/TargetMediaVoice.cs Enum + API-value mapping for target TTS voice selection.
DeepL/Model/VoiceSessionInfo.cs DTO for session creation response (streaming_url, token, session_id).
DeepL/Model/TranscriptSegment.cs DTO for transcript segment (text).
DeepL/Model/TranscriptUpdate.cs DTO for transcript updates (concluded, tentative, optional language).
DeepL/Model/TargetMediaChunk.cs DTO for target media chunks (base64 packets + metadata; closed beta).
DeepL/Model/VoiceStreamError.cs DTO for server-sent stream errors (code, reason, message).
DeepLTests/VoiceSessionTest.cs Unit/client validation tests for new Voice types and DeepLClient input validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


using System;
using System.Collections.Generic;
using System.Text.Json;
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using System.Collections.Generic; appears unused in this test file. With TreatWarningsAsErrors enabled in the test project, this will produce CS8019 and fail the build; please remove the unused using (or use it).

Suggested change
using System.Text.Json;

Copilot uses AI. Check for mistakes.
public string SourceMediaContentType { get; set; } = DeepL.SourceMediaContentType.Auto;

/// <summary>
/// Message encoding format for WebSocket communication. Defaults to <see cref="VoiceMessageFormat.Json" />.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XML docs say MessageFormat defaults to VoiceMessageFormat.Json, but the property’s default value is null (meaning it won’t be sent in the request). Either set the property default to Json, or adjust the docs to clarify that null means “use API default (currently JSON)”.

Suggested change
/// Message encoding format for WebSocket communication. Defaults to <see cref="VoiceMessageFormat.Json" />.
/// Message encoding format for WebSocket communication. If <c>null</c>, the API default is used
/// (currently <see cref="VoiceMessageFormat.Json" />).

Copilot uses AI. Check for mistakes.
/// Controls how the <see cref="SourceLanguage" /> value is used.
/// Defaults to <see cref="DeepL.SourceLanguageMode.Auto" /> if not specified.
/// </summary>
public SourceLanguageMode? SourceLanguageMode { get; set; }
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs state SourceLanguageMode defaults to Auto, but the property is nullable and defaults to null (i.e., not sent). Please either set a default value (Auto) or clarify in docs that null means “use API default (auto)”.

Suggested change
public SourceLanguageMode? SourceLanguageMode { get; set; }
public SourceLanguageMode? SourceLanguageMode { get; set; } = DeepL.SourceLanguageMode.Auto;

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +46
/// Languages specified here will automatically be added to <see cref="TargetLanguages" /> if not already present.
/// Maximum 5 target media languages per session.
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TargetMediaLanguages docs claim languages will be automatically added to TargetLanguages if missing, but CreateVoiceSessionAsync currently sends target_languages and target_media_languages independently and does not merge/enforce the combined max of 5. Either implement the merge + validation (recommended), or update the docs to match the current behavior.

Suggested change
/// Languages specified here will automatically be added to <see cref="TargetLanguages" /> if not already present.
/// Maximum 5 target media languages per session.
/// This list is sent independently of <see cref="TargetLanguages" /> and is not merged automatically.
/// Maximum 5 target media languages per session for this property.

Copilot uses AI. Check for mistakes.
Comment on lines +951 to +979
if (options.TargetLanguages == null || options.TargetLanguages.Length == 0) {
throw new ArgumentException("At least one target language must be specified");
}

if (options.TargetLanguages.Length > 5) {
throw new ArgumentException("Maximum 5 target languages per session");
}

var requestData = new Dictionary<string, object> {
["source_media_content_type"] = options.SourceMediaContentType,
["target_languages"] = options.TargetLanguages
};

if (options.MessageFormat != null) {
requestData["message_format"] = options.MessageFormat.Value.ToApiValue();
}

if (options.SourceLanguage != null) {
requestData["source_language"] = options.SourceLanguage;
}

if (options.SourceLanguageMode != null) {
requestData["source_language_mode"] = options.SourceLanguageMode.Value.ToApiValue();
}

if (options.TargetMediaLanguages != null) {
requestData["target_media_languages"] = options.TargetMediaLanguages;
}

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateVoiceSessionAsync validates TargetLanguages count, but does not validate TargetMediaLanguages (max 5 per docs) nor enforce that target media languages are included in target_languages. This can lead to avoidable API-side errors; consider merging TargetMediaLanguages into TargetLanguages (deduping) and validating the combined count before sending the request.

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +146
/// <inheritdoc />
public async Task ReconnectAsync(CancellationToken cancellationToken = default) {
// Stop current receive loop
_receiveCts.Cancel();
if (_receiveTask != null) {
try {
await _receiveTask.ConfigureAwait(false);
} catch (OperationCanceledException) {
// Expected
}
}

// Close existing WebSocket if still open
if (_webSocket.State == WebSocketState.Open || _webSocket.State == WebSocketState.CloseReceived) {
try {
await _webSocket.CloseAsync(WebSocketCloseStatus.NormalClosure, "Reconnecting", CancellationToken.None)
.ConfigureAwait(false);
} catch (WebSocketException) {
// Ignore close errors during reconnection
}
}

_webSocket.Dispose();

// Request new token via GET v3/voice/realtime?token=<lastToken>
var queryParams = new[] { ("token", _lastToken) };
using var responseMessage = await _httpClient.ApiGetAsync("v3/voice/realtime", cancellationToken, queryParams)
.ConfigureAwait(false);
await DeepLHttpClient.CheckStatusCodeAsync(responseMessage).ConfigureAwait(false);
var sessionInfo = await JsonUtils.DeserializeAsync<VoiceSessionInfo>(responseMessage).ConfigureAwait(false);

_lastToken = sessionInfo.Token;
SessionId = sessionInfo.SessionId;

// Establish new WebSocket connection
var wsUri = new Uri($"{sessionInfo.StreamingUrl}?token={Uri.EscapeDataString(sessionInfo.Token)}");
_webSocket = new ClientWebSocket();
await _webSocket.ConnectAsync(wsUri, cancellationToken).ConfigureAwait(false);

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VoiceSession mutates and disposes _webSocket inside ReconnectAsync while SendAudioAsync/EndAudioAsync and ReceiveLoopAsync use _webSocket without synchronization. This can race (send on a disposed/old socket, or receive loop reading from a replaced socket). Use a lock/SemaphoreSlim to serialize reconnect/send/dispose, or capture the current socket into a local variable under lock and only dispose/swap once no other operations are using it.

Copilot uses AI. Check for mistakes.
} catch (OperationCanceledException) {
// Normal cancellation
} catch (WebSocketException) {
// Connection lost — consumer should call ReconnectAsync
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReceiveLoopAsync only catches OperationCanceledException and WebSocketException. Any other exception (e.g., ObjectDisposedException during shutdown, unexpected parsing failures) will fault the background task; ReconnectAsync awaits _receiveTask and will then throw, preventing reconnection. Catch a broader Exception here and either surface it via an event/callback or ensure it doesn’t fault the task.

Suggested change
// Connection lost — consumer should call ReconnectAsync
// Connection lost — consumer should call ReconnectAsync
} catch (Exception) {
// Unexpected exception — swallow to prevent the background task from faulting

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants