Skip to content

QUIC REALITY DATAGRAM Roadmap

Last updated: 2026-05-10

Purpose

This file tracks the DATAGRAM-specific performance plan for UDP/L3 style payloads over QUIC REALITY. It is separate from the main QUIC REALITY roadmap because DATAGRAM forwarding has different tradeoffs from STREAM forwarding:

  • DATAGRAM payloads are message-oriented and unreliable.
  • They are still QUIC ack-eliciting packets, but the payload itself must not be retransmitted by the QUIC layer.
  • Throughput depends mostly on packet rate, syscall count, packet assembly, crypto cost, task handoff cost, and connection striping.
  • For future L3 tunnel mode, IP packets should map to QUIC DATAGRAM frames, not STREAM frames.

Current Baseline

Test chain:

text
iperf3 UDP client
  -> local RAW UDP ingress
  -> QUIC REALITY 1-RTT DATAGRAM
  -> RAW UDP upstream
  -> local iperf3 server on 127.0.0.1:5201

Current test command:

bash
RUST_LOG=nexus_agent::gateway=warn,nexus_agent::gateway::quic=warn,nexus_agent::gateway::udp=warn \
NEXUS_TEST_IPERF3_UDP_BITRATE=1G \
NEXUS_TEST_IPERF3_SECONDS=3 \
NEXUS_TEST_IPERF3_UDP_LEN=1150 \
cargo test --release -p nexus-agent gateway::tests::tcp_udp_udp_over_quic_reality_iperf3_5201_smoke -- --ignored --nocapture

Measured local results after the current optimization pass:

TargetUDP payloadReceiverLoss
1G1150Babout 995 Mbit/sabout 0.34%
1.5G1150Babout 1.33 Gbit/sabout 5.6%

Earlier confirmed baseline:

TargetUDP payloadReceiverNotes
1G1000Babout 181 Mbit/sbefore hot-path fixes
1G1150Babout 807 Mbit/sbefore ACK-history fix
1G1150Babout 995-1000 Mbit/scurrent stable target

Direct UDP baseline on the same machine previously reached roughly line rate at 1G, so the remaining gap is in the QUIC REALITY DATAGRAM data path, not iperf3 itself.

Current Implemented Optimizations

Keep these unless a later benchmark proves a regression:

  • Cached QUIC packet AEAD and header-protection ciphers in quic-core.
  • Short-packet open avoids copying the full packet; it now builds short AAD on the stack and decrypts the original ciphertext slice.
  • DATAGRAM frame encoding writes directly into the QUIC payload buffer instead of allocating a temporary Vec per UDP payload.
  • Server/client DATAGRAM-only ACKs are decimated separately from STREAM ACKs.
  • Application ACK history is pruned with a sliding window to avoid repeatedly scanning hundreds of thousands of received packet numbers during high-rate DATAGRAM tests.
  • UDP ingress avoids one unnecessary clone in the QUIC REALITY path.
  • Server-side QUIC socket handling drains several ready packets per receive wake using safe try_recv_from batching.

Rejected experiments:

  • Fast direct upstream try_send from the QUIC packet handler reduced receiver throughput and increased loss. Do not reintroduce without a bounded queue and explicit backpressure model.
  • A first recvmmsg prototype did not produce stable gains in this data path. Revisit only after packet/accounting instrumentation is in place.
  • UDP ingress hot-session direct channel send from the receive loop reduced 1.5G receiver throughput in local tests. It slowed the receive loop more than it helped session lookup. Keep the batch session-table fast path instead.

Bottleneck Analysis

The current single-session ceiling is around 1.3-1.4 Gbit/s for 1150B payloads. At this rate the system is handling roughly 140k-160k QUIC DATAGRAM packets per second. The likely bottleneck stack is:

  1. Per-packet crypto and header protection.
  2. Per-packet QUIC frame parsing and session-id decode/copy.
  3. Per-packet task handoff between UDP ingress, QUIC client session, QUIC listener, and upstream UDP relay.
  4. Syscall cost for many MTU-sized UDP sends and receives.
  5. Single QUIC connection/session serialization on one packet-number space.
  6. ACK bookkeeping and packet-number set maintenance under high packet rate.
  7. Kernel socket queue drops when user-space processing cannot keep up.

The current 1G result is good enough for basic UDP-over-QUIC REALITY. Reaching 3Gbit/s requires architectural work, not small constant tweaks.

Full frontend/backend/agent L3 point-to-point tunnel planning lives in l3ptp-reality.md. Keep this file focused on DATAGRAM packet-rate, batching, ACK, socket, and QUIC data-path work. Long-term fully kernel-resident QUIC REALITY DATAGRAM DCO planning lives in quic-reality-kernel-dco.md.

Goals

Short-term:

  • Keep 1G / 1150B stable with less than 1% loss on local release tests.
  • Raise single-session receiver throughput above 1.5G without increasing loss.
  • Add enough counters to explain drops instead of relying only on iperf output.

Mid-term:

  • Reach 2G receiver throughput for one UDP session on loopback/LAN profile.
  • Support multiple UDP sessions over one QUIC REALITY connection without each session paying a full handshake.
  • Make L3-tunnel DATAGRAM forwarding explicit in the transport model.

Long-term:

  • Reach 3G aggregate DATAGRAM receiver throughput through connection striping, kernel batching, or both.
  • Keep WAN mode MTU-safe: no dependence on large fragmented UDP datagrams.
  • Keep DATAGRAM semantics unreliable; do not accidentally convert UDP/L3 payloads into STREAM-like reliable delivery.

Non-Goals

  • Do not retransmit QUIC DATAGRAM payloads. UDP/L3 reliability belongs above the DATAGRAM layer if needed.
  • Do not optimize by increasing QUIC packet size beyond safe path MTU for WAN mode.
  • Do not keep unsafe syscall code unless it demonstrates a reproducible gain and has a narrow, documented safety boundary.

Phase 0: Measurement Harness

Status: in progress.

Deliverables:

  • Add a DATAGRAM benchmark helper that records:
    • iperf sender bitrate.
    • iperf receiver bitrate.
    • loss percentage.
    • datagram length.
    • test duration.
    • number of QUIC packets sent/received on client and server.
    • number of UDP payloads forwarded to upstream.
    • application ACK packets sent in each direction.
  • Add hot-path counters in the agent:
    • UDP ingress datagrams.
    • QUIC DATAGRAM frames encoded/decoded.
    • DATAGRAM-only ACK flush count.
    • socket send failures / would-block count.
    • upstream UDP send/receive count.
    • per-session pending queue depth.
  • Current implementation has an atomic in-process counter snapshot exposed by gateway::quic::quic_datagram_metrics_snapshot() and reset by gateway::quic::reset_quic_datagram_metrics(). The ignored iperf3 smoke test prints one QUIC_REALITY_DATAGRAM_METRICS line with bitrate, duration, payload length, QUIC DATAGRAM frame/packet counts, upstream UDP counts, ACK counts, and socket send error counters.
  • Pending counter gap: per-session pending queue depth.
  • Add a repeatable release test matrix:
    • payload sizes: 1000, 1150, 1200-budget-clamped.
    • target rates: 1G, 1.2G, 1.5G, 2G.
    • durations: 3s quick and 15s stability.

Acceptance:

  • One command prints a compact table of throughput/loss/counters.
  • Results can distinguish user-space drops from upstream/socket drops.
  • Baseline is recorded before each optimization phase.

Rollback condition:

  • None. Measurement code should be low risk and test-only or counter-only.

Phase 1: DATAGRAM Hot Path Cleanup

Status: partially complete; continue.

Deliverables:

  • Replace decode_quic_udp_payload returning (session_id, Vec<u8>) with a borrowed decode result for the QUIC receive path.
  • Push owned Vec allocation to the exact boundary that requires ownership, not at frame parse time.
  • Use the borrowed application-frame parser on the client application packet path so DATAGRAM frames do not allocate in quic-core::parse_frames before being copied to the final owner.
  • Evaluate bytes::Bytes / BytesMut for UDP payload handoff to reduce copies through channels.
  • Pre-allocate reusable packet buffers for DATAGRAM frame assembly on both client and server.
  • Keep DATAGRAM frame batching MTU-aware: coalesce only small UDP payloads that fit the packet budget.

Acceptance:

  • 1G / 1150B remains stable below 1% loss.
  • 1.5G / 1150B improves over the current about 1.33G receiver or reduces loss at the same receiver rate.
  • cargo test -p nexus-agent quic::tests --lib and cargo test -p quic-core pass.

Rollback condition:

  • Any buffer reuse that introduces lifetime complexity or data corruption is reverted unless it gives a clear measured win.

Phase 2: ACK and Packet-Number Accounting

Status: started.

Deliverables:

  • Keep DATAGRAM-only ACK decimation separate from STREAM ACK policy.
  • Replace BTreeSet received packet tracking in the high-rate application path with a compact range set or ring window.
  • Add explicit upper bound for ACK range count in DATAGRAM-only ACK frames.
  • Avoid building large ACK frames for old DATAGRAM packet numbers that no longer affect recovery.
  • Make ACK delay configurable per profile:
    • low-latency UDP profile.
    • high-throughput L3 tunnel profile.

Acceptance:

  • ACK CPU and ACK packet rate stay bounded during 2G target tests.
  • No regression in STREAM recovery tests.
  • 1G / 1150B remains stable; 1.5G / 1150B loss decreases.

Rollback condition:

  • If reduced ACK information causes packet-number recovery bugs for STREAM, gate the optimization behind DATAGRAM-only frame classification.

Phase 3: Dedicated DATAGRAM Session Actor

Status: planned.

Problem:

The current path still has general-purpose forwarding structure inherited from TCP/STREAM work. DATAGRAM needs a packet-rate-oriented actor with bounded queues and batch flush semantics.

Deliverables:

  • Add a QuicDatagramRelay actor per peer or per upstream profile.
  • It owns:
    • one QUIC client session or connection pool.
    • UDP session-id map.
    • inbound batch queue.
    • outbound batch queue.
    • periodic ACK flush and maintenance timer.
  • Use bounded channels or ring buffers instead of unbounded per-packet channels.
  • Add clear backpressure policy:
    • drop newest.
    • drop oldest.
    • per-session quota.
    • optional priority for control packets.

Acceptance:

  • Queue depth counters stay bounded at 1.5G target.
  • Packet drops are intentional and counted, not hidden in socket errors.
  • 1G remains stable and 1.5G improves in loss or throughput.

Rollback condition:

  • If actor separation adds extra task hops without batching benefit, collapse it back into the current loop and keep only the queue/counter pieces.

Phase 4: Multi-Session and Connection Striping

Status: planned.

Problem:

A single QUIC connection has one packet-number space and one serialized session state. For UDP/L3 throughput, aggregate capacity can scale better by striping across multiple QUIC REALITY connections.

Deliverables:

  • Add a QUIC REALITY DATAGRAM pool with N connections per peer.
  • Flow-hash UDP sessions to a stable connection:
    • 5-tuple for UDP proxy mode where available.
    • session-id for current UDP session proxy.
    • IP flow hash for future L3 tunnel mode.
  • Start with configurable datagram_connection_count.
  • Keep each UDP flow ordered within its assigned connection.
  • Add config surface in agent/backend/frontend for a high-throughput DATAGRAM profile.

Acceptance:

  • Aggregate local receiver throughput reaches 2G+ with multiple UDP sessions.
  • Per-flow ordering is preserved within a connection.
  • A single UDP session still works without striping.

Rollback condition:

  • If striping breaks NAT/session semantics, keep it opt-in for L3 tunnel mode first.

Phase 5: Kernel Send Batching and UDP GSO

Status: planned.

Problem:

At MTU-sized DATAGRAMs, 3G throughput requires hundreds of thousands of packets per second. sendmmsg helps syscall count, but UDP GSO is the larger lever for LAN/high-BDP paths.

Deliverables:

  • Keep current sendmmsg fallback path.
  • Add Linux UDP GSO using UDP_SEGMENT ancillary data.
  • Build a packet scheduler that groups encrypted QUIC short packets with the same segment size where possible.
  • Runtime detect GSO support; fallback to sendmmsg if unsupported.
  • Add metrics:
    • GSO packets sent.
    • segments per GSO send.
    • fallback sends.
    • GSO send errors.

Acceptance:

  • 2G target improves without raising loss.
  • 3G aggregate becomes reachable on local/LAN profile.
  • WAN profile still respects MTU and can disable GSO.

Rollback condition:

  • Any GSO implementation that depends on oversized IP fragmentation is rejected. GSO must segment in kernel into valid MTU-sized UDP datagrams.

Phase 6: Receive Batching Revisit

Status: deferred.

Notes:

  • A first recvmmsg experiment did not produce stable gains and added unsafe complexity.
  • Revisit only after Phase 0 counters show receive syscall cost as a top bottleneck.

Deliverables if revisited:

  • Isolate Linux receive batching in a small module with unit tests for sockaddr conversion.
  • Avoid raw pointers stored across async .await unless the type has a documented and reviewed Send boundary.
  • Compare:
    • Tokio recv_from + try_recv_from drain.
    • recvmmsg after readiness.
    • dedicated blocking thread with recvmmsg.

Acceptance:

  • Must show repeatable improvement over safe drain in 1.5G and 2G tests.
  • Must not reduce 1G stability.

Rollback condition:

  • Any unstable or neutral result should keep the safe drain implementation.

Phase 7: L3 Tunnel DATAGRAM Mode

Status: planned.

Deliverables:

  • Define L3 tunnel payload mapping:
    • one IP packet per QUIC DATAGRAM when it fits.
    • MTU clamp and ICMP/PMTUD strategy.
    • explicit drop policy for oversized packets.
  • Add route/device integration separately from the QUIC data path.
  • Add per-flow connection striping for L3 mode.
  • Add counters for:
    • IP packets in/out.
    • oversized drops.
    • flow-hash distribution.
    • per-connection throughput.

Acceptance:

  • L3 mode uses DATAGRAM, not STREAM.
  • IP packet forwarding does not depend on reliable QUIC retransmission.
  • Per-flow ordering remains stable.

Phase 8: Production Safety

Status: planned.

Deliverables:

  • Add profile-level limits:
    • max datagram payload.
    • max queued packets per session.
    • max queued bytes per peer.
    • max DATAGRAM connections per peer.
  • Add overload behavior:
    • counted packet drops.
    • backpressure logs at rate-limited intervals.
    • no unbounded memory growth.
  • Add observability:
    • Prometheus counters for DATAGRAM packets, bytes, drops, ACKs, queue depth, and connection striping distribution.

Acceptance:

  • Sustained 15s high-rate tests do not grow memory without bound.
  • Overload degrades by controlled packet loss rather than task stalls.

Immediate Next Actions

  1. Add Phase 0 counters and a compact benchmark parser.
  2. Convert DATAGRAM decode to borrowed payloads and reduce ownership copies.
  3. Replace high-rate application received-packet tracking with a compact range window specialized for ACK generation.
  4. Build an opt-in multi-connection DATAGRAM pool for aggregate throughput.
  5. Start UDP GSO only after the benchmark table proves the single-connection software path has reached its practical ceiling.

Current Engineering Judgment

The next best optimization is not recvmmsg. The next best step is measurement plus ownership/copy cleanup in the DATAGRAM hot path, followed by connection striping and UDP GSO. The current single-session path is already near 1G stable; 3G will require parallelism or kernel segmentation.

NexusNet documentation