QUIC REALITY DATAGRAM Roadmap
Last updated: 2026-05-10
Purpose
This file tracks the DATAGRAM-specific performance plan for UDP/L3 style payloads over QUIC REALITY. It is separate from the main QUIC REALITY roadmap because DATAGRAM forwarding has different tradeoffs from STREAM forwarding:
- DATAGRAM payloads are message-oriented and unreliable.
- They are still QUIC ack-eliciting packets, but the payload itself must not be retransmitted by the QUIC layer.
- Throughput depends mostly on packet rate, syscall count, packet assembly, crypto cost, task handoff cost, and connection striping.
- For future L3 tunnel mode, IP packets should map to QUIC DATAGRAM frames, not STREAM frames.
Current Baseline
Test chain:
iperf3 UDP client
-> local RAW UDP ingress
-> QUIC REALITY 1-RTT DATAGRAM
-> RAW UDP upstream
-> local iperf3 server on 127.0.0.1:5201Current test command:
RUST_LOG=nexus_agent::gateway=warn,nexus_agent::gateway::quic=warn,nexus_agent::gateway::udp=warn \
NEXUS_TEST_IPERF3_UDP_BITRATE=1G \
NEXUS_TEST_IPERF3_SECONDS=3 \
NEXUS_TEST_IPERF3_UDP_LEN=1150 \
cargo test --release -p nexus-agent gateway::tests::tcp_udp_udp_over_quic_reality_iperf3_5201_smoke -- --ignored --nocaptureMeasured local results after the current optimization pass:
| Target | UDP payload | Receiver | Loss |
|---|---|---|---|
| 1G | 1150B | about 995 Mbit/s | about 0.34% |
| 1.5G | 1150B | about 1.33 Gbit/s | about 5.6% |
Earlier confirmed baseline:
| Target | UDP payload | Receiver | Notes |
|---|---|---|---|
| 1G | 1000B | about 181 Mbit/s | before hot-path fixes |
| 1G | 1150B | about 807 Mbit/s | before ACK-history fix |
| 1G | 1150B | about 995-1000 Mbit/s | current stable target |
Direct UDP baseline on the same machine previously reached roughly line rate at 1G, so the remaining gap is in the QUIC REALITY DATAGRAM data path, not iperf3 itself.
Current Implemented Optimizations
Keep these unless a later benchmark proves a regression:
- Cached QUIC packet AEAD and header-protection ciphers in
quic-core. - Short-packet open avoids copying the full packet; it now builds short AAD on the stack and decrypts the original ciphertext slice.
- DATAGRAM frame encoding writes directly into the QUIC payload buffer instead of allocating a temporary
Vecper UDP payload. - Server/client DATAGRAM-only ACKs are decimated separately from STREAM ACKs.
- Application ACK history is pruned with a sliding window to avoid repeatedly scanning hundreds of thousands of received packet numbers during high-rate DATAGRAM tests.
- UDP ingress avoids one unnecessary clone in the QUIC REALITY path.
- Server-side QUIC socket handling drains several ready packets per receive wake using safe
try_recv_frombatching.
Rejected experiments:
- Fast direct upstream
try_sendfrom the QUIC packet handler reduced receiver throughput and increased loss. Do not reintroduce without a bounded queue and explicit backpressure model. - A first
recvmmsgprototype did not produce stable gains in this data path. Revisit only after packet/accounting instrumentation is in place. - UDP ingress hot-session direct channel send from the receive loop reduced 1.5G receiver throughput in local tests. It slowed the receive loop more than it helped session lookup. Keep the batch session-table fast path instead.
Bottleneck Analysis
The current single-session ceiling is around 1.3-1.4 Gbit/s for 1150B payloads. At this rate the system is handling roughly 140k-160k QUIC DATAGRAM packets per second. The likely bottleneck stack is:
- Per-packet crypto and header protection.
- Per-packet QUIC frame parsing and session-id decode/copy.
- Per-packet task handoff between UDP ingress, QUIC client session, QUIC listener, and upstream UDP relay.
- Syscall cost for many MTU-sized UDP sends and receives.
- Single QUIC connection/session serialization on one packet-number space.
- ACK bookkeeping and packet-number set maintenance under high packet rate.
- Kernel socket queue drops when user-space processing cannot keep up.
The current 1G result is good enough for basic UDP-over-QUIC REALITY. Reaching 3Gbit/s requires architectural work, not small constant tweaks.
Full frontend/backend/agent L3 point-to-point tunnel planning lives in l3ptp-reality.md. Keep this file focused on DATAGRAM packet-rate, batching, ACK, socket, and QUIC data-path work. Long-term fully kernel-resident QUIC REALITY DATAGRAM DCO planning lives in quic-reality-kernel-dco.md.
Goals
Short-term:
- Keep 1G / 1150B stable with less than 1% loss on local release tests.
- Raise single-session receiver throughput above 1.5G without increasing loss.
- Add enough counters to explain drops instead of relying only on iperf output.
Mid-term:
- Reach 2G receiver throughput for one UDP session on loopback/LAN profile.
- Support multiple UDP sessions over one QUIC REALITY connection without each session paying a full handshake.
- Make L3-tunnel DATAGRAM forwarding explicit in the transport model.
Long-term:
- Reach 3G aggregate DATAGRAM receiver throughput through connection striping, kernel batching, or both.
- Keep WAN mode MTU-safe: no dependence on large fragmented UDP datagrams.
- Keep DATAGRAM semantics unreliable; do not accidentally convert UDP/L3 payloads into STREAM-like reliable delivery.
Non-Goals
- Do not retransmit QUIC DATAGRAM payloads. UDP/L3 reliability belongs above the DATAGRAM layer if needed.
- Do not optimize by increasing QUIC packet size beyond safe path MTU for WAN mode.
- Do not keep unsafe syscall code unless it demonstrates a reproducible gain and has a narrow, documented safety boundary.
Phase 0: Measurement Harness
Status: in progress.
Deliverables:
- Add a DATAGRAM benchmark helper that records:
- iperf sender bitrate.
- iperf receiver bitrate.
- loss percentage.
- datagram length.
- test duration.
- number of QUIC packets sent/received on client and server.
- number of UDP payloads forwarded to upstream.
- application ACK packets sent in each direction.
- Add hot-path counters in the agent:
- UDP ingress datagrams.
- QUIC DATAGRAM frames encoded/decoded.
- DATAGRAM-only ACK flush count.
- socket send failures / would-block count.
- upstream UDP send/receive count.
- per-session pending queue depth.
- Current implementation has an atomic in-process counter snapshot exposed by
gateway::quic::quic_datagram_metrics_snapshot()and reset bygateway::quic::reset_quic_datagram_metrics(). The ignored iperf3 smoke test prints oneQUIC_REALITY_DATAGRAM_METRICSline with bitrate, duration, payload length, QUIC DATAGRAM frame/packet counts, upstream UDP counts, ACK counts, and socket send error counters. - Pending counter gap: per-session pending queue depth.
- Add a repeatable release test matrix:
- payload sizes: 1000, 1150, 1200-budget-clamped.
- target rates: 1G, 1.2G, 1.5G, 2G.
- durations: 3s quick and 15s stability.
Acceptance:
- One command prints a compact table of throughput/loss/counters.
- Results can distinguish user-space drops from upstream/socket drops.
- Baseline is recorded before each optimization phase.
Rollback condition:
- None. Measurement code should be low risk and test-only or counter-only.
Phase 1: DATAGRAM Hot Path Cleanup
Status: partially complete; continue.
Deliverables:
- Replace
decode_quic_udp_payloadreturning(session_id, Vec<u8>)with a borrowed decode result for the QUIC receive path. - Push owned
Vecallocation to the exact boundary that requires ownership, not at frame parse time. - Use the borrowed application-frame parser on the client application packet path so DATAGRAM frames do not allocate in
quic-core::parse_framesbefore being copied to the final owner. - Evaluate
bytes::Bytes/BytesMutfor UDP payload handoff to reduce copies through channels. - Pre-allocate reusable packet buffers for DATAGRAM frame assembly on both client and server.
- Keep DATAGRAM frame batching MTU-aware: coalesce only small UDP payloads that fit the packet budget.
Acceptance:
- 1G / 1150B remains stable below 1% loss.
- 1.5G / 1150B improves over the current about 1.33G receiver or reduces loss at the same receiver rate.
cargo test -p nexus-agent quic::tests --libandcargo test -p quic-corepass.
Rollback condition:
- Any buffer reuse that introduces lifetime complexity or data corruption is reverted unless it gives a clear measured win.
Phase 2: ACK and Packet-Number Accounting
Status: started.
Deliverables:
- Keep DATAGRAM-only ACK decimation separate from STREAM ACK policy.
- Replace
BTreeSetreceived packet tracking in the high-rate application path with a compact range set or ring window. - Add explicit upper bound for ACK range count in DATAGRAM-only ACK frames.
- Avoid building large ACK frames for old DATAGRAM packet numbers that no longer affect recovery.
- Make ACK delay configurable per profile:
- low-latency UDP profile.
- high-throughput L3 tunnel profile.
Acceptance:
- ACK CPU and ACK packet rate stay bounded during 2G target tests.
- No regression in STREAM recovery tests.
- 1G / 1150B remains stable; 1.5G / 1150B loss decreases.
Rollback condition:
- If reduced ACK information causes packet-number recovery bugs for STREAM, gate the optimization behind DATAGRAM-only frame classification.
Phase 3: Dedicated DATAGRAM Session Actor
Status: planned.
Problem:
The current path still has general-purpose forwarding structure inherited from TCP/STREAM work. DATAGRAM needs a packet-rate-oriented actor with bounded queues and batch flush semantics.
Deliverables:
- Add a
QuicDatagramRelayactor per peer or per upstream profile. - It owns:
- one QUIC client session or connection pool.
- UDP session-id map.
- inbound batch queue.
- outbound batch queue.
- periodic ACK flush and maintenance timer.
- Use bounded channels or ring buffers instead of unbounded per-packet channels.
- Add clear backpressure policy:
- drop newest.
- drop oldest.
- per-session quota.
- optional priority for control packets.
Acceptance:
- Queue depth counters stay bounded at 1.5G target.
- Packet drops are intentional and counted, not hidden in socket errors.
- 1G remains stable and 1.5G improves in loss or throughput.
Rollback condition:
- If actor separation adds extra task hops without batching benefit, collapse it back into the current loop and keep only the queue/counter pieces.
Phase 4: Multi-Session and Connection Striping
Status: planned.
Problem:
A single QUIC connection has one packet-number space and one serialized session state. For UDP/L3 throughput, aggregate capacity can scale better by striping across multiple QUIC REALITY connections.
Deliverables:
- Add a QUIC REALITY DATAGRAM pool with N connections per peer.
- Flow-hash UDP sessions to a stable connection:
- 5-tuple for UDP proxy mode where available.
- session-id for current UDP session proxy.
- IP flow hash for future L3 tunnel mode.
- Start with configurable
datagram_connection_count. - Keep each UDP flow ordered within its assigned connection.
- Add config surface in agent/backend/frontend for a high-throughput DATAGRAM profile.
Acceptance:
- Aggregate local receiver throughput reaches 2G+ with multiple UDP sessions.
- Per-flow ordering is preserved within a connection.
- A single UDP session still works without striping.
Rollback condition:
- If striping breaks NAT/session semantics, keep it opt-in for L3 tunnel mode first.
Phase 5: Kernel Send Batching and UDP GSO
Status: planned.
Problem:
At MTU-sized DATAGRAMs, 3G throughput requires hundreds of thousands of packets per second. sendmmsg helps syscall count, but UDP GSO is the larger lever for LAN/high-BDP paths.
Deliverables:
- Keep current
sendmmsgfallback path. - Add Linux UDP GSO using
UDP_SEGMENTancillary data. - Build a packet scheduler that groups encrypted QUIC short packets with the same segment size where possible.
- Runtime detect GSO support; fallback to
sendmmsgif unsupported. - Add metrics:
- GSO packets sent.
- segments per GSO send.
- fallback sends.
- GSO send errors.
Acceptance:
- 2G target improves without raising loss.
- 3G aggregate becomes reachable on local/LAN profile.
- WAN profile still respects MTU and can disable GSO.
Rollback condition:
- Any GSO implementation that depends on oversized IP fragmentation is rejected. GSO must segment in kernel into valid MTU-sized UDP datagrams.
Phase 6: Receive Batching Revisit
Status: deferred.
Notes:
- A first
recvmmsgexperiment did not produce stable gains and added unsafe complexity. - Revisit only after Phase 0 counters show receive syscall cost as a top bottleneck.
Deliverables if revisited:
- Isolate Linux receive batching in a small module with unit tests for sockaddr conversion.
- Avoid raw pointers stored across async
.awaitunless the type has a documented and reviewedSendboundary. - Compare:
- Tokio
recv_from+try_recv_fromdrain. recvmmsgafter readiness.- dedicated blocking thread with
recvmmsg.
- Tokio
Acceptance:
- Must show repeatable improvement over safe drain in 1.5G and 2G tests.
- Must not reduce 1G stability.
Rollback condition:
- Any unstable or neutral result should keep the safe drain implementation.
Phase 7: L3 Tunnel DATAGRAM Mode
Status: planned.
Deliverables:
- Define L3 tunnel payload mapping:
- one IP packet per QUIC DATAGRAM when it fits.
- MTU clamp and ICMP/PMTUD strategy.
- explicit drop policy for oversized packets.
- Add route/device integration separately from the QUIC data path.
- Add per-flow connection striping for L3 mode.
- Add counters for:
- IP packets in/out.
- oversized drops.
- flow-hash distribution.
- per-connection throughput.
Acceptance:
- L3 mode uses DATAGRAM, not STREAM.
- IP packet forwarding does not depend on reliable QUIC retransmission.
- Per-flow ordering remains stable.
Phase 8: Production Safety
Status: planned.
Deliverables:
- Add profile-level limits:
- max datagram payload.
- max queued packets per session.
- max queued bytes per peer.
- max DATAGRAM connections per peer.
- Add overload behavior:
- counted packet drops.
- backpressure logs at rate-limited intervals.
- no unbounded memory growth.
- Add observability:
- Prometheus counters for DATAGRAM packets, bytes, drops, ACKs, queue depth, and connection striping distribution.
Acceptance:
- Sustained 15s high-rate tests do not grow memory without bound.
- Overload degrades by controlled packet loss rather than task stalls.
Immediate Next Actions
- Add Phase 0 counters and a compact benchmark parser.
- Convert DATAGRAM decode to borrowed payloads and reduce ownership copies.
- Replace high-rate application received-packet tracking with a compact range window specialized for ACK generation.
- Build an opt-in multi-connection DATAGRAM pool for aggregate throughput.
- Start UDP GSO only after the benchmark table proves the single-connection software path has reached its practical ceiling.
Current Engineering Judgment
The next best optimization is not recvmmsg. The next best step is measurement plus ownership/copy cleanup in the DATAGRAM hot path, followed by connection striping and UDP GSO. The current single-session path is already near 1G stable; 3G will require parallelism or kernel segmentation.