HPC fabric engineer evaluating InfiniBand vs Ethernet vs Slingshot / Omni-Path for AI training.
Audience Profile
- Age / Experience: 10–25 years in HPC interconnect
- Current role: HPC Fabric Architect / Cluster Lead (national lab / hyperscaler / large AI research org)
- Top pain points:
- Latency tail comparison at 10k+ GPU scale lacks public benchmarks
- AllReduce performance variance vendor-by-vendor
- Multi-vendor mixed-fabric integration cost
- Top decision blockers:
- Procurement bundling forces fabric choice with compute choice
- Existing operational expertise on NVIDIA-side
- (Only two declared for this segment.)
What This Segment Needs
(No product_needs supplied; derived from pain points + role.)
- Information: Independent p99.9 latency-tail and AllReduce-variance benchmarks at 10k+ GPU scale; per-vendor UEC 1.0 conformance status.
- Tools: Mixed-fabric integration cost models; RoCEv2-vs-InfiniBand congestion-control test harnesses (incast/lossless tuning).
- Services: Vendor-neutral bake-off / POC access decoupled from compute-procurement bundling.
Top 5 Companies for You (Fit Score)
| Rank | Company | Score | Why | |------|---------|-------|-----| | 1 | Meta Platforms | 88/100 | UEC founding steering member (UEC 1.0, 2025-06-11) on 24,576-GPU clusters; DSF/FBOSS/RoCE roadmap public at OCP Oct 2025; self-funds buildout ($18.34B net income Q2'25). Standards-facing, builder titles. | | 2 | HPE | 79/100 | Five 2026-02→05 reqs span Slingshot, 800G switch ASIC, RoCE+InfiniBand, Ultra Ethernet, NCCL/RCCL; Cray exascale heritage. Q3 FY25 ~$9.1B but +19% YoY is Juniper-consolidation-inflated. | | 3 | Ayar Labs | 74/100 | Full optical-I/O stack (WDM micro-rings, UCIe, 100+Gbps/lane); Staff/Principal IC tracks 2026-02-24→05-08; $155M Series C Dec 2024 (prior, not grounded). Private, pre-scale, partner-concentrated. | | 4 | Cornelis Networks | 71/100 | Omni-Path CN5000 launched 2025-05-01 (400 Gbps/port, 500k+ endpoints), CN6000 800G 2026; libfabric/OFI + UEC 1.0. DOE single-vertical risk; CN6000 unshipped. | | 5 | Enfabrica | 67/100 | Five senior silicon reqs (Principal SerDes 2026-02-20 → SuperNIC board 2026-05-07) signal active pre-tapeout cycle; 224G PAM4, 800G MAC, RoCEv2. Pre-revenue, single product vs NVIDIA. |
Deal-Breakers (Your Hard Preferences)
No hard preferences declared for this segment.
How to Evaluate Any Company in this Niche (Checklist)
- [ ] Check growth signals: count senior fabric reqs in last 180d — target ≥5 distinct silicon-to-collective specialties (SerDes + 800G MAC + RoCEv2 + NCCL/RCCL).
- [ ] Check comp data: pull levels.fyi + H-1B LCA disclosures for "Network ASIC"/"Fabric Architect" bands; all 5 here have "no comp data" — make it a screening question.
- [ ] Check learning signals: confirm UEC 1.0 tier (founding/steering vs member) and that reqs name RoCEv2 congestion control + libfabric/OFI, not just "InfiniBand experience".
- [ ] Check stability signals: for private vendors ask runway + customer-win count — flag empty business_signals_180d (Ayar, Cornelis, Enfabrica all empty).
- [ ] Check culture signals: request OCP/standards talk links and engineering-blog cadence; no Glassdoor cross-source on any of the 5, so probe attrition directly.
- [ ] Check lock-in: ask if fabric ships decoupled from compute (the procurement-bundling blocker) and get shipped cluster size vs forward GW/endpoint targets.
Reverse-Hype Watch
- Forward capacity targets are unbacked by shipped wins: Meta "toward 5GW", Cornelis "500k+ endpoints", Enfabrica "~3.2 Tbps ahead of BlueField-3" — only Meta's 24,576-GPU cluster is documented as shipping.
- Growth inflation: HPE's +19% YoY is ~1-month Juniper consolidation (organic ~6%); Meta Q3'25 carried a one-time $15.93B non-cash tax charge plus open-ended "notably larger" 2026 capex.
- Private-vendor traction claims rest on empty business_signals_180d with funding figures reconstructed from AI priors (Ayar, Cornelis, Enfabrica).
Under-reported for this segment: the deciding number — p99.9 latency-tail distribution and AllReduce completion-time variance at 10k+ GPU scale under incast — is exactly what no vendor publishes. Public coverage fixates on peak port bandwidth (400G/800G/1.6T) and gigawatt capacity, while congestion-control behavior, lossless-tuning maturity at multi-GW scale, and real mixed-fabric integration cost stay invisible until you run your own bake-off.