IBETH-HPC-FABRIC-01

HPC fabric engineer evaluating InfiniBand vs Ethernet vs Slingshot / Omni-Path for AI training.

Audience

  • · 10-25
  • Current: HPC Fabric Architect / Cluster Lead
  • Pain: Latency tail comparison at 10k+ GPU scale lacks public benchmarks
  • Pain: AllReduce performance variance vendor-by-vendor

Product Needs

(none)

Channels

(none)

Competitor Lens

(none)

Fit Score weights — adjust to your priorities

30%
15%
40%
10%
5%
Top 5 for this segment
  1. 1. Meta Platforms75/100
  2. 2. HPE67/100
  3. 3. Ayar Labs63/100
  4. 4. Cornelis Networks61/100
  5. 5. Enfabrica57/100

Full Persona Brief

HPC fabric engineer evaluating InfiniBand vs Ethernet vs Slingshot / Omni-Path for AI training.

Audience Profile

  • Age / Experience: 10–25 years in HPC interconnect
  • Current role: HPC Fabric Architect / Cluster Lead (national lab / hyperscaler / large AI research org)
  • Top pain points:
  • Latency tail comparison at 10k+ GPU scale lacks public benchmarks
  • AllReduce performance variance vendor-by-vendor
  • Multi-vendor mixed-fabric integration cost
  • Top decision blockers:
  • Procurement bundling forces fabric choice with compute choice
  • Existing operational expertise on NVIDIA-side
  • (Only two declared for this segment.)

What This Segment Needs

(No product_needs supplied; derived from pain points + role.)

  • Information: Independent p99.9 latency-tail and AllReduce-variance benchmarks at 10k+ GPU scale; per-vendor UEC 1.0 conformance status.
  • Tools: Mixed-fabric integration cost models; RoCEv2-vs-InfiniBand congestion-control test harnesses (incast/lossless tuning).
  • Services: Vendor-neutral bake-off / POC access decoupled from compute-procurement bundling.

Top 5 Companies for You (Fit Score)

| Rank | Company | Score | Why | |------|---------|-------|-----| | 1 | Meta Platforms | 88/100 | UEC founding steering member (UEC 1.0, 2025-06-11) on 24,576-GPU clusters; DSF/FBOSS/RoCE roadmap public at OCP Oct 2025; self-funds buildout ($18.34B net income Q2'25). Standards-facing, builder titles. | | 2 | HPE | 79/100 | Five 2026-02→05 reqs span Slingshot, 800G switch ASIC, RoCE+InfiniBand, Ultra Ethernet, NCCL/RCCL; Cray exascale heritage. Q3 FY25 ~$9.1B but +19% YoY is Juniper-consolidation-inflated. | | 3 | Ayar Labs | 74/100 | Full optical-I/O stack (WDM micro-rings, UCIe, 100+Gbps/lane); Staff/Principal IC tracks 2026-02-24→05-08; $155M Series C Dec 2024 (prior, not grounded). Private, pre-scale, partner-concentrated. | | 4 | Cornelis Networks | 71/100 | Omni-Path CN5000 launched 2025-05-01 (400 Gbps/port, 500k+ endpoints), CN6000 800G 2026; libfabric/OFI + UEC 1.0. DOE single-vertical risk; CN6000 unshipped. | | 5 | Enfabrica | 67/100 | Five senior silicon reqs (Principal SerDes 2026-02-20 → SuperNIC board 2026-05-07) signal active pre-tapeout cycle; 224G PAM4, 800G MAC, RoCEv2. Pre-revenue, single product vs NVIDIA. |

Deal-Breakers (Your Hard Preferences)

No hard preferences declared for this segment.

How to Evaluate Any Company in this Niche (Checklist)

  • [ ] Check growth signals: count senior fabric reqs in last 180d — target ≥5 distinct silicon-to-collective specialties (SerDes + 800G MAC + RoCEv2 + NCCL/RCCL).
  • [ ] Check comp data: pull levels.fyi + H-1B LCA disclosures for "Network ASIC"/"Fabric Architect" bands; all 5 here have "no comp data" — make it a screening question.
  • [ ] Check learning signals: confirm UEC 1.0 tier (founding/steering vs member) and that reqs name RoCEv2 congestion control + libfabric/OFI, not just "InfiniBand experience".
  • [ ] Check stability signals: for private vendors ask runway + customer-win count — flag empty business_signals_180d (Ayar, Cornelis, Enfabrica all empty).
  • [ ] Check culture signals: request OCP/standards talk links and engineering-blog cadence; no Glassdoor cross-source on any of the 5, so probe attrition directly.
  • [ ] Check lock-in: ask if fabric ships decoupled from compute (the procurement-bundling blocker) and get shipped cluster size vs forward GW/endpoint targets.

Reverse-Hype Watch

  • Forward capacity targets are unbacked by shipped wins: Meta "toward 5GW", Cornelis "500k+ endpoints", Enfabrica "~3.2 Tbps ahead of BlueField-3" — only Meta's 24,576-GPU cluster is documented as shipping.
  • Growth inflation: HPE's +19% YoY is ~1-month Juniper consolidation (organic ~6%); Meta Q3'25 carried a one-time $15.93B non-cash tax charge plus open-ended "notably larger" 2026 capex.
  • Private-vendor traction claims rest on empty business_signals_180d with funding figures reconstructed from AI priors (Ayar, Cornelis, Enfabrica).

Under-reported for this segment: the deciding number — p99.9 latency-tail distribution and AllReduce completion-time variance at 10k+ GPU scale under incast — is exactly what no vendor publishes. Public coverage fixates on peak port bandwidth (400G/800G/1.6T) and gigawatt capacity, while congestion-control behavior, lossless-tuning maturity at multi-GW scale, and real mixed-fabric integration cost stay invisible until you run your own bake-off.