Site reliability engineer running AI cluster networks day-to-day.
Audience Profile
- **Age / Experience:** 5-12 years in production networking
- **Current role:** Network SRE / NetOps Lead at hyperscaler, large enterprise, or cloud-native shop
- **Top pain points:**
- Debugging hung allreduce at 10k+ GPU scale
- Topology change rollout safety (link drains)
- Vendor support escalation latency on production fires
- **Top decision blockers:**
- Single point of failure on closed-source controller
- Patch + firmware cycle disruption to running clusters
What This Segment Needs
- **Information:** Collective-stall postmortems, controller HA/failover architecture docs, hitless firmware-upgrade runbooks validated on live fabrics.
- **Tools:** Per-flow telemetry that pinpoints allreduce/NCCL stalls, automated link-drain orchestration, RoCEv2/PFC/ECN congestion observability.
- **Services:** SLA-backed escalation with named TAMs, war-room bridge for production fires, controller upgrade support without cluster pause.
Top 5 Companies for You (Fit Score)
| Rank | Company | Score | Why | |------|---------|-------|-----| | 1 | Arista Networks | 83/100 | Three quarters of 27-30% growth, guidance raised twice, ~$750M AI back-end target reaffirmed. AI PM/Architect reqs posted 2026-05-06. UEC steering member (UEC 1.0, 2025-06-11); Etherlink on Tomahawk 6 102.4 Tbps. Capped <70: no Glassdoor data. | | 2 | Cisco | 83/100 | AI infra orders ~$2B FY2025 (2x the $1B target); Q1 FY26 webscale orders >$1.3B in one quarter, full-year guidance raised. Shipped Silicon One P200 51.2 Tbps Oct 2025. RoCEv2/lossless-Ethernet reqs. 2023-24 restructuring history tempers score. | | 3 | Celestica | 75/100 | FY2024 $9.65B → FY2025 $12.2B (+26.4%); three 2025 guidance raises; FY2026 ~$16B outlook. Tomahawk 6 102.4 Tbps + OCP 1.6T. 5/5 sampled reqs engineering. Risk: sole-source hyperscaler whitebox programs, thin EMS margins. | | 4 | Credo Technology | 73/100 | Q1 FY2026 record $222.8M, +274% YoY; FY2025 $436.8M, +126%. 1.6T 200G/lane AEC in production ramp; second hyperscaler crossed >10% revenue. But FY2025 10-K: one hyperscaler ~86% of revenue — severe concentration. | | 5 | Semtech | 68/100 | Q1 FY2026 ~$251.1M (+22% YoY), record data-center ~$50M; Q2 data center +100% YoY on CopperEdge ACC ramp. But ~$1B+ Sierra Wireless debt, CEO transition ~2024, and growth off a low base vs Marvell/Broadcom. |
Deal-Breakers (Your Hard Preferences)
No hard preferences declared for this segment. Apply your two decision blockers as live filters: closed-source single-controller designs and disruptive firmware/patch cycles are disqualifiers — confirm both in the checklist below before signing.
How to Evaluate Any Company in this Niche (Checklist)
- [ ] **Growth signals:** Look for 3+ consecutive quarters of disclosed AI/data-center revenue growth >25% AND a named back-end/AI-fabric dollar target reaffirmed across earnings calls (e.g. Arista's ~$750M).
- [ ] **Comp data:** No comp supplied here for any of the five — pull levels.fyi + Blind for the specific IC/Staff network-SRE ladder before negotiating; treat "no comp data" as a gap, not a positive.
- [ ] **Learning signals:** Confirm shipped frontier silicon dates (e.g. Silicon One P200 Oct 2025, Tomahawk 6 102.4 Tbps) and UEC/UALink/OCP participation — not just roadmap slides.
- [ ] **Stability signals:** Check 10-K customer concentration; reject >50% single-customer revenue (Credo ~86%) unless mitigants are in production, and watch acquisition debt overhang.
- [ ] **Controller architecture:** Ask in the interview: "Is the fabric controller multi-instance/HA, and can firmware roll cluster-wide without an allreduce pause?" Get the failover RTO in writing.
- [ ] **Escalation reality:** Ask for the named-TAM SLA and median sev-1 time-to-engineer-on-bridge from their last quarter of production fires.
Reverse-Hype Watch
Material warnings (aggregated from the Top 5): (1) **Growth off a low base** — Semtech's ~$50M/quarter data-center run-rate is a fraction of Marvell/Broadcom optical-DSP revenue, so 100%+ YoY overstates absolute scale. (2) **Concentration + debt overhang** — Semtech's CopperEdge inflection rests on a small number of hyperscale scale-up customers against a ~$1B+ Sierra Wireless debt load and substitution risk. (3) **Single-customer dependence** — Credo's ~86% one-hyperscaler revenue and Celestica's sole-source whitebox programs mean roadmap and headcount track one buyer's capex.
What's under-reported for this segment: public scoring rewards revenue growth and shipped silicon, but says almost nothing about the things that wake a Network SRE at 3am — controller failover behavior under partial fabric failure, whether firmware/EOS/NX-OS upgrades are genuinely hitless on a running 10k-GPU job, and real sev-1 escalation latency. None of the five disclosed comp, and none disclosed operational MTTR or upgrade-disruption data. Treat the absence of operational-resilience evidence as the biggest unknown, not a neutral.