NETWORK-OPS-SRE-01

Site reliability engineer running AI cluster networks day-to-day.

Audience

  • · 5-12
  • Current: Network SRE / NetOps Lead
  • Pain: Debugging hung allreduce at 10k+ GPU scale
  • Pain: Topology change rollout safety (link drains)

Product Needs

(none)

Channels

(none)

Competitor Lens

(none)

Fit Score weights — adjust to your priorities

20%
20%
25%
30%
5%
Top 5 for this segment
  1. 1. Arista Networks67/100
  2. 2. Cisco66/100
  3. 3. Celestica60/100
  4. 4. Credo Technology58/100
  5. 5. Semtech55/100

Full Persona Brief

Site reliability engineer running AI cluster networks day-to-day.

Audience Profile

  • **Age / Experience:** 5-12 years in production networking
  • **Current role:** Network SRE / NetOps Lead at hyperscaler, large enterprise, or cloud-native shop
  • **Top pain points:**
  • Debugging hung allreduce at 10k+ GPU scale
  • Topology change rollout safety (link drains)
  • Vendor support escalation latency on production fires
  • **Top decision blockers:**
  • Single point of failure on closed-source controller
  • Patch + firmware cycle disruption to running clusters

What This Segment Needs

  • **Information:** Collective-stall postmortems, controller HA/failover architecture docs, hitless firmware-upgrade runbooks validated on live fabrics.
  • **Tools:** Per-flow telemetry that pinpoints allreduce/NCCL stalls, automated link-drain orchestration, RoCEv2/PFC/ECN congestion observability.
  • **Services:** SLA-backed escalation with named TAMs, war-room bridge for production fires, controller upgrade support without cluster pause.

Top 5 Companies for You (Fit Score)

| Rank | Company | Score | Why | |------|---------|-------|-----| | 1 | Arista Networks | 83/100 | Three quarters of 27-30% growth, guidance raised twice, ~$750M AI back-end target reaffirmed. AI PM/Architect reqs posted 2026-05-06. UEC steering member (UEC 1.0, 2025-06-11); Etherlink on Tomahawk 6 102.4 Tbps. Capped <70: no Glassdoor data. | | 2 | Cisco | 83/100 | AI infra orders ~$2B FY2025 (2x the $1B target); Q1 FY26 webscale orders >$1.3B in one quarter, full-year guidance raised. Shipped Silicon One P200 51.2 Tbps Oct 2025. RoCEv2/lossless-Ethernet reqs. 2023-24 restructuring history tempers score. | | 3 | Celestica | 75/100 | FY2024 $9.65B → FY2025 $12.2B (+26.4%); three 2025 guidance raises; FY2026 ~$16B outlook. Tomahawk 6 102.4 Tbps + OCP 1.6T. 5/5 sampled reqs engineering. Risk: sole-source hyperscaler whitebox programs, thin EMS margins. | | 4 | Credo Technology | 73/100 | Q1 FY2026 record $222.8M, +274% YoY; FY2025 $436.8M, +126%. 1.6T 200G/lane AEC in production ramp; second hyperscaler crossed >10% revenue. But FY2025 10-K: one hyperscaler ~86% of revenue — severe concentration. | | 5 | Semtech | 68/100 | Q1 FY2026 ~$251.1M (+22% YoY), record data-center ~$50M; Q2 data center +100% YoY on CopperEdge ACC ramp. But ~$1B+ Sierra Wireless debt, CEO transition ~2024, and growth off a low base vs Marvell/Broadcom. |

Deal-Breakers (Your Hard Preferences)

No hard preferences declared for this segment. Apply your two decision blockers as live filters: closed-source single-controller designs and disruptive firmware/patch cycles are disqualifiers — confirm both in the checklist below before signing.

How to Evaluate Any Company in this Niche (Checklist)

  • [ ] **Growth signals:** Look for 3+ consecutive quarters of disclosed AI/data-center revenue growth >25% AND a named back-end/AI-fabric dollar target reaffirmed across earnings calls (e.g. Arista's ~$750M).
  • [ ] **Comp data:** No comp supplied here for any of the five — pull levels.fyi + Blind for the specific IC/Staff network-SRE ladder before negotiating; treat "no comp data" as a gap, not a positive.
  • [ ] **Learning signals:** Confirm shipped frontier silicon dates (e.g. Silicon One P200 Oct 2025, Tomahawk 6 102.4 Tbps) and UEC/UALink/OCP participation — not just roadmap slides.
  • [ ] **Stability signals:** Check 10-K customer concentration; reject >50% single-customer revenue (Credo ~86%) unless mitigants are in production, and watch acquisition debt overhang.
  • [ ] **Controller architecture:** Ask in the interview: "Is the fabric controller multi-instance/HA, and can firmware roll cluster-wide without an allreduce pause?" Get the failover RTO in writing.
  • [ ] **Escalation reality:** Ask for the named-TAM SLA and median sev-1 time-to-engineer-on-bridge from their last quarter of production fires.

Reverse-Hype Watch

Material warnings (aggregated from the Top 5): (1) **Growth off a low base** — Semtech's ~$50M/quarter data-center run-rate is a fraction of Marvell/Broadcom optical-DSP revenue, so 100%+ YoY overstates absolute scale. (2) **Concentration + debt overhang** — Semtech's CopperEdge inflection rests on a small number of hyperscale scale-up customers against a ~$1B+ Sierra Wireless debt load and substitution risk. (3) **Single-customer dependence** — Credo's ~86% one-hyperscaler revenue and Celestica's sole-source whitebox programs mean roadmap and headcount track one buyer's capex.

What's under-reported for this segment: public scoring rewards revenue growth and shipped silicon, but says almost nothing about the things that wake a Network SRE at 3am — controller failover behavior under partial fabric failure, whether firmware/EOS/NX-OS upgrades are genuinely hitless on a running 10k-GPU job, and real sev-1 escalation latency. None of the five disclosed comp, and none disclosed operational MTTR or upgrade-disruption data. Treat the absence of operational-resilience evidence as the biggest unknown, not a neutral.