ML systems engineer evaluating merchant AI silicon (AMD / Groq / Cerebras / etc.) vs NVIDIA defaults.
Audience Profile
- **Age / Experience:** 5–12 years; mid-to-senior IC.
- **Current role:** ML Systems Engineer / Distributed Training Engineer (AI lab / hyperscaler / GPU-rich startup).
- **Top pain points:**
- ROCm vs CUDA software-stack parity gap (real status, not vendor claims)
- Inference-cost-per-token actual numbers, not theoretical TOPS
- Switching cost between accelerator families
- **Top decision blockers:**
- Existing CUDA codebase migration effort
- Vendor support response time on hard bugs
- Toolchain maturity (debuggers / profilers / multi-node)
What This Segment Needs
- **Information:** Independent ROCm-vs-CUDA parity status, measured cost-per-token, real CUDA→target port case studies — not TOPS decks.
- **Tools:** Mature multi-node debuggers/profilers and an MLIR/LLVM compiler with quantization-aware optimization.
- **Services:** Vendor support with a published bug-response SLA and hands-on CUDA migration assistance.
Top 5 Companies for You (Fit Score)
| Rank | Company | Score | Why | |------|---------|-------|-----| | 1 | Groq | 81/100 | $750M at $6.9B post-money (2025-09-17, ~2.5x step-up); OpenAI gpt-oss day-one partner (2025-08-05); HF Inference Provider (5M+ devs). Deterministic software-scheduled LPU; inference-only narrows training-stack breadth. | | 2 | Cerebras Systems | 81/100 | $1.1B Series G at $8.1B (2025-09-30); Llama 4 Maverick ~2,500 tok/s, Qwen3-235B ~1,500 tok/s. Wafer-scale architecture; revenue leans heavily on G42 + CFIUS exposure. | | 3 | AMD | 81/100 | Record Q3 FY2025 ~$9.25B rev +36% YoY, data center $4.3B; OpenAI 6 GW Instinct, Oracle 50,000 MI450. Profitable (EPS $1.20) — but 5/5 reqs silicon, zero ROCm/ML-systems despite ROCm being the watch-point. | | 4 | Tenstorrent | 76/100 | Blackhole GA p150a 140 Tensix cores (2025-05-12) → Galaxy multi-chip (2025-09-18); Samsung SF2 Quasar (2025-07-22). Open-source TT-Metalium/TT-Forge; no disclosed revenue or customer wins. | | 5 | Rebellions | 76/100 | Series C ~$1.4B post-money (2025-06-12); REBEL on Samsung 4nm + SK hynix HBM3E (2025-08-26); Arm partner (2025-09-30). Staff Compiler (MLIR/LLVM) hiring; zero disclosed customer wins. |
Deal-Breakers (Your Hard Preferences)
No hard preferences declared for this segment.
How to Evaluate Any Company in this Niche (Checklist)
- [ ] **Check growth signals:** Require ≥1 named foundation-model/hyperscaler design win with *deployed* MW or GPU counts in the last 180d — not "target" capacity (e.g. Groq Bell 7 MW live vs 500 MW target).
- [ ] **Check comp data:** None of the 5 disclose comp — pull levels.fyi "ML Systems"/"Compiler Engineer" bands and benchmark offers against NVIDIA L5/L6 before negotiating.
- [ ] **Check learning signals:** Count public MLIR/LLVM + ROCm/TT-Metalium commit and issue-close rates; demand a live multi-node profiler/debugger demo.
- [ ] **Check stability signals:** Identify single-customer concentration (G42, HUMAIN/PIF ~$1.5B, OpenAI/Oracle warrant) and export-control/CFIUS exposure.
- [ ] **Check switching cost:** Request a CUDA→target port case study with engineer-days and measured perf delta.
- [ ] **Check culture signals:** Ask the compiler/ML-systems-to-pure-silicon req ratio and the vendor support SLA on hard kernel bugs.
Reverse-Hype Watch
- **Targets sold as capacity:** Groq Bell 500 MW is a target (only ~7 MW live); Cerebras "tens of millions tok/s" is a funded build-out goal, not deployed.
- **Single-customer revenue concentration:** Groq HUMAIN/PIF ~$1.5B; Cerebras G42; AMD OpenAI/Oracle 160M-share warrant.
- **Scale/positioning unbacked by named customers:** Tenstorrent Galaxy scale-out and Rebellions up-market LLM REBEL show zero customer-win signals.
- **Aspirational TAM as trajectory:** AMD FAD "35%+ growth" / "$1T TAM by 2030" is a management target, not booked revenue.
What's under-reported for this segment: every reasoning block says "no comp data," and not one quantifies ROCm/compiler parity, real inference cost-per-token, or vendor bug-response SLA. Hiring is ~100% RTL/physical-design — AMD explicitly zero compiler/ML-systems reqs *despite ROCm being the named watch-point*. The toolchain you'd actually live in is exactly the dimension least evidenced publicly; assume the software stack lags the silicon until proven otherwise.