Franck Cuny

Technical Director, Site Reliability Engineering

San Francisco Bay Area | hi@fcuny.net

Technical Director with 15+ years driving reliability transformations at scale in large-scale, physical infrastructure environments.

Led SRE strategy for infrastructure serving 100M+ daily active users at Roblox, driving the transition to cell-based architecture that achieved 99.95%+ availability across owned and operated bare-metal data centers.

Previously spent 8 years at Twitter scaling one of the world’s largest compute clusters across hundreds of thousands of bare-metal nodes, delivering tens of millions in infrastructure cost savings and establishing the reliability practices that underpinned the platform.

Combines deep technical expertise with a focus on building reliability culture—mentoring engineers, establishing production readiness frameworks, and enabling teams to own operational excellence.

Experience

Roblox, San Mateo

Technical Director, Site Reliability Engineering | August 2024 - present
Principal Site Reliability Engineer | Feb 2022 - August 2024

Define SRE strategy and technical roadmaps for infrastructure supporting 100M+ daily active users (grown from <50M since 2022) running on owned and operated bare-metal infrastructure across core data centers and edge points of presence. Lead teams of 6 to 10 engineers directly while coordinating cross-functional initiatives involving 20+ engineers across a 40-person reliability organization and 60+ person compute organization.

Drive both technical transformations and cultural change through mentorship, production readiness frameworks, and failure testing practices.

Key Achievements:

Cell Architecture Transformation: Led the SRE strategy to transition from a single monolithic Nomad cluster to 100+ isolated clusters distributed across 2 core data centers and 20+ edge points of presence, including the adoption of Kubernetes as the next-generation orchestration platform. Architected migration plans, built automation, and established production readiness criteria. Achieved 99.95%+ platform availability while enabling independent failure domains and improved blast radius containment.
Kubernetes Adoption Enablement: Supported and enabled the organization’s transition from Nomad to Kubernetes on bare-metal infrastructure, partnering with platform teams to define operational models, establish reliability standards, and ensure teams could confidently adopt and run Kubernetes workloads in production.
Traffic Scaling & Infrastructure Efficiency: Supported a >5x increase in peak traffic in 2025 without a proportional expansion of infrastructure, driving reliability decisions that balanced aggressive efficiency targets with platform stability. Navigated cross-team tradeoffs to maintain 99.95%+ availability commitments while enabling the organization to absorb significant growth within existing infrastructure capacity.
Multi-Region Failover Framework: Orchestrated cross-team failover strategy across infrastructure and product teams. Developed comprehensive action plans, validation procedures, and automated testing frameworks. Established quarterly disaster recovery exercises that reduced failover complexity and improved team readiness.
Edge Infrastructure Modernization: Led introduction of Envoy at the edge, reducing failure domains and enabling cell-aware traffic routing. Improved request latency and simplified the proxy chain while enabling dynamic traffic steering based on cell health.
GPU Infrastructure: Contributed to the reliability and bring-up of physical GPU infrastructure, including machine configuration, hardware validation procedures, and tooling to detect and triage hardware issues at scale.
Reliability Culture & Enablement: Established production readiness model adopted across engineering organization. Mentored 15+ engineers (SRE and SWE) on operational excellence. Popularized failure exercise practices for major infrastructure projects, significantly improving launch quality and team confidence in handling incidents.

Twitter, San Francisco

Senior SRE → Staff SRE → Senior Staff SRE | Aug 2014 - Jan 2022

Led SRE efforts for one of the world’s largest Mesos compute clusters, spanning hundreds of thousands of bare-metal nodes across multiple owned and operated data centers. Served as Tech Lead for a 6-person SRE team supporting compute infrastructure.

Key Achievements:

Platform Reliability: Improved the overall reliability of the compute platform by defining and rolling out SLOs, overhauling incident management practices, and leading systematic root cause analysis of recurring system issues—reducing operational toil and improving platform stability at scale.
Infrastructure Cost Optimization: Designed and implemented hardware utilization strategies that delivered tens of millions of dollars in annual infrastructure savings through improved bin packing, workload optimization, and capacity planning across bare-metal data center environments.
Kubernetes Platform Adoption: Drove architectural decisions and implementation strategy for adopting Kubernetes on-premise and in the cloud, establishing the foundation for the company’s container orchestration platform and enabling teams to modernize workload deployment.
Storage Platform Transformation: Pioneered migration of all pub-sub systems from bare-metal to Aurora/Mesos, reducing operational overhead and deployment times while improving reliability. Established patterns that enabled storage teams to adopt orchestration platforms.
Network Infrastructure Scaling: Championed adoption of 10Gb+ networking in data centers, removing network bottlenecks and enabling significant scaling improvements for distributed storage systems.
Team Process & Culture: Established critical SRE processes including on-call rotations, incident response procedures, and postmortem practices that improved team effectiveness and operational excellence.

Say Media, San Francisco

Senior Software Engineer | Aug 2011 - Aug 2014

Platform engineering and operations tooling development.

Linkfluence, Paris

Senior Software Engineer | July 2007 - July 2011

Early engineer leading crawler development and platform architecture. Contributed to open source projects and represented the company at European conferences.