Franck Cuny
Technical Director, Site Reliability Engineering
San Francisco Bay Area | hi@fcuny.net
Technical Director with 15+ years driving reliability transformations at scale.
Led SRE strategy for infrastructure serving 100M+ daily active users at Roblox, driving the transition to cell-based architecture that achieved 99.95%+ availability.
Previously spent 8 years at Twitter scaling one of the world’s largest compute clusters and delivering tens of millions in infrastructure cost savings.
Combines deep technical expertise with a focus on building reliability culture—mentoring engineers, establishing production readiness frameworks, and enabling teams to own operational excellence.
Experience
Roblox, San Mateo
- Technical Director, Site Reliability Engineering | August 2024 - present
- Principal Site Reliability Engineer | Feb 2022 - August 2024
Define SRE strategy and technical roadmaps for infrastructure supporting 100M+ daily active users (grown from <50M since 2022). Lead teams of 10+ engineers directly while coordinating cross-functional initiatives involving 20+ engineers across a 40-person reliability organization and 60+ person compute organization.
Drive both technical transformations and cultural change through mentorship, production readiness frameworks, and failure testing practices.
Key Achievements:
Cell Architecture Transformation: Led the SRE strategy to transition from a single monolithic Nomad cluster to 100+ isolated clusters (Nomad and Kubernetes) distributed across 2 core data centers and 10+ edge points of presence. Architected migration plans, built automation, and established production readiness criteria. Achieved 99.95%+ platform availability while enabling independent failure domains and improved blast radius containment.
Multi-Region Failover Framework: Orchestrated cross-team failover strategy across infrastructure and product teams. Developed comprehensive action plans, validation procedures, and automated testing frameworks. Established quarterly disaster recovery exercises that reduced failover complexity and improved team readiness.
Edge Infrastructure Modernization: Led migration from HAProxy to Envoy at the edge, reducing failure domains and enabling cell-aware traffic routing. Improved request latency and simplified the proxy chain while enabling dynamic traffic steering based on cell health.
Reliability Culture & Enablement: Established production readiness model adopted across engineering organization. Mentored 15+ engineers (SRE and SWE) on operational excellence. Popularized failure exercise practices for major infrastructure projects, significantly improving launch quality and team confidence in handling incidents.
Twitter, San Francisco
- Senior SRE → Staff SRE → Senior Staff SRE | Aug 2014 - Jan 2022
Led SRE efforts for one of the world’s largest Mesos compute clusters, spanning hundreds of thousands of nodes across multiple data centers. Served as Tech Lead for a 6-person SRE team supporting compute infrastructure.
Key Achievements:
Infrastructure Cost Optimization: Designed and implemented hardware utilization strategies that delivered tens of millions of dollars in annual infrastructure savings through improved bin packing, workload optimization, and capacity planning.
Kubernetes Platform Adoption: Drove architectural decisions and implementation strategy for adopting Kubernetes on-premise, establishing the foundation for the company’s container orchestration platform and enabling teams to modernize workload deployment.
Storage Platform Transformation: Pioneered migration of all pub-sub systems from bare-metal to Aurora/Mesos, reducing operational overhead and deployment times while improving reliability. Established patterns that enabled storage teams to adopt orchestration platforms.
Network Infrastructure Scaling: Championed adoption of 10Gb+ networking in data centers, removing network bottlenecks and enabling significant scaling improvements for distributed storage systems.
Team Process & Culture: Established critical SRE processes including on-call rotations, incident response procedures, and postmortem practices that improved team effectiveness and operational excellence.
Say Media, San Francisco
Senior Software Engineer | Aug 2011 - Aug 2014
Platform engineering and operations tooling development.
Linkfluence, Paris
Senior Software Engineer | July 2007 - July 2011
Early engineer leading crawler development and platform architecture. Contributed to open source projects and represented the company at European conferences.