The Pragmatic Engineer
By Kevin O'Donnell
About this collection
The Pragmatic Engineer by *Gergely Orosz*, turned into a queryable knowledge base. Covers software engineering at scale, engineering culture, AI tooling, systems design, and what it's actually like to build and ship at Big Tech and high-growth startups — told by the engineers who were there. Built from podcast transcripts and newsletter content so you can dig across the full archive rather than hunting for a specific episode. Try questions like: - What do experienced engineers say about integrating AI tools into their workflow? - How did teams at Uber, WhatsApp, or other companies handle scaling challenges? - What makes a great engineering culture at a high-growth company? - How are staff and principal engineers thinking about the shift to agentic development?
Curated Sources
Why Rust is different, with Alice Ryhl - by Gergely Orosz
Rust is gaining traction as a high-performance language for building reliable applications. Alice Ryhl, a software engineer on Google's Android Rust team and maintainer of the Tokyo async runtime, explains the core concepts that distinguish Rust from languages like TypeScript or C++. A primary differentiator is the absence of a garbage collector. Instead, Rust uses an ownership model where variables have exclusive control over objects. When a variable goes out of scope, the memory is cleaned up immediately. This avoids the latency spikes common in garbage-collected languages like Java and makes Rust suitable for embedded or high-performance backend use cases. The borrow checker is another fundamental component. It ensures memory safety at compile time by enforcing strict rules on how references are used. It prevents common bugs like null pointer exceptions or use-after-free vulnerabilities. In C++, these mistakes often become security vulnerabilities, such as an attacker gaining root access by writing to a reused memory address. Rust also eliminates the billion-dollar mistake of null by using an Option type that forces developers to explicitly handle the absence of a value before the code can compile. Governance in the Rust project is decentralized. Unlike Linux or Python, Rust does not have a benevolent dictator for life. Decisions are made by specialized teams using a Request for Comments (RFC) process. This process includes a final comment period where team members must reach a consensus. The language evolves through editions every few years, allowing for breaking syntax changes while maintaining backwards compatibility between crates. Rust's integration into the Linux kernel marks a significant milestone. It is no longer considered experimental and holds a status similar to C. This shift is driven by the need for memory-safe code in critical systems, especially as governments increase scrutiny on software vulnerabilities.
Key Takeaways
- Rust solves the reliability-performance trade-off by replacing garbage collection with a compile-time ownership model, making it suitable for both low-level kernel work and high-level backend APIs.
- The reputation that code works once it compiles stems from the compiler's ability to catch entire classes of logic and security errors, such as null references and data races, before execution.
- Rust's governance model proves that a large-scale technical project can thrive without a single leader by using structured RFCs and consensus-based team decision-making.
- The transition of Rust from experimental to stable in the Linux kernel signals a broader industry shift toward memory-safe systems programming, likely accelerated by government security mandates.
TypeScript, C# and Turbo Pascal with Anders Hejlsberg
Anders Hejlsberg discusses his 40-year career creating foundational programming languages, emphasizing that a language is an entire experience rather than just a compiler. Starting with Turbo Pascal, he highlights how integrating the editor, compiler, and debugger into a single, fast, and affordable package disrupted the market. The transition to Windows led to Delphi, which combined rapid application development with a powerful compiled backend, eventually powering major applications like the original Skype desktop client. The creation of C# was catalyzed by the Sun versus Microsoft lawsuit over Java, leading Microsoft to develop its own managed runtime and language. C# aimed to bridge the gap between the productivity of Visual Basic and the power of C++, introducing features like a unified object system, reflection, and the influential async/await pattern. Hejlsberg explains that async/await solved the state machine problem by allowing the compiler to handle complex asynchronous transformations, a pattern since adopted by JavaScript, Python, and Rust. He notes that while this introduced function coloring, it was the right trade-off for existing event-loop environments. TypeScript emerged as a response to the growing complexity of JavaScript applications. By adding an erasable type system, TypeScript enabled professional-grade tooling like IntelliSense and refactoring, which Hejlsberg argues are essential for scaling large teams. He details the technical architecture of the TypeScript compiler, which functions as a service to provide sub-200ms feedback in IDEs like VS Code. This project also marked a major shift in Microsoft’s culture toward open development on GitHub and the creation of the Language Server Protocol (LSP). Regarding AI, Hejlsberg views programming languages as the necessary deterministic layer for stochastic AI models. While AI excels at generating tests and boilerplate, he believes the developer's role is shifting toward architecture and code review. He emphasizes that language design is a long play involving 10-year cycles, where small, expert teams of six or seven people are more effective than large committees at refining a language from version one to a mature, widely adopted ecosystem.
Key Takeaways
- A successful programming language is defined by the total developer experience, including the IDE, rather than just the compiler or syntax.
- Programming languages serve as the essential deterministic infrastructure for AI, which is inherently stochastic and non-deterministic.
- The 10-year cycle is the standard timeframe for a language to reach maturity, with version three usually being the point where industry adoption accelerates.
- Small, expert design teams of six to seven people are superior to large committees because they can engage in rigorous, high-level criticism without constant level-setting.
- Software engineering is shifting from a craft of writing code to one of reviewing architecture and overseeing AI-generated output.
Designing Data-Intensive Applications: The Cloud & Doing the Right Thing
This excerpt from the second edition of Martin Kleppmann’s book explores how cloud computing has fundamentally changed backend systems since 2016. The decision between cloud services and self-hosting comes down to core competencies. While the cloud offers speed and handles variable loads well, it introduces risks like vendor lock-in, lack of control during outages, and potential security complications. Cloud-native architecture differs from traditional setups by building on higher-level abstractions and separating storage from compute. Instead of relying on local disks, these systems use dedicated services like object storage, treating local resources as ephemeral caches. This shift changes the role of operations from manual hardware maintenance to automation, cost optimization, and service integration. The second half focuses on the ethical responsibilities of engineers. As data-driven decision-making becomes common, the risk of algorithmic prisons grows. Predictive analytics often rely on historical data that contains human bias, which software then codifies and amplifies. This can systematically exclude people from jobs or financial services without a clear path for appeal. The text also frames modern corporate data collection as a form of mass surveillance, noting that we have built an infrastructure that even totalitarian regimes would envy. Engineers are urged to use systems thinking to consider the unintended consequences of their work, ensuring that technology serves human dignity rather than just business metrics. The new edition also adds context on AI training data, local-first software, and regulatory requirements like GDPR.
Key Takeaways
- Cloud-native architecture relies on the disaggregation of storage and compute, which allows for better scalability but makes applications highly sensitive to network performance.
- The shift to cloud services transforms traditional capacity planning into a financial exercise where performance optimization is essentially cost optimization.
- Automated systems often act as money laundering for bias by taking discriminatory historical data and outputting seemingly objective but unfair decisions.
- Engineers must treat data collection as a form of surveillance to better understand the power dynamics between the data collector and the individual.
Building Pi, and what makes self-modifying software so fascinating
This discussion features Mario Zechner, creator of the Pi coding agent, and Armin Ronacher, creator of Flask, exploring the shift toward agentic software development. Mario built Pi as a minimalist, stable alternative to tools like Cloud Code, which he found increasingly buggy and intrusive. The core innovation of Pi is its self-modifying nature; because it has a minimal core with numerous extension points, users can instruct the agent to rewrite its own UI or add features like MCP support or plan modes. This represents a move toward malleable software that adapts to specific user workflows rather than remaining static. The conversation addresses the decline in software quality caused by the high velocity of AI-generated code, often referred to as vibe slop. Unlike human engineers, agents do not feel the pain of increasing complexity or technical debt. They continue to add code until the system exceeds its own context window, making the software unmaintainable. Armin notes that this lack of back pressure leads to agentic regret, where developers realize too late that they have rubber-stamped complex, low-quality implementations. To combat this, Mario emphasizes the importance of refactoring and maintaining human-in-the-loop verification. Regarding technical architecture, the guests critique the Model Context Protocol (MCP). While useful for enterprise authentication, they argue it is often non-composable and bloats the model's context window. They prefer Command Line Interfaces (CLIs) because they allow for piping and data transformation outside the model's immediate context. The episode also covers the crisis in open source, where maintainers are overwhelmed by autonomous, low-intent pull requests. Mario shares his strategy of using automated bottlenecks to filter out agent-generated spam, ensuring that only human-verified contributions reach his repositories. They conclude that the industry must slow down to prioritize quality and intentionality over raw token output.
Key Takeaways
- Agents lack a pain response to complexity, meaning they will indefinitely add layers of code until the context window is overwhelmed, creating a technical debt trap that humans must eventually refactor.
- Deliberate friction in engineering processes is a feature that prevents catastrophic errors; removing this friction to increase agent velocity often results in unmaintainable vibe slop.
- The future of software lies in malleability, where tools like Pi allow users to prompt the agent to rewrite the tool's own functionality, effectively ending the era of rigid, static software applications.
- The commoditization of pull request generation necessitates new gatekeeping mechanisms to preserve human intentionality and prevent open-source repositories from being buried under low-quality agentic spam.
Designing Data-intensive Applications with Martin Kleppmann
Martin Kleppmann, author of the influential book Designing Data-Intensive Applications, discusses his journey from startup founder to LinkedIn engineer and academic researcher. The conversation explores the evolution of backend systems and the release of the book's second edition. Kleppmann explains how his experience building Kafka at LinkedIn shaped the original text, which focused on the trade-offs of reliable, scalable, and maintainable systems. The second edition reflects a major shift toward cloud-native architectures, where databases are increasingly built on top of object stores like S3 rather than local disks. It also removes outdated technologies like MapReduce while adding modern concepts like vector indexes for AI and data frames. The discussion covers the inherent difficulties of distributed systems, including network unreliability and clock synchronization issues. Kleppmann advocates for formal verification methods, such as TLA+ and Isabelle, arguing they will become essential as AI-generated code increases the need for automated correctness checks. He also details his current research into local-first software, which aims to reduce user dependence on centralized cloud providers by enabling peer-to-peer data synchronization and better user agency. This work involves solving complex engineering problems like decentralized access control and consistency without a central server. Finally, Kleppmann reflects on the relationship between industry and academia. He notes that while industry focuses on shipping products, academia has the freedom to pursue long-term, idealistic projects that might not be immediately commercial. He encourages engineers to move between these worlds to gain both practical experience and critical thinking skills. The interview concludes with a reminder that engineers have a societal responsibility to consider the ethical implications and risks of the systems they build.
Key Takeaways
- Cloud-native architecture has fundamentally changed database design by shifting the storage layer from local disks to object stores, requiring a rethink of replication and consistency.
- Formal verification is transitioning from a niche academic pursuit to a necessary tool for security and correctness, particularly to audit AI-generated code that humans cannot manually review at scale.
- The local-first software movement represents a strategic shift toward user agency, requiring complex decentralized algorithms to manage data without the kill switch of a centralized subscription model.
- Scaling down is becoming as important as scaling up, with serverless technologies enabling cost-efficient operations for low-load services that were previously difficult to manage on-premise.
Learnings from conducting ~1,000 interviews at Amazon
Steve Huynh, a former Principal Engineer at Amazon who conducted nearly 1,000 interviews, explains why behavioral rounds are often the deciding factor in tech hiring. While technical skills are the "ante" that gets a candidate into the game, behavioral signals determine the final offer and the leveling. A key concept is the Bar Raiser, an interviewer from outside the hiring team tasked with ensuring every new hire improves the company's average talent level. These interviewers often have veto power, and many candidates fail not because of technical gaps, but because of how they present their experience. Three primary learnings emerge from this high volume of interviews. First, candidates over-index on technical prep while neglecting behavioral stories, which have higher leverage. Ten hours of story prep can change an outcome more than eighty hours of coding practice. Second, delivery is critical; rambling or backtracking makes even great accomplishments seem unconvincing. Third, the interview is an audition for daily work life, testing how a candidate handles conflict, ambiguity, and trade-offs. Leveling is assessed through four specific dimensions. Scope measures the number of people affected by the work, ranging from individual productivity at entry levels to organizational strategy at the Principal level. Contribution focuses on individual ownership versus team effort, looking for evidence of leadership as one advances. Impact quantifies what changed for the better, ideally using business or user metrics like revenue or retention. Difficulty evaluates the complexity and constraints managed, such as balancing conflicting stakeholder needs. Fit is categorized into Role Fit, which is about handling specific position challenges, and Company Fit, which is about alignment with organizational values like "bias for action" or "customer obsession." The same story can be a positive signal at a startup but a negative one at a highly regulated enterprise. To succeed, candidates should research companies through recruiters, engineering blogs, and current employees to understand which behaviors are actually rewarded in practice.
Key Takeaways
- Behavioral interviews serve as the primary tool for determining a candidate's seniority level and compensation, rather than just being a culture fit check.
- The "Fit" signal is highly contextual, meaning a candidate must tailor their stories to match whether a company prizes speed, stability, or consensus.
- Leveling is a measure of influence and ambiguity; moving from Senior to Staff requires showing impact across multiple teams rather than just technical excellence.
- Recruiters are often underutilized allies who can provide specific competencies and prep materials that reveal the "answer key" for behavioral rounds.
DHH’s new way of writing code - by Gergely Orosz
David Heinemeier Hansson (DHH) explains his transition from an AI skeptic to adopting an agent-first workflow at 37signals. He describes the current state of AI as a "super mech suit" for senior developers, allowing them to act with the power of a dozen arms. By using agent harnesses like OpenCode and frontier models like Claude Opus, DHH has shifted his daily routine from "code first" to "agent first." He now kicks off tasks with agents, reviews the diffs, and only steps in for manual adjustments. This has allowed him to tackle projects that were previously too time-consuming, such as building a CLI for Basecamp or adding dual-boot capabilities to his Omaghi Linux distribution. A central theme is the "Peak Programmer" theory. DHH suggests we have reached the height of the "learned guild" of developers who command high salaries simply by being the bottleneck for implementation. As AI makes the act of writing code a commodity, the value shifts toward "taste" and the ability to verify that software is correct and beautiful. He references the Jevons Paradox, noting that as the cost of producing software drops, the ambition of projects will explode. This means small, senior-led teams can now explore hunches that would have previously required a week of manual labor in just a few minutes. The role of the designer is also evolving. At 37signals, designers are already expected to act as product managers and handle their own HTML and CSS. DHH argues that AI will further empower these "hybrid" roles, making the traditional wall between design and engineering even more obsolete. He emphasizes that Ruby on Rails is having a renaissance in this environment because its token efficiency and readability make it the ideal language for humans to verify agent-generated work. Ultimately, he believes the future belongs to those who treat software as a craft and possess the judgment to direct AI agents effectively.
Key Takeaways
- The "Super Mech Suit" effect means senior developers get a massive productivity boost because they have the "taste" to verify agent output, while junior roles become more tenuous.
- Software development is shifting from an implementation-constrained field to a judgment-constrained one where knowing what to build is the only thing that matters.
- The Unix philosophy of small, piping tools is the best way to build for agents, which is why 37signals is prioritizing CLIs for all their products.
- "Peak Programmer" doesn't mean the end of coding, but the end of the "learned guild" where pure technical skill was enough to guarantee high compensation without business or product empathy.
Scaling Uber with Thuan Pham (Uber’s first CTO)
Thuan Pham joined Uber in 2013 as its first CTO when the company had only 40 engineers and faced frequent system crashes. His journey began as a refugee from Vietnam, eventually leading to MIT and roles at HP Labs, Silicon Graphics, and VMware. His hiring process at Uber involved a 30-hour interview with Travis Kalanick over two weeks, simulating real working conditions and philosophical alignment. Upon joining, Pham identified that the dispatch system would hit a capacity wall within five months. He led a critical rewrite of the Node.js single-threaded dispatch system to a scalable architecture where multiple boxes could power a city and vice versa. This set a pattern of seeing around corners to prevent existential technical failures. The expansion into China was a major milestone, completed in just five months despite industry estimates of 18 months. This required physical data centers on Chinese soil and a complete partitioning of data for security. Pham advocated for launching in the hardest city, Chengdu, first to build team confidence. Uber's move to thousands of microservices was a byproduct of rapid growth; the team could not decompose the monolith fast enough while the business added features, leading to a fan out strategy to maintain velocity. Pham also implemented unique organizational structures, such as splitting the senior engineer level into L5A and L5B to show progress and creating a frictionless internal transfer process to retain talent. He emphasizes that a CTO's primary job is building high-density talent teams and looking two years ahead. Currently the CTO at Faire, Pham is exploring AI-driven productivity, including swarm coding with agents, which has doubled the output of top performers. He maintains that while tools like AI change the landscape, the core traits of exceptional engineers like curiosity, fearlessness, and innovation remain constant.
Key Takeaways
- The 30-hour interview process served as a high-fidelity simulation of a working relationship, ensuring that the CEO and CTO were philosophically aligned before facing high-stakes crises.
- Scaling through violent growth often requires technical debt as a survival strategy; Uber's microservices architecture was a necessity to prevent teams from blocking each other during rapid feature expansion.
- Launching in the most difficult market first, such as Chengdu, is a strategic risk-mitigation tactic that builds psychological momentum and makes subsequent expansions feel routine.
- Talent density is self-protecting; high-performing teams naturally attract and retain A-level players while becoming intolerant of mediocrity.
- AI is shifting the engineering focus from syntax to orchestration and review, where the best engineers are those who can manage higher cognitive loads while leveraging agents for massive output.
What is inference engineering? Deepdive - by Gergely Orosz
Inference engineering is the second phase of an AI model's life, focusing on serving models in production after they have been trained. While training remains expensive and limited to a few big players, the explosion of open models like Llama and DeepSeek means any engineer can now deploy high-quality intelligence. This shift has turned inference engineering into a critical skill for teams wanting to move away from expensive, closed APIs. The work happens across three main layers: runtime, infrastructure, and tooling. At the runtime level, engineers use techniques like quantization to reduce the precision of model weights, which saves memory and speeds up processing. Another method is speculative decoding, where a smaller model guesses tokens that a larger model then validates, effectively increasing the typing speed of the AI. Caching is also vital, specifically prefix caching, which reuses data from previous prompts to skip redundant work. Scaling these models requires a shift in infrastructure. Most teams start with Kubernetes for basic autoscaling, but high-growth products eventually need multi-cloud setups to find enough GPUs and stay close to users. A modern stack often uses disaggregation, which splits the initial prompt processing (prefill) from the token generation (decode) so they can run on separate, specialized hardware. For a SaaS company, the decision to invest in inference engineering is essentially a build vs buy choice. While closed APIs are easier to start with, running your own stack on open models can be 80% cheaper at scale and offers much better control over uptime and latency. As open models continue to match the capabilities of closed ones, the ability to optimize these stacks becomes a major strategic lever for growth and margins.
Key Takeaways
- The performance gap between closed and open models has largely vanished, making in-house inference a viable competitive advantage for SaaS companies looking to improve margins.
- Optimization is a game of tradeoffs where engineers must balance latency, throughput, and output quality based on specific product requirements rather than just maximizing one metric.
- Multi-cloud infrastructure is becoming a necessity for high-scale AI products to bypass GPU capacity limits and ensure global availability and data sovereignty.
- Techniques like disaggregation and speculative decoding show that software-level optimizations can provide massive speedups without needing immediate hardware upgrades.
Building WhatsApp with Jean Lee - by Gergely Orosz
Jean Lee, the 19th engineer at WhatsApp, details the unconventional engineering culture that allowed a team of just 30 people to support 450 million monthly active users across eight different platforms. The technical foundation relied on Erlang for backend concurrency. This was chosen for its robustness in handling massive message volumes without downtime. Unlike traditional startups, WhatsApp operated with almost zero formal process. There were no code reviews, no stand-ups, and no sprint planning. Instead, the team relied on high trust, internal dogfooding, and a quality-first mindset driven by CEO Jan Koum, who acted as the chief QA officer. A core strategic pillar was ruthlessly saying no to feature requests. WhatsApp prioritized quality and simplicity, specifically targeting a grandma in a remote countryside as the ideal user. This meant keeping the app lightweight and functional on low-end devices like Nokia S40 and Symbian rather than chasing the latest iOS or Android trends. Growth was actually suppressed through a 1 dollar annual fee to keep server and SMS costs manageable. This allowed the company to remain roughly break-even without touching its Sequoia funding. Following the 19 billion dollar acquisition by Facebook in 2014, the culture shifted toward the Meta model. Lee describes the transition from a high-ownership startup to a large organization where visibility became a prerequisite for career advancement. In the Meta calibration system, managers act as lawyers for their reports. Engineers who publicly document their impact on internal platforms like Workplace often see faster career progression. Lee eventually moved to London to help scale WhatsApp’s European engineering presence, transitioning into management by focusing on individual strengths and psychological motivators. Reflecting on the current AI-native landscape, Lee suggests that while AI increases individual efficiency, the fundamental value of small, lean teams remains constant. She emphasizes that today’s founders can learn from WhatsApp’s focus on removing distractions and maintaining technical excellence through high individual responsibility.
Key Takeaways
- Organizational scale often creates problems that process is then designed to solve. WhatsApp proved that by keeping the team extremely small, you can bypass the need for Scrum, TDD, or formal code reviews.
- Ruthless feature prioritization is a growth and quality strategy. By saying no to 99 percent of requests, WhatsApp ensured the core messaging experience was flawless even on low-bandwidth, low-end hardware.
- In large-scale organizations like Meta, technical excellence is insufficient for promotion without internal marketing. Visibility through social documentation of impact is what drives consensus in calibration meetings.
- Monetization can be used as a strategic lever for operational stability. The 1 dollar fee was less about revenue and more about suppressing growth to a pace the tiny engineering team could support.
From IDEs to AI Agents with Steve Yegge - by Gergely Orosz
Steve Yegge discusses the radical transformation of software engineering through AI agents and the concept of vibe coding. He outlines eight levels of AI adoption for developers, moving from basic IDE completions to managing parallel agent swarms. Yegge argues that traditional IDEs are becoming obsolete as development shifts toward conversational orchestration. He introduces Gastown, an open source agent orchestrator designed to handle complex engineering tasks by managing multiple workers. A central theme is the bitter lesson, which suggests that leveraging massive computation and data is more effective than trying to program human domain knowledge into AI. This shift allows for a new workflow where developers can create dozens of prototypes to defer decisions until the right path emerges. The conversation addresses the organizational impact of this hyper-productivity. Yegge predicts that big tech companies are quietly dying because they cannot absorb the massive output of AI-augmented engineers. He suggests that small teams of two to twenty people will soon rival the output of massive corporations. However, this productivity comes with a vampiric effect on developers. AI drains cognitive energy at a much higher rate, meaning a developer might only have three truly productive hours in a day despite producing 100 times more value. This creates a new challenge for value capture and work-life balance. Yegge also warns of heresies in AI codebases, where agents latch onto incorrect architectural ideas that become difficult to weed out without specific documentation and tooling. Ultimately, he remains an optimist, viewing AI as a powerful force that will eventually allow non-programmers to build sophisticated software and enable a new era of personal, bespoke applications.
Key Takeaways
- The abstraction ladder is moving from manual code writing to high level orchestration, mirroring the historical shift from raw pixel manipulation to modern game engines.
- Large organizations face a structural bottleneck where they cannot effectively absorb the massive output of AI-augmented engineers, leading to political friction and potential obsolescence.
- The new work-life balance requires recognizing that AI-driven development is cognitively draining, making three hours of high intensity vibe coding equivalent to a full traditional workday.
- Maintaining AI-generated codebases requires identifying and documenting heresies, which are persistent incorrect architectural patterns that agents tend to replicate across a project.
- Value capture is shifting toward individuals and small teams who can use agents to bypass traditional corporate infrastructure and build bespoke software solutions.
Building Claude Code with Boris Cherny - by Gergely Orosz
Boris Cherny, the engineering lead for Claude Code at Anthropic, describes a fundamental shift in software development where AI agents handle the bulk of code generation and verification. After leading code quality at Meta for seven years, Cherny joined Anthropic and transitioned to a workflow where he ships 20 to 30 pull requests daily without writing a single line of code by hand. This process relies on parallelizing tasks across multiple agentic sessions, often using plan mode to align on logic before execution. At Anthropic, this approach has resulted in Claude writing approximately 80% of the internal codebase. The development of Claude Code itself moved from a simple chatbot to a tool-using agent capable of executing bash commands and editing files. Cherny emphasizes that the model should not be put in a box but rather given tools to solve problems autonomously. This philosophy extends to product development, where the team eschews traditional Product Requirement Documents (PRDs) in favor of rapid prototyping. For instance, the Claude Co-work feature was built in just 10 days by a small team using Claude Code to generate dozens of interactive versions until the user experience felt right. Safety is managed through a Swiss cheese model, incorporating multiple layers of protection including model alignment, runtime classifiers, and sub-agents that summarize web fetches to prevent prompt injection. Cherny compares this era to the invention of the printing press, suggesting that while the scribe (the manual coder) may disappear, the author (the creative generalist) will see their impact expand exponentially. The most valuable skills in this new landscape are methodical hypothesis testing, curiosity, and the ability to switch contexts rapidly across multiple disciplines like engineering, design, and finance.
Key Takeaways
- The role of the software engineer is shifting from a scribe who manually writes syntax to an author who orchestrates agents and verifies outcomes.
- Traditional documentation like PRDs is being replaced by high-velocity prototyping where teams build and test dozens of functional versions in days rather than weeks.
- Safety in agentic systems requires a multi-layered approach rather than a single solution, combining model alignment with runtime monitoring and sub-agent verification.
- Generalists who understand business logic and engineering can now outperform specialized silos by using agents as force multipliers.
Mitchell Hashimoto’s new way of writing code
Mitchell Hashimoto, co-founder of HashiCorp, details the evolution of modern cloud infrastructure tools and his current approach to software engineering. The journey began with a failed university research project called the Seattle Project, where Hashimoto documented unsolved infrastructure problems in a notebook. These notes eventually formed the basis for the HashiStack, including Vagrant for reproducible dev environments, Packer for image building, and Terraform for declarative infrastructure as code. A critical turning point for HashiCorp was the pivot from a failed platform product called Atlas to an Open Core business model. This shift occurred after a disappointing board meeting, leading Hashimoto and co-founder Armand Blanchot to focus on enterprise-grade features like secrets replication for Vault, which aligned better with corporate budget structures. Hashimoto shares the story of a near-acquisition by VMware when he was only 23. Using a regret minimization framework, he and Blanchot set a high price floor to ensure they would not regret losing their dream project to a large corporate machine. The deal ultimately fell through, allowing HashiCorp to eventually go public. He provides a candid assessment of major cloud providers based on years of partnership. He characterizes AWS as historically arrogant and slow to support open-source integrators, while praising Microsoft Azure for its professional, win-win partnership culture despite technical complexity. Google Cloud is described as having superior engineering but lacking business and sales alignment. Currently, Hashimoto is developing Ghosty, a high-performance terminal emulator built in Zig that utilizes GPU rendering. He advocates for a love of the game approach to performance, optimizing render times to microseconds even when the gains are imperceptible to users. His modern development workflow centers on agentic infrastructure, where he keeps AI agents running in the background for research and boilerplate tasks. He emphasizes that AI should be used to choose what a human thinks about, rather than replacing thinking entirely. To combat the influx of low-quality, AI-generated pull requests in open source, he is implementing a reputation-based vouching system for Ghosty, moving away from the traditional default trust model of public repositories to protect maintainer bandwidth.
Key Takeaways
- The failure of the Atlas platform taught HashiCorp that enterprise software must align with specific departmental budgets like security or networking rather than attempting to cross functional boundaries.
- Successful open-source monetization often requires an Open Core model where scale-related features like replication and legal protections are gated behind commercial licenses while the core remains permissive.
- The always-on agent strategy treats AI as an asynchronous research assistant, allowing engineers to delegate non-thinking tasks like library comparisons or boilerplate generation while they focus on high-level design.
- Open-source maintenance is shifting toward a vouching or reputation-based system to survive the AI slop era where the cost of submitting a pull request has dropped to near zero.
The programming language after Kotlin – with the creator of Kotlin
Andrey Breslav, the creator of Kotlin, discusses the evolution of programming languages from the 2010 launch of Kotlin at JetBrains to his current work on Codespeak. Kotlin was born out of a need for a pragmatic, static language in a Java ecosystem that had stagnated after Java 5. By borrowing successful features from Scala, C#, and Groovy, the team focused on reducing boilerplate and enhancing readability. Key design choices included implementing null safety to address the billion dollar mistake of null pointer exceptions, extension functions, and smart casts. A massive portion of the development effort was dedicated to seamless Java interoperability, which involved complex compiler trickery to handle collections and nullability across language boundaries. This interoperability allowed for incremental adoption, which eventually led to Google's surprise announcement of official Android support in 2017, causing usage to skyrocket into the millions. Breslav's new project, Codespeak, addresses the shift toward AI-driven development. He argues that while AI can write code, current workflows lose the human intent layer because conversations with agents are ephemeral and separate from the committed code. Codespeak aims to be a programming language based on English that uses LLMs as a library, potentially shrinking codebases by 10x. The goal is to elevate the level of abstraction so engineers focus on essential complexity and intent rather than the ceremony of implementation. Breslav emphasizes that while AI will handle the obvious implementation details, humans must remain in charge of managing system complexity and defining behavior, as technological singularity would render human engineering irrelevant.
Key Takeaways
- Kotlin's success was driven by extreme pragmatism and the decision to stand on the shoulders of giants rather than trying to invent entirely new academic concepts.
- The technical moat for Kotlin was its seamless Java interoperability, which required building a Java frontend into the Kotlin compiler to allow mixed-language projects to function without friction.
- The next evolution in programming is the preservation of intent, moving away from ephemeral AI chat histories toward a system where natural language descriptions serve as the primary, persistent source of truth.
- AI is effectively a new type of package manager where the entire world's code is the library, and the challenge is designing a query language that can reliably extract and verify that logic.
- Engineering remains a human-centric discipline focused on managing essential complexity, even as AI eliminates the accidental complexity of boilerplate and syntax.
The third golden age of software engineering – thanks to AI, with Grady Booch
Grady Booch, a pioneer of object-oriented design and co-creator of UML, argues that AI marks the beginning of a third golden age rather than the end of software engineering. He defines software engineering as the discipline of balancing technical, economic, and ethical forces to build optimal solutions in a fluid medium. Booch traces the industry through three distinct epochs defined by rising levels of abstraction. The first golden age (late 1940s to late 1970s) focused on algorithmic abstraction and decoupling software from hardware, driven largely by defense needs like the SAGE project. The second golden age (1980s to 1990s) introduced object-oriented abstraction and the rise of personal computers and early platforms. The third golden age, which Booch argues began around the turn of the millennium, focuses on component and library-level abstraction, leading into the current era of AI agents. Booch explicitly challenges Dario Amodei’s prediction that software engineering will be automated within a year, calling it "utter bullshit." He argues that while AI can automate coding patterns and reduce the distance between intent and execution, it cannot replace the human engineering required to manage complexity, security, and ethical trade-offs. He encourages developers to move up the abstraction ladder, focusing on systems theory and complexity management rather than just syntax.
Key Takeaways
- Coding is merely a subset of software engineering. The core of the profession involves balancing competing forces like physics, economics, and ethics which AI cannot yet navigate.
- AI represents a new level of abstraction similar to the shift from assembly to high-level languages. It frees engineers from tedium rather than making them obsolete.
- The most valuable skill in the AI era is the ability to manage complexity at the systems level rather than focusing on individual application logic.
- History shows that every major technological shift in software creates an existential crisis that ultimately leads to a massive expansion of the industry.
The creator of Clawd: "I ship code I don't read"
Peter Steinberger, the developer behind PSPDFKit, discusses his return to tech through AI-native development and his new project, CloudBot. He argues that traditional coding is evolving into agentic engineering where the developer acts as a system architect. Key Concepts: Closing the Loop: Agents must validate their own work using automated tests and linters. This feedback loop allows for shipping code without reading every line. Prompt Requests: Pull requests are becoming prompt requests where the focus is on intent and architectural vision. CLI over MCP: Steinberger prefers CLIs for agents because they allow for better data filtering using tools like jq and avoid context window bloat. CloudBot and the Loving Machine: CloudBot is a personal assistant with a persistent soul and memory. It uses markdown files to store user values and core identity, allowing it to be proactive rather than just reactive. It can manage home automation, schedule meetings, and even make phone calls to businesses. Steinberger describes it as the future of Siri, where the technology blends away into a natural conversation. Future of Engineering: Steinberger predicts AI will allow companies to run with 70% fewer staff. The most valuable engineers will be those with high agency and taste who can steer multiple agents in parallel. He emphasizes that weaving code into existing structures is the new primary skill, replacing manual line-by-line writing. Workflow and Productivity: AI-native development is highly addictive, similar to a slot machine where a prompt can yield brilliant results. His workflow involves managing 5-10 agents simultaneously, which is mentally taxing but allows for 600+ commits in a single day. He also highlights that vibe coding only works when paired with rigorous automated validation to ensure the output isn't just slop.
Key Takeaways
- The Closing the Loop principle is the differentiator between vibe coding and reliable engineering. By giving agents the tools to compile, lint, and test their own output, engineers can delegate complex refactoring with high confidence.
- CLIs provide a more efficient interface for agents than MCPs. Agents can use standard Unix tools like jq to filter large datasets, preventing context window saturation and reducing token costs. This suggests that current agentic infrastructure might be over-complicating the tool-calling layer.
- Pull requests are evolving into prompt requests. In an AI-first workflow, the senior engineer's role is to review the prompt and the architectural intent rather than the specific lines of code generated.
- AI-native development fundamentally changes the economics of team size and GTM strategy. High-agency builders who can manage multiple agents in parallel can replace large traditional engineering teams, though this requires a total rethink of company structure and role definitions.
How S3 is built - by Gergely Orosz - The Pragmatic Engineer
AWS S3 currently manages over 500 trillion objects and hundreds of exabytes of data. The system serves hundreds of millions of transactions per second across millions of servers and tens of millions of hard drives. A major technical milestone was the shift from eventual consistency to strong consistency. This was achieved through a replicated journal and a specialized cache coherency protocol. Remarkably, this transition was completed without increasing latency or costs for users. To ensure correctness at this scale, the team uses automated reasoning and formal methods. These mathematical proofs are built into the code check-in process to prevent regressions in the consistency model. S3 also maintains a durability promise of 11 nines. This is supported by background auditor systems that constantly inspect every byte in the fleet and trigger automated repairs when hardware fails. The service is evolving beyond simple object storage. New primitives like S3 Tables provide native support for Apache Iceberg, allowing users to interact with data using SQL. S3 Vectors now support billions of embeddings for AI applications, offering sub-100 millisecond query times by pre-computing vector neighborhoods. The engineering culture balances two core tenets: respecting what came before while remaining technically fearless. This approach allows the system to function like a living organism that adapts to new data patterns while maintaining its foundational reliability.
Key Takeaways
- Strong consistency was implemented as a free, zero-latency upgrade by inventing a replicated journal that chains nodes together sequentially.
- At S3 scale, mathematical proofs through formal methods are the only way to guarantee correctness across infinite edge cases and hardware failure combinations.
- The scale is to your advantage philosophy means the system is designed so that performance and workload decorrelation actually improve as the service grows larger.
- S3 is positioning itself as the permanent home for both structured and unstructured data by integrating SQL and vector search directly into the storage layer.
The Product-Minded Engineer: The importance of good errors and warnings
The role of the product-minded engineer is becoming a baseline requirement for startups, particularly as AI tools handle more of the raw coding work. This shift requires engineers to act as a blend of product manager and developer, focusing on what to build rather than just how to build it. A major part of this transition involves mastering diagnostics, which are often the most important interface a user interacts with. Drew Hoskins, a former Staff Engineer at Stripe and Facebook, outlines a framework for categorizing errors into five types: System, User's Invalid Argument, Precondition, Developer's Invalid Argument, and Assertion. By identifying the specific persona and scenario, engineers can write messages that are both contextual and actionable. For example, instead of a generic 'invalid channel' error, a product-minded approach suggests alternatives or explains nomenclature like the difference between user handles and channel tags. This level of detail is especially critical for autonomous agents, which rely on error messages to correct their own mistakes. If an agent receives a vague error, it fails the task and wastes compute costs. Shifting left is another core strategy, which involves firing diagnostics as early as possible through static checks, upfront validations, and service fakes. These techniques reduce resource usage and save users time by preventing them from proceeding down a doomed path. Ultimately, being product-minded means asking 'why' frequently, switching between system and user viewpoints, and using scenarios to simulate interactions before writing code. Hoskins emphasizes that engineers should spend time on customer support to identify permanent fixes and use AI tools to gather user signals from Slack, GitHub, and Miro. This proactive approach ensures that the engineering team is not just reacting to bugs but actively improving the product's usability and stickiness. By treating diagnostics as a primary interface, teams can reduce churn and help users navigate complex workflows without needing constant manual intervention.
Key Takeaways
- AI agents change the ROI of error messages. Since agents are billed on usage, vague errors directly increase costs by causing task failure or unnecessary retries.
- The best place to raise an error is at the interface boundary. This is where the system has both the technical details of the failure and the context of what the user was trying to achieve.
- Shifting left through service fakes and dry runs builds trust. Providing a test mode like Stripe allows users to explore edge cases safely before moving to production.
- Product thinking requires a constant shift in perspective. Engineers should move from the system level to the user lens and back again to ensure technical implementation aligns with user goals.
The Pragmatic Engineer in 2025 - by Gergely Orosz
2025 was defined by the rapid evolution of AI tools and a volatile tech job market. The Pragmatic Engineer reached over one million readers, highlighting a massive shift in how software is built. Key technical deep dives focused on Claude Code and Cursor, which moved the needle from simple autocomplete to agentic workflows. The introduction of the Model Context Protocol (MCP) emerged as a critical building block for connecting AI agents to development environments, allowing for much deeper integration between LLMs and local tools. The tech industry faced a strange paradox throughout the year. While OpenAI reached a $500B valuation and expanded into browsers, the broader job market remained tight and difficult to navigate. Job seekers struggled with ghosting and lack of feedback, while employers dealt with sophisticated AI-fueled recruitment scams. Some cases even involved North Korean agents posing as US contractors to infiltrate engineering teams. This friction contributed to a significant decline in remote job listings and an acceleration of return to office mandates at major companies like Amazon and Instagram. AI engineering became a distinct and accessible path for software engineers. Senior engineers leveraged these tools for massive productivity gains, while junior engineers became profitable more quickly due to faster onboarding. However, the gap between senior and junior output widened as experienced devs used AI more efficiently. Traditional resources like StackOverflow continued to decline as developers shifted toward LLMs for problem solving, a trend that seems irreversible. The year also saw the continued success of The Software Engineer's Guidebook, which sold 40,000 copies and saw a unique vertical-reading release in Japan. The Pragmatic Engineer podcast surpassed 10 million downloads, featuring deep conversations on Linux development with Greg Kroah-Hartman, Netflix's engineering culture with Elizabeth Stone, and the measurable impact of AI on team productivity. These discussions highlighted how thoughtful software design remains crucial even as AI transforms the daily workflow of coding.
Key Takeaways
- The Model Context Protocol (MCP) is shifting from a niche idea to a foundational building block for AI agents. By standardizing how agents access data and tools, it enables a new layer of agentic infrastructure that goes far beyond simple chat interfaces.
- Recruitment fraud is driving a structural shift in how companies hire. The discovery of North Korean agents and AI-filtered candidates has created a trust tax that is actively reducing remote work opportunities and bringing back high-friction, in-person interview processes.
- AI tools are creating a seniority multiplier effect. While they help juniors onboard faster, the most significant gains are seen by senior engineers who can use their architectural knowledge to direct AI agents, potentially widening the productivity gap between experience levels.
- The decline of StackOverflow marks the end of the community-driven era of troubleshooting. As developers move to private LLM interactions, the industry loses a public, searchable knowledge base, which may change how new engineers learn and how technical information is shared.
The history of servers, the cloud, and what’s next – with Oxide
Brian Cantrell traces the history of server technology from the 1990s dot-com era to the current shift toward specialized on-premise cloud hardware. During his time at Sun Microsystems, the industry was defined by proprietary systems like Spark and Solaris. Interestingly, Cantrell argues that the most significant technical breakthroughs, such as ZFS and DTrace, happened during the post-boom bust when teams were forced to innovate out of desperation. As the market shifted, x86 commodity hardware and Linux eventually overtook proprietary RISC architectures. AWS then transformed the landscape by offering elastic, API-driven infrastructure, though they initially hid their high margins to prevent competition. Oxide was founded to bring hyperscale efficiency to the broader market. While giants like Google and Meta build their own custom racks, most companies are stuck with legacy servers from Dell or HP that are not designed for modern cloud operations. Oxide takes a clean sheet approach by integrating hardware and software from the ground up. This includes using DC bus bars for power and blind mating for networking to eliminate messy cabling. They even developed their own programmable switch and a Rust-based operating system called Hubris. The conversation also covers Oxide's unique culture, which features a uniform compensation model where every employee earns the same salary. This transparency helps attract high-quality talent in specialized fields like power engineering and QA. Regarding AI, Cantrell remains skeptical of its ability to solve deep systems problems. While LLMs are helpful for summarizing documents or idiomatic code suggestions, they lack the accountability and first-principles reasoning required for hardware bring-up and complex physical debugging.
Key Takeaways
- Economic busts often produce better technical work than booms because the lack of frothy capital forces engineers to focus on solving core problems with fewer resources.
- The move toward cloud repatriation is an economic inevitability for large-scale companies that outgrow the margins of public cloud providers like AWS.
- True hardware innovation requires moving past reference designs and taking a first-principles approach to eliminate decades of accumulated technical debt in the PC ecosystem.
- Uniform compensation models can act as a powerful filter for attracting principled, mission-driven talent who value culture and technical challenges over individual salary negotiations.
Frictionless: why great developer experience can help teams win in the ‘AI age’
Developer experience (DevEx) is the critical factor determining whether teams actually realize the productivity gains promised by AI. While frameworks like DORA and SPACE provide the research and measurement foundations, the focus is now shifting toward the practical implementation of reducing friction. Friction is defined as the invisible barriers that turn quick wins into delays, such as broken tools, slow pipelines, and high cognitive load. If these foundations are weak, AI tools often amplify the pain rather than masking it. Building a business case for DevEx requires translating technical pain into executive language. This involves three primary levers: recovering time, saving money, and making money. Recovering time can be framed as recaptured productivity dollars or free headcount by calculating hours lost to toil. For example, a team of 50 developers wasting five hours a week on build delays represents over $1.3 million in annual losses. Saving money involves quantifying cost reductions in cloud expenses, vendor consolidation, or reduced production incidents. Making money focuses on accelerating revenue through faster feature velocity and experimentation speed. Case studies from Etsy, Block, and Capital One demonstrate that strategic DevEx investments lead to documented millions in savings and significantly faster time-to-market. The rise of AI-augmented development does not replace existing frameworks but requires new metrics. The SPACE framework remains relevant but must now account for trust, validation effort, and prompting efficiency. Developers spend more time reasoning about information and reviewing AI suggestions than writing code, making traditional output metrics like lines of code even more obsolete. The focus must shift toward outcomes like problem-solving speed and the breadth of exploration. Ultimately, AI amplifies the impact of underlying system inefficiencies, making the removal of friction more urgent than ever for engineering leaders.
Key Takeaways
- AI acts as an amplifier for existing system friction. If your foundations are bad, AI tools will likely highlight those bottlenecks rather than bypass them, making DevEx more critical in the AI age.
- The most effective way to sell DevEx to leadership is by framing it as free headcount. By calculating time lost to build delays or toil, you can show how many additional developers the company effectively gains without increasing the budget.
- Metrics must shift from output to outcomes in the AI era. Since AI can generate massive amounts of code, measuring lines of code is useless; instead, focus on problem-solving speed and validation effort.
- Successful DevEx initiatives like those at Etsy and Capital One succeeded because they translated technical improvements into business terms like reduced deployment wait times and faster customer acquisition.
Being a founding engineer at an AI startup
Michelle Lim, founding engineer at Warp and founder of Flint, shares her journey from interning at Meta, Slack, and Robinhood to building AI-driven startups. She explains that her preference for ownership grew as she moved to smaller companies, eventually leading her to join Warp as the first engineer when no code had been written yet. She chose this path over safer, high-growth Series A options because of the product vision and the opportunity to be mentored by a principal Google engineer. At Warp, she saw the technical stack evolve from TypeScript to Rust to meet performance demands and developer expectations. Michelle details her approach to joining early-stage teams, emphasizing the importance of negotiating for high equity over cash and performing reverse reference checks on managers. She specifically suggests asking potential managers for references from junior engineers they have previously mentored to gauge career growth potential. She also distinguishes between product-first and code-first engineers, arguing that startups need people motivated by user impact who see technology as a tool rather than the primary goal. Now building Flint, Michelle describes a future of autonomous websites that use AI to update themselves in real-time. These sites can generate comparison pages when a competitor launches or morph their content based on whether a visitor is a healthcare executive or an AI agent. She highlights the shift toward an agentic web where sites communicate via protocols like MCP (Model Context Protocol) rather than just HTML. Her advice for aspiring founding engineers is to build AI products on the side and volunteer for the unsexy business tasks that others avoid, as this builds the broad expertise required for future founders.
Key Takeaways
- Individual impact and ownership increase by orders of magnitude as company size decreases, making early-stage roles ideal for those seeking direct line-of-sight to user impact.
- In startups, you are effectively married to your manager, so checking their track record with junior talent is often more important than the technical interview itself.
- Success as a founding engineer often comes from picking up hot potatoes like marketing, sales, or security questionnaires that others avoid, which eventually prepares you for a founder role.
- The next evolution of the internet involves websites acting as autonomous agents that optimize for conversion and communicate directly with other AI agents via JSON or MCP.
Code security for software engineers - by Gergely Orosz
Code security has shifted from a compliance-driven, quarterly audit process to a shared responsibility where developers take the lead. Since vulnerabilities manifest in code, engineers are the only ones capable of fixing them in real-time. Security teams should act as specialized consultants for complex areas like cryptography or authentication, rather than gatekeepers for every minor bug. A recent study of 8 billion lines of code found roughly one security issue for every 1,000 lines, highlighting that common problems like SQL injection, cross-site scripting, and hard-coded secrets remain prevalent. The relationship between code quality and security is often overlooked. Poorly structured or spaghetti code is harder to review and maintain, which keeps the attacker window open longer. This problem is exacerbated by AI coding assistants. While these tools speed up production, they often generate verbose or low-quality code that requires rigorous verification. AI also introduces new attack vectors like prompt injection, where human language acts as the new code injection, and slop-squatting, where AI suggests non-existent libraries that attackers then register with malicious payloads. To manage these risks, teams should use a layered automation approach. Static Application Security Testing (SAST) analyzes code paths and data flows without execution, while Software Composition Analysis (SCA) checks dependencies against known vulnerability databases (CVEs). Dynamic tools like DAST and fuzzing are better suited for security teams as they require a running environment and longer feedback loops. Ultimately, security isn't a product you buy but a process built into the development lifecycle. It focuses on closing windows and doors through consistent hygiene and automation. As AI continues to change the landscape, the bottleneck for engineering teams is no longer writing code, but the speed and accuracy of verifying that code before it reaches production.
Key Takeaways
- Developers are the primary owners of code security because they are the only ones who can fix vulnerabilities at the source during the development process.
- Code quality is a direct proxy for security. Complex or unreadable code increases the likelihood of overlooked bugs and significantly slows down the patching process.
- AI tools shift the engineering bottleneck from writing code to verifying it, requiring new deterministic guardrails to catch non-deterministic AI errors.
- Effective security relies on a layered automation strategy using SAST and SCA rather than manual audits or security-as-a-product solutions.
Holiday gift ideas for techies
This 2025 gift guide for tech professionals covers eight categories ranging from office accessories to kitchen tools. It highlights items that solve common developer problems like cold coffee with the Ember mug or poor focus with CO2 monitors. High quality peripherals like the Logitech MX Master 3 and Bose QuietComfort Ultra are recommended for daily use. The guide features several e-ink and focus-oriented devices such as the Boox Palma 2 and the Daylight DC-1 computer. For those interested in hardware and security, the Flipper Zero and YubiKey remain top choices. Gaming suggestions include the Nintendo Switch 2 and the Playdate handheld. The book list includes recent technical releases like AI Engineering and The Engineering Executive's Primer. Practical travel gear like the Bellroy Tech Kit and Cadence magnetic containers are also mentioned. The authors emphasize that the best gifts often help tech workers step away from their screens or improve their physical environment. The list was compiled using personal recommendations and crowdsourced ideas from various tech communities.
Key Takeaways
- The rise of e-ink and reflective LCD technology like the Daylight DC-1 shows a growing demand for high performance tools that reduce eye strain and blue light.
- There is a strong shift toward analog productivity systems like the Analog to-do list to help manage digital overwhelm and maintain focus.
- Specialized hardware like JetKVM and Flipper Zero indicates that techies value tools that provide low level control over their environments and devices.
- Optimizing the physical workspace with tools like CO2 monitors and temperature controlled mugs is becoming as important as the digital stack itself.
Martin Fowler - by Gergely Orosz - The Pragmatic Engineer
Martin Fowler discusses the profound shift AI brings to software engineering, comparing it to the historic transition from assembly to high-level languages. The core of this change is the move from deterministic to non-deterministic environments. While traditional programming relies on predictable logic, AI-driven development requires a mindset shift toward engineering tolerances, similar to how structural engineers account for material variations. Fowler highlights that while AI can rapidly generate code, it often produces convoluted or low-quality output that necessitates rigorous refactoring and human oversight. The conversation covers the ThoughtWorks Technology Radar, explaining how the Doppler group identifies emerging tech like the Model Context Protocol (MCP) and the use of GenAI for understanding legacy systems. Fowler is critical of vibe coding, a practice where developers generate code without reviewing it. He argues this breaks the essential learning loop, making it impossible to evolve or maintain software over time. For Fowler, the value of a developer isn't just writing code but understanding what to write through communication with users and stakeholders. Reflecting on his career, Fowler revisits the origins of the Agile Manifesto and his book Refactoring. He notes that while Agile concepts are now mainstream, their execution in large enterprises is often a pale shadow of the original vision. He suggests that software architecture patterns became less fashionable partly because cloud providers now offer pre-architected building blocks, though a shared technical vocabulary remains vital. Fowler concludes by advising junior engineers to seek human mentors and remain skeptical of AI output, as these models frequently hallucinate or provide outdated advice.
Key Takeaways
- The transition from deterministic to non-deterministic systems is the most significant shift in software engineering history, requiring developers to adopt a tolerance-based mindset similar to physical engineering.
- Vibe coding is useful for throwaway prototypes but dangerous for long-term systems because it removes the learning loop, leaving developers unable to tweak or evolve the generated code.
- AI's most effective current use cases are rapid prototyping and interrogating legacy codebases to understand data flows and logic in systems where the original authors have left.
- Software architecture patterns have lost mainstream momentum because cloud hyperscalers provide well-architected managed services, reducing the need for teams to build foundational structures from scratch.
- The core skills of a standout engineer remain unchanged by AI: curiosity, effective communication, and the ability to collaborate across the divide between technical requirements and business needs.
Netflix’s Engineering Culture - by Gergely Orosz
Netflix operates at a massive scale, capturing over a trillion events daily and managing a global infrastructure that includes the Open Connect CDN with 6,000 locations. CTO Elizabeth Stone explains the pitch to play lifecycle, where engineering underlies the entire process from content greenlighting to delivery. A major recent milestone was the Jake Paul vs. Mike Tyson live event, which saw 65 million concurrent streams. This event pushed the limits of their systems, requiring real-time problem-solving and a shift from their traditional video-on-demand model to a high-stakes live environment. Despite the scale, Netflix maintains a culture of high autonomy and unusual responsibility. They famously avoid formal performance reviews, instead relying on continuous, candid feedback and the Keeper Test to maintain high talent density. While they historically only hired senior engineers, they have recently expanded to include early-career talent and interns to bring in fresh perspectives and native AI familiarity. Regarding AI, the company takes a pragmatic approach, using tools for prototyping, documentation, and automating migrations rather than viewing it as a universal solution. They also contribute significantly to open source, with about one in five engineers involved in projects like video encoding, which has earned them multiple technical Emmys. The core philosophy remains focused on curiosity and excellence, ensuring that engineers own their outcomes without being slowed down by rigid top-down processes.
Key Takeaways
- Strategic infrastructure like the Open Connect CDN acts as a massive competitive moat, allowing Netflix to pivot into live events and gaming with lower latency than competitors.
- The Keeper Test and the absence of formal performance reviews prioritize high-trust relationships and immediate feedback over administrative overhead.
- Bottom-up innovation means individual contributors, not top-down architects, drive the design of complex systems like the live streaming stack.
- Netflix uses open source contributions not just for altruism but to up-level the entire industry, which ultimately reduces their own costs, such as bandwidth requirements.
The Software Engineer’s Guidebook: a recap
Gergely Orosz provides a detailed retrospective on his book, The Software Engineer's Guidebook, two years after its release. The book has sold approximately 40,000 copies, generating $611,911 in royalties. This success highlights the financial viability of self-publishing for established tech creators, especially when compared to traditional publishing models where authors typically retain only 7 to 15 percent of revenue. Orosz initially pitched to major publishers like O'Reilly and Manning but ultimately broke ties with Manning due to creative differences and restrictive templates that he felt would dumb down the content. The self-publishing process involved a specific stack of tools: Google Docs and Craft for writing, Vellum for ebooks, and Overleaf for print layouts. Orosz emphasizes the importance of a weighted table of contents to manage scope and page counts. He also notes that starting his newsletter actually delayed the book by two years, as the formats require different levels of depth and timeliness. A significant portion of the recap focuses on the international reach of the book. A 30-person Mongolian startup called Nasha Tech translated the work to uplevel their local tech ecosystem. This experience revealed a vibrant, growing startup scene in Ulaanbaatar with a tech stack similar to Silicon Valley, including tools like Cursor and Claude Code. Financially, the bulk of revenue came from Amazon KDP, which accounted for $470,000 in royalties. Orosz criticizes the monopolistic practices of Amazon and Audible, noting their high take rates of 70 to 75 percent for digital products. He contrasts this with print-on-demand services where Amazon is significantly more cost-effective than competitors like Ingram Spark. The document concludes that while the impact of a book is harder to measure than a newsletter, its long-term value lies in the career growth it facilitates for readers and the structured thinking it requires from the author.
Key Takeaways
- Self-publishing can yield 4 to 8 times higher royalties than traditional publishing but requires the author to manage production, layout, and marketing independently.
- Amazon and Audible maintain a dominant monopoly in the digital book market, enforcing take rates as high as 75 percent for audiobooks and 70 percent for ebooks.
- Grassroots translation efforts in smaller markets like Mongolia demonstrate that high-quality technical content has universal demand regardless of local market size.
- A successful technical book creates massive industry leverage, with the author estimating his work has generated roughly $80 million in value by saving engineering time.
- The reporting tools for digital and print sales remain surprisingly outdated, with significant delays and usability issues even on major platforms like Ingram Spark.
From Swift to Mojo and high-performance AI Engineering with Chris Lattner
Chris Lattner details his journey from developing LLVM as a university research project to founding Modular to solve the fragmentation in AI infrastructure. LLVM succeeded by offering a modular alternative to the monolithic GCC compiler, eventually powering Apple's transition to 64-bit mobile chips. Swift followed a similar trajectory, starting as a side project to replace the complexities of C++ and Objective-C with a safer, more modern language. A key insight from these transitions is that experts often resist new tools because their specialized knowledge becomes less valuable, requiring a slow S-curve of adoption driven by clear business value. The current AI landscape suffers from a two-world problem where developers use Python for research but must rewrite code in C++ or CUDA for production performance. This creates massive technical debt and forces companies like Anthropic to maintain separate codebases for different hardware like NVIDIA GPUs and Google TPUs. Mojo is designed to unify these worlds by being a superset of Python that provides the performance of C. It avoids the leaky abstraction of sufficiently smart compilers by giving developers explicit control over hardware features like SIMD and vectorization through library-level abstractions. Lattner explains that Mojo's metaprogramming, inspired by Zig, allows the program and metaprogram to share the same language, making high-performance code easier to debug and scale. This approach empowers domain experts, such as geneticists, to write GPU-accelerated code without needing a background in compiler engineering. Regarding the future of AI-assisted coding, Lattner uses tools like Cursor for mechanical tasks but warns against vibe coding in production. He emphasizes that while AI can handle the how of writing code, human architects must still curate the what to ensure long-term maintainability and clean system design.
Key Takeaways
- Bridging the AI Two-World Gap: The current necessity of using Python for ease and CUDA for speed creates a massive efficiency tax. Mojo's strategy relies on being a Python superset that offers C-level performance, allowing teams to stay in one ecosystem from prototype to production.
- Predictable Performance vs. Opaque Optimizations: Traditional compilers often rely on pattern matching that can fail silently when code changes. Mojo shifts this power to the developer through explicit library features, ensuring that performance is a predictable outcome rather than a compiler magic trick.
- Decoupling Software from Silicon: As the AI chip market diversifies with new entries from AMD, Apple, and cloud-specific ASICs, the industry needs a hardware-agnostic layer. Modular is positioning itself as that essential infrastructure, similar to how LLVM standardized the CPU compiler market 20 years ago.
- The Expertise Invalidation Barrier: Resistance to new languages like Swift or Mojo is often a social rather than technical problem. Understanding that experts protect their prior investments helps in designing adoption strategies that focus on enabling new classes of developers rather than just converting the old guard.
Beyond Vibe Coding with Addy Osmani - by Gergely Orosz
Professional software engineering is transitioning from vibe coding to a more disciplined AI-assisted engineering model. Vibe coding is characterized by high-level prompting and rapid iteration, which is highly effective for prototypes and MVPs but often ignores the underlying code structure. This approach prioritizes speed over correctness and maintainability, making it risky for production environments. In contrast, AI-assisted engineering treats the LLM as a powerful collaborator while keeping the human engineer firmly in control of the architecture, security, and final quality. Addy Osmani emphasizes the importance of spec-driven development, where engineers create clear plans and requirements before prompting. This method, combined with rigorous testing, helps de-risk the use of LLMs and prevents projects from going off the rails. A central concept discussed is the 70% problem, which suggests that while AI can quickly generate the bulk of an application, it frequently struggles with the final 30% involving complex edge cases, security vulnerabilities, and performance optimization. This last mile often results in a two steps back pattern where a single prompt can inadvertently break existing functionality. To combat this, engineers should review the thinking log of models to understand the logic behind generated changes. The emergence of the Model Context Protocol (MCP) is a significant advancement, providing LLMs with eyes into external tools. The Chrome DevTools MCP, for instance, allows an LLM to see what the browser sees, detecting console errors and rendering issues in real-time. As AI increases development velocity, human code review becomes the primary bottleneck. The role of the senior engineer is evolving into that of a conductor who orchestrates multiple parallel agents and asynchronous background tasks. This shift requires advanced skills in work decomposition and system design. Rather than just writing code, engineers must focus on context engineering, ensuring the LLM has the right files and institutional knowledge to produce high-quality results. Ultimately, maintaining critical thinking and a growth mindset is essential to prevent over-reliance on non-deterministic tools.
Key Takeaways
- Vibe coding is a creative flow for prototyping, but production engineering requires context engineering to ensure the LLM understands team-specific constraints and history.
- The 70% problem proves that AI velocity without human precision leads to technical debt, especially in security and long-term maintainability.
- MCP (Model Context Protocol) represents a shift toward agentic infrastructure where tools have a shared language to provide LLMs with real-time environment data.
- Senior engineers are evolving into orchestrators who must master the art of decomposing complex specs into small, verifiable chunks for parallel AI agents.
Google’s Engineering Culture
Google operates as a tech island with a completely custom infrastructure built to handle planet-scale operations. Instead of industry standards like GitHub or Kubernetes, engineers use internal tools like Piper for version control, Blaze for builds, and Borg for orchestration. This vertical integration allows for high developer productivity and seamless hooks between planning, coding, and review tools. The company famously uses a monorepo containing billions of files and lines of code, supported by a cloud-based development environment where code rarely lives on local machines. Hiring and career progression follow a rigid L-level structure from L3 for entry level to L11 for SVP. Compensation is significantly higher than local market rates, often including liquid stock and cash bonuses. Performance reviews recently shifted from the PERF system to GRAD, focusing on impact levels like significant or transformational. While Google pioneered the SRE role and offers generous on-call pay, the culture faces criticism for promotion-driven development. This incentive structure encourages engineers to launch new projects to get promoted, often leading to the abandonment of existing tools and the famous Killed by Google product graveyard. The culture is defined by Googliness, which emphasizes thriving in ambiguity, teamwork, and user focus. However, the environment has shifted recently. The first major layoffs in company history in 2023 and 2024 ended the era of absolute job security. Increased focus on revenue, competition from TikTok, and the race for AI dominance have introduced more corporate pressure. Despite these changes, Google remains a top destination for its pedigree, internal mobility, and the ability to work on frontier technologies like LLMs and self-driving cars.
Key Takeaways
- The custom tech stack creates a productivity moat but also a tech island where internal skills like Piper or Borg are not directly transferable to the outside industry.
- Google's promotion system inadvertently rewards launching new features over maintaining old ones, which explains their history of redundant or abandoned products.
- Internal mobility is a core strength that allows engineers to switch teams without manager vetoes, helping retain talent even when specific projects fail.
- The transition from a 'Don't be evil' startup vibe to a profit-focused giant is complete, marked by the 2023 layoffs and a pivot toward protecting ad revenue.
The Pragmatic Engineer 2025 Survey: What’s in your tech stack? Part 3
This survey of over 3,000 engineers highlights the current state of software tooling across observability, incident management, and frontend development. In observability, Datadog remains the market leader, though many companies are moving toward open source stacks like Grafana and Prometheus to manage costs. Sentry has also evolved from simple error tracking into a full application performance monitoring platform. Incident management is seeing a generational shift. While PagerDuty and OpsGenie still hold the most market share, newer Slack-native challengers like incident.io, Rootly, and FireHydrant are gaining ground. These younger tools succeeded by prioritizing Slack integration while the older guard was slow to adapt. Feature flagging is dominated by LaunchDarkly, but it is also the category most likely to be built in-house. Engineers often start with custom solutions because they are easy to launch, though maintenance becomes the long-term challenge. In the frontend world, React and Next.js are the undisputed leaders, with Tailwind CSS becoming the preferred styling choice. Mobile development shows a clear preference for native-adjacent frameworks like SwiftUI for iOS and Jetpack Compose for Android, while React Native leads the cross-platform category. Developer productivity tools like Postman for APIs and Sonar for static analysis remain staples. However, the momentum for developer portals like Backstage seems to have slowed, possibly due to the overhead of microservices or a preference for custom internal discovery tools.
Key Takeaways
- The Slack-native movement is the primary differentiator for new incident management tools, proving that workflow integration matters more than legacy feature sets.
- Feature flagging represents a unique build vs. buy tension where the high cost of vendors like LaunchDarkly often outweighs the initial complexity of building a custom internal flag service.
- Frontend development has reached a period of stability or stagnation, with React and its meta-framework Next.js facing very little meaningful competition in the current market.
- Developer portals like Backstage are struggling to achieve mass adoption, suggesting that service discovery is either being solved through simpler internal tools or that microservice complexity is being consolidated.
Python, Go, Rust, TypeScript and AI with Armin Ronacher
Armin Ronacher discusses how programming languages are evolving and how AI is changing the way engineers work. He looks back at the Python 2 to 3 migration as a massive effort that taught the industry how to handle breaking changes through edition systems. In his current startup work, he compares Python, Go, and Rust based on pragmatism rather than just performance. Python remains the standard for machine learning and infrastructure, while Rust is the best choice for binary data and high-performance tasks despite its slow compile times. He prefers Go for backend services because its simple abstractions are easy for both humans and AI to understand. Ronacher explains his shift from being an AI skeptic to a power user of tools like Claude and Cursor. He uses AI to build custom internal tools, such as log visualizers, that would have taken weeks to build manually in the past. He also uses it to debug complex AWS permission issues by feeding it production logs. On the technical side, he warns that moving from stack frames to promise chains in modern languages has made observability much harder. It is now difficult to carry correlation IDs and context through asynchronous code. Regarding startup culture, he critiques the 996 model, noting that high intensity is possible without the burnout associated with rigid 12-hour shifts. He believes the future of engineering involves humans staying in the loop to provide architectural direction while AI handles the bulk of the implementation. He points out that while TypeScript is unavoidable in the browser, the dependency bloat in the NPM ecosystem makes it a liability for backend services. He uses Pulumi with Python for infrastructure and mentions tools like Stainless for automated SDK generation, which reduces the need for unified codebases across the stack. He also mentions that his non-technical co-founder is now able to validate product features using AI tools, which significantly changes the early-stage startup dynamic. This shift allows for rapid iteration where 80 percent of the code might be generated by an agent, provided the human engineer maintains control over the system architecture and maintainability.
Key Takeaways
- AI favors simple languages because Go's lack of complex abstractions leads to higher success rates for AI-generated code compared to Python or Rust.
- Internal tooling is now essentially free since engineers can use AI to build bespoke visualizers and debuggers in minutes, removing the trade-off between product work and support tools.
- Modern observability is at risk because the shift from stack-based execution to promise chains makes it significantly harder to track correlation IDs across services.
High-growth startups: Uber and CloudKitchens with Charles-Axel Dein
Charles-Axel Dein shares his experience as engineer number 20 at Uber, witnessing its growth to over 2,000 engineers. He describes hypergrowth as a state of "good chaos" where rapid hiring and constant incidents are the norm. At Uber, the focus was on shipping features that impacted people's livelihoods, such as payment systems for drivers. This high-stakes environment required extreme flexibility, often asking engineers to switch tech stacks immediately based on business needs. The conversation highlights the shift from the "zero interest rate" era of massive hiring to the more focused, smaller team approach at CloudKitchens. Dein emphasizes that while hypergrowth might be slowing down, the lessons remain relevant. Effective hiring requires a tight partnership between engineering managers and recruiters, treating the process like a sales funnel with continuous feedback. He advocates for "shadowing" and "reverse shadowing" to train interviewers, ensuring a high quality bar. On personal productivity, Dein recommends the "Getting Things Done" (GTD) method and using flashcards to memorize fundamentals like standard libraries and architecture patterns. He views AI as a useful tool for migrations and as a "coach" for reviewing documents, but warns against outsourcing thinking or copy-pasting AI-generated prose. He argues that writing is thinking, and losing that skill diminishes an engineer's value. Strategic advice for engineers includes taking "extreme ownership" and learning project and product management. By understanding the business context and managing trade-offs between time, cost, and quality, engineers can ensure they aren't building features that fail to move the needle. He concludes that the best engineers are "rakes" rather than "T-shaped," possessing deep expertise across multiple areas and maintaining a sense of humility and humor.
Key Takeaways
- Hypergrowth requires decoupling deployment from release. Using feature flags and gradual rollouts allows teams to maintain velocity without the catastrophic impact of "all-at-once" deployments in the physical world.
- The "irony of automation" suggests that automating simple tasks often leaves users with only the most complex, un-automatable problems. Engineers should do things manually first to gain the context needed to automate correctly.
- Project management is a superpower for engineers. Simple habits like sending weekly updates with highlights, lowlights, and ETAs manage stakeholder perception and force clarity on goals.
- AI is most effective for "toilsome" tasks like code migrations rather than high-level design. While it can bootstrap a fix in an unfamiliar stack, the "cognitive load" of reviewing AI code is often higher than writing it from scratch.
How tech companies measure the impact of AI on software development
Tech companies are moving beyond lines of code to measure AI impact, focusing instead on how these tools affect quality, speed, and developer experience. While 85% of engineers now use AI tools, 60% of engineering leaders struggle with a lack of clear metrics. Leading companies like Google, Microsoft, and Dropbox use a combination of existing core metrics and new AI-specific indicators to bridge this gap. Dropbox reports a 90% adoption rate, with AI users merging 20% more pull requests while simultaneously reducing their change failure rate. The DX AI Measurement Framework suggests balancing AI adoption metrics like active users, spend, and time saved with core engineering outcomes. Key metrics include PR throughput, cycle time, and change failure rate. Microsoft specifically tracks bad developer days to see if AI reduces daily friction or adds new obstacles like context switching. Glassdoor focuses on innovation by measuring the number of A/B tests per month. A major challenge remains the telemetry moat where vendors like Microsoft or Google hoard detailed usage data. This forces companies to rely on qualitative methods like periodic surveys and experience sampling to capture change confidence and code maintainability. Monzo finds AI particularly effective for code migrations, saving 40% to 60% of manual effort, but they remain cautious about using AI in areas involving sensitive customer data. The industry is shifting toward measuring autonomous agent performance and agent telemetry, which is expected to become a standard in the next year. For now, the most effective strategy involves an experimental mindset: setting baselines, slicing data by developer tenure or role, and ensuring that speed gains do not come at the expense of long-term technical debt.
Key Takeaways
- Core metrics like Change Failure Rate are more important than ever to ensure AI-generated speed does not compromise code quality or maintainability.
- Qualitative data from developer surveys is the only way to measure change confidence since automated system telemetry cannot capture human perception of risk.
- AI tools provide the highest immediate ROI in well-defined, repetitive tasks like code migrations and library updates rather than complex, abstract planning.
- The telemetry moat created by AI vendors makes it difficult for companies to get objective usage data, necessitating internal tracking and custom dashboards.
Code Complete with Steve McConnell - by Gergely Orosz
Steve McConnell explains that Code Complete was born from his own need for a guide on software construction, a phase distinct from high level design or simple coding. He wrote the 900 page book only five years into his career by focusing on his own knowledge gaps from his first year in the industry. A core concept he shares is the Career Pyramid, which encourages engineers to aim for a specific point on the horizon to ensure their work adds up to cumulative value. This prevents lily pad hopping, where developers move between projects or technologies without actually increasing their long term marketability or value to an organization. The discussion highlights that software design is often an irrational and sloppy process that developers must eventually fake into a rational one for the sake of documentation and maintenance. McConnell argues that rewriting code multiple times is a valid and often superior path to excellence. The second or third version of a piece of code typically applies lessons that the first version could not have known, leading to much higher quality in a fraction of the time. On organizational growth, he notes that startups often succeed through raw energy and fluid roles where everyone does what is necessary. However, as companies mature and age, they eventually require more explicit roles and necessary bureaucracy to sustain their scale and reliability for customers. Regarding the impact of AI, he sees it as a new layer of aggregation that handles the happy path well but forces a return to software fundamentals. Because AI is non-deterministic and can be subtly wrong, the engineer's job shifts toward being the final arbiter who ensures the output is exactly right for the messy, real world requirements that a machine might miss.
Key Takeaways
- Construction is a holistic discipline that covers everything from detailed design to debugging, and treating it as a distinct skill prevents the ignorant cousin syndrome where code quality is an afterthought.
- Strategic career mapping using a pyramid model to triangulate between technology, business domain, and best practices helps avoid stagnant career growth and ensures every project increases your marketability.
- Losing work or intentionally rewriting code often results in a version that is superior to a first draft because the mental model is already refined and the lessons from the first attempt are applied immediately.
- AI acts as a forcing function because its tendency to be subtly off requires developers to become better at defining requirements and test cases, effectively mandating the best practices that many have ignored for decades.
- Organizational success in early stages is driven more by focused personal energy than by process, but sustaining that success as a company ages requires a shift toward explicit roles and reliability.
The Pragmatic Engineer 2025 Survey: What’s in your tech stack? Part 2
This survey analysis of over 3,000 software engineers reveals the dominant tools used for project management, communication, and infrastructure in 2025. JIRA, VS Code, and AWS are the most mentioned tools across the board. Despite being frequently cited as the most disliked tool, JIRA remains the market leader in project management due to its entrenchment in large enterprises. However, Linear has emerged as a major challenger, particularly in companies with fewer than 50 employees where it rivals JIRA in popularity. In the design and collaboration space, Figma has achieved near-total dominance with a 97% mention rate among respondents, far outstripping competitors like Sketch or Penpot. PostgreSQL has solidified its position as the industry standard for databases, used by one in three engineers. The survey highlights that specialized vector databases like Pinecone have seen limited adoption because relational databases like Postgres and MongoDB have successfully integrated vector support through extensions. For backend infrastructure, Docker and Kubernetes remain the standard for containerization and orchestration. AWS services like ECS, EKS, and EC2 are ubiquitous, though Microsoft maintains a strong presence through VS Code, GitHub, and Azure DevOps. The report also examines the impact of open source forks. OpenSearch has successfully captured about 25% of the Elasticsearch market share, largely due to AWS's distribution power. In contrast, OpenTofu has struggled to gain significant traction against Terraform because it lacks a similar massive platform backer.
Key Takeaways
- PostgreSQL is now the default database choice for most teams, making the primary architectural question what would prevent its use rather than why it should be chosen.
- The success of open source forks depends heavily on distribution power; OpenSearch succeeded because of AWS backing, while OpenTofu lacks a massive platform to drive adoption.
- Linear is successfully disrupting the project management space by focusing on high-growth startups, though JIRA remains the default for large-scale corporate environments.
- Vector-only databases are being sidelined as mainstream relational databases add vector support, allowing teams to avoid the complexity of managing additional specialized systems.
The state of VC within software and AI startups – with Peter Walker
Startup hiring has plummeted from 73,000 monthly hires in early 2022 to a projected 20,000 in 2025. This shift isn't just about less capital; it's about AI making small teams significantly more productive. Investors have moved away from growth at all costs and are now obsessed with ARR per FTE. For example, the median Series A startup now has about 15 employees compared to 22 a few years ago, yet they are generating nearly $3 million in ARR, which is more than double the 2021 benchmark. While the total amount of VC money remains high, it is concentrated in a few massive AI players like OpenAI. For everyone else, the number of funding rounds has dropped by half since 2021. This has led to longer gaps between rounds, now averaging two and a half to three years. Solo founders are becoming more common, making up a third of new companies, but they still face a funding gap as VCs only back them about 17% of the time. The data on bridge rounds is particularly sobering. The success rate for a company moving from seed to Series A after taking a bridge round has crashed from 33% to just 8%. For employees, understanding equity dilution and liquidation preferences is more important than ever, as headline acquisition prices often mask the fact that investors get paid back first. Technical and commercial advisors remain valuable, but median equity grants have stabilized around 0.25%.
Key Takeaways
- The ARR per FTE metric has replaced raw user growth as the primary filter for VC health and capital efficiency.
- Bridge rounds have become a statistical red flag, with the success rate for reaching Series A dropping from 33% to 8% in two years.
- AI is fundamentally decoupling headcount from revenue, allowing Series A companies to operate with 30% fewer people while doubling their income.
- A massive gap exists between solo founder formation and funding, as VCs still view co-founding teams as a proxy for talent attraction and risk mitigation.
Measuring the impact of AI on software engineering – with Laura Tacho
Laura Tacho, CTO at DX, breaks down how to move past AI hype by using data to measure engineering impact. While headlines often claim AI will replace developers or write 30% of all code, the reality is more nuanced. Most companies see AI as a tool to improve Developer Experience (DX) rather than just a way to pump out more lines of code. In fact, source code is often a liability, and generating it faster can increase risk and technical debt if not managed correctly. Data from over 180 companies shows that the biggest time savings do not come from mid-loop code generation, but from stack trace analysis and refactoring existing code. This suggests that AI is currently most effective at removing toil rather than replacing the creative act of programming. However, there is a satisfaction paradox: if AI only automates the parts of the job developers enjoy, they may end up feeling less satisfied as they are left with more administrative work and meetings. The discussion highlights case studies from Booking.com, which achieved 65% adoption through structured enablement, and WorkHuman, which saw an 11% boost in their DX index. For AI to be successful, organizations should focus on structured rollouts, baseline measurements, and architectural improvements like clean API interfaces and AI-first documentation. As the industry moves toward agentic workflows, the cost per developer may rise significantly, shifting the focus from simple licenses to consumption-based models where senior and junior engineers might require different resource allocations.
Key Takeaways
- AI is primarily a Developer Experience (DX) enhancer. Improving DX leads to better business outcomes, but treating AI as a pure output machine often leads to misleading metrics like acceptance rates that do not reflect production reality.
- The most valuable AI use cases are often non-obvious. Stack trace analysis and refactoring save more net time than code generation because they eliminate the cognitive load of debugging complex errors without requiring the same level of manual review as new code.
- Architectural hygiene is becoming a competitive advantage. Clean service boundaries and AI-first documentation focused on code examples rather than visual aids make it easier for both humans and agentic models to navigate a codebase.
- Structured rollouts outperform organic adoption. Highly regulated industries like finance and pharma are seeing better results because they are forced to be intentional about licensing, security, and cohort-based experimentation.
Amazon, Google and Vibe Coding with Steve Yegge
Steve Yegge breaks down the fundamental differences between Amazon and Google engineering cultures based on his long tenures at both tech giants. He reflects on his famous platform rant and explains how Jeff Bezos forced Amazon to become a platform company by mandating internal APIs to solve customer service bottlenecks. Google, despite its superior engineering on individual services like Chubby and Bigtable, never developed the same platform DNA. This cultural gap explains why Google still struggles with developer stories like Flutter versus React Native. The conversation shifts to the current AI revolution and the rise of vibe coding. Yegge describes this as a shift where the AI writes the code while the human maintains the flow and direction. He warns that while this makes developers 100 times more productive, it requires a foundation of absolute distrust. You cannot trust the LLM to get it right, so the job moves from writing code to auditing and guiding agents. He also highlights the massive costs involved, noting that professional-grade agentic workflows can cost thousands of dollars a week in tokens. This makes local inference the next major frontier for sustainable development. Regarding the job market, Yegge argues that while big companies might shed headcount, the total number of developer jobs will explode as software creation becomes commoditized. He points to 2026 as a potential end game where AI employees become a reality, urging engineers to adapt to this new version of the role immediately to stay relevant.
Key Takeaways
- Platform success is cultural rather than just technical. Amazon's API-first mandate was born from a need to unblock customer service teams, while Google's focus on individual product excellence created silos that hinder platform growth.
- Vibe coding is a high-stakes audit. It is not about being lazy. It requires senior-level judgment to catch subtle hallucinations and reward function hacking where the AI claims success without actually solving the problem.
- The economics of AI agents are currently unsustainable for individuals. The high cost of tokens means that local inference at the level of Claude Sonnet is the missing piece for the vibe coding revolution to go mainstream.
- The role of the junior developer is evolving into a mentor for non-technical contributors. Instead of writing basic functions, they will likely spend more time vetting PRs from product managers or designers who use AI to build their own tools.
What is a Principal Engineer at Amazon? With Steve Huynh
Amazon's engineering culture is built on high-scale technical challenges and a unique organizational structure. Steve Huynh, a veteran of over 17 years at the company, details the transition from a C++ monolith that hit a 4GB binary limit in the early 2000s to a massive service-oriented architecture. This shift was necessary for maintainability and growth but introduced significant latency trade-offs. Amazon famously discovered a linear correlation between page load speed and revenue, leading to a relentless focus on performance. Engineering at this scale involves managing brownouts where services remain reachable but return partial or bad results due to dependency chain failures. The Correction of Error (COE) process serves as a blameless post-mortem mechanism that immortalizes these technical lessons for the entire company. The Principal Engineer role (L7) represents a significant hurdle, often described as a two and a half level jump from senior engineer rather than a linear step. This high bar has historically caused a brain drain to competitors like Meta, yet it fosters an elite internal community. Unlike many tech firms, Amazon formally invests in this community through dedicated program managers, offsites, and the Principles of Amazon presentation series. Principals often report directly to VPs and operate with a freedom of responsibility where they are assigned directions rather than specific problems. They act as technical advisors across hundreds of developers, balancing deep technical work with high-level strategy. Cultural pillars like the writing culture are central to operations. The six-page memo format is used for everything from business strategy to system designs. This disciplined approach ensures all stakeholders are aligned through study hall sessions at the start of meetings. This documentation also feeds into a robust patent culture, where technical designs are easily handed to legal teams for defensive IP protection. Core leadership principles like customer obsession and bias for action serve as axioms that guide decision-making, even when they require burning money to delight a customer.
Key Takeaways
- The jump to Principal at Amazon is intentionally non-linear, acting as a gatekeeper for an elite tier of engineers who function more like internal consultants than traditional individual contributors.
- Amazon's freedom of movement policy created an internal marketplace that forced managers to improve team culture or face 100% attrition, effectively using talent liquidity as a management quality control.
- The writing culture serves as a massive force multiplier for IP generation, allowing technical innovations to be converted into defensive patents with minimal friction between engineering and legal.
- Principled thinking at Amazon functions like mathematical axioms; by fixing certain values as unchangeable, the company reduces decision fatigue and maintains a consistent identity during rapid scaling.
How AI is changing software engineering at Shopify with Farhan Thawar
Shopify's approach to AI began with early access to GitHub Copilot in 2021, well before the ChatGPT surge. Head of Engineering Farhan Thawar describes a culture where AI is not just a tool but a fundamental shift in how work is evaluated. The company has moved beyond standard IDE extensions to deploying Cursor and building an internal LLM proxy. This proxy ensures data security for employee and customer information while allowing the company to track token usage and celebrate high-volume users. A core component of their infrastructure is the Model Context Protocol (MCP), which they use to create a context layer over internal data sources like The Vault (their wiki) and Salesforce. This allows AI agents to answer complex historical questions about product launches or board letters. Interestingly, AI adoption is growing fastest outside of R&D. Sales and finance teams are vibe coding their own tools, such as custom homepages that connect to Salesforce and Google Calendar via MCP servers without engineering intervention. To maintain high standards, Shopify requires coding interviews for all engineering leaders, including VPs, to ensure they can distinguish between high-quality code and AI-generated garbage. The company also recently completed a seven-month Code Red to eliminate technical debt, unique exceptions, and segfaults, proving that AI-first companies still prioritize core system reliability. To accelerate this cultural shift, Shopify is hiring 1,000 interns annually, viewing them as AI centaurs who naturally integrate LLMs into their workflows. They also use tools like Gumloop for web scraping and LibreChat for internal chat interfaces. Leadership emphasizes role modeling, with executives sharing their own prompts and workflows to encourage company-wide adoption.
Key Takeaways
- Shopify uses MCP to turn internal documentation and third-party SaaS data into an accessible context layer for agents, effectively creating an internal Perplexity.
- The company rejects penny-pinching on AI costs, arguing that even a $1,000 monthly spend per engineer is too cheap if it yields a 10% productivity boost.
- Vibe coding by non-technical staff is shifting the internal power dynamic, as PMs and sales teams build their own prototypes instead of waiting for engineering resources.
- The 1,000-intern hiring spree is a strategic move to learn from the interns and inject AI-native habits into the existing corporate culture.
Software engineering with LLMs in 2025: reality check
The current landscape of software engineering with LLMs shows a sharp divide between executive hype and developer reality. While leaders at Anthropic and Microsoft claim AI writes 30% to 90% of code, developers often encounter frustrating bugs or agent fumbles in complex repositories. However, a significant shift occurred in early 2025 with the rise of agentic workflows. Tools like Claude Code, Cursor, and Windsurf now use loops to execute compilers, run tests, and fix errors autonomously. This move from passive autocomplete to active agents has convinced even long-term skeptics like Armin Ronacher and Kent Beck that AI is a fundamental step change in how software is built. In Big Tech, Google and Amazon are taking different paths. Google maintains "internal AI islands" with custom tools like Cider and Critique, preparing for a future where code volume increases tenfold. Amazon is quietly becoming an MCP-first company. By converting their massive library of internal APIs into Model Context Protocol servers, Amazon allows agents to easily navigate their complex ecosystem. This strategy stems from Jeff Bezos' 2002 API mandate, giving them a massive head start in agentic infrastructure. Developers at Amazon are already using these servers to automate ticketing, wikis, and internal service calls. Data from a DX study of 38,000 developers shows that 50% now use AI tools weekly, saving an average of four hours per week. While this 10% productivity boost is meaningful, it falls short of the 10x claims often seen in media. The bottleneck is often organizational. Faster coding doesn't matter if code reviews, testing, and deployments remain slow. Experts like Martin Fowler suggest LLMs represent a new nature of abstraction, forcing engineers to work with non-deterministic tools rather than just higher-level syntax. The consensus among seasoned pros is that code has become "cheap," allowing for more experimentation and ambitious projects. Startups like incident.io are already leaning into this by building custom Claude Code projects that contain their specific architectural preferences and documentation.
Key Takeaways
- Agentic loops are the definitive shift. By allowing LLMs to run compilers and tests in a loop, the industry has moved past the toy phase of simple code completion into reliable, autonomous task execution.
- Amazon is leveraging its 2002 API mandate to lead in agentic infrastructure. By converting internal services into MCP servers, they have created a plug-and-play environment where AI agents can interact with thousands of internal tools seamlessly.
- Individual productivity gains of 10% are the current reality, not 10x. The bottleneck has shifted from writing code to the surrounding software pipeline like reviews and deployments, which aren't yet optimized for AI-generated volume.
- LLMs are a lateral move in abstraction. Rather than just another high-level language, LLMs introduce non-deterministic tools that make code cheap to produce, shifting the engineer's role toward high-level orchestration and intent.
The present, past and future of GitHub with Thomas Dohmke
GitHub remains one of the largest Ruby on Rails monoliths in existence, managing over two million commits and tens of thousands of pull requests within its core application. While the company has diversified its stack to include Go for the Copilot API and .NET for GitHub Actions, the foundational architecture still prioritizes moving fast in a single codebase. This technical pragmatism dates back to its 2007 founding as a bootstrapped startup that had to optimize for cost and speed without venture capital for the first five years. A defining innovation was the invention of the pull request, which transformed the original email-based Git workflow into a collaborative, web-native experience. Today, the platform handles roughly 10 billion API requests per day, or 120,000 per second, reflecting its scale as the central hub for global software development. The company operates with a strict remote-first and async-first culture, famously avoiding internal email in favor of Slack and GitHub itself. Every internal announcement, HR policy change, or legal update is handled as a pull request against a repository called the hub. This dogfooding extends to their AI strategy. Thomas Dohmke describes GitHub as being refounded on Copilot, a shift triggered by the realization that GPT-3 and the subsequent Codex model could write high-quality code across multiple languages without mixing syntax. Since the Microsoft acquisition in 2018, GitHub has grown its annual recurring revenue from $200 million to over $2 billion, proving the viability of developer tools as a massive business driver. Looking forward, the focus is shifting toward agentic workflows where engineers direct AI agents to handle testing, documentation, and security fixes, allowing humans to manage higher levels of system complexity.
Key Takeaways
- The refounding on Copilot represents a fundamental shift where GitHub is moving from a human-to-human collaboration platform to a human-to-agent ecosystem.
- Hiring junior developers is a strategic advantage in the AI era because they lack legacy thinking and adapt to prompting and agentic workflows more naturally than senior counterparts.
- GitHub's 10x revenue growth post-acquisition validates the Microsoft strategy of maintaining brand independence and investing in the developer ecosystem rather than forcing immediate integration.
- The future of engineering is not about autonomous agents replacing humans but about engineers increasing their level of abstraction to manage more complex systems through agent orchestration.
Kent Beck - by Gergely Orosz - The Pragmatic Engineer
Kent Beck discusses the evolution of software engineering, focusing on how AI tools are changing the development process. He introduces the Genie metaphor for AI agents, highlighting their unpredictability and tendency to hallucinate or bypass constraints like tests. Beck argues that TDD is more relevant than ever in an AI driven world because it provides the necessary guardrails to catch agent errors. He recounts the origins of Extreme Programming and the Agile Manifesto, expressing some regret over the term Agile due to its corporate dilution. The conversation shifts to his time at Facebook from 2011 to 2017, where he observed a culture of extreme ownership and rapid experimentation that often bypassed traditional TDD in favor of high observability and feature flags. He concludes by emphasizing that AI allows for bigger thoughts and more ambitious projects by handling mundane coding details, though it requires a shift toward accepting a higher volume of discarded experiments.
Key Takeaways
- AI agents require immutable annotations or tests they are not allowed to change to prevent them from hallucinating success by modifying the requirements.
- The Genie metaphor highlights that while AI grants wishes, it often does so in ways that ignore critical constraints or design taste, making human oversight of architecture essential.
- Early Facebook succeeded by replacing traditional pre-deployment testing with extreme observability and a culture where nothing is someone else's problem.
- AI tools are shifting the developer's role from syntax expert to high-level architect who manages complexity and sets strategic milestones.
- Organizations must get comfortable throwing away a much higher volume of completed experiments as the cost of generating code artifacts drops toward zero.
50 Years of Microsoft and Developer Tools with Scott Guthrie
Microsoft began in 1975 as a developer tools company with a Basic interpreter for the Altair. This developer-first DNA fueled the success of Windows, as tools like QuickBasic and MFC made it easy to build apps for the OS. Visual Basic later revolutionized the industry by allowing non-technical users to build GUI applications through drag-and-drop interfaces and features like edit and continue. The launch of .NET in 2000 unified disparate languages and frameworks under a common runtime, which helped Microsoft dominate the server market. By 2014, Microsoft faced a relevance crisis. Scott Guthrie and Satya Nadella made three bold decisions: making Visual Studio free for individuals, open-sourcing .NET, and creating VS Code. These moves were risky but necessary to win back developers who had moved to Mac and Linux. VS Code specifically served as a bridge to the open-source community, eventually making the GitHub acquisition possible. Azure's growth followed a similar path of identifying underserved markets. After realizing Amazon dominated the consumer startup space, Microsoft pivoted Azure to be the cloud for modern business, focusing on hybrid needs. Today, the focus is on AI agents. Guthrie views AI as the next major productivity leap, comparable to the introduction of debuggers or garbage collection. He suggests that developers who embrace these tools will deliver more business value and see higher career impact, rather than being replaced by automation.
Key Takeaways
- Platform growth is a byproduct of developer success. Microsoft's history shows that making it easy to build apps on a platform is the only way to ensure the platform itself survives.
- Business models dictate technical strategy. Microsoft's early resistance to open source was a rational response to a license-based revenue model, whereas the shift to a consumption-based cloud model removed the friction for open-source adoption.
- Relevance requires bold cannibalization. Making Visual Studio free and open-sourcing .NET risked existing revenue but was the only way to prevent the melting iceberg of a shrinking developer base.
- AI agents shift the developer's role from syntax mastery to logic and problem-solving. Just as IntelliSense and debuggers were once mocked by purists, AI agents will become the standard for high-output engineering.
From Software Engineer to AI Engineer – with Janvi Kalra
Janvi Kalra shares her journey from a software engineer at Coda to an AI engineer at OpenAI, providing a roadmap for technical professionals looking to enter the AI space. She categorizes the current AI market into three distinct segments: product companies building on top of models (like Cursor or Hebbia), infrastructure companies providing tools for LLM usage (like Pinecone, Modal, or Braintrust), and model companies building the base intelligence (like OpenAI, Anthropic, or Google). Janvi emphasizes that the transition to AI engineering requires a strong foundation in deep learning basics, such as tokens, weights, and embeddings, which she mastered through self-study and participating in hackathons like Buildspace. At Coda, Janvi moved into AI by building side projects and internal prototypes, eventually leading the development of Coda Brain, an enterprise search tool. Her job search involved interviewing at 46 companies, where she applied a rigorous due diligence rubric normally used by investors. This rubric focuses on four pillars: high revenue growth, large market room, obsessed customers, and a clear competitive moat. She notes that engineers should treat their equity as an investment and request sensitive financial data like gross margins and GPU spend once an offer is on the table. Now at OpenAI, Janvi works on safety engineering, managing low-latency classifiers and monitoring model harms. She describes the OpenAI culture as a rare mix of startup speed and massive scale, where services handle 60,000 requests per second but engineers still maintain high agency with minimal red tape. The role of the software engineer is evolving into a more full-stack position that blurs the lines between PM, designer, and data scientist. Janvi argues that while AI makes code generation cheaper, the core skills of debugging, reading code, and designing high-level systems remain essential, especially when models fail in unique edge cases.
Key Takeaways
- Adopt a permissionless career strategy by building side projects and prototypes to prove expertise in new domains before a formal role is available.
- Evaluate startups like an investor by analyzing revenue growth, market size, customer obsession, and competitive moats rather than just headcount or hype.
- Expect an expansion and collapse cycle in AI engineering where you build complex guardrails for model limitations only to scrap them when the underlying model improves.
- The modern full-stack engineer is expected to absorb adjacent roles like PM and data engineering because AI tools lower the barrier to executing these tasks independently.
- OpenAI maintains high shipping velocity by granting engineers significant trust, such as allowing production deployments with only a single reviewer.
The AI Engineering Stack - by Gergely Orosz and Chip Huyen
AI engineering has emerged as a distinct discipline, primarily involving software engineers who integrate large language models into applications. While it evolved from machine learning engineering, it focuses less on training models from scratch and more on adapting foundation models through prompting and fine-tuning. The AI stack consists of three layers: application development, model development, and infrastructure. Application development is currently seeing the most innovation, focusing on prompt engineering, context construction, and user interfaces. Model development involves fine-tuning and inference optimization to reduce the high latency and costs associated with large models. The infrastructure layer remains relatively stable, handling resource management and model serving. A major shift in this field is the move toward a product-first workflow. Instead of starting with data collection and training, engineers can build a functional demo using existing APIs to test product promise before investing in deeper model work. This transition makes AI engineering feel closer to full-stack development, with increasing support for JavaScript and web-centric tools. Evaluation remains a significant challenge because foundation models produce open-ended outputs, making it difficult to establish ground truths compared to traditional close-ended ML tasks like spam detection. The distinction between prompt engineering and fine-tuning is central to the stack. Prompting adapts a model without changing its weights, making it ideal for quick experiments and low-data scenarios. Fine-tuning involves updating weights, which can improve quality and reduce costs but requires more data and compute. As models scale, inference optimization becomes a necessity to meet the 100ms latency expectations of modern web applications. This requires expertise in techniques like quantization and distillation to make large models run efficiently in production environments.
Key Takeaways
- AI engineering shifts the primary focus from model creation to model adaptation and evaluation.
- The product-first workflow lets developers validate ideas with off-the-shelf models before investing in custom data or training.
- Evaluation is the new bottleneck because open-ended outputs are much harder to benchmark than traditional classification tasks.
- AI engineering is converging with full-stack development as JavaScript support grows for LLM frameworks and interfaces.
How Kubernetes is Built with Kat Cosgrove
Kubernetes originated from Google's internal tool Borg and was donated to the Cloud Native Computing Foundation (CNCF) nearly 11 years ago. It has since grown into the second largest open source project globally, following Linux. The project functions as an abstraction layer for managing and scaling containerized applications, automating tasks like resource allocation and networking that were previously manual. Kat Cosgrove, leader of the Kubernetes Release Team subproject, explains that the project's success is largely attributed to its rigorous documentation standards and highly structured governance model. Unlike many open source projects, Kubernetes enforces a strict policy where no user-facing feature is allowed into a release unless it is fully documented. This is managed through the Kubernetes Enhancement Proposal (KEP) process, which requires code completion, testing, and production readiness reviews. The organizational structure is a pyramid consisting of Special Interest Groups (SIGs), maintainers, and a rotating cast of contributors. A unique aspect of the project is its release team shadow program, which allows newcomers and students to apply for competitive spots to learn project management and release engineering. The release cycle typically spans 14 to 16 weeks and includes specific freezes for enhancements, code, and documentation. To prevent burnout, the project mandates that release leads take cycles off and prioritizes team well-being over strict deadlines. Regarding modern development trends, Cosgrove expresses skepticism toward generative AI for content creation or documentation, citing accuracy issues and the human-centric nature of people management, though she acknowledges its potential for reducing administrative toil like GitHub labeling.
Key Takeaways
- Documentation is a strategic moat. Kubernetes won the market not just through technical superiority but by mandating that every user-facing change must be documented before it can be included in a release.
- Sustainable open source governance requires active burnout prevention. The project uses a rotating leadership model and mandatory breaks for release leads to ensure the long-term health of its volunteer workforce.
- The contributor ladder is formalized through a shadow program. By allowing anyone to apply for release team spots, Kubernetes creates a high-value networking and career-building entry point that maintains a steady pipeline of new talent.
- Managed services are the recommended path for most companies. Unless a startup has the resources to hire dedicated SREs, rolling a custom cluster is considered a high-risk move compared to using GKE or EKS.
- Project management is the hidden engine of large-scale open source. Every maintainer essentially functions as a project manager, handling people, policy, and cross-team coordination rather than just writing code.
Building Windsurf with Varun Mohan - by Gergely Orosz
Varun Mohan, CEO of Windsurf, details the technical architecture and engineering philosophy behind their AI-powered IDE. The project evolved from GPU virtualization infrastructure into a tool designed for agentic coding workflows. A core technical challenge involves training custom LLMs to handle fill-in-the-middle capabilities. Standard chat models struggle with code because they are optimized for appending text, whereas coding requires inserting logic into the middle of existing functions or lines. This requires specialized tokenization to handle incomplete code states that are typically out of distribution for generic models. Performance is driven by a strict focus on latency, with a target of sub-100ms for initial code suggestions. Mohan explains that GPU memory bandwidth is often the bottleneck rather than raw compute, necessitating optimizations like speculative decoding and model parallelism to maintain speed without sacrificing quality. Their retrieval stack avoids the lossy nature of pure embedding-based search. Instead, it uses a fusion of embeddings, keyword search, and AST-based knowledge graphs to build a high-precision context for the AI agent. This allows the agent to understand complex dependencies across millions of lines of code. The team chose to fork Code OSS to leverage the existing extension ecosystem while maintaining the freedom to modify the core IDE experience. This architecture uses a shared language server binary to support other environments like JetBrains and Vim. Regarding the future of the profession, Mohan suggests that AI will not replace engineers but will instead increase the ROI of technology. As the cost of building software drops, companies will likely increase their ambitions, requiring developers to shift from writing boilerplate to high-level problem solving and architectural reasoning.
Key Takeaways
- Generic LLMs are insufficient for professional coding because they aren't trained for fill-in-the-middle logic or the specific tokenization patterns of incomplete code.
- Effective codebase retrieval requires a hybrid approach: combining vector embeddings with keyword search and AST-derived knowledge graphs to ensure agents don't miss critical dependencies.
- AI agents change the ROI of software development, encouraging companies to build more rather than hire fewer, as the ceiling for what one developer can produce rises significantly.
- The split brain strategy allows startups to ship today while building disruptive tech for tomorrow by balancing incremental features with high-risk R&D.
How to work better with Product, as an Engineer with Ebi Atawodi
Ebi Atawodi and Gergely Orosz share lessons from their time at Uber, Netflix, and Google on how to build high-performing product and engineering teams. They argue that the most successful teams operate without rigid silos, where every member acts as a product leader. This means engineers should be deeply familiar with business metrics like gross bookings and conversion rates, while product managers should understand technical hurdles like on-call health and tech debt. One effective tactic they used was the State of the Union meeting, which provided the team with a clear view of their business impact and market trends, fostering a sense of collective ownership. They also utilized a business scorecard to track P0 metrics, making the group feel like a startup within a larger organization. Ebi highlights her onboarding framework of conversations, comprehension, and conviction to build trust quickly. A significant portion of the discussion focuses on secretly bootstrapping initiatives. By solving small but critical problems first, such as the UberPay API or web payment flows, teams can demonstrate value and make a compelling case for additional headcount. They discuss how a tiny team at Uber managed a billion-dollar run rate for cash payments by focusing on underlying latency issues rather than just surface-level features. Standout engineers are characterized as lifelong learners who are not afraid to code, hold strong convictions, and communicate complex ideas in simple terms. Ebi emphasizes that human connection is the bedrock of productivity. She suggests that knowing the person behind the role, including their personal milestones and life goals, makes professional disagreements much easier to navigate. This involves moving beyond small talk to genuine interest in a colleague's growth. Finally, she advises professionals to treat their careers like a project with regular check-ins and to seek out sponsors who will advocate for them when they are not in the room. This long-game approach prioritizes doing great work and building a reputation over simply checking boxes for a promotion.
Key Takeaways
- Blurring the lines between product and engineering roles creates a one team culture where everyone is responsible for business outcomes, not just their specific tasks.
- Bootstrapping high-impact prototypes is often more effective for securing resources than writing a formal business case without proof of concept.
- Building deep personal trust through human connection acts as a lubricant for team efficiency, allowing for direct feedback and faster conflict resolution.
- Career advancement is most effective when treated with the same tactical rigor as a product roadmap, focusing on sponsorship from those who know your work deeply.
Building Reddit’s iOS and Android app - by Gergely Orosz
Reddit underwent a massive mobile modernization starting in 2021 to address severe technical debt and scaling issues. At the time, the Android app had a 13-second startup time and CI builds could take over two hours. The engineering organization grew from 50 to over 200 mobile engineers, managing 2.5 million lines of code and nearly 600 screens. To handle this, they established dedicated platform teams for iOS and Android and introduced the Core Stack. This framework standardized a monorepo structure, GraphQL for all client-server communication, and MVVM architecture. On the UI side, Android bet early on Jetpack Compose while it was still in alpha, while iOS developed an internal framework called Slice Kit before eventually moving toward SwiftUI. The transition from REST to GraphQL was a major pillar, providing better type safety and contract enforcement, though it initially faced a latency tax that required optimization. While the team experimented with server-driven UI, they found it challenging for core features like the feed due to double-fetching issues and synchronization bugs. The modernization effort significantly improved performance, reducing startup times to under four seconds and increasing crash-free rates by 1.5% on Android. Beyond metrics, the shift improved developer sentiment by creating golden paths that allow feature teams to focus on creativity rather than infrastructure. The platform team operates as a service organization, prioritizing humble collaboration and dog-fooding their own tools to ensure they solve real developer pain points.
Key Takeaways
- Platform teams should act as service providers rather than gatekeepers. Their primary goal is removing boring infrastructure work so feature teams can focus on creative product delivery.
- Severe technical debt creates a unique window for high-risk bets. Reddit adopted Jetpack Compose during its alpha phase because the existing system was so broken that the potential upside of a modern framework outweighed the stability risks.
- Server-driven UI is not a silver bullet for mobile flexibility. If the UI definitions are not perfectly synced with the underlying data models, it often leads to double-fetching and a degraded user experience.
- Successful modernization requires a comprehensive plan that addresses the entire stack. Reddit did not just change the UI layer; they simultaneously moved to monorepos, transitioned to GraphQL, and enforced new testing standards to ensure long-term stability.
Dave Anderson - by Gergely Orosz - The Pragmatic Engineer
Amazon's engineering structure starts at Level 4 for entry-level hires and moves up to Level 10 for VPs and Distinguished Engineers. Level 6 is the most common senior role where many careers plateau. Promotions are driven by a heavy document culture, requiring multi-page narratives that prove an engineer is already operating at the next level. The interview process is famous for the Bar Raiser, an experienced interviewer from outside the immediate team who holds veto power to ensure every new hire improves the average. This role ensures that hiring managers do not lower standards just to fill a seat quickly. The performance management system uses an Unregulated Attrition (URA) target, often between 6% and 10%. Managers must identify the least effective members of their teams, even if those individuals are technically meeting their goals. This creates a high-pressure environment where being at the bottom of a high-performing group is a significant risk. On-call culture follows a strict ownership model where the team that writes the code supports it in production. This leads to high operational excellence because the pain of bad software falls directly on the creators, incentivizing them to fix root causes rather than applying temporary patches. Frugality is a core tenet, partly because Amazon's low-margin retail business and massive fulfillment center workforce limit the perks available to corporate employees. This results in a scrappy environment where engineers often have to justify equipment costs with data. However, this decentralized structure allows individual teams to act like small startups, choosing their own tech stacks and processes. Many engineers find that these skills, including high ownership, operational rigor, and product-mindedness, make them highly successful when moving to startups or pursuing financial independence through the company's stock-heavy compensation model.
Key Takeaways
- The Bar Raiser system acts as a decentralized quality control mechanism that prevents hiring managers from lowering standards under hiring pressure.
- Amazon's URA target shifts the focus from absolute performance to relative performance, making the least effective label a structural necessity regardless of team quality.
- The you build it, you run it model creates a feedback loop that prioritizes long-term stability over short-term feature velocity by making engineers responsible for their own production outages.
- Decentralization allows for pockets of horribleness or excellence, meaning the employee experience is almost entirely dependent on the specific manager and team rather than company-wide policies.
The Philosophy of Software Design – with John Ousterhout
John Ousterhout discusses how AI tools like LLMs are rapidly automating low-level coding tasks. This makes high-level software design more critical than ever. He defines software design as a decomposition problem. It involves breaking complex systems into smaller, independent units to manage complexity. A central concept is the deep module. This offers a simple interface while hiding significant internal complexity. This contrasts with shallow modules that have wide interfaces but do little work, which increases the cognitive load on users. Ousterhout critiques the tactical tornado personality type. These are programmers who prioritize speed over clean design. While management often rewards them, they leave behind technical debt that others must clean up. He argues for designing it twice. This is a practice where engineers force themselves to come up with a second, different approach to a problem. This often leads to superior APIs, as seen in his creation of the Tcl/Tk toolkit. The conversation also covers his disagreements with Clean Code principles. He specifically questions the push for extremely short methods and strict test-driven development. He suggests that over-decomposing code into tiny methods can actually increase complexity by creating too many interfaces. Instead, he advocates for general-purpose modules that solve multiple problems at once. Finally, he shares his experience teaching these concepts at Stanford. He uses an iterative feedback loop similar to an English writing class to help students internalize design principles through code reviews and rework.
Key Takeaways
- AI automation of syntax means developers will spend more time on architectural decisions. This makes software design the primary differentiator for engineering talent.
- The most effective way to fight complexity is through deep modules. They provide high leverage by offering massive functionality through a very narrow and simple interface.
- Investing just 1% to 2% of total project time into exploring a second design alternative can prevent months of technical debt. It often leads to much more intuitive APIs.
- Rigid adherence to industry fads like TDD or micro-methods can be counterproductive. They often encourage tactical, point-solution thinking instead of holistic system design.
Stacked diffs and tooling at Meta with Tomas Reimers
Tomas Reimers, co-founder of Graphite and former Meta engineer, discusses the unique developer ecosystem at Meta and why its custom tooling often outperforms industry standards. Meta's environment is defined by deep integration across its entire platform, where tools like Phabricator for code review, Sandcastle for CI, and OnDemand for dev boxes work together. This integration extends to internal task systems and translation engines, allowing developers to see experiment results, localization status, and rollout percentages directly within a code review interface. A core focus of the conversation is stacked diffs, a workflow where developers create a series of small, dependent changes rather than one large pull request. This approach prevents engineers from being blocked by the review process, as they can continue branching off their own unmerged code. Reimers explains that while Git and GitHub make this difficult due to complex rebasing, internal tools at Meta and Google automate the process. This leads to smaller, more reviewable changes, reduced merge conflicts, and faster velocity. The discussion also covers the industry shift toward monorepos. While open-source projects favor polyrepos to maintain independence, large companies like Meta and Google find that monorepos reduce coordination overhead and simplify dependency management. Reimers notes that even Meta struggled with polyleths before moving toward a single unified repository for web, mobile, and internal tools. Finally, the interview explores how AI will transform software engineering. Reimers predicts a massive increase in code volume, which will shift the focus of code review from mechanical correctness to high-level intent and architectural alignment. He suggests that AI will eventually handle the minutiae of reviews, freeing senior engineers to focus on business logic and shared learning.
Key Takeaways
- Integration is the primary competitive advantage of Big Tech internal tooling, as it removes the context switching tax by surfacing deployment, testing, and business metrics in a single UI.
- Stacked diffs solve the reviewer frustration problem by breaking large features into small, logical increments that are easier to approve and less likely to cause massive merge conflicts.
- The move from polyrepos to monorepos is a strategic choice to trade off individual repo autonomy for organizational speed and consistent engineering culture.
- AI coding tools will create a review bottleneck because they allow developers to generate code faster than humans can currently vet it, necessitating AI-assisted review systems.
- Engineering metrics like time spent waiting in review are more actionable than simple PR counts because they highlight specific process bottlenecks in distributed teams.
Building Figma Slides with Noah Finer and Jonathan Kaufman
Figma Slides launched in beta in April 2024 and reached 4.5 million slide decks created within months. The engineering team, consisting of about 10 to 12 engineers, built the product in less than a year. The tech stack mirrors Figma Design and Fig Jam, utilizing a C++ codebase for the canvas renderer (Full Screen) and TypeScript with React for the UI layers. A critical technical choice was implementing the Single Slide View as a viewport snap on an infinite canvas rather than a separate mode. This allowed the team to leverage existing multiplayer infrastructure and ensure interoperability between Figma products. The development process involved a hack week style prototyping phase, focusing first on the complex Grid View before building the more traditional single slide interface. To manage the complexity of multiplayer editing, the team used a minimum mutations strategy. Instead of updating every slide's coordinates during a reorder, they nested slides within slide row and slide grid nodes, using a parent index represented as a float. This approach minimizes data sent over the wire and reduces conflict resolution issues when users are offline. Figma's engineering culture emphasizes internal dogfooding and unique processes like Eng Crits held in Fig Jam. These 30-minute sessions involve 20 minutes of asynchronous feedback via sticky notes followed by a 10-minute discussion on the most contentious points. For quality assurance, the team runs their entire unit and interaction test suite twice: once with all feature flags off and once with all flags on. This prevents regressions caused by flag interactions, a common issue in large-scale SaaS environments. Debugging is handled through specialized tools like the Chrome Dwarf debugging extension, which allows engineers to debug C++ WebAssembly and TypeScript simultaneously within the browser.
Key Takeaways
- Multiplayer efficiency relies on minimizing mutations by using relative positioning and nested node structures rather than absolute coordinates.
- Running the full test suite against both all flags off and all flags on states is a high-leverage practice for preventing complex regressions in feature-rich SaaS products.
- Interoperability between different product editors is maintained by sharing a core C++ WASM bundle while differentiating the UI via unique React trees and mouse behaviors.
- The Eng Crit model in Fig Jam prioritizes high-density asynchronous feedback over traditional document-based reviews, lowering the barrier for peer input.
How Linux is built with Greg Kroah-Hartman
Linux powers nearly everything from 4 billion Android devices and web servers to 5G modems and the International Space Station. The kernel consists of roughly 40 million lines of code, though only about 5% is the core used by everyone. The rest provides hardware support through drivers. Mobile versions are significantly more complex than server versions, often requiring three times the code to manage power, clocks, and diverse buses. A server kernel is relatively simple, focusing on CPU, network, and storage, while a phone kernel must handle battery control and complex system-on-a-chip interactions. The development process follows a rigid nine-week time-based release cycle. A two-week merge window allows maintainers to submit new features, followed by seven weeks of strictly bug and regression fixes. This predictable cadence removes pressure on maintainers to accept unfinished features. The project operates through a trust-based hierarchy of roughly 800 maintainers and 4,000 annual contributors. There are no product or project managers. Instead, automation and a pyramid of maintainers handle the flow of patches. Most contributors are paid by their employers, such as Intel, IBM, or Google, because it is more cost-effective to improve Linux than to build a proprietary operating system. A core rule of development is never breaking user space. While the kernel is monolithic, meaning a bug in a driver can crash the system, this model allows for massive code refactoring and commonality across drivers. The kernel is currently integrating Rust to improve memory safety, particularly for drivers where object lifecycles are complex. While C remains the primary language, Rust helps eliminate common bugs like memory leaks and improper locking. The project continues to evolve based on new hardware needs rather than a centralized master plan, maintaining a philosophy of evolution over intelligent design.
Key Takeaways
- The trust model is the primary scaling mechanism where maintainers don't just review code but rely on the submitter's commitment to fix future regressions.
- Strict time-based releases prevent feature creep by ensuring that if a feature misses one window, the next opportunity is only two months away.
- Open source creates a selfish collaboration loop where companies contribute to solve their own hardware problems, but the generic solutions benefit the entire ecosystem.
- The absence of project managers is possible because planning happens at the company level before code ever reaches the kernel mailing lists.
- Integrating Rust is a strategic move to satisfy government mandates for memory-safe languages and to simplify complex driver development.
Developer Experience at Uber with Gautam Korlam
Gautam Korlam shares insights from his ten years at Uber where he rose from an early Android engineer to a Principal Engineer. He details the creation of Uber's unique internal stack including the Java monorepo and the Submit Queue. The Submit Queue was a novel system that used machine learning models to serialize commits and ensure the main branch stayed green despite thousands of daily contributions. This system speculatively tested changes in combination to prevent merge conflicts and broken builds. He also explains Local Developer Analytics (LDA) which tracked everything from IDE indexing times to developer funnels to identify bottlenecks. This tool allowed the platform team to see exactly where engineers were losing time during the coding process. Another major win was Dev Pods. These cloud development environments achieved six second boot times by using pre-indexed containers and standardized home directories. This allowed engineers to switch contexts instantly without waiting for local builds or indexing. On the career side Gautam emphasizes building social capital by helping others and running office hours. He views the Principal Engineer role as a partnership with management where technical depth meets business strategy. He notes that senior engineers often act as a peer to their managers by load balancing technical priorities. Looking forward he discusses the impact of AI and a concept called vibe coding. This involves using AI agents to prototype rapidly based on desired outcomes rather than specific implementation details. He believes junior engineers will thrive with these tools because they can bridge knowledge gaps quickly. Senior engineers will shift focus toward system architecture and product taste. His new startup Guitar aims to automate the entire software development lifecycle using agents that handle maintenance and on-call tasks. He argues that as code generation becomes a commodity the real differentiator for engineers will be their understanding of business value and end user experience.
Key Takeaways
- Treating internal platform teams like product teams with strict SLOs and customer obsession is the only way to scale developer productivity effectively.
- Monorepos act as a forcing function for better interface design and prevent the long term debt of fragmented library versions across hundreds of repositories.
- The future of engineering shifts from writing code to exercising taste and business judgment as AI agents take over the grunt work of maintenance and boilerplate.
- Successful cloud development environments require deep optimization of the local workflow like pre-indexing and low latency compute rather than just basic containerization.
Design-first software engineering: Craft – with Balint Orosz
Balint Orosz, founder of the text editor Craft, discusses the unique engineering philosophy required to build high-fidelity, user-facing software. He highlights a common industry gap where engineering leadership is typically backend-oriented, often overlooking the technical complexity of fluid UI and UX. At Craft, this led to a radical design-first approach where the team prioritizes control over native framework convenience. For example, they avoid Apple's AutoLayout in favor of manual coordinate-based layouts to ensure animations remain perfectly smooth and performant. This level of control allows them to animate every element in the document without the 'jerkiness' common in standard apps. The technical architecture of Craft is equally unconventional. It utilizes a 99% shared codebase across iOS, Mac, iPad, and Vision Pro by leveraging Mac Catalyst. This allows a tiny team of only three to four engineers to maintain the entire native application suite. Organizationally, Craft operates without dedicated product managers, instead relying on small, domain-specific platform teams of four to five people. Balint argues that this structure minimizes communication overhead and allows engineers to maintain a holistic understanding of the codebase, which is often lost in larger feature-team models. Regarding data, Craft focuses on a local-first model. Because personal software deals with data volumes that fit on a single device, much of the compute can be offloaded from the cloud to the user's hardware. This shift not only improves privacy and offline access but also changes the cost structure of running a SaaS business. Balint also shares how AI is evolving from a boilerplate tool to a capability expander. He recently used OpenAI's o1 model to implement a complex shape-recognition algorithm for the Apple Pencil in hours, a task that would have previously required weeks of specialized mathematical research.
Key Takeaways
- Owning low-level primitives like manual layouts and custom toolbars is essential for building 'top 1%' user experiences that native OS frameworks often restrict.
- A shared codebase across platforms is viable for complex apps if you treat the UI as a canvas and build adaptive components rather than relying on platform-specific defaults.
- Small domain-specific teams of 4-5 people often outperform larger organizations because they maintain technical context and avoid the coordination tax of feature-based squads.
- AI is shifting from a coding assistant to a reasoning partner, enabling engineers to implement specialized algorithms in fields where they lack deep domain expertise, such as advanced geometry or shader code.
Trimodal Nature of Tech Compensation in the US, UK and India
Software engineering compensation follows a trimodal distribution across major markets like the US, UK, and India, rather than a simple bell curve. This model categorizes employers into three distinct tiers based on who they compete with for talent. Tier 1 consists of local companies and traditional industries where engineering is often a cost center. Tier 2 includes Big Tech and large public companies like Google, Meta, and Uber. Tier 3 represents the highest payers, typically elite hedge funds, quant firms, and top-tier late-stage scaleups like OpenAI or Databricks. In the US, senior software engineer medians range from $180,000 in Tier 1 to over $430,000 in Tier 3. A unique finding is that the US is the only market where a tier exists above Big Tech; in the UK and India, Big Tech usually represents the ceiling. US-headquartered companies significantly drive up compensation in international markets, often paying 2 to 4 times the local median. Total compensation (TC) includes base salary, cash bonuses, and equity. While Big Tech offers liquid equity that can be sold immediately, scaleups often offer higher paper TC that remains illiquid until an IPO or buyback. Hedge funds differ by offering no equity but providing massive cash bonuses that can reach 100% of base salary. The gap between these tiers widens significantly as engineers gain seniority. While Tier 3 roles offer the highest pay, they often come with higher stress and a Silicon Valley-style culture emphasizing high autonomy and longer hours.
Key Takeaways
- The US market is the only one where elite hedge funds and top scaleups consistently outpay Big Tech, creating a distinct third tier that doesn't exist as clearly in the UK or India.
- US-headquartered companies act as a massive catalyst for salary growth in international markets, frequently paying multiples of what local-HQ firms offer for the same seniority.
- Seniority acts as a multiplier for compensation divergence, with the gap between Tier 1 and Tier 3 becoming much more pronounced at the senior level than at entry-level.
- High total compensation at private scaleups carries significant liquidity risk, as the equity portion may remain inaccessible for years compared to the liquid stocks of Big Tech.
Developer productivity with Dr. Nicole Forsgren (creator of DORA, co-creator of SPACE)
Manu Cornet, a software engineer who spent 14 years at Google before moving to Twitter, discusses the stories behind his viral tech comics. His most famous work, an illustration of big tech organizational structures, captured the essence of companies like Microsoft and Apple so accurately that Satya Nadella referenced it in his book. Cornet explains that Google's notorious naming confusion and product duplication stem from a bottom up culture where engineers have significant power. This freedom allows for innovation but often results in competing efforts and a lack of focus on the end customer. The discussion contrasts Google's employee friendly environment with Amazon's customer obsessed but high pressure culture. Cornet illustrates this with a comic showing Google pointing roses at employees and guns at customers, while Amazon does the opposite. While Google provides chill on-call rotations and SRE support, Amazon prioritizes customer support to an extreme degree, even for mid sized B2B clients. Cornet also touches on the decline of Google's 20% time, noting that as the company matured, it became more siloed and traditional, moving away from the "not a traditional company" ideal set by its founders. Cornet's Google Graveyard comic proved prophetic for Stadia, which he predicted would fail due to a lack of network effects. He argues that even superior products fail if they cannot displace existing ecosystems like Steam or Facebook. Regarding Twitter, Cornet describes the pre-acquisition culture as a younger, less bureaucratic version of Google. However, the Elon Musk takeover brought massive, seemingly random layoffs that claimed even the highest output developers. Cornet reflects on the death of ideals in tech, suggesting that the massive cash flows of the early search era allowed for a level of employee freedom that is increasingly rare in a more mature, capitalist driven industry.
Key Takeaways
- Bottom up engineering cultures naturally produce product duplication because there is little top down pressure to consolidate competing ideas.
- Google and Amazon represent opposite ends of the employee vs. customer priority spectrum, with Google historically favoring employee comfort.
- Network effects are the ultimate wall for tech giants; even with billions in funding, products like Stadia and Google Plus failed because they couldn't displace existing user bases.
- The transition from an idealistic startup to a traditional corporation is often marked by increased siloing and the removal of organic innovation programs like 20% time.
Developer productivity with Dr. Nicole Forsgren (the creator of DORA)
Dr. Nicole Forsgren, creator of DORA and SPACE frameworks, breaks down the complexities of measuring and improving engineering performance. Traditional productivity definitions fail in software because output is not linear. While metrics like PR counts or diff frequency are common, they are often misleading when viewed in isolation. Senior engineers frequently show lower PR volume because their value lies in unblocking others, architectural design, and mentoring. A holistic approach requires a constellation of metrics across dimensions like satisfaction, activity, and efficiency to capture the full context of a developer's lived experience. The DORA metrics (deployment frequency, lead time, change failure rate, and recovery time) remain essential signals for delivery pipeline health. However, the SPACE framework expands this by including qualitative data like developer satisfaction and communication patterns. Developer experience is increasingly critical, as friction in systems like security compliance or poor documentation increases cognitive load and kills flow. One of the most telling indicators of a team's efficiency is onboarding time. High-performing teams use tactics like dummy pull requests in the first week to ensure new hires can navigate the entire toolchain immediately. AI and LLMs are shifting the engineering landscape from a writing exercise to a review and guidance exercise. This transition introduces new challenges, such as the need for trust in automated outputs and the potential for AI suggestions to interrupt flow states. Successful engineering leaders focus on reducing paper cuts and honoring the reality of their developers' daily workflows. Cultural transformation in tech rarely happens through top-down mandates. Instead, it occurs when leaders change the actual tools and processes developers use, which eventually shifts the broader organizational culture.
Key Takeaways
- Single metrics like PR counts are dangerous because they ignore the high-value invisible work of senior engineers like mentoring and architecture.
- Onboarding speed serves as a canary in the coal mine for engineering health, revealing hidden friction in documentation and toolchains.
- Cultural change in engineering is most effective when it starts with improving the daily lived experience and tools rather than abstract cultural values.
- The shift from writing code to reviewing AI-generated code requires developers to develop new mental models for maintaining flow and verifying trust.
Live streaming at world-record scale with Ashutosh Agrawal
Ashutosh Agrawal, former software architect at JioCinema, details the engineering behind a world-record 32 million concurrent live streams during the Indian Premier League (IPL). The architecture relies on a complex pipeline starting from the stadium production control room (PCR), through contribution encoders, to cloud-based distribution encoders that generate over 500 stream variants across 13 languages. These streams use HLS and DASH protocols, breaking video into four to six second segments. A critical component is the orchestrator, which manages playback URLs and CDN endpoints to ensure users receive the correct format for their device and network. The system prioritizes stream smoothness over absolute low latency, typically maintaining a five to ten second buffer to prevent buffering icons. Capacity planning is a massive, year-long effort because cloud resources are finite at this scale. It involves working with providers to upgrade physical data centers, power, and network backbones in specific cities like Mumbai and Bangalore. Standard auto-scaling is rejected in favor of custom systems that scale based on concurrency metrics, as automated tools cannot react fast enough to sudden surges during match events like innings breaks or key wickets. Unique APAC challenges include a mobile-intensive audience. Engineers must account for users switching between 5G and 4G towers while moving, as well as battery preservation for viewers watching on older devices at the end of the day. To prepare, the team runs Game Day simulations where they flood the system with synthetic traffic and follow strict operational protocols without warning the engineering teams, ensuring the platform can handle the unpredictable nature of live production.
Key Takeaways
- Standard cloud auto-scaling is too slow for the massive, instantaneous spikes of live sports. Custom scaling engines tied to concurrency metrics are necessary to prep systems before the traffic actually hits.
- At world-record scale, the infinite cloud is a myth. Physical infrastructure constraints like regional fiber capacity and data center power require pessimistic planning and hardware procurement cycles that start a year in advance.
- Reliability in mobile-first markets requires optimizing for device-side constraints like battery life and tower-switching rather than just server-side throughput.
- High-stakes readiness depends on Game Day drills that simulate not just traffic, but the full operational chaos of a live event, including withholding information from teams to test their monitoring and response.
AI Engineering with Chip Huyen - by Gergely Orosz
AI engineering represents a shift from building models to building applications. In traditional machine learning, you needed massive datasets and specialized expertise to train models before reaching the product stage. Today, foundation models accessible via APIs allow engineers to start with a product demo and work backward. This change lowers the barrier to entry and places more emphasis on product design and engineering precision than on model training. Building these applications follows a specific progression. Most teams should start with prompt engineering and few-shot learning. If the model needs more context, the next step is Retrieval-Augmented Generation (RAG). A common mistake is jumping straight to complex vector databases when simple keyword search or better data chunking would provide better results. Fine-tuning is treated as a last resort because it introduces significant maintenance overhead and risks being quickly outdated by newer base models. Evaluation remains the biggest challenge in the field. As AI becomes more coherent, it gets harder for humans to spot subtle errors. Effective evaluation requires a mix of functional correctness, using AI as a judge, and comparative testing. However, human oversight is still necessary to ensure the AI is solving the right problem. For example, a meeting summarizer might be factually correct but useless if it misses the specific action items a user actually needs. The future of software engineering is about solving problems rather than just writing syntax. While AI can generate code snippets, it lacks the ability to understand complex business contexts and edge cases. Engineers will likely manage much larger and more complex systems by using AI to handle the repetitive parts of coding while they focus on system design and precision.
Key Takeaways
- The AI engineering workflow is product-first. You start with a demo using APIs and only invest in data collection or custom models once the product value is proven.
- RAG performance is driven by data quality and retrieval logic. Many teams over-engineer their vector database setup while ignoring simpler, more effective methods like keyword search or better metadata tagging.
- Evaluation must be grounded in user intent. A model can be technically accurate but fail if it doesn't prioritize the specific information, like action items or specific error codes, that the user is looking for.
- Fine-tuning is a maintenance trap. Because base models improve so rapidly, custom fine-tuned models often lose their competitive edge within months, making it a last resort strategy.
Building a best-selling game with a tiny team – with Jonas Tyroller
Jonas Tyroller, co-creator of the hit game Thronefall, shares the technical and strategic process behind building a million-selling title with just two people. The stack centers on Unity and C#, utilizing Blender for 3D modeling and the Unity Asset Store for commodities like shaders to save time. A core part of their workflow involves a rigorous prototyping phase where they build a new mini-game every day for months, eventually narrowing down hundreds of ideas to one that is both fun to build and commercially viable. Tyroller advocates for technical pragmatism, admitting that indie projects often involve spaghetti code and a lack of unit tests because speed and maintainability requirements differ from large-scale enterprise systems. He emphasizes that the goal is a functional, fun product rather than architectural perfection. Key technical hurdles included implementing complex pathfinding using the A* algorithm and balancing the snowball effect common in strategy games, where early leads become insurmountable. On the business side, Steam remains the dominant platform for indie revenue, often accounting for 90% of sales. Porting to consoles like the Nintendo Switch is typically outsourced to specialized firms in exchange for a revenue share, allowing the core team to stay focused on new development. Tyroller also highlights the role of AI, specifically using ChatGPT to generate skeleton code and explain unfamiliar areas like shader programming, which significantly accelerates the development cycle for a small team.
Key Takeaways
- Prioritize the payoff-to-effort ratio. Success in indie games comes from focused, high-quality experiences rather than sheer scale or technical complexity.
- Decouple gameplay from visuals during prototyping. Testing mechanics with programmer art allows for faster iteration before committing to expensive assets.
- Embrace good enough engineering. At a small scale, pushing directly to the main branch and debugging with print statements can be more efficient than enterprise-grade CI/CD pipelines.
- Outsource console porting for revenue shares. This allows tiny teams to access the Nintendo Switch or PlayStation markets without the technical and bureaucratic overhead of doing it in-house.
Taking on Google search: Perplexity and Kagi
Perplexity and Kagi are emerging as significant alternatives to Google Search, which has increasingly prioritized advertising revenue over user experience. Perplexity, a VC-backed startup valued at approximately $9 billion, functions as an AI-powered answer engine. It uses a modern JavaScript stack with React and Next.js, and it has transitioned from being a wrapper for third-party models to developing its own search index and custom LLMs. Speed is a core differentiator for Perplexity, achieved by running multiple optimized models for every query. Culturally, the company emphasizes in-person work and high ownership, with a lean management structure where engineers possess strong product sense. Kagi takes a different path as a bootstrapped, user-funded search engine that is entirely ad-free. It focuses on privacy and high-quality results by crawling a non-commercial web. Technically, Kagi uses the Crystal programming language and a backend based on Flow-Based Programming (FBP), which allows for observable, concurrent systems through message-passing. Their indexing strategy involves skipping sites with excessive ads and trackers, which they find correlates with lower content quality. Kagi operates as a remote-first team, utilizing trunk-based development and a marble-chiseling iterative approach to product design. Both companies use Linear for project management and prioritize direct engineering ownership over heavy product management layers.
Key Takeaways
- Perplexity's strategy of starting as a wrapper for third-party APIs allowed them to find product-market fit rapidly before investing in proprietary search indexes and custom LLMs.
- Kagi's technical stack, specifically the Crystal language and Flow-Based Programming, provides a unique advantage in managing complex concurrent tasks with high observability for a small team.
- The two companies represent opposite ends of the growth spectrum: Perplexity is scaling aggressively with VC backing toward an Action Engine, while Kagi remains user-funded and focused on meticulous search quality.
- Both startups demonstrate that lean engineering teams with high product autonomy can effectively challenge incumbents by focusing on specific user pain points like ad-clutter and slow response times.
Observability: the present and future, with Charity Majors
Charity Majors, co-founder of Honeycomb, outlines the transition from legacy monitoring to Observability 2.0. The traditional three pillars model of metrics, logs, and traces is increasingly viewed as inefficient because it forces engineers to correlate data across disconnected silos. This creates a cost multiplier where every request is stored multiple times in different formats, leading to dead ends during debugging. Observability 2.0 moves toward unified storage using wide structured events and columnar databases. This architecture allows for high cardinality data, such as specific user IDs or social security numbers, without the massive price hikes typical of legacy vendors like Datadog. This shift transforms observability from a reactive operations tool into a proactive development feedback loop. By using tools like feature flags and progressive deployments, engineers can observe their code's impact in production in real time. Majors also discusses the role of OpenTelemetry in breaking vendor lock-in. It standardizes telemetry collection and allows companies to switch backends more easily, forcing vendors to compete on value rather than proprietary lock-in. For leadership, observability serves as a translation layer. It helps engineering teams explain their work in the language of business and money, which is often missing in the CTO role. In the context of AI, she argues that observability is essential for managing software of unknown origin, which includes AI-generated code. Monitoring LLMs is fundamentally a tracing and high-cardinality problem rather than a standalone AI problem. The goal is to capture enough rich telemetry all the time so that engineers can understand the intersection of code, systems, and users without needing to pre-define every question in advance.
Key Takeaways
- High cardinality data is the most identifying and useful for debugging. Legacy metrics tools often make it prohibitively expensive, but modern columnar storage solves this by allowing for wide logs with deep context.
- The most effective engineering teams use Service Level Objectives as their primary entry point for understanding system health. This provides a clear boundary for autonomy and prevents management from meddling if targets are met.
- Observability should be integrated during the development phase rather than being added in production. It functions similarly to automated testing to accelerate the time to insight and value.
“The Coding Machine” at Meta with Michael Novati
Michael Novati discusses his eight year career at Meta, where he moved from intern to E-7 level in just six years. He is known for being the original Coding Machine, an engineering archetype Meta created to recognize individual contributors who drive massive impact through high volume code output and rapid refactoring. Unlike the Fixer archetype who might solve a single ten million dollar problem with one line of code, a Coding Machine accelerates entire teams by unblocking projects and cleaning up legacy systems at scale. Novati explains that Meta uses these archetypes to ensure fairness in promotions by comparing candidates against established patterns of success. The conversation covers Meta's unique internal tools culture. Because the company built its own infrastructure from scratch, it also built custom tools for everything from code reviews to meeting room bookings. These tools were treated as internal products, allowing for rapid iteration and a culture where even interns could ship major UI changes on their first week. Novati also shares his experiences with Zuck Reviews, which were short, high pressure product meetings with Mark Zuckerberg focused on detail and trust. Regarding hiring, Novati describes the Meta hiring committee process. Recruiters present a packet to a group of directors and VPs. This packet includes the candidate's interview performance and the historical data of the interviewers themselves, such as whether they tend to be hard or easy graders. This data helps the committee calibrate feedback and maintain a consistent hiring bar. Now the co-founder of Formation, Novati applies these lessons to help engineers master stack-agnostic problem solving for technical interviews at top tier tech companies.
Key Takeaways
- Archetypes enable diverse career paths. By defining roles like the Coding Machine or Fixer, Meta allows high level individual contributors to progress based on their specific strengths rather than forcing them into a one size fits all management track.
- Impact is the ultimate metric. Seniority at Meta is not about years of experience but about the scale of influence. A Coding Machine is valued because their output replaces the need for much larger teams or removes massive technical debt that would otherwise slow the entire organization.
- Data driven hiring reduces bias. Meta's use of interviewer histograms to calibrate feedback shows a sophisticated approach to hiring. By knowing if an interviewer is historically binary or lean yes, the committee can adjust for individual subjectivity.
- Internal tools as a competitive advantage. Treating internal infrastructure as a product allowed Meta to maintain a move fast culture. This ownership meant tools were perfectly aligned with their custom hardware and specific developer workflows.
Confessions of a Big Tech recruiter – with Blake Stockman
Blake Stockman shares insights from 12 years of recruiting at Google, Meta, Uber, and Flexport. The conversation focuses on the mechanics of tech hiring, the relationship between recruiters and managers, and strategies for candidates. A central theme is the importance of the recruiter-manager partnership. Stockman argues that the most effective hiring happens when recruiters are deeply embedded in teams rather than acting as a transactional service. This involves a calibration phase where managers provide specific examples of ideal hires to align the search. Stockman also details the differences in decision-making structures, noting that while Google and Meta use decentralized committees, companies like Uber keep decisions closer to the hiring manager to maintain speed. On the candidate side, the discussion covers negotiation and equity. Stockman advises candidates to avoid being the first to state a salary number. Instead, they should ask the company to provide an offer that reflects the value they bring. He notes that almost every offer has room for negotiation, especially when a candidate has multiple offers to leverage. For startup roles, he emphasizes that equity should be treated as a professional investment. Candidates should ask about strike prices, valuations, and financial projections. If a recruiter cannot provide these details, candidates should request to speak with the CEO or finance team. Finally, the interview touches on the risks of desperate hiring. Stockman warns that hiring simply to fill a seat or to avoid losing a headcount budget often leads to poor long-term outcomes and cultural friction.
Key Takeaways
- Calibration is the most critical part of the hiring process. Managers should spend time upfront showing recruiters specific profiles of people they would hire immediately to ensure the top of the funnel is high quality.
- Recruiters can act as strategic allies in borderline hiring decisions. A recruiter who has built trust with a candidate can provide context on their potential and cultural fit that a standard coding interview might miss.
- Startup equity requires the same due diligence as a professional investment. Candidates should not hesitate to ask for financial transparency regarding valuations and future funding rounds before signing.
- Desperate hiring is a major trap for early-stage companies. Hiring to meet a deadline or save a budget often results in bad hires that are far more costly than leaving the seat empty.
How AI-assisted coding will change software engineering: hard truths
AI-assisted coding has reached a tipping point with 75% of developers using these tools, yet the industry faces a 70% problem. While tools like Bolt, v0, and Cursor allow users to reach a functional prototype quickly, the final 30% required for production-readiness remains a significant barrier. This gap exists because AI often generates house of cards code that looks complete but lacks the architectural integrity to handle real-world edge cases. Two distinct usage patterns have emerged: bootstrappers who use AI to go from zero to MVP, and iterators who use it for daily refactoring and testing. A critical finding is the knowledge paradox, where AI tools actually provide more leverage to senior engineers than beginners. Seniors use their mental models to verify, constrain, and refactor AI output, whereas juniors often accept incorrect suggestions that lead to technical debt. The next phase of innovation is agentic software engineering. Unlike simple autocomplete, agentic tools like Devin or Claude Code can plan, execute, and iterate autonomously. This shifts the developer's role toward system design and precise requirement specification in natural language. Despite these advances, software engineering remains a broader discipline than just writing code. Historical data from Fred Brooks suggests coding only accounts for about 15-20% of the total effort. The remaining 80% involves planning, testing, production readiness, and maintenance. True software quality is defined by polish and empathy for the user experience, which AI cannot yet replicate. As AI lowers the barrier to creating code, the demand for experienced engineers who can manage the resulting complexity and ensure security is likely to increase. The future of the field relies on treating software as a craft where AI handles the routine boilerplate, allowing humans to focus on high-level architecture and user delight.
Key Takeaways
- The knowledge paradox means AI acts as a force multiplier for seniors who can aim the tool, while it often creates a dependency trap for juniors who lack debugging mental models.
- Agentic workflows represent a shift from AI as a responder to AI as a collaborator that can proactively identify issues and validate its own fixes.
- Coding speed was never the primary bottleneck for high-quality software: the real challenges remain requirement clarity, edge-case handling, and long-term system maintenance.
- The rise of English-first development environments makes precise communication and architectural thinking more valuable than syntax knowledge.
Wrapped: The Pragmatic Engineer in 2024 - by Gergely Orosz
The Pragmatic Engineer newsletter reached over 866,000 readers in 2024, adding 300,000 in a single year. The primary macro trend defining the period was the end of the Zero Interest Rate Policy (ZIRP) era. This shift moved the industry focus toward efficiency, reduced venture capital allocation, and created a tighter job market for software engineers. Despite the funding slowdown, Generative AI (GenAI) created a massive counter-trend, with OpenAI and Databricks raising record-breaking rounds. In software development, GenAI adoption is nearly universal, with over 75% of engineers using these tools. While GitHub Copilot and ChatGPT remain dominant, new IDEs like Cursor, Windsurf, and Zed are gaining significant momentum. The rise of AI coding agents, such as Devin from Cognition AI, represents the next wave of innovation. This shift is fundamentally altering the hiring landscape. Junior roles are becoming harder to find because AI tools often perform at the level of an intern. Consequently, companies are prioritizing senior engineers who can leverage these tools for higher output. The newsletter expanded its research capabilities by adding Elin Nilsson as a Tech Industry Researcher. This led to deep dives on engineering cultures at companies like Stripe, Bluesky, and Anthropic, as well as investigations into Gen Z developer preferences and bug management. A new podcast was also launched, featuring industry veterans like Grady Booch and Simon Willison. These conversations highlighted that while tools change, fundamentals like communication and project estimation remain the biggest challenges in large teams. Gergely Orosz's self-published work, The Software Engineer's Guidebook, sold over 33,000 copies in its first year. The data shows a strong preference for print (87%) over e-books. Interestingly, 40% of e-book sales occurred directly through a personal store rather than Amazon, despite Amazon's market dominance. The book is now being translated into multiple languages, including German, Korean, and Japanese, to address global demand for structured career development resources in tech.
Key Takeaways
- The end of ZIRP has permanently changed the growth at all costs mindset to one of operational efficiency, making senior-level expertise more valuable than ever.
- AI is not just a tool but a filter for hiring, as entry-level engineers now face a higher barrier to entry because AI replaces traditional junior tasks.
- The success of The Software Engineer's Guidebook proves that high-quality, niche technical content can thrive through self-publishing, bypassing traditional gatekeepers who may want to simplify complex topics.
- New IDEs like Cursor are successfully challenging incumbents like Microsoft by integrating AI more natively into the developer workflow.
Shipping projects at Big Tech with Sean Goedecke
Sean Goedecke defines shipping as a socially constructed fact rather than a technical one. A project is officially shipped when the management chain acknowledges it and is satisfied. This often means aligning with business goals that might not be obvious to junior engineers, such as regulatory compliance or strategic market positioning. Projects in large organizations like GitHub or Zendesk are complex, involving 5 to 25 teams. They naturally tend toward failure unless a single engineer acts as the Directly Responsible Individual (DRI) and maintains the entire technical context in their head. While soft skills are necessary for navigating politics, technical strength remains a superpower. It allows engineers to create demos that cut through weeks of debate and gives them the authority to challenge roadblocks in unfamiliar codebases. GenAI tools like GitHub Copilot accelerate this by handling the volume of work and helping engineers contribute to repos they do not usually touch. For remote teams, a follow the sun model between US and APAC squads can double delivery speed for bug fixes and investigations, provided the culture is async-first and relies on written artifacts.
Key Takeaways
- Shipping is defined by management perception rather than just code deployment or closing tickets.
- Technical depth is a political tool that allows engineers to bypass bureaucracy by creating demos that prove what is possible.
- The DRI model is essential because large projects default to failure without a single person holding the full technical context.
- Follow the sun workflows turn time zone differences into a competitive speed advantage for bug fixes and investigations.
- GenAI increases engineering ambition by lowering the barrier to entry for unfamiliar languages and complex codebases.
Notion: going native on iOS and Android - by Gergely Orosz
Notion transitioned its mobile apps from a WebView-based architecture to native components to solve critical performance issues, specifically app start time. Originally built using Cordova and later a React Native wrapper, the app faced a performance ceiling because it had to boot a full JavaScript bundle before rendering. To compete with native tools like Apple Notes, the team moved to a native stack using SwiftUI for iOS and Jetpack Compose for Android. Despite having over 100 million users, the mobile engineering team remains small with only 11 engineers. They adopted an incremental rewrite strategy rather than a total overhaul, starting with the Home Tab as the most impactful entry point. This project took nine months and required building a native service layer for networking and data persistence. The editor remains largely web-based due to its extreme complexity, including recursive data models and embedded collections, though the team is slowly porting rich text blocks to native. Notion's engineering culture relies heavily on an RFC process where engineers drive product changes and share documents company-wide for feedback. They maintain high velocity through weekly releases, nightly public betas, and a hybrid update model where the web-based parts of the app can update independently of the native shell. Performance is treated as a top-tier priority, with weekly reviews of top-line metrics and dedicated sessions to investigate regressions. The data model uses a transaction-based system written to SQLite to support a reactive, real-time experience across devices.
Key Takeaways
- Incrementalism prevents rewrite failure. By focusing only on the Home Tab first, the team delivered measurable performance gains in nine months without the multi-year risk of a total 'big bang' rewrite.
- Native is required to break the performance ceiling. Web-based wrappers like Cordova or React Native introduce a boot-up lag that makes it impossible to match the speed of built-in tools like Apple Notes or Google Keep.
- Small, senior-heavy teams can drive massive scale. Notion manages tens of millions of mobile users with just 11 engineers by acting as internal consultants for web teams and focusing on high-leverage platform infrastructure.
- Local-first architecture is a competitive moat. Building a transaction-based syncing layer with SQLite allows for the real-time reactivity and offline-capable feel that users expect from a modern productivity tool.
- Early adoption of modern frameworks pays off. Choosing SwiftUI and Jetpack Compose before they were fully mature allowed the team to avoid a future migration and align with the long-term direction of the iOS and Android ecosystems.
Software architecture with Grady Booch - by Gergely Orosz
Software engineering history is a continuous climb through rising levels of abstraction. In the early days, machines were more expensive than humans, forcing developers to optimize every instruction. This led to the first golden age of algorithmic decomposition. As systems became distributed, the focus shifted to the second golden age of object-oriented analysis and design. Grady Booch pioneered this shift with the Booch method, which eventually merged with other methodologies to create UML. Legacy systems are a permanent fixture of the industry because code never dies until it is actively killed. Organizations like the IRS still run assembly code from the 1960s on emulators because that code contains essential business rules. Today, the role of the software architect has evolved. Instead of focusing on low-level software design, architects now make systemic and economic decisions about cloud services, messaging platforms, and frameworks. Architecture is defined by significant design decisions where the primary metric is the cost of change. The current excitement around large language models (LLMs) requires a cautious perspective. While LLMs are powerful, they are essentially stochastic parrots and unreliable narrators. They lack a theory of mind and the embodiment necessary for true intelligence. Scaling alone will likely hit diminishing returns. Future progress probably lies in neuro-symbolic architectures that combine neural networks with symbolic reasoning. This approach mirrors the complexity of the human brain, which uses entangled architectures and hormonal systems rather than simple layers. Current software engineers should not fear AI tools like Copilot. The tools have changed, but the need for informed decision-making remains. Success in the modern market requires finding niche domains and maintaining a broad understanding of computational thinking. Grady is currently documenting as-built architectures for major systems like Photoshop and Wikipedia to provide a practical handbook for the next generation of architects.
Key Takeaways
- Architecture is defined by the cost of change. If a decision is expensive to reverse, it is an architectural one, regardless of whether it involves code or cloud providers.
- The modern architect is a systems economist. They no longer just design code structures but instead choose between high-level service providers and frameworks based on long-term costs.
- LLMs are sensory-sparse and lack embodiment. True AGI likely requires systems that can interact with and respond to the physical world, similar to the neuro-symbolic models used in space exploration.
- Software entropy is inevitable without a strong hand. Systems like the Linux kernel maintain integrity because a chief decider prevents the natural drift that occurs when original design rationale is lost.
Move fast with little process: Linear (with Engineering Manager Sabin Roman)
Linear's engineering philosophy centers on high quality and minimal overhead. Sabin Roman, the company's first engineering manager, explains how the team of 60 people (25 engineers) operates without internal email. They rely on Slack for urgent pings and Linear itself for structured, asynchronous work. This setup forces engineers to be more intentional and autonomous. A core pillar of their culture is the zero bug policy. While the official SLA is one week, the team aims to fix bugs within 24 hours. Every morning starts with clearing issues before moving to new product work. This is supported by a goalie rotation where two engineers per week handle all customer support and triage, keeping them directly connected to user pain points. Linear hires for product sense rather than just technical skill. Their process includes a paid, remote work trial week where candidates build greenfield projects. This ensures engineers can make independent judgment calls on UX and features without needing a middleman. Unlike the heavy process at Uber, where Sabin and host Gergely Orosz previously worked, Linear avoids rigid PRDs and complex promotion rubrics. Instead, they use a fluid "talk, do, show" iteration cycle. Teams are formed for specific projects and dissolved afterward. There are no titles or levels: everyone is simply an engineer. This lack of hierarchy keeps the focus on the product rather than career ladder climbing. Sabin notes that while this requires more effort from managers to maintain trust and connection in a remote setting, it results in a more creative and focused environment. Success for a manager at Linear is measured by how much they have improved the product, not by the size of their team or the complexity of their reporting.
Key Takeaways
- Quality is a strategic choice. By enforcing a zero bug policy, Linear avoids the technical debt that slows down larger organizations.
- The "Product-Minded Engineer" reduces the need for heavy management. When engineers have good taste and judgment, you can replace rigid PRDs with fluid iterations.
- Removing internal email eliminates "meta-work." This forces communication into transparent channels like Slack or the project management tool itself.
- Process should be principle-based, not guidebook-based. Linear uses strict processes for reliability and incidents but keeps product development creative and unstructured.
How to debug large, distributed systems: Antithesis
Antithesis addresses the stagnation in debugging technology by providing Deterministic Simulation Testing (DST) as a service. While traditional debugging has remained largely unchanged since the 1980s, distributed systems have become significantly more complex. These systems face unique challenges like meaningless timestamps, unreliable networks, and concurrency issues that are nearly impossible to reproduce in standard environments. Antithesis solves this by making the hypervisor itself deterministic. This allows any software running on the platform to benefit from DST without requiring developers to build custom simulation frameworks from scratch. The core of the technology is a multiverse debugger that enables developers to rewind and branch the state of a system. If a crash occurs, a developer can move back in time, add logging, or even change the code to see how the system reacts. This capability transforms debugging from a guessing game into a precise science. The platform uses a combination of fuzzing to inject random inputs and fault injection to simulate hardware or network failures. This helps identify one-in-a-million bugs that occur frequently in high-traffic production environments but are elusive in staging. The engineering culture at Antithesis is notably unconventional. The team works five days a week in the office using desktops rather than laptops to maintain a clear boundary between work and home life. They have a strong preference for building their own tools, including a custom hypervisor and a specialized database designed to handle the tree-like event structures generated by a branching multiverse. This approach is driven by the need for extreme performance and specific functionality that off-the-shelf tools like BigQuery cannot provide. Their bug management philosophy emphasizes catching errors immediately after they are introduced, arguing that the cost of fixing a bug increases exponentially the longer it remains in the codebase. By prioritizing new bugs over old ones, teams can move toward a zero-bug state and maintain higher long-term productivity.
Key Takeaways
- Hypervisor-level determinism is the key to making DST accessible. By moving the complexity of simulation from the application code to the infrastructure layer, Antithesis allows teams to test complex distributed systems without the massive overhead of building bespoke testing environments.
- The cost of a bug is tied directly to the time elapsed since its introduction. Fixing a bug immediately costs nearly zero engineer hours, while fixing it months later requires weeks of investigation, justifying a fanatical focus on new bugs over old backlogs.
- Distributed systems turn rare edge cases into frequent production failures. In a system processing millions of requests, a one-in-a-million bug is a daily occurrence, and deterministic simulation is the only reliable way to capture and reproduce these specific timing issues.
Promotions and tooling at Google (with Irina Stanescu, Ex-Google)
Google's engineering culture centers on rigorous documentation and standardized code quality. Design docs are a non-negotiable prerequisite for any project, serving as the primary vehicle for cross-team coordination and architectural alignment. This contrast is sharp compared to faster-moving environments like Uber, where documentation was historically more elective. Internal tools like Code Search and Critique facilitate a high-velocity environment where engineers can quickly find examples or get feedback. A standout feature is the readability review, a certification process requiring engineers to demonstrate mastery of language-specific style guides before they can approve code for others. This ensures the codebase remains maintainable as it scales. Promotion cycles at Google traditionally relied on centralized committees to reduce manager bias. However, this decoupling can lead to rejections if an engineer's work is niche or lacks visible impact relative to high-traffic services. Irina Stanescu highlights that moving from a failed promotion to a senior role in just one year required a shift in strategy: choosing high-visibility projects and explicitly aligning work with next-level competencies. She notes that while Uber initially used manager-led promotions, they eventually iterated toward a committee-based model similar to Google's to ensure fairness. Influence in a technical organization is less about politics and more about building social capital and credibility. Being influential is a state where peers proactively seek your input because of your track record and helpfulness. Tactical advice for engineers includes troubleshooting the manager relationship, asking for specific feedback on building trust, and partnering with product managers who value technical input for strategy. Ultimately, the most difficult part of software engineering is the human collaboration and leadership required to move projects forward, rather than the coding itself. Success at senior levels depends on the ability to affect system outputs through feedback, disagreement, and negotiation.
Key Takeaways
- The Readability certification at Google treats code style as a first-class engineering skill, reducing the cognitive load for the entire organization at the cost of a steeper onboarding curve.
- Centralized promotion committees prioritize impact over mere execution, which can penalize engineers in niche or infrastructure roles unless they proactively translate their work into broader organizational value.
- Influence is a push and pull dynamic; the most effective engineers build a reputation for being helpful and reliable so that their input is pulled into decisions rather than pushed onto others.
- Transitioning from a move fast culture like Uber to a process-heavy one like Google requires a translation layer for tools and workflows to maintain productivity.
How to become a more effective engineer
Effectiveness in software engineering, as Cindy Sridharan explains, is about more than just writing code. It is really about learning how to navigate the people and systems inside your company. A lot of engineers get frustrated because they follow advice from social media that does not match their actual workplace. For example, people complain about technical debt but wait for a manager to fix it. The best engineers just handle it themselves through small, iterative wins that do not slow down the team. You have to understand how your specific org works to get anything done. This means finding the informal hierarchy. Even if someone is not a manager, they might have a ton of influence because of their history or expertise. If you do not respect what came before, your new ideas will probably get rejected. You also need to know if your company is top-down or bottom-up. In a top-down place, you have to align with what the boss values. In a bottom-up place, you have to convince your peers. Either way, you have to get used to the mess. Most codebases and orgs are not perfect. Productive engineers learn to move fast even when documentation is missing or incentives are weird. Do not try to fix everything at once. Big culture shifts like DEI or changing the promotion process usually need the executives to lead the way. Focus on your main job first and build up your credibility with small wins. If you understand the politics and keep shipping, you become a force multiplier. This approach helps you avoid the trap of being stuck in your career. Many engineers fail to move up because they ignore the sociotechnical side of the job. They treat politics as a dirty word instead of a hard skill. By learning how decisions are actually made and who holds the real power, you can drive much more impact. This is especially true in a post-ZIRP world where companies are laser-focused on efficiency and profitability.
Key Takeaways
- The informal hierarchy often carries more weight than the official org chart when it comes to technical decision making and project approval.
- Technical debt is a sociotechnical problem that requires iterative, low-visibility fixes rather than waiting for a dedicated mandate from leadership.
- Productivity in messy environments is a competitive advantage; the ability to build a mental model of a system quickly without perfect documentation is a high-value skill.
- Career stagnation often stems from a refusal to accept organizational practicalities in favor of maintaining rigid, idealistic standards of how things should work.
Twisting the rules of building software: Bending Spoons (the team behind Evernote)
Bending Spoons is a Milan-based tech company that has built a $700 million revenue business by acquiring established software products like Evernote, Meetup, and WeTransfer. Unlike typical VC-backed startups, they have been profitable every year since 2013, initially relying on bank debt and cash flow because raising equity in Italy was difficult. This scarcity of resources forced them to develop a culture of extreme efficiency and radical simplicity. When they acquire a company, they enter a deep learning phase to identify patterns and technical debt. For Evernote, this meant addressing a massive Java 11 monolith running on 750 manually provisioned virtual machines. They migrated the data to managed databases and rewrote the backend into a cloud-native microservices architecture. A key technical shift was moving from a heavy polling mechanism to an event-driven synchronization system, which significantly improved performance and reduced data conflicts. Their operational philosophy is unconventional. They aim to eliminate on-call rotations by engineering systems to be robust enough that fallbacks aren't needed. They use a 100% fixed pay model with no bonuses, believing that performance should be driven by talent and ownership rather than financial incentives. Internally, they avoid titles and complex career ladders to minimize administrative friction. Hiring focuses on high-talent juniors and new graduates rather than experienced professionals. This allows them to mold engineers into their unique culture without the baggage of how other companies operate. This strategy has resulted in an exceptionally low unwanted churn rate of about 1% per year. In the AI space, their product Remini serves over 100 million monthly users. Managing this requires handling up to 8,000 inferences per second using 4,000 GPUs. They developed custom predictive algorithms to balance GPU availability with cost efficiency, ensuring they remain profitable even with high infrastructure demands. This focus on engineering excellence and lean operations allows them to rebuild and scale legacy products that other firms might consider too far gone.
Key Takeaways
- Profitability as a Moat: Their lack of early VC access forced a profitability-first mindset that became a competitive advantage in resource allocation and operational efficiency.
- Radical Simplicity as an Operating System: By removing complexity like bonuses, titles, and on-call, they reduce cognitive load and administrative overhead, allowing for higher talent density and faster execution.
- Technical Debt as Value Arbitrage: Bending Spoons views legacy monoliths as opportunities for value creation through total backend rewrites and UX modernization rather than just maintenance.
- Long-term Talent Compounding: Their hiring strategy bets on long-term talent potential over immediate experience, creating a culture that is difficult for outsiders to replicate and results in extremely low attrition.
Why techies leave Big Tech - by Gergely Orosz
Big Tech used to be the ultimate career destination for software engineers, offering unmatched stability and top-tier pay. That reality has shifted significantly since 2022. Mass layoffs at Meta, Google, Amazon, and Microsoft have removed the sense of job security that once defined these roles. Engineers now face a landscape where even high performers can be hit by unpredictable cuts or forced out through performance improvement plans used as headcount reduction tools. Beyond stability, professional stagnation is a major driver for departures. Many techies feel like cogs in a massive machine, spending their time on exploitation of existing systems rather than exploration of new ones. This leads to a startup shock where engineers realize they have only learned company-specific tools and lost the ability to build from scratch. Career paths also play a role. Moving from middle management to an executive level is nearly impossible at a giant like Google, but scaleups actively recruit Big Tech managers for VP and C-level roles to bring in seasoned experience. Compensation remains the strongest tether, often referred to as golden handcuffs. However, this tether breaks during the four-year vesting cliff. When initial equity grants fully vest, total compensation can drop by 30% or more if refreshers don't match the original package. Falling stock prices also trigger exits, as seen with Meta in late 2022. Conversely, massive stock growth at companies like NVIDIA makes it nearly impossible for competitors to hire their talent away. The culture within these giants has also become increasingly political. Promotion-driven development often prioritizes internal optics over actual product value. Even successful scaleups eventually hit a point where they become too Big Tech, introducing committees and rigid processes that drive away the original builders. Finally, the market has become bifurcated. Senior talent with specialized skills still sees high demand and can move between top-tier companies like an old aristocracy, while junior engineers face a much tougher hiring environment.
Key Takeaways
- The four-year cliff creates a predictable cycle of talent churn that Big Tech companies often use as a silent filter for retention.
- Scaleups offer a unique career arbitrage opportunity for Big Tech middle managers to leapfrog into executive roles.
- The shift from exploration to exploitation in mature tech companies is the primary catalyst for high-performer burnout and departure.
Efficient scaleups in 2024 vs 2021: Sourcegraph (with CEO & Co-founder Quinn Slack)
Quinn Slack, CEO of Sourcegraph, reflects on the transition from the peacetime growth of 2021 to the efficiency-focused reality of 2024. Sourcegraph, a code intelligence platform with 248 million dollars in funding and 180 employees, experienced 30x revenue growth during the pandemic but faced challenges with unrealistic expectations and a lack of focus. Slack admits to regretting many 2021 decisions, noting that the abundance of capital led to shipping features that should not have made the cut. To pivot the company toward AI and their new product, Cody, they implemented a job fair system where employees could choose to work on the highest-priority projects, effectively shocking the system to break old habits and realign the team. The discussion highlights a pragmatic approach to AI. Rather than chasing 100 percent autonomous agents that often fail in complex legacy codebases, Sourcegraph focuses on automating toil like changelog generation and providing better context. Slack emphasizes that AI cannot safely review its own code without ground truth from tests, builds, and performance data. He predicts a turnover in the dev tool landscape where legacy tools might be replaced by AI-native versions optimized for machine consumption rather than human UIs. This shift requires faster feedback loops, such as builds that run in milliseconds, to allow AI to iterate effectively. On the organizational side, Sourcegraph recently moved from location-independent pay to zone-based pay. Slack explains that while global flat pay was simpler initially, it created hiring inefficiencies in high-cost hubs and misaligned incentives for shareholders. He argues that as a company scales past 200 people, treating employees as shareholders requires being real about market rates to ensure long-term survival. Slack also advocates for technical CEOs to stay close to the code, noting that his daily coding practice helps him make better strategic decisions and stay fluent in the rapidly changing AI landscape.
Key Takeaways
- The Job Fair model serves as a high-friction but effective mechanism for forcing a GTM pivot. By allowing teams to dissolve and reform around new priorities, Sourcegraph successfully shifted focus to AI faster than traditional top-down restructuring would allow.
- Autonomous AI agents face a ground truth ceiling in enterprise environments. Without integration into fast builds, test suites, and feature flag systems, AI cannot safely manage the complex rollouts required in large-scale production systems.
- Location-independent compensation is a scaling liability. While it works for early-stage startups to cast a wide net, it eventually prevents hiring in top-tier tech hubs and creates a financial reckoning as the company matures.
- Technical fluency is a leadership moat in the AI era. A CEO who codes daily can better distinguish between demo-ware and actual utility, ensuring the product roadmap stays grounded in developer reality rather than market hype.
The Pragmatic Engineer: Three Years - by Gergely Orosz
Gergely Orosz reflects on three years of The Pragmatic Engineer, which has grown to over 759,000 subscribers. The publication now delivers two weekly articles: a Tuesday deep dive and a Thursday industry analysis called The Pulse. A significant shift occurred when Orosz moved away from the name The Scoop to avoid a gossip-oriented reputation, focusing instead on analytical software engineering topics. This change helped build trust with engineering teams at companies like OpenAI, Stripe, and Meta, allowing for exclusive inside look articles. The team expanded by hiring Elin Nilsson as a Tech Industry Researcher, enabling more intensive projects on topics like Gen Z developers and AI coding agents. Popular content from the past year focused on the end of zero interest rate policies (ZIRP) and its impact on the tech industry. Other major hits included a critique of McKinsey's developer productivity metrics and an exploration of OpenAI's internal execution strategies. Orosz also released The Software Engineer's Guidebook, which sold over 30,000 copies and is being translated into multiple languages. Looking forward, the publication is launching The Pragmatic Engineer Podcast to feature deep-dive interviews with engineering leaders every two weeks. The newsletter remains focused on providing practical, research-backed insights for software engineers and managers across Big Tech and startups.
Key Takeaways
- Moving from news-breaking to deep analysis improved access. By rebranding The Scoop to The Pulse, Orosz signaled he was not looking for leaks, which made engineering leaders more comfortable sharing internal cultural details.
- Macroeconomics now dictates engineering strategy. The high engagement with the ZIRP series suggests that engineers are increasingly concerned with how interest rates and market shifts affect hiring and project funding.
- Specialization drives newsletter scaling. Hiring a dedicated researcher allowed the publication to tackle complex, multi-month projects that a solo creator could not sustain, maintaining high quality during rapid subscriber growth.
Paying down tech debt - by Gergely Orosz and Lou Franco
Technical debt should be viewed as a productivity booster rather than just a long term maintenance task. Lou Franco defines tech debt as any problem in the codebase that makes it harder for programmers to make necessary changes. While many managers fear that addressing debt slows down feature delivery, Franco found at Atalasoft that ignoring it leads to developer frustration and higher attrition. By rewriting a tangled build system and installer, the team achieved faster CI builds and easier modifications immediately, proving that debt reduction can pay off in the short term. At Trello, the team learned the danger of over-rewriting. They spent months on a new navigation paradigm that was ultimately overkill. Franco now uses a specific heuristic: only pay down debt if it increases developer productivity and delivers business value right now. This involves making small, behavior-preserving changes using unit tests and refactoring. For larger initiatives, he suggests coupling debt fixes with the delivery of high priority product changes. This way, stakeholders see the resulting feature rather than the invisible engineering work. To get management buy-in, Franco recommends making the effects of debt visible through dashboards. At Atlassian, he used an Errors Per Million (EPM) metric to track sync reliability. By defining excellence, acceptable, and unacceptable thresholds, the team could treat high error rates as incidents, forcing prioritization. This visibility transformed a vague engineering concern into a quantifiable business problem. For large scale rewrites, Franco warns that they usually fail without heavyweight support from executive leadership. Successful rewrites, like the one at ISO-NE, require planning for a temporary increase in headcount, maintaining the legacy system in parallel, and branding the project alongside visible user-facing improvements. Ultimately, the goal is to reduce feedback loops, lower cognitive load, and help engineers stay in a state of flow.
Key Takeaways
- Prioritize debt that shortens feedback loops like build times and PR reviews to see immediate ROI.
- Use quantifiable metrics like Errors Per Million to turn invisible technical issues into urgent business priorities.
- Avoid the architect's urge to bulldoze code: incremental renovation is usually more cost effective than a full rewrite.
- Link technical improvements directly to user facing features so stakeholders can see the value in every sprint.
Leading Effective Engineering Teams: a Deepdive
Effective engineering teams are built on specific dynamics rather than just technical talent. Google's Project Aristotle identified five key factors for success, with psychological safety being the most critical. This refers to a team's comfort level in taking risks and being vulnerable without fear of judgment. Other vital factors include dependability, structure and clarity, meaning, and impact. While Google's research suggested team composition did not matter as much, other studies highlight that smaller teams under 10 people often perform better due to reduced communication complexity. Leadership roles in these environments typically fall into three categories: Technical Lead (TL), Engineering Manager (EM), and Tech Lead Manager (TLM). The TL is a hands-on role focused on architecture, coding standards, and mentoring. They bridge the gap between development and management. The EM focuses on people management, hiring, process, and aligning the team with company goals. They act as a development abstraction layer, shielding engineers from infrastructure and administrative friction. The TLM is a hybrid role common at Google but rare elsewhere. It combines technical leadership with people management, usually for smaller teams. TLMs must balance hands-on coding with coaching and cross-functional coordination. This role requires high technical aptitude to push the team while managing individual career growth. Effective leadership across all these roles involves enabling, empowering, and expanding the team's capabilities. Research also shows that agility, diversity, and clear communication are essential for maintaining a high-performing culture. While remote work offers flexibility, colocation is often cited as a driver for innovation through serendipitous knowledge sharing.
Key Takeaways
- Psychological safety is the primary predictor of team success, directly correlating with higher revenue and better employee retention.
- The development abstraction layer concept suggests that a manager's primary value is removing operational friction so engineers can focus entirely on technical execution.
- The TLM role serves as a critical bridge for scaling nascent products, allowing for deep technical oversight without losing the human element of management.
The biggest-ever global outage: lessons for software engineers
The July 2024 CrowdStrike outage impacted 8.5 million Windows machines, causing widespread disruption across airlines, banks, and hospitals. The crash originated from a routine content update to CrowdStrike's Falcon sensor. Specifically, a new configuration file designed to detect malicious named pipes triggered a logic error. This caused the CSAgent.sys process to attempt an invalid memory operation, resulting in the Blue Screen of Death. Because the software operates at the kernel level, a single faulty instruction brought down the entire operating system. Recovery was notoriously difficult because it required manual intervention on every affected device. IT staff had to boot machines into Safe Mode to delete the offending file. While CrowdStrike bears primary responsibility for failing to use staged rollouts or canary testing for what they labeled as content, regulatory history played a role. A 2009 agreement with the European Commission prevents Microsoft from walling off its kernel from third-party security vendors, a restriction Apple does not face. For software engineers, the incident highlights the danger of treating configuration differently than code. CrowdStrike likely bypassed standard deployment safeguards because the update was classified as data rather than logic. The outage serves as a reminder to quantify the potential blast radius of any change and to implement automated, multi-stage rollout pipelines. It also underscores the importance of blameless postmortems, focusing on organizational and systemic failures rather than individual mistakes.
Key Takeaways
- Classifying updates as content rather than code creates a dangerous blind spot in deployment safety. Even binary configuration files can trigger catastrophic logic errors when processed by kernel-level drivers.
- Regulatory environments can dictate technical architecture and system stability. Microsoft's inability to lock down its kernel stems from a 2009 EU antitrust agreement, showing how legal constraints impact global software resilience.
- Manual recovery is the ultimate bottleneck in modern distributed systems. When an outage requires physical access to millions of machines, the recovery time scales linearly with headcount rather than exponentially with automation.
- Organizational culture is the primary defense against systemic failure. The lack of staged rollouts for critical updates suggests a process gap where speed or perceived simplicity was prioritized over safety.
What is Old is New Again - by Gergely Orosz
Gergely Orosz breaks down the massive shifts in the tech industry over the last two years, moving from the hottest hiring market in history to a period of mass layoffs and funding droughts. The core driver is a fundamental change in interest rates. For over a decade, the Zero Interest Rate Policy (ZIRP) made risky startups attractive because keeping money in the bank offered no return. Now that rates have jumped from 0% to 5%, investors demand immediate cash flow and profitability over long-term growth bets. This shift explains why even profitable giants like Google and Meta are cutting costs and managing out low performers. The 2010s were a unique golden age where rock-bottom rates coincided with the smartphone and cloud revolutions. These provided cheap scaling and massive new distribution channels. In contrast, the current GenAI boom is happening in a high-interest environment without a new free distribution channel, making the hurdle for success much higher. For software engineers, this new reality means a tougher job market and slower career progression. Engineering practices are also pivoting. There is a growing preference for boring technology, monolithic architectures over complex microservices, and a shift-left approach where developers take on more operational and testing responsibilities.
Key Takeaways
- Startups in the ZIRP era were effectively a macro bet on interest rates rather than just pure innovation.
- The GenAI revolution lacks the free distribution advantage that smartphones gave to apps like Uber and WhatsApp, making its ROI harder to prove.
- Big Tech companies have shifted from hoarding talent to strategically cutting cost centers to maintain valuations in a high-rate environment.
- Pragmatic engineering now favors fullstack versatility and simpler tech stacks to reduce overhead and speed up delivery.
What do GenZ software engineers really think?
GenZ software engineers prioritize flexibility, flat hierarchies, and transparent company cultures. This group, mostly aged 24 to 27, values autonomy and modern tech stacks. They are direct in their communication and prefer text over meetings. Many would quit over meaningless work or bad company values. They are sensitive to business health, often spotting negative cash flow or poor product-market fit before leadership does. Regarding older colleagues, GenZ sees them as either coasting or work-obsessed. They often find senior staff's written communication poor and lacking context. However, they respect the deep engineering knowledge and war stories that experienced engineers bring to design reviews. While they are bullish on AI, they feel frustrated by legacy tech debt and red tape. They use developer-influencers like Theo or Primeagen for learning, which older generations often overlook. Onboarding and mentorship are areas where they feel companies fail, leaving them to figure out systems without documentation. They like managers who act as supportive partners. Most respondents hold computer science degrees but feel undervalued in maintenance roles. They switch jobs for better pay or growth, knowing tenure is often less rewarding than moving to a profit-center role. This generation guards their free time strictly and views work as a part of life rather than their core identity. They are ambitious and have been coding since high school, making them highly capable but demanding of their employers. They appreciate remote or hybrid work and expect a high level of transparency from leadership regarding company decisions and financial stability. When these expectations are not met, they are quick to look for new opportunities that offer better alignment with their personal and professional values.
Key Takeaways
- GenZ engineers have a bimodal distribution of ambition, either being highly driven hacker types or strictly guarding their work-life balance.
- There is a significant disconnect in digital communication styles, where GenZ finds older colleagues' written responses lacking context or being unnecessarily formal.
- Career progression is the primary driver for job switching, with many young engineers realizing early that tenure matters less than skill acquisition and salary bumps.
- The generation gap is most visible between GenZ and GenX, often due to differing views on hierarchy, bureaucracy, and the role of work in one's personal identity.
Getting an Engineering Executive Job - by Gergely Orosz
This document outlines the unique path to becoming an engineering executive, specifically focusing on CTO, VP of Engineering, and Head of Engineering positions. It features insights from Will Larson's book, The Engineering Executive's Primer. A core concept is that every executive role and hiring process is "one of one," meaning they are bespoke and lack the standardization found in middle management hiring. Internal promotions are surprisingly rare because companies often seek a different skill set than what exists in the current team, and internal transitions can be fraught with peer friction. Most external roles are found through a pipeline that starts with internal candidates and personal networks before moving to investors and finally executive recruiters. Relying solely on recruiters often limits candidates to second-tier opportunities that were hard to fill. The interview process is frequently chaotic because the hiring manager (often the CEO) may lack technical expertise, shifting the focus toward perceived fit, prestige, and communication skills. Candidates should expect a multi-stage process involving recruiter screens, CEO discussions, peer interviews, and a final 90-day plan presentation. Negotiation is a critical phase where almost everything is on the table, including equity acceleration (single or double trigger), severance packages, and organizational support like budget or executive assistants. Before accepting, candidates must perform due diligence by meeting the board, checking with former executives, and ensuring the CEO is someone they trust.
Key Takeaways
- Executive hiring is non-standardized and "one of one," requiring candidates to adapt to bespoke processes rather than relying on typical management interview frameworks.
- The most desirable executive roles often never reach recruiters, making a strong network and visibility within investor circles more effective than traditional job searching.
- Interviewing for these roles is less about technical proficiency and more about managing the perceptions of non-technical stakeholders like CEOs and board members.
- Negotiation is the primary window to secure the specific resources and organizational support, such as budget or an EA, that determine long-term success in the role.
Building Bluesky: a Distributed Social Network (Real-World Engineering Challenges)
Bluesky scaled to 25 million users with a team of just 15 engineers, maintaining a lean operation reminiscent of early Instagram. The network is built on the Authenticated Transfer (AT) Protocol, a decentralized framework designed to prevent platform lock-in and give users ownership of their data. Initially, the team used a standard stack of PostgreSQL and AWS to move quickly during the experimentation phase. However, as user growth spiked during Elon Musk Events (surges caused by Twitter changes), they hit scaling limits that required a total architectural rethink. The architecture evolved from a centralized Postgres setup to a distributed model. They migrated the read-heavy Appview service to ScyllaDB for horizontal scalability and transitioned Personal Data Servers (PDS) to a PDS in a box model using SQLite. This unique approach gives every user their own SQLite database, which drastically reduces operational overhead and removes the need for complex database management like replicas or failovers. A major strategic shift involved moving from AWS to on-premise hardware. By becoming cloud-agnostic and using bare-metal servers through Vultr, they achieved 10x the performance at a significantly lower cost. The network functions through several modular components: PDSs host user data, Relays crawl these servers to create a firehose of events, and Appviews provide the frontend logic. This modularity allows developers to build custom feeds, algorithms, and moderation tools via the Ozone service that plug directly into the open ecosystem.
Key Takeaways
- Small, high-seniority teams can outperform massive organizations by prioritizing modularity and avoiding 'not invented here' syndrome until standard tools truly fail.
- Moving off-cloud is a viable strategy for high-scale SaaS once load patterns are predictable, offering massive performance gains and cost savings that AWS can no longer match.
- The single-tenant SQLite architecture for PDSs is a masterclass in operational simplicity, turning a complex distributed problem into a manageable file-based system.
- Decentralization enables permissionless innovation, allowing third-party developers to build custom algorithms and moderation layers without platform owner approval.
Scaling ChatGPT: Five Real-World Engineering Challenges
OpenAI's rapid growth to 100 million weekly users required a complete rethink of traditional scaling. The core challenge stems from the Transformer architecture's self-attention mechanism, which scales quadratically. To manage this, engineers use a KV cache to store mathematical results from prior tokens in GPU RAM. Because High Bandwidth Memory (HBM) is significantly faster than PCIe, keeping data on the GPU is essential, making GPU RAM the most valuable commodity and the most frequent bottleneck. The team manages efficiency through arithmetic intensity, specifically the ratio of floating point operations to data movement. For an NVIDIA H100, this ratio is 591:1. If the batch size is too small, the GPU wastes compute cycles waiting for memory. Conversely, if the batch size is too large, it hits memory bandwidth limits. Traditional metrics like CPU utilization proved misleading because they do not account for KV cache misses or memory starvation. Hardware scarcity also dictates strategy. Since there is no infinite cloud for GPUs, OpenAI cannot autoscale in the traditional sense. They must find GPUs globally, leading to a multi-region, multi-cluster setup where a well-balanced fleet is prioritized over geographic proximity to the user. This shift means latency from network round-trips is less critical than having a GPU ready to stream tokens. Ultimately, scaling LLMs requires jumping between low-level CUDA kernels and high-level global data center strategies.
Key Takeaways
- Memory bandwidth and GPU RAM capacity are more critical constraints than raw compute power for modern LLM inference.
- Traditional cloud scaling assumptions like infinite autoscaling and edge-priority latency do not apply when hardware is physically scarce.
- Effective LLM monitoring requires tracking KV cache utilization and arithmetic intensity rather than simple processor load.
Measuring Developer Productivity: Real-World Examples
Tech companies like Google and LinkedIn use a combination of qualitative and quantitative data to track engineering efficiency rather than relying on a single silver bullet metric. Google uses the Goals, Signals, Metrics (GSM) framework, focusing on three dimensions: speed, ease, and quality. They combine system logs with developer surveys and diary studies to ensure objective data matches subjective experience. LinkedIn employs a Developer Insights Hub to track real-time feedback and system-based metrics like build times and deployment success. They use Winsorized means to account for outliers and calculate a Developer Experience Index as an aggregate score. Peloton focuses on engagement, velocity, quality, and stability, using a sampling method for surveys to reduce fatigue. Scaleups with 100 to 1,000 engineers prioritize moveable metrics like Ease of Delivery and Time Loss. Time Loss is particularly effective for business buy-in because it can be translated into dollar amounts based on engineering payroll. The DORA and SPACE frameworks are used selectively as components rather than wholesale replacements for custom strategies. A significant trend is the tracking of focus time or deep work as a top-level metric at companies like Stripe and Uber. For engineering leaders, the advice is to frame metrics around business impact, system performance, and engineering effectiveness to demonstrate stewardship of resources.
Key Takeaways
- Qualitative data is as critical as quantitative logs. If developers feel a process is slow despite good numbers, the subjective friction is a real productivity blocker that needs addressing.
- Moveable metrics like Time Loss bridge the gap between engineering and finance. By quantifying hours lost to environment obstacles, leaders can justify infrastructure investments in terms of recovered payroll value.
- Focus time is emerging as a leading indicator of burnout and output. Tracking days with sufficient focus time allows teams to predict productivity drops before they show up in PR counts or velocity charts.
- The GSM framework prevents metric chasing by forcing teams to define what success looks like and what evidence would prove it before selecting specific data points.
The Pragmatic Engineer in 2023 - by Gergely Orosz
2023 saw a pivot in the tech landscape. Profitable Big Tech firms like Google, Microsoft, and Amazon executed massive layoffs. Google’s dismissal of 12,000 employees was historic, marking only its third layoff since 1998. The industry shifted from a growth mindset to a focus on efficiency. This led to fewer middle management roles and tougher performance reviews. Companies aggressively cut vendor spending on cloud, observability, and SaaS. Coinbase notably built its own observability stack to reduce a $65M annual Datadog bill. The Silicon Valley Bank collapse in March signaled the end of easy capital. Startup funding returned to 2019 levels across all stages. Engineering culture research moved toward direct sourcing from CTOs and current staff at companies like Stripe, OpenAI, and Figma. The most popular articles included a critique of McKinsey’s developer productivity metrics and an exploration of stacked diffs as a superior engineering workflow. Other highlights covered Stripe’s engineering culture, lessons from bootstrapped companies, and the internal shipping practices at OpenAI. Personal favorites from the year touched on staying hands-on as an engineering manager, the productivity impact of AI coding tools, and the distinction between wartime and peacetime environments at tech companies. Generative AI drove massive revenue for NVIDIA, which saw an 88% quarterly revenue jump. Conversely, StackOverflow saw a decline in traffic and questions as developers pivoted to AI assistants. Open source companies like HashiCorp moved to restrictive Business Source Licenses to protect revenue. Regulators became more active, blocking the $20B Adobe-Figma deal and forcing Meta to sell Giphy. Despite the turmoil, Meta staged a turnaround by optimizing ad revenue against Apple’s privacy changes. The year ended with recovered market caps for Big Tech, though the hiring market for senior leadership remains challenging.
Key Takeaways
- The transition to wartime operations has made non-technical management roles increasingly difficult to maintain.
- Regulators in the US and EU are now major obstacles for Big Tech acquisitions of potential competitors.
- Open source companies are abandoning permissive licenses for restrictive models to survive high interest rates.
- AI coding tools are replacing community Q&A platforms as the primary source for programming advice.
Dead Code, Getting Untangled, and Coupling versus Decoupling
Kent Beck's book Tidy First? focuses on the practical and economic aspects of software design, specifically the relationship between tidying code and changing its behavior. The book is structured into three distinct parts: Tidyings (the "what"), Managing (the "how"), and Theory (the "why"). Beck argues that software design is primarily about human relationships and managing the cost of change over time. In the chapter on Dead Code, the advice is simple: delete it. Version control serves as the safety net, so there is no need to keep unused code around "just in case." If you are unsure if code is used, Beck recommends logging its use in production before removal. When feature work and tidying become tangled, Beck suggests three options: shipping the mess, manually untangling it into separate pull requests, or discarding the work to start over with a tidying-first approach. While discarding work feels counterintuitive due to the sunk cost fallacy, it often leads to better insights during re-implementation and results in a more coherent commit history. The goal is to explain your intentions to other humans, not just to instruct a computer. Regarding coupling and decoupling, the text explains that coupling is often an economic choice to ship faster, but it eventually becomes a cost. Decoupling also has costs and can sometimes increase coupling in other areas. The fundamental challenge of software design is navigating this trade-off space where the exact costs of coupling and decoupling are not known in advance. Beck notes that some coupling is inevitable and that trying to remove every bit of it is rarely worth the effort. The book series is planned to expand from individual design decisions to team dynamics and eventually the relationship between business and technology. This first volume emphasizes that the design decisions you make affect your own productivity and clarity first.
Key Takeaways
- Tidying code is a strategic move to make future behavior changes easier and more cost-effective.
- Deleting dead code reduces cognitive load and relies on version control for historical recovery rather than keeping clutter in the active codebase.
- The most effective way to handle tangled refactoring is often to discard the work and restart with a clear sequence of tidyings followed by changes.
- Software design is an economic trade-off where you balance the immediate revenue of coupled code against the long-term flexibility of decoupling.
Holiday Season Gift Ideas for Techies - by Gergely Orosz
Selecting gifts for software engineers requires a mix of technical utility, creative stimulation, and physical wellness. These recommendations, crowdsourced from the engineering community, highlight tools that range from advanced hardware hacking to simple office comforts. In the category of gadgets, the Flipper Zero is the most frequently suggested item. It functions as a multi-tool for security professionals and hobbyists, allowing them to interact with radio protocols, NFC, and infrared signals. For those who prefer building from scratch, Arduino kits and 3D printers like the Bambu P1P offer platforms for physical prototyping. Creative blocks can be addressed with Oblique Strategies, a deck of cards providing prompts to encourage lateral thinking. Subscriptions also play a major role, whether for educational platforms like Frontend Masters or industry-specific newsletters such as Stratechery and The Information. Office productivity is another focus, with items like the Ember mug designed to keep coffee at a constant temperature during long coding sessions. Ergonomic tools, including the Carpio 2.0 wrist rest and noise-canceling headphones, help manage the physical demands of desk work. Beyond digital tools, there is a strong emphasis on wellbeing and tactile experiences. This includes biometric trackers like WHOOP, adult-focused LEGO sets, and the Playdate handheld console. Board games that mirror engineering logic, such as Dominion and Zendo, provide social ways to engage the mind away from a screen. Finally, non-tech gifts like indoor climbing sessions or simple house plants are noted for their ability to help engineers disconnect and recharge.
Key Takeaways
- Tech professionals often value gifts that facilitate a transition from abstract digital work to tactile or physical activities.
- The Flipper Zero has emerged as a primary interest for engineers looking to explore hardware security and signal hacking.
- Small environmental improvements in the office, such as temperature-controlled mugs or ergonomic rests, provide significant daily utility.
- Logic-based board games and creative prompt cards offer ways to exercise engineering skills in a social or non-digital context.
What is OpenAI, Really? - by Gergely Orosz
The five-day leadership crisis at OpenAI in November 2023 highlighted a fundamental tension between the company's nonprofit origins and its aggressive commercial trajectory. The board of directors fired CEO Sam Altman on a Friday, citing a lack of candor. This move triggered an immediate revolt, with cofounder Greg Brockman resigning and 743 out of 778 employees signing a petition to join Microsoft unless the board resigned. Microsoft CEO Satya Nadella stepped in by offering to hire the entire staff, effectively demonstrating that Microsoft holds indirect control over the entity despite not having a board seat at the time. The conflict stems from an exotic corporate structure where a 501(c)(3) nonprofit governs a capped-profit subsidiary. While OpenAI claims a 100x return cap for investors, the math suggests this is an illusory constraint. With over $12 billion raised, the profit cap sits around $1.2 trillion. For context, this exceeds the cumulative historical profits of Apple, Google, or Microsoft. This structure was designed to attract talent with Profit Participation Units (PPUs). Median compensation at the company reached $905,000 per year, creating a workforce whose financial interests are tied directly to commercial success rather than purely academic research or safety. Internal friction peaked following the rapid success of ChatGPT, which reached 100 million weekly users within a year. Chief Scientist Ilya Sutskever and other board members reportedly felt Altman pushed commercialization too fast, prioritizing product launches over safety audits. The board's failure to communicate specific reasons for the firing allowed Altman to leverage his charisma and employee loyalty to force a return. The resolution saw Altman reinstated as CEO with a new board featuring Bret Taylor and Larry Summers. This outcome solidified OpenAI's shift toward a traditional Silicon Valley for-profit model, regardless of its official nonprofit charter.
Key Takeaways
- The capped profit model functions more as a PR shield than a financial limit, as the $1.2 trillion threshold is virtually unreachable even for the most profitable companies in history.
- High-density talent retention is tied to Profit Participation Units, meaning the mission of benefiting humanity is structurally at odds with a compensation model that requires aggressive monetization to make employee equity valuable.
- Microsoft executed a strategic move by positioning itself as the lender of last resort for OpenAI's talent, proving that infrastructure and capital provide more governance power than official board seats.
I Wrote a Book on Growing as a Software Engineer
Gergely Orosz has released "The Software Engineer's Guidebook," a comprehensive resource for career progression in tech companies and startups. The book covers the full career trajectory from junior developer to staff and principal levels. It distinguishes itself by balancing technical skills with soft skills like communication, influence, and strategic thinking. With 135,000 words across 413 pages, it is nearly double the length of typical non-fiction books. The content includes modern engineering practices such as AI coding tools, post-commit code reviews, developer portals, and advanced deployment strategies like canary deploys and feature flagging. The author structured the book into parts based on seniority levels: Developer, Senior, Tech Lead, and Staff or Principal. This project took four years to complete and was self-published to maintain the author's specific voice after several traditional publishers passed or requested too many changes. Writing "The Pragmatic Engineer" newsletter alongside the book allowed the author to update sections in real time, ensuring the advice remains relevant for today's fast-moving tech environment. The book addresses common challenges faced by engineers at fast-growing startups and large tech firms, offering mental models for things that are usually unwritten. It covers technical topics like debugging, testing, refactoring, and productionizing systems alongside organizational topics like performance reviews and team dynamics. Industry leaders like Tanya Reilly from Squarespace and James Stanier from Shopify have praised the book for demystifying the unwritten rules of the tech industry. Specific technical subsections explore logging, architecture debt, secure development, and multi-tenancy. The guidebook serves as a reference for engineers looking to increase their impact and navigate the complexities of the modern tech industry. It also includes bonus materials for newsletter subscribers, such as a 100-page PDF of additional chapters and templates. The physical book is available in multiple international markets, including the US, UK, Europe, India, and Australia, with plans for ebook and audiobook versions in the future.
Key Takeaways
- The book fills a gap in the market by focusing on how to be a better engineer within a business context, rather than just focusing on writing better code.
- Career progression beyond the senior level depends heavily on non-technical skills like influencing peers, owning career growth, and navigating company politics.
- Modern engineering efficiency is increasingly tied to mastering the toolbelt, which now includes AI assistants, developer portals, and sophisticated deployment workflows.
- The author's decision to self-publish highlights a trend of experts prioritizing creative control and direct distribution over traditional publishing prestige.
Three Cloud Providers, Three Outages: Three Different Responses
In 2023, AWS, Azure, and Google Cloud all hit regional outages, giving us a clear look at how their infrastructure and communication styles stack up. The triggers were diverse: GCP dealt with a fire and water leak in Paris, AWS had a Lambda capacity failure in its massive us-east-1 region, and Azure suffered a fiber cut in the Netherlands caused by a major storm. The way these providers define a region is a major factor in their resilience. AWS is the most rigid, requiring three physically separate Availability Zones (AZs) with their own power and cooling. Azure is more vague, defining regions by latency perimeters rather than physical distance. Google Cloud has the loosest setup, where zones are just logical abstractions. This came back to haunt GCP during the Paris fire: because two zones were actually running out of the same building, the fire knocked out the whole region. This proves that logical separation does not help if the physical hardware is in the same room. When it comes to talking to customers, Azure is currently winning on transparency. They now promise preliminary reports within three days and final reviews within 14 days. They even host live video retrospectives where their engineers explain what went wrong. AWS takes the opposite approach. They rarely publish public postmortems anymore, choosing instead to share details privately with big customers through Technical Account Managers or private health dashboards. This seems like a defensive move to protect their top market spot and keep competitors from using outages as sales leverage. For engineering teams, the big takeaway is that you cannot always trust a green status page. AWS often hides smaller outages from the public dashboard, only showing them to impacted users internally. As Azure plays offense by being radically open, they are setting a new standard for cloud accountability that might force the others to follow suit.
Key Takeaways
- Physical infrastructure beats logical abstraction: GCP's regional failure happened because their independent zones shared a single physical building, a design choice AWS avoids.
- Transparency is a GTM strategy: Azure is using high-accountability reporting and video retrospectives to position itself as the most reliable partner for enterprise SaaS.
- The market leader tax: AWS prioritizes brand protection over public transparency, keeping detailed failure analysis behind a paywall of enterprise support.
Lessons from Bootstrapped Companies Founded by Software Engineers
Bootstrapping is becoming a primary path for entrepreneurial software engineers as venture capital funding declines following the end of zero percent interest rates. This analysis profiles five successful bootstrapped firms: Fern Creek Software, Ticket Tailor, Formspree, Friendly Captcha, and Secta Labs. These companies were all founded or cofounded by engineers, many with Big Tech backgrounds, and have grown into profitable businesses without external investment. Most of these ventures began as side projects or low-risk experiments, such as Ticket Tailor starting as a way for a freelance developer to stop selling his time and Secta Labs emerging from GenAI image model experiments. A common thread is the 'de-risked' leap: founders often waited until the business could support them before leaving full-time employment. The technical approaches across these firms favor pragmatism over technical elegance. Most rely on monolith architectures and 'boring' but proven technologies like PHP, Python, Go, and Postgres. For instance, Ticket Tailor processes 19% of Eventbrite's global volume using a modularized PHP monolith, while Formspree avoids complex control planes like Kubernetes to maintain speed. Engineering cultures in these environments are distinct from Big Tech, often skipping mandatory code reviews and product manager roles to achieve delivery speeds up to 10x faster than larger organizations. While these companies typically stay small (under 30 people) and offer lower total compensation than Big Tech, they provide significantly more flexibility, remote-first cultures, and stability during economic downturns because they lack investor pressure for hypergrowth.
Key Takeaways
- Pragmatism is a forced constraint for bootstrapped firms: they choose boring technology and monoliths because they cannot afford the luxury of over-engineering for hypothetical hypergrowth scenarios.
- The side project to full-time pipeline is the most effective de-risking strategy: successful founders typically wait for revenue to hit a survival threshold before quitting their day jobs.
- Efficiency is driven by the removal of corporate layers: by eliminating PMs and mandatory reviews, small engineer-led teams often outpace VC-funded competitors in shipping speed.
- Bootstrapped growth is intentionally reactive: these companies only hire when the pain of being understaffed is undeniable, leading to leaner, more stable organizations that can better withstand market shifts.
Cloud Development Environment Vendors - by Gergely Orosz
Cloud development environments (CDEs) are moving from niche tools to the engineering standard. The landscape features 28 solutions, including 23 vendors and 5 open source projects. Market growth spiked between 2021 and 2022, fueled by remote work, monorepos, and security needs. Gitpod leads the pack with a Dedicated plan that lets customers use their own infrastructure. Stackblitz uses a unique approach, running a micro-OS in WebAssembly inside the browser to skip cloud VMs. DevZero, built by former Uber engineers, focuses on production-symmetric environments and supports all IDEs via SSH. Crafting extends Kubernetes to manage memory spikes during builds, serving high-growth firms like Faire. Microsoft dominates with GitHub Codespaces and Dev Box, leveraging Windows and Office 365 ecosystems. Google Cloud Workstations provides a lower-cost alternative with strong security. Amazon's Cloud9 has struggled due to poor documentation and a web-first UI that fell behind VS Code. The bring your own infrastructure model is now the standard for mid-market and enterprise deals to avoid lock-in and use existing cloud discounts. Latency is the main technical hurdle, usually solved by placing VMs in regions near developers. Security teams in finance use CDEs to control open-source contributions and prevent data leaks. The industry is hitting an inflection point where remote setups will likely outpace local development at major tech companies within a few years.
Key Takeaways
- The Bring Your Own Infrastructure (BYOI) model is the key revenue driver for enterprise CDE adoption. It solves security hurdles and lets companies use their existing cloud compute discounts.
- Latency is the biggest barrier to adoption. Stackblitz solves this with browser-based WebAssembly, while others rely on multi-region VM placement to stay close to the user.
- CDEs are moving beyond code editors to become production-symmetric environments. Startups like DevZero and Crafting provide ephemeral clusters for CI and staging that mirror production better than any local machine.
Measuring developer productivity? A response to McKinsey, Part 2
Measuring developer productivity through individual metrics often backfires by incentivizing people to game the system. When outcomes like sales quotas or recruitment closes are the only metrics that matter, employees may hoard prospects or make false promises to hit targets, ultimately harming the company. In software engineering, individual performance does not directly predict team success. Just as a soccer team of stars can lose to a cohesive underdog, engineering teams thrive on collaboration that individual metrics often discourage. The plus-minus indicator used in hockey to track goal differentials doesn't translate to software because engineering lacks clear, frequent scoring systems and strict time limits. Engineering leaders should reframe the productivity question from how productive are we to how well are we stewarding the investment. This involves showing business impact, system performance (reliability and speed), and developer effectiveness (ease and satisfaction). Goodhart’s Law applies here: once a measure becomes a target, it ceases to be a good measure. For example, measuring story points or pull requests leads to inflated estimates or fragmented code changes rather than better software. Instead of a factory model, engineering investment should follow an R&D or oil drilling approach. Companies should place small, inexpensive bets on various initiatives and double down on those showing tangible promise. The most effective way to track performance is at the team level, focusing on impact metrics like revenue generated, contribution to profit, or cost reduction. High-performing teams are best identified by hands-on leaders who stay technical enough to spot execution issues rather than relying solely on dashboards. Metrics should be used to debug issues with outcomes rather than serving as the primary incentive structure. Capturing impact in a simple format, like a quarterly wiki page of completed projects and their results, is often more effective than complex consultancy frameworks.
Key Takeaways
- Individual incentives frequently work against long-term profitability by discouraging collaboration and encouraging metric hoarding or deceptive practices.
- The plus-minus indicator from sports fails in software because engineering projects lack the clear scoring systems and strict time limits found in games like hockey.
- Treating engineering as a factory leads to the cost center trap; it is more effective to view it as a series of exploratory R&D bets where success is unpredictable but high-value.
- Effective stewardship means reporting on a full picture that includes business impact and developer satisfaction rather than just output volume.
Measuring developer productivity? A response to McKinsey 2
Software engineering productivity cannot be reduced to simple activity metrics without causing significant damage to culture and output. While sales and recruitment have clear outcome-based metrics, applying the same logic to engineering often leads to people gaming the system. For instance, a star salesperson might hoard prospects to hit quotas, or a recruiter might make false promises to close candidates. These behaviors maximize personal gain at the expense of the company's long-term health. The document argues that team performance is far more important than individual performance. Using sports analogies, it notes that a cohesive team often beats a group of superstars. In hockey, the plus-minus stat tracks a player's impact on the team's success while they are on the ice, but software lacks such clear, frequent scoring events. This makes individual measurement in engineering notoriously unreliable. To address executive concerns about engineering costs, the authors suggest a comparison exercise: imagine spending 0% versus 100% on engineering to find the right balance. Instead of viewing engineering as a factory, leaders should treat it like oil drilling or R&D. This involves making small, inexpensive bets and doubling down on what works. Kent Beck provides a framework for leaders: be clear about power dynamics, encourage self-measurement for improvement, and avoid creating incentives around metrics because they inevitably lead to corrupted data. True accountability comes from the weekly delivery of value that customers actually appreciate.
Key Takeaways
- Individual incentives discourage collaboration. When people chase personal quotas, they ignore cross-team opportunities and hoard resources to hit their own targets, which hurts the company's overall profitability.
- Team performance is easier to track and more valuable than individual output. Just like in sports, a group of average players working in sync usually outperforms a disjointed group of senior stars.
- Engineering should be treated as R&D rather than a factory. Executives should treat software investment like exploratory oil drilling, where you fund multiple small bets to find the ones that pay off.
- Metrics corrupt data. Once you attach rewards or status to a specific measure, you lose the ability to get an honest look at what is actually happening in the organization.
Measuring developer productivity? A response to McKinsey
Software engineering productivity cannot be reduced to simple activity metrics without causing systemic harm. Kent Beck and Gergely Orosz argue that McKinsey’s proposed framework focuses heavily on effort and output, using metrics like Developer Velocity Benchmark Index, Contribution Analysis, and Talent Capability. This approach is fundamentally flawed because the act of measurement changes developer behavior. When metrics like survey scores or commit counts are tied to performance reviews and status, engineers inevitably game the system. This leads to a legibility trap where data appears clear to executives but no longer reflects reality. A specific example from Facebook shows how a sentiment survey was eventually rolled up into manager scores and used for performance goals, leading to directors cutting teams based on gamed numbers rather than organizational sense. A key mental model consists of four stages: Effort, Output, Outcome, and Impact. While sales and recruitment teams are held accountable through outcome and impact metrics like revenue or heads filled, engineering often struggles to provide similar clarity. This creates a vacuum that non-technical CEOs and CFOs fill with external consultancy frameworks. Sales leaders can explain a revenue miss and provide a plan for the next quarter, whereas engineering updates often focus on shipped features and technical debt that non-engineers do not understand. The tradeoff in measurement is clear: the earlier in the cycle you measure, the easier it is to collect data but the higher the risk of unintended consequences. Measuring later in the cycle ensures alignment with company goals but makes it difficult to attribute success to specific individuals. To maintain a high-performing culture, leaders should focus on DORA and SPACE metrics or specific goals like pleasing a customer once per week. Relying on McKinsey's effort-based metrics risks damaging engineering culture in ways that take years to repair. High-performing teams are those where developers satisfy customers and do not feel measured by senseless metrics that work against solving problems.
Key Takeaways
- Measuring effort or output instead of impact creates a perverse incentive for engineers to prioritize looking busy over solving actual customer problems.
- The transition of a metric from a helpful signal to a performance goal is the exact moment it loses its value and begins to corrupt organizational culture.
- Engineering leaders must proactively define outcome-based metrics to prevent non-technical executives from imposing naive activity-based frameworks.
- High-performing teams succeed by focusing on the end of the value chain, specifically how their work changes customer behavior and generates business value.
Measuring developer productivity? A response to McKinsey
McKinsey recently claimed to have a methodology for measuring software developer productivity, sparking significant pushback from the engineering community. This response, co-authored by Gergely Orosz and Kent Beck, argues that the McKinsey approach is fundamentally flawed because it focuses on effort and output rather than outcome and impact. To understand why this matters, one must look at the software engineering cycle through a four-stage mental model. Effort includes activities like planning and coding. Output consists of tangible items like features or design documents. Outcome is the resulting change in customer behavior. Impact is the ultimate value returned to the business, such as revenue or referrals. The primary issue with measuring effort or output is that it inevitably changes developer behavior in ways that can harm the organization. For example, when Facebook turned developer sentiment surveys into performance metrics, managers began negotiating scores with reports to ensure high ratings, rendering the data useless. Similarly, when Uber introduced a dashboard tracking diff counts, engineers started creating many small, unnecessary diffs to boost their numbers, which significantly increased continuous integration costs without improving the product. In contrast, departments like sales and recruitment are held accountable through outcome and impact metrics, such as revenue targets or positions filled. Engineering leaders often struggle to provide similar clarity, leading CEOs and CFOs to seek external frameworks like McKinsey's. However, measuring earlier in the cycle is easier but leads to gaming the system. The authors suggest that high-performing teams should instead focus on delivering at least one customer-facing improvement per week or meeting specific business impact commitments. While attributing individual contributions to broad business goals is difficult, it is preferable to the cultural damage caused by micro-measuring lines of code or commit frequency. The goal is to create an environment where developers solve customer problems rather than optimizing for metrics.
Key Takeaways
- Measuring developer activity acts as a behavioral intervention that often incentivizes engineers to prioritize metric optimization over actual product value.
- The Attribution Paradox shows that while impact metrics like profit align the whole company, they make individual performance tracking difficult, tempting leaders to use flawed effort-based metrics instead.
- Engineering must bridge the accountability gap with non-technical executives by adopting outcome-based metrics similar to those used in sales and recruitment.
Building a Simple Game
This technical guide explores the transition from traditional software engineering to game development using the Unity engine. It breaks down core concepts like GameObjects, which serve as the fundamental entities in a game, and Scenes, which act as levels or containers for these objects. A major focus is placed on Prefabs, which are reusable templates for objects like coins or enemies, allowing for efficient runtime instantiation. The article explains that Unity programming relies on C# scripts inheriting from MonoBehaviour. This framework often bypasses standard object oriented principles by using reflection to trigger lifecycle events like Start and Update instead of traditional method overrides. Performance is a central theme, specifically the challenge of maintaining 60 frames per second, which leaves only 16 milliseconds for all logic and rendering to complete. The guide includes a practical tutorial for building a coin collection game, demonstrating player movement via keyboard input and collision detection using triggers. It also addresses the complexities of threading, noting that while Unity is multi threaded, it is not thread safe, requiring specific approaches like the Job System or ECS for parallel processing. Finally, it advocates for using the MVC framework to keep game logic separate from visual representation, ensuring better maintainability and easier debugging.
Key Takeaways
- Unity uses a unique execution model where the engine uses reflection to call scripts, which can lead to performance overhead if empty lifecycle methods are left in the code.
- The shift from SaaS development to game development involves moving from a continuous maintenance mindset to a performance critical build phase where every millisecond of the frame loop counts.
- Object pooling is a vital optimization technique because creating and destroying GameObjects at runtime is memory intensive and can cause noticeable frame rate drops.
- Applying MVC architecture prevents the common mistake of bundling all game logic into a single player script, making the codebase much easier to scale and debug.
Interesting Learning from Outages (Real-World Engineering Challenges #10)
Outages provide expensive but vital lessons for engineering teams. While most companies keep incident reviews internal, public postmortems help restore customer trust and educate the broader tech community. Internal versions are usually the most detailed, while public ones are edited to remove jargon and confidential data. Adevinta dealt with a month-long investigation into intermittent 5xx errors. They cycled through suspects like ingress controllers and logging agents before identifying a DNS issue. The root cause was a combination of low concurrent query limits in their DNS cache, specific internal requests not being cached, and a flood of non-existent DNS requests. This highlights why reliable service level indicators (SLIs) are essential. If indicators are flaky, teams stop trusting them, which turns critical data into noise. GitHub experienced a brief outage during a live failover test. They were validating a second internet edge facility when a network pathing configuration error was exposed. Even though the test caused the downtime, the exercise was successful in identifying a single point of failure before a real disaster occurred. Regular failover and failback testing is necessary to ensure systems can actually handle regional data center failures. It is better to trigger a controlled two-minute rollback than to find out your backup site is broken during a genuine emergency. Reddit faced a five-hour outage during a Kubernetes upgrade. The team had to decide between live debugging or a full production restore. After two hours of failed fixes, they committed to the restore process, which they found to be incredibly stressful despite previous simulations. The failure was traced to inconsistent, bespoke infrastructure configurations that had accumulated as the company grew. This infrastructure debt is a common byproduct of autonomous teams moving fast, but it eventually requires a dedicated effort to standardize environments and prevent systemic failures.
Key Takeaways
- Reliable SLIs are the foundation of fast incident response because inconsistent indicators lose their utility and lead teams down the wrong path during high-pressure investigations.
- Controlled failover testing is worth the risk of brief downtime since discovering a configuration mismatch during a planned exercise allows for a quick two-minute mitigation instead of a catastrophic failure during a real disaster.
- Production restores are never as smooth as simulations and the psychological stress of a full restore often leads to hesitation, meaning teams should expect to improvise even with a documented backup plan.
- Infrastructure debt is the hidden cost of team autonomy where bespoke configurations help teams move fast initially but eventually become a primary source of outages during routine upgrades.
A new way to measure developer productivity – from the creators of DORA and SPACE
Measuring developer productivity has historically relied on activity counts like pull requests or commits, but these often miss the mark. A new framework developed by the creators of DORA and SPACE shifts the focus toward Developer Experience (DevEx). This approach emphasizes three core dimensions to understand what actually drives productivity. While DORA measures delivery performance and SPACE provides a broad productivity lens, this new framework prioritizes perceptual data, which includes the attitudes and opinions of developers themselves. The research team includes industry experts like Nicole Forsgren, Margaret-Anne Storey, Abi Noda, and Michaela Greiler, who have spent years advising companies like Microsoft and GitHub. System data from telemetry provides precision for things like build times and on-call ticket volume. However, it cannot capture whether a developer feels blocked by a slow review process or if the feedback they receive is high quality. Surveys allow organizations to gather data on these human factors much faster than instrumenting new system metrics. The researchers argue that developers are the users of the engineering system. A user-centric approach is required to improve software delivery. Companies like eBay and Pfizer have already begun implementing these survey-based approaches to gain a more holistic view of their engineering health. The transition from measurement to improvement often fails because teams feel the metrics are inactionable. To solve this, the authors suggest a constraints-based approach. This involves identifying the specific roadblocks that cause the most frustration. This might involve asking what people are swearing at most often to find the real bottlenecks. Implementing this effectively requires expertise in psychometrics to ensure surveys are valid and reliable. Validity ensures the survey measures what it intends to, while reliability ensures consistent results over time. Leading tech companies like Google and Microsoft have used these survey-based methods for years to maintain high engineering efficiency and reduce developer burnout.
Key Takeaways
- Perceptual data often reveals friction points that system telemetry misses, such as the quality of code reviews versus just the speed of completion.
- Treating developers as the primary users of internal systems allows leaders to apply user-centric design principles to engineering workflows.
- A constraints-based approach is necessary to turn high-level metrics like DORA into actionable improvements by targeting specific developer blockers.
- Reliable survey design is a technical discipline that requires strict wording consistency and statistical validation to be useful for long-term tracking.
Inside DataDog’s $5M Outage (Real-World Engineering Challenges #8)
Datadog experienced its first global outage on March 8, 2023, resulting in over 24 hours of downtime and $5 million in lost revenue. The incident was triggered by a routine security update to the Ubuntu 22.04 operating system. Specifically, a patch for systemd vulnerabilities, including CVE-2022-3821 and CVE-2022-4415, caused the systemd process to re-execute itself. This restart cascaded to sub-processes like systemd-networkd, which inadvertently cleared network routes. These routes were critical for Cilium, the eBPF-based container routing control plane managing Datadog's Kubernetes clusters. The outage became global because Datadog's base OS images had a legacy security update channel enabled. This channel triggered updates automatically across all regions and cloud providers, including AWS, GCP, and Azure, within the same one-hour UTC window. This synchronized update broke regional control planes simultaneously, creating a circular dependency where the infrastructure needed to repair the system was itself offline. Recovery efforts varied significantly by cloud provider. Some providers simply rebooted unhealthy nodes, allowing for faster recovery of stateful workloads. Others automatically replaced nodes, leading to data loss and a thundering herd effect that hit regional rate limits. Datadog eventually prioritized live data and alerts over historical backfilling to restore core functionality. Post-incident analysis highlighted significant communication failures. While Datadog declared the incident quickly, updates were often repetitive and lacked substance for the first 14 hours. Furthermore, the company was slow to release a public postmortem, initially sharing it only with select customers. To prevent recurrence, Datadog disabled automatic OS updates, moved to manual rollouts, and modified systemd-networkd configurations to preserve Cilium routes during updates. The incident serves as a case study in the risks of synchronized automated updates and the importance of maintaining independent regional control planes.
Key Takeaways
- Multi-cloud redundancy fails when shared configurations exist. Even with three different providers, the identical OS update policy created a single point of failure that bypassed regional isolation.
- Circular dependencies remain a critical infrastructure risk. When the control plane managing the network relies on that same network to function, a minor routing error can escalate into a total system lockout.
- Communication strategy is as vital as technical recovery for brand trust. Datadog's inconsistent postmortem distribution and generic status updates damaged customer relationships, proving transparency must be centralized.
- Automated security updates require staggered rollout schedules. The decision to move to manual, controlled rollouts highlights that for massive fleets, the risk of a synchronized failure outweighs the benefits of instant patching.
The Full Circle on Developer Productivity with Steve Yegge
Steve Yegge's career trajectory highlights a persistent obsession with developer productivity and the tools that enable it. Starting at GeoWorks in 1992, he experienced a high-performance Assembly debugger that set a lifelong benchmark for developer nirvana. At Amazon, he learned the art of delivery as a Technical Program Manager (TPM) and witnessed Jeff Bezos's famous mandate for service-oriented architecture, which transformed Amazon into a platform company. His 13-year tenure at Google provided insights into bottom-up innovation during its golden age, though he later criticized the company for creating barriers between engineers and customers. Yegge notes that while tech giants like Google and Amazon build incredible internal tools, they are structurally incapable of building these tools for the external market. This gap led him to Sourcegraph, where he now addresses the Big Data challenge of modern codebases. He argues that code has become as unmanageable as data lakes were a decade ago, requiring a code intelligence platform rather than just a traditional IDE. Sourcegraph operates as a flashlight in this dark cavern of massive repositories, using AI and machine learning to enable large-scale refactoring and better code reviews. The discussion also touches on the customer obsession mindset. Yegge advocates for engineers to spend time in call centers or sitting with users to turn rough edges of a product into sharp edges they feel compelled to fix. He emphasizes that the most successful engineering cultures prioritize this empathy and invest heavily in the intangible results of superior developer tooling.
Key Takeaways
- Large-scale codebases have evolved into a Big Data problem that traditional IDEs cannot solve, necessitating a centralized intelligence layer for search and refactoring.
- The TPM muscle is essential for senior engineers because it forces them to manage cross-functional complexity beyond just writing code.
- Google's process-heavy evolution created a disconnect between engineers and Cloud customers, a trap that high-growth startups must avoid by maintaining direct customer contact.
- World-class developer tools are rarely built for the public by big tech firms because they lack the appetite for the long tail of external integrations.
Inside Uber’s move to the Cloud: Part 1 - by Gergely Orosz
Uber operated its own data centers for nearly a decade before announcing a long-term shift to Google Cloud and Oracle in early 2023. Initially, Uber outsourced its infrastructure to Peak Hosting in 2013, running on roughly a dozen Dell PowerEdge servers. By 2014, the company began building its first internal data center, SJC in San Jose, followed by DCA near Washington DC. This strategic choice mirrored moves by Google and Facebook, driven by rapid growth and available capital. However, maintaining private infrastructure presented significant hardware hurdles. Uber transitioned from OEM suppliers like Dell to ODM manufacturing with vendors like Quanta, Wiwynn, and Foxconn around 2018 to capture economies of scale. This shift introduced quality control issues, as Uber lacked the massive engineering staff required to manage custom firmware and BIOS testing compared to giants like Meta or AWS. SSD failures were common due to low Drive Writes Per Day (DWPD) ratings on cheaper drives. The push toward the cloud intensified under CEO Dara Khosrowshahi. The COVID-19 pandemic highlighted the rigidity of private data centers: while Rides demand plummeted, energy and maintenance costs remained fixed, creating waste that cloud elasticity might have mitigated. Financially, the shift from Capital Expenditure (CapEx) to Operating Expense (OpEx) became attractive for cash flow management and investor predictability. Acquisitions also played a role. Postmates, acquired in 2020, demonstrated significantly lower infrastructure costs as a percentage of revenue while running on AWS. Furthermore, internal software tools like the Schemaless storage platform began lagging behind commercial offerings like Google Spanner and Looker. While Uber developed the Crane project over five years to facilitate a hybrid cloud approach, the move signifies a broader trend where only a handful of companies find it economically viable to operate at the physical hardware layer.
Key Takeaways
- The move from OEM to ODM hardware demands a huge internal engineering team for firmware and quality checks, which Uber found hard to sustain compared to hyperscalers like Meta or AWS.
- Postmates acted as an internal case study, proving that cloud-native setups could hit better cost-to-revenue ratios than Uber's own private data centers.
- Moving from CapEx to OpEx is a financial play for public market stability, helping manage cash flow during unpredictable events like the pandemic.
- Proprietary software like Schemaless eventually becomes a burden because commercial tools like Google Spanner often move faster and offer more features than internal teams can build.
Behind the Scenes with React.js: the Documentary
React.js is currently the dominant web framework, but its origins at Facebook were far from certain. This documentary, produced by filmmaker Ida Lærke Bechtle and funded by the developer job platform Honeypot, chronicles the library's journey from an internal project called FBolt to a global standard. The film features interviews with pivotal figures including creator Jordan Walke, Dan Abramov, and Christopher Chedeau. It highlights how React was initially an underdog story, facing internal skepticism before being open sourced in 2013. The production involved extensive travel to tech hubs like San Francisco, London, and Boston to capture the perspectives of the original core team. Despite concerns about its hour-long runtime, the documentary gained over 250,000 views in its first week on YouTube, proving a strong appetite for high-quality, long-form technical storytelling. The project also serves as a case study for community-focused marketing, as Honeypot funded the film to give back to the developer ecosystem while building brand awareness. The premiere took place at JSWorld in Amsterdam, followed by a series of intimate screenings across Europe.
Key Takeaways
- React's success was a bottom-up movement rather than a top-down corporate directive, showing how grassroots engineering projects can disrupt established standards.
- The documentary proves that technical audiences value deep, long-form storytelling on platforms like YouTube when the production quality and narrative arc are high.
- Strategic content sponsorship by companies like Honeypot demonstrates a shift toward giving back to open-source communities as a viable alternative to traditional lead generation.
Real-world Engineering Challenges #8: Breaking up a Monolith
Khan Academy successfully transitioned a one million line Python monolith into approximately 40 Go services over a 3.5 year period. The primary drivers were the end-of-life for Python 2 and the need for better performance and cost efficiency. They adopted a federated GraphQL architecture, replacing their legacy REST endpoints. A key strategy was the field-by-field migration, where individual data fields were moved to new services while others remained in the monolith, enabled by a GraphQL federation hub. This allowed for side-by-side testing where both systems were called, results compared, and differences logged before fully switching traffic. The project was split into two phases. The first focused on the Minimum Viable Experience (MVE), which prioritized features essential to the site's identity like content delivery and user management. This phase took two years and handled 95% of traffic. The second phase, Endgame, involved migrating internal tools and remaining features. Despite the common preference for agile, the team treated this as a fixed-scope, fixed-timeline project with a massive burndown chart. This approach ensured they met their hard deadline, finishing just four days early. The move to Go resulted in significant performance gains, with service hour costs dropping by up to 10x compared to Python. However, the transition was not without friction. The team had to learn Go from scratch, leading to initial mistakes and a temporary halt in new feature development. This caused some attrition in product and design roles. Ultimately, the project succeeded by maintaining a rhythm of incremental shipping and enforcing a strict rule that only one service could own and write any specific piece of data.
Key Takeaways
- The Minimum Viable Experience (MVE) framework is more effective than a traditional MVP for migrations because it focuses on preserving core brand identity and essential user flows rather than building a barebones product.
- Fixed-scope and fixed-timeline management can outperform agile methodologies for large-scale migrations where the problem space is well-defined and the primary goal is execution rather than discovery.
- The 10x reduction in operational costs demonstrates the massive ROI of moving from interpreted languages like Python to compiled languages like Go for high-traffic SaaS platforms.
- Long-running technical migrations create significant innovation debt that can lead to attrition in non-engineering functions like product and design due to the lack of new feature development.
- Enforcing a strict data ownership rule where only one service is permitted to write to a specific data set is critical for maintaining system integrity during a distributed services transition.
The Pragmatic Engineer in 2022 - by Gergely Orosz
Gergely Orosz summarizes a year of deep-dive reporting for The Pragmatic Engineer, a period that saw the publication of over 100 issues totaling 550,000 words. The review highlights how the newsletter tracked the dramatic shift in the tech industry from a boiling hiring market to a tech winter characterized by widespread layoffs and rescinded offers. Key reporting included the Atlassian outage, the collapse of Fast, and the onset of hiring freezes at Meta and Apple. Orosz notes that his Scoop series often served as a leading indicator, identifying trends like the Big Tech hiring slowdown weeks or months before mainstream outlets like the New York Times. The year's most popular content focused on practical engineering challenges such as shipping to production, managing complex migrations, and choosing technology stacks. The review also points to deep dives into the engineering cultures of Meta, Amazon, and Uber, providing a rare look at internal processes, career ladders, and performance calibrations. For engineering leaders, the newsletter expanded its library of templates covering oncall compensation, layoff preparation, and promotion checklists. Looking toward 2023, the focus shifts toward developer productivity, security engineering, and how startups can operate more efficiently with smaller budgets.
Key Takeaways
- Independent reporting on internal company dynamics often predicts major market shifts like hiring freezes and valuation cuts long before they hit the mainstream news.
- The industry is pivoting from a growth-at-all-costs mindset to a focus on engineering efficiency and measuring productivity accurately without resorting to vanity metrics.
- Understanding the distinction between profit centers and cost centers is becoming a critical factor for software engineers navigating career growth during a market downturn.
The Staff Engineer’s Path: You’re a Role Model Now (Sorry)
The staff engineer role represents a technical leadership path that differs from management. While the manager's path is well-documented, the staff engineer's journey is often ambiguous. A core aspect of this seniority is becoming an involuntary role model. Because people assume staff engineers know what they are talking about, their casual comments can be mistaken for project mandates. Engineering culture is ultimately defined by what staff engineers do rather than what a company writes in its values. If a senior leader takes shortcuts or ignores code reviews, the rest of the team will follow suit. Effective staff leadership requires a shift toward a longer time horizon. Software engineering is essentially programming integrated over time, meaning impact often lasts five to ten years. Planning for this involves telegraphing future changes early, such as announcing system deprecations long before they happen to prevent wasted investment. It also requires tidying up the production environment to ensure future velocity. This includes writing tests, following style guides, and removing traps like dangerous scripts. Creating institutional memory is vital for handling staff turnover. Documenting decision records, system diagrams, and context-rich code comments ensures that the rationale behind a system survives even after the original creators leave. Furthermore, staff engineers must expect and plan for failure through chaos engineering and disaster drills. A major shift in mindset at this level is prioritizing maintenance over creation. Since software is maintained far longer than it takes to build, engineers should optimize for understandability and simplicity. Complex solutions are often a cost to bear rather than a sign of prowess. Building with decommissioning in mind also leads to more modular and maintainable systems. Finally, staff engineers must create space for junior engineers to solve difficult problems, effectively building the next generation of leaders.
Key Takeaways
- Staff engineers exert massive passive influence because their daily actions and shortcuts become the de facto engineering standards for the whole team.
- High-velocity teams prioritize keeping tools sharp by investing in build speed and deployment reliability to reduce the long-term cost of every outage.
- Institutional memory acts as a safeguard against turnover; without written decision records, the rationale for complex systems disappears when key people leave.
- Simplicity is a deliberate engineering choice that reduces maintenance burdens, whereas complexity is often an unnecessary cost disguised as technical prowess.
Real-World Engineering Challenges #7: Choosing Technologies
Trello transitioned from RabbitMQ to Kafka to manage its websocket infrastructure after facing significant reliability issues. RabbitMQ struggled with network partitioning, which forced full resets and caused message loss during failovers. The team evaluated several alternatives including Amazon SNS/SQS, Kinesis, and Redis Streams. They ultimately selected Kafka because it offered the necessary throughput of 2,000 messages per second and proved to be unexpectedly cheaper to operate. Birdie adopted a Micro Frontend architecture to solve the bottleneck of slow automated tests in their monolithic React application. As the app grew, running the full suite of unit, integration, and end-to-end tests for every small change became a major drag on development speed. By breaking the app into independent components wrapped in a shell, they improved developer autonomy and allowed teams to test pieces in isolation. The transition was not without hurdles. It required unpicking complex Redux global state and moving away from cross-feature dependencies that had built up over years of rapid startup growth. MetalBear, a startup building open-source tools for backend developers, standardized on Rust for its core components like the Agent and CLI. The decision was driven by low-level technical requirements such as Linux namespace switching and the need for a small memory footprint without the overhead of a garbage collector. Beyond technical merits, the founders used Rust as a strategic hiring lever. They correctly gambled that using a modern, high-performance language would attract top-tier engineers who were eager to work with Rust professionally. Motive implemented Kotlin Multiplatform Mobile (KMM) to synchronize business logic between its iOS and Android apps. Their iOS version was perpetually lagging behind Android by several months, leading to inconsistent features and logic. By sharing code through KMM while maintaining native UIs, they increased development speed by roughly 30 percent. This approach allowed them to keep native performance for map-heavy features while ensuring that core business logic only had to be written and tested once. The recurring theme across these case studies is that technology shifts must solve a large enough pain point to justify the cost. Successful teams often start small by prototyping a single feature or component before committing to a wider rollout.
Key Takeaways
- Technology choices serve as a powerful recruiting tool. MetalBear's use of Rust shows that picking a modern, high-performance language can attract talent even as a small startup.
- Incremental migration beats the big bang approach. Both Birdie and Motive avoided full rewrites, instead opting to build new features in the new architecture while slowly unpicking legacy dependencies.
- Performance and native control often dictate cross-platform choices. Motive chose KMM over Flutter or React Native specifically to maintain native UI performance and leverage existing JVM expertise.
- Operational costs and reliability are the primary drivers for infrastructure pivots. Trello's move to Kafka was not just about features. It was about the high hardware costs and fragility of RabbitMQ at scale.
The Story of Linear as told by its CTO - by Gergely Orosz
Tuomas Artman, CTO and co-founder of Linear, details the company's journey from its inception in 2019 to becoming a profitable, high-efficiency startup. After years at Uber and Groupon, Artman and his co-founders identified a gap in the market: project management tools were built for managers, not the individual contributors (ICs) who actually use them. Linear was designed to prioritize speed, craft, and the IC experience. The company operates with a remarkably small team of 30 people, with 18 focused on product, and achieved profitability early by intentionally hiring slowly and maintaining a high talent bar. Their technical foundation relies on a proprietary sync engine that manages data replication and offline mode, allowing engineers to build complex features without worrying about backend infrastructure or networking. This 'very normal' tech stack consists of TypeScript, React, Node, and Postgres running on GCP with Kubernetes. Linear rejects common Big Tech practices like heavy A/B testing and complex user stories, opting instead for 'The Linear Method' which emphasizes plain language tasks, public changelogs, and building in close collaboration with users. Artman highlights that their Big Tech pedigree from Uber, Airbnb, and Coinbase made early fundraising from Sequoia and Index Ventures straightforward, but their success stems from avoiding the 'hypergrowth' headcount trap that often leads to brittle infrastructure and communication overhead.
Key Takeaways
- Infrastructure as a force multiplier: By building a complex sync engine and Kubernetes setup before hitting product-market fit, Linear enabled a tiny team to ship features faster than much larger competitors.
- The hiring speed constraint: Profitability was an accidental byproduct of being extremely picky with talent. Hiring only senior, remote-ready engineers eliminated the need for heavy management and mentoring overhead.
- Product over process: Linear replaces traditional 'user stories' with simple tasks and relies on intuition and craft rather than A/B testing, which they view as a funnel-optimization tool rather than a product-building tool.
- Strategic use of pedigree: The founders used their Big Tech backgrounds to de-risk the investment for VCs, allowing them to raise money on favorable terms without even needing a formal pitch deck initially.
Leaving big tech to build the #1 technology newsletter | Gergely Orosz (The Pragmatic Engineer)
Gergely Orosz explains his transition from an engineering manager at Uber to running the top technology newsletter on Substack. He details the financial reality of leaving a high-paying tech job, noting that his current earnings exceed his Uber total compensation of $330,000. The newsletter, The Pragmatic Engineer, reached 189,000 subscribers with growth accelerating to nearly 1,000 new signups daily. This success is attributed to six years of consistent blogging and the introduction of Substack's recommendation engine. He operates a two-post-per-week schedule: a deep-dive technical piece on Tuesdays and a market-focused "Scoop" on Thursdays. To maintain this output, he uses strict productivity hacks like blocking social media via host files and using the Centered app. He emphasizes that credibility is the foundation of a successful newsletter, advising aspiring writers to build deep expertise in a field before attempting to teach others. While the life of a solo creator offers freedom and an empty calendar, it also brings loneliness and the constant pressure of a never-ending production cycle. He views his work as a one-person business rather than just content creation, focusing on long-term sustainability over quick exits. He also touches on the "Fisherman's Story" as a motivation for choosing this path over the traditional venture-backed startup route.
Key Takeaways
- The Fisherman's Story highlights that many founders grind for years to reach a lifestyle they could achieve now through independent writing and consulting.
- Public deadlines act as a necessary forcing function for productivity when you no longer have a corporate manager or structured schedule to keep you on track.
- Credibility is non-negotiable: people pay for insights derived from real-world experience at scale, like Uber or Skype, rather than generic reporting or interviews.
- Newsletters offer a unique daily raise mechanism where high-quality content directly and immediately impacts recurring revenue through new subscriptions.
- A newsletter is a one-person business that is difficult to exit because the value is tied to the individual, unlike a traditional tech startup.
The State of Frontend in 2022 - by Gergely Orosz
The State of Frontend 2022 survey, with over 3,700 respondents from 125 countries, highlights a significant maturation in frontend engineering. Remote work has become the standard, with 56% of engineers working fully remotely and only 5% remaining in-office. The data shows a shift toward larger frontend teams, with 50% of respondents working at companies with 10 or more frontend engineers. Engineering practices once reserved for the backend are now mainstream: 75% of frontend developers write unit tests, 79% use continuous integration (CI), and 80% follow code review processes. There is a strong correlation between these practices, suggesting that teams adopting one typically adopt all three to manage increasing codebase complexity. TypeScript has established itself as the de facto language of the field, used by 84% of respondents. This shift toward static typing reflects a desire to reduce production errors caused by JavaScript's weak type system. In the framework landscape, React remains dominant at 76% usage, while Next.js has seen a meteoric rise to 43%. Other frameworks like Vue and Svelte are seeing stable or slightly declining interest. For developer tools, Visual Studio Code has effectively won the market, used by the vast majority of developers. Microsoft's strategic position is remarkably strong, as it now owns the primary language (TypeScript), the editor (VS Code), the repository hosting (GitHub), and the package manager (npm) used by the community. In the vendor space, Vercel is gaining significant momentum, growing from 6% to 25% market share in two years and overtaking Netlify. This growth is driven by the popularity of Next.js and Vercel's focus on edge computing. Conversely, Gatsby is struggling to maintain relevance despite significant funding, largely due to past performance issues and the dominance of Next.js. While AWS remains the most popular cloud provider overall, specialized frontend hosting platforms like Vercel and Netlify are capturing a large portion of the deployment market, particularly for tech-first companies.
Key Takeaways
- Microsoft has become the silent enabler of the entire web development ecosystem by owning the core tools including TypeScript, VS Code, GitHub, and npm.
- The rise of Next.js demonstrates a successful 'framework-to-infrastructure' funnel where Vercel uses a popular open-source tool to drive adoption of its hosting and edge computing services.
- Frontend engineering is undergoing a professionalization phase where complex business logic and rigorous DevOps practices like CI/CD and unit testing are now standard requirements.
- There is an emerging trend toward 'unified frontend' teams where companies consolidate web and mobile development under a single stack like React and React Native to increase efficiency.
- Gatsby serves as a cautionary tale of how performance issues and a slow response to market shifts can lead to a loss of developer mindshare even with significant venture backing.
Real-World Engineering Challenges #6: Migrations
Engineering migrations are a critical but often overlooked aspect of scaling software systems. Successful migrations at companies like Box, Pinterest, and Stripe follow a rigorous phased approach to ensure zero downtime and data integrity. A common technical pattern involves dual writing, where data is written to both the old and new systems simultaneously. This usually starts with asynchronous writes to minimize latency impact, followed by data backfilling and validation. Once the new system is verified, the team switches to synchronous dual writes before finally making the new system the primary source of truth and retiring the legacy infrastructure. Box specifically highlighted the importance of performance validation, choosing SSDs over HDDs for their Cloud Bigtable move after discovering a 20x difference in latency. Beyond technical execution, large-scale migrations involving hundreds of engineers require significant coordination and cultural alignment. LinkedIn manages this by treating migrations as horizontal initiatives, focusing on clear documentation, automated tooling, and progress dashboards to reduce friction for participating teams. Spotify takes this further by product-ifying migrations. They assign product managers to lead these efforts, ensuring the value is communicated clearly to stakeholders and using gamification, such as leaderboards, to encourage completion. Automation is also a major factor at Spotify, where automated pull requests are used to handle component upgrades across the organization. Safety mechanisms are essential when migrations involve both client and server changes. DoorDash utilized kill switches and magic cookies during their move from a monolith to microservices for session management. These tools allowed for immediate rollbacks and manual testing in production without impacting the broader user base. Measuring business metrics during the rollout, such as conversion rates, provides data-driven guardrails that take the guesswork out of when to ramp up traffic to a new service. Ultimately, the most successful migrations treat other engineers as customers, providing them with the support and tools needed to transition without disrupting their primary product work.
Key Takeaways
- Zero-downtime data migrations require a multi-step phased approach that separates asynchronous and synchronous dual writes to protect system availability.
- The most effective large-scale migrations treat the process as a product, utilizing product managers to handle stakeholder communication, value alignment, and user feedback.
- Automation is the primary lever for scaling migrations in autonomous organizations, as seen with Spotify's use of automated pull requests for version upgrades.
- Safety in complex client-server migrations depends on robust rollback mechanisms like kill switches and the use of experimentation frameworks to monitor business metrics in real-time.
My learnings a year into writing a paid newsletter
Gergely Orosz reflects on the first year of The Pragmatic Engineer, which became the top technology newsletter on Substack. He transitioned from an engineering manager role at Uber to full-time writing after identifying a gap in the market for practical, in-depth software engineering insights. The publication reached over 150,000 total subscribers and hit 1,000 paid subscribers within just six weeks of launch. A core pillar of the business is strict editorial independence, refusing all sponsorships, ads, or affiliate links to ensure the reader remains the primary customer. The content strategy evolved from a single weekly deep dive to a multi-column format. Tuesdays focus on long-form educational articles, while Thursdays feature "The Scoop," a news-oriented column providing exclusive industry insights. Orosz notes that his background in Big Tech allows him to bridge information gaps for those outside that bubble. He highlights that in-depth, niche writing often outperforms broad content and that his long-standing engineering blog served as the primary driver for paid conversions, surpassing social media platforms like Twitter and LinkedIn. Reader feedback indicates a strong preference for deep dives into engineering cultures at companies like Facebook and Amazon, alongside practical topics like incident reviews and project leadership. Looking ahead, the newsletter will expand into related fields like data engineering and ML, incorporate guest writers, and implement a formal holiday policy to maintain sustainable production. The business model proves that niche, high-quality technical content can thrive independently of mainstream media structures.
Key Takeaways
- Niche authority beats broad reach because high-quality, deep-dive technical content can build a sustainable business without mainstream media backing or broad-market appeal.
- Owned assets drive the most value since long-term personal blogs proved to be more effective at converting paid subscribers than social media platforms like Twitter or LinkedIn.
- Independence acts as a competitive advantage by explicitly rejecting sponsorships, which builds higher trust and allows for a focus on depth that creates a loyal subscriber base.
- Identifying the information gap in Big Tech culture allowed Orosz to find product-market fit by providing internal bubble insights to the broader engineering community.
Real-World Engineering Challenges #5 - by Gergely Orosz
This analysis covers technical architectures and management frameworks from several high growth companies. Shopify's payment infrastructure emphasizes resilience through idempotency keys to prevent double charges and circuit breakers like Semian to stop cascading failures. Capacity planning is treated as a prerequisite for production readiness, focusing on QPS and resource utilization. Grab handles millions of food orders by separating transactional and analytical queries. They use DynamoDB as an OLTP database for high write loads and strong consistency, while leveraging MySQL RDS as an OLAP database for read intensive analytical tasks. This separation allows them to manage hotkey traffic and maintain stability during peak hours. Yelp's analytics infrastructure team highlights the unique constraints of platform engineering, where code must be initialized first but cannot depend on client repositories. Their onpoint rotation system manages the high volume of internal support requests. Instagram's recommendation engine functions as a two stage information retrieval system. It first generates candidates using embeddings and co-occurrence matrices for high recall, then selects content through ranking algorithms like multi task multi label neural nets and gradient boosted decision trees. Airbnb streamlined incident management by building a custom Slack bot that integrates PagerDuty and JIRA, allowing engineers to declare incidents and page responders without switching tools. Finally, Honeycomb proposes an Engineering Manager Bill of Rights to address the lack of support for the management track compared to individual contributors. This framework advocates for transparent feedback loops, market aligned compensation, and clear advancement paths to prevent talented managers from reverting to IC roles due to burnout or lack of clarity.
Key Takeaways
- Separating OLTP and OLAP workloads is essential for high growth SaaS to maintain transactional integrity while allowing for complex data analysis without performance degradation.
- Platform teams face a visibility gap where their most critical work is invisible and hard to validate, requiring specific internal feedback loops and sample apps to ensure stability.
- Effective recommendation systems rely on a two stage funnel of broad candidate generation followed by precise ranking, using both positive and negative engagement signals to tune Bayesian models.
- The unwinnable game of engineering management can be mitigated by decoupling manager compensation from IC tracks and ensuring leadership provides the same level of transparency they expect from their teams.
The Pragmatic Engineer: Year One - by Gergely Orosz
This collection indexes the first year of The Pragmatic Engineer newsletter, spanning August 2021 to August 2022. During this period, Gergely Orosz published 86 issues totaling approximately 500,000 words. The content is categorized to help engineering managers and senior engineers navigate challenges at Big Tech and high-growth startups. A significant portion of the archive focuses on engineering approaches, including project management best practices, incident reviews, and the use of RFCs and design documents. Notably, it explores why many Big Tech companies avoid Scrum in favor of more flexible, engineer-led project management styles. The archive features deep dives into the engineering cultures of major players like Amazon, Facebook, and Uber. These articles provide granular details on internal operations, such as the platform and program split at Uber or the specific engineering philosophies that drive Facebook. Another core theme is the evolution of the tech hiring market. Through The Scoop series, the newsletter tracks the industry's move from a perfect storm of aggressive hiring in late 2021 to the tech winter characterized by freezes and layoffs at companies like Klarna, Netflix, and Microsoft in 2022. Career development is also a major focus, with articles covering the seniority rollercoaster, the transition from individual contributor to engineering manager, and strategies for remote compensation. The index includes contributions from other industry experts on topics like engineering productivity and payment system design. For those in leadership, the collection offers guidance on hiring diverse teams, managing attrition, and preparing for annual planning. The first year also produced 29 templates and resources designed to provide practical inspiration for software engineers in their daily workflows.
Key Takeaways
- The archive captures a pivotal moment in tech history, documenting the rapid shift from a candidate-driven hiring market to widespread industry layoffs and budget freezes.
- Effective engineering organizations often replace rigid frameworks like Scrum with document-heavy cultures relying on RFCs, ADRs, and design docs to maintain speed and clarity.
- Success in senior engineering roles depends heavily on soft technical skills, particularly writing, project leading, and understanding the business distinction between profit and cost centers.
- Platform engineering is treated as a strategic necessity for scaling, with specific organizational structures like the platform/program split being used to manage developer productivity.
Preparing for Layoffs in Tech - by Gergely Orosz
Layoffs should be treated as a tool of last resort because they cause deep, often permanent damage to company culture and psychological safety. Beyond the immediate loss of staff, companies frequently experience a secondary wave of attrition where top performers leave due to broken trust or survivor guilt. Productivity typically drops for months as remaining employees feel unsafe and distracted. In some regions, particularly Europe, the legal complexity and mandatory consultation periods make layoffs a multi-year strategic decision rather than a quick cost-cutting fix. Before reaching for headcount reductions, leaders should exhaust alternatives. These include cutting non-essential spend like travel and perks, instituting hiring freezes, disallowing backfills, and postponing promotions. More aggressive measures include replacing cash bonuses with stock or implementing temporary pay cuts for senior leadership. If layoffs are unavoidable, the most effective strategy is to cut once and cut deep to avoid the morale-killing cycle of multiple small rounds. Execution requires a meticulous communication plan that prioritizes internal staff over the press. Leaders must own the decision personally and avoid making the announcement about their own emotional difficulty. Providing a goodbye window on internal systems like Slack allows departing staff to trade contact details and say farewell, which preserves some humanity in the process. Post-layoff support should include generous severance, extended benefits, and active job placement assistance through talent hubs or investor introductions. Case studies like Deliveroo show that transparency about selection criteria and frequent Q&A sessions can help a company eventually recover, whereas poor execution at firms like Klarna or Better.com leads to long-term reputational damage.
Key Takeaways
- The survivor effect often triggers the departure of high performers who have the most market mobility and the least tolerance for a perceived breach of the employment contract.
- US-based leaders frequently underestimate the regulatory hurdles in Europe, where mandatory consultations can turn a intended quick cut into a multi-month process that effectively halts all work.
- Transparency regarding selection criteria, such as stack ranking or specific business unit pivots, is more effective at preserving trust than vague or performance-masked explanations.
- Small logistical choices, like providing a 48-hour Slack access window or creating an external alumni talent hub, significantly impact long-term employer branding and the mental health of the entire team.
Inside the Longest Atlassian Outage of All Time
In April 2022, Atlassian experienced its most significant outage to date, impacting around 400 companies and between 50,000 and 800,000 users. The incident disabled core services like Jira, Confluence, and OpsGenie for over a week. The root cause was a faulty maintenance script intended to retire a legacy plugin. Instead of marking specific data for deletion, the script ran in a permanent delete mode with an incorrect list of IDs, effectively wiping out entire customer sites. This mistake was compounded by the fact that many of these customers were in the middle of migrating to the cloud at Atlassian's urging. Restoration proved exceptionally slow because Atlassian lacked the automation to selectively restore individual tenants within a shared environment. While they could restore the entire cloud to a previous state, doing so would have wiped out fresh data for the 99.8% of customers who were not affected. This forced engineering teams to perform manual restorations in small batches of about 60 tenants at a time. Each batch took several days to verify, resulting in a recovery timeline of up to two weeks for some organizations. The handling of the incident drew heavy fire for a lack of transparency and executive ownership. Atlassian remained largely silent for the first eight days, providing only generic status updates that lacked technical depth. Customers found themselves in a catch-22 where they could not even open support tickets because the system did not recognize their deleted domains. This communication breakdown was particularly damaging because it left technical decision-makers at client companies unable to provide answers to their own internal stakeholders or defend the vendor's reliability. From a business perspective, the timing was poor. Atlassian was actively retiring its Server product and pushing users toward the Cloud, using reliability as a primary selling point. The outage gave competitors like Linear and PagerDuty an opening to win over frustrated customers by offering free service or highlighting their own uptime. The event serves as a critical case study in the importance of having a disaster recovery runbook that includes clear communication protocols and the technical capability for granular data recovery in multi-tenant SaaS environments.
Key Takeaways
- Multi-tenant recovery requires granular tools. Having a global backup is insufficient if you cannot restore specific customers without impacting the rest of the user base.
- Communication is a technical requirement. When a vendor goes silent or provides vague updates, they undermine the CTOs and engineering managers who championed their product.
- The SaaS trap is real. Companies realized they had no independent backups for mission-critical data stored in Jira and Confluence, leaving them completely at the mercy of Atlassian's manual recovery process.
- Reputational damage outlasts technical fixes. Failing to follow their own published incident management standards created a trust gap that competitors will use in sales pitches for years.
The Scoop: Inside Fast’s Rapid Collapse - by Gergely Orosz
Fast, a one-click checkout startup, collapsed in April 2022 despite raising a $102M Series B led by Stripe. The company went from a $500M valuation to total shutdown in just six days after reports revealed it generated only $600K in revenue for 2021 while burning $10M per month. Fast hired aggressively from Big Tech by offering top-of-market base salaries between $200K and $240K and massive sign-on bonuses. They used a specific equity spreadsheet to show candidates potential multi-million dollar payouts based on a hypothetical $12B valuation. Internal warning signs existed but were often ignored. Daily revenue was frequently below $6K, and the platform served fewer than 500,000 button renders per day. Engineering leadership received daily sales summaries showing these low numbers, but the broader workforce remained unaware of the impending bankruptcy. The CEO, Dominic Holland, prioritized a high-volume sales strategy targeting small merchants, which required heavy engineering customization for minimal return. A hiring freeze was quietly instituted in January 2022 by a new CFO, yet the CEO continued to publicly boast about growth. When the Series C funding failed to materialize, the company ran out of cash. Following the collapse, Affirm negotiated to hire about 100 of Fast's 150 engineers. This failure serves as a case study in reckless capital burn and the risks of joining startups without verifying runway, revenue, and core business metrics.
Key Takeaways
- The company prioritized vanity growth metrics like headcount over sustainable unit economics, leading to a $10M monthly burn on negligible revenue.
- A flawed GTM strategy focused on small merchants created an engineering bottleneck because each client required custom integration work for very low transaction volume.
- Leadership maintained a facade of success through lavish perks and misleading equity projections, even after a quiet hiring freeze was instituted in early 2022.
- The collapse highlights the importance of reverse interviewing founders and demanding transparency regarding runway, burn rate, and real-time business metrics before joining a late-stage startup.
Real-World Engineering Challenges #4 - by Gergely Orosz
DAZN organizes its platform engineering into four distinct areas: Cloud Engineering, Developer Experience (DX), SRE, and Core Services. While this setup improves collaboration and treats developers as customers, the team noted a significant gap in service metadata. They lacked a centralized way to track service ownership, on-call details, and dependencies. Razorpay addressed scaling pains by forming a Frontend Platform team. As their engineering headcount tripled, they struggled with inconsistent designs and complex release cycles. They developed a Nirvana workflow to standardize the entire lifecycle from project setup to production monitoring. This includes building custom CLI scaffolding, automated versioning, and observability tooling to ensure a consistent developer experience. Gojek tackled the challenge of food delivery ETAs using a regression machine learning system called Tensoba. To improve accuracy, they divided the delivery process into three segments: the time from booking to driver arrival (T1), arrival to pickup (T2), and pickup to final delivery (T3). By modeling these separately, they reduced estimation errors by 23%, which directly impacts customer retention. LinkedIn manages over 1,000 microservices using a system called Hodor. This tool detects service overload and automatically remediates it through load shedding. Hodor consists of detectors, a load shedder, and an adapter that translates request data into a generic format. Their core philosophy is to do no harm, ensuring that traffic is only dropped when absolutely necessary to prevent total service failure. Pinterest migrated 3,000 workflows from their legacy Pinball system to an Airflow-based platform called Spinner. The key to their success was building a custom UI migration tool. This allowed engineering teams to move their workflows in minutes without writing custom code, demonstrating the value of internal tooling for massive infrastructure shifts.
Key Takeaways
- Internal developer platforms are shifting toward a developer as customer model, prioritizing guidance over strict governance.
- Breaking down complex estimation problems into chronological phases (T1, T2, T3) allows for more granular and accurate ML model performance.
- Custom self-service migration tools are essential for moving thousands of workflows without overwhelming engineering resources.
- Automated load shedding systems like Hodor prevent cascading failures in microservice architectures by prioritizing system stability over total traffic.
War in Ukraine - and Its Impact on the Tech Industry
The Russian invasion of Ukraine on February 24, 2022, represents a massive geopolitical shift with immediate consequences for the international technology sector. Ukraine has long been a primary destination for senior engineering talent due to its high skill levels and flexible contracting norms. Major firms like EPAM, Softserve, and GlobalLogic employ tens of thousands of developers in the region, serving global giants such as Google, Microsoft, and Barclays. While large companies may absorb the impact through distributed teams, startups and firms with concentrated Ukrainian hubs face significant operational halts. Corporate responses have been swift and varied. Companies like Wix, AppsFlyer, and Fiverr initiated evacuations before the invasion, while others like Hopin and Lemon.io provided emergency funds, salary advances, and job security for employees joining the defense forces. A notable trend is the rise of corporate sanctions, where tech companies like Apple, Intel, AMD, and even smaller startups like MessageBird and Omnisend have voluntarily suspended services or sales in Russia and Belarus, often moving faster than government mandates. The conflict is triggering a great tech migration. Ukrainians are fleeing to neighboring EU countries like Poland, which has seen over 124,000 arrivals in the first three days. Simultaneously, a Russian tech brain drain is accelerating as professionals flee economic collapse and political repression, seeking visas in Spain or Dubai. The devaluation of the Ruble and exclusion from the SWIFT payment system are primary drivers for this exodus. For the broader European tech market, this influx of highly skilled talent presents both a humanitarian challenge and a significant hiring opportunity if immigration hurdles are lowered. This migration could fundamentally alter the competitive landscape of the European software industry for the next decade.
Key Takeaways
- Tech firms are now active geopolitical actors, implementing voluntary sanctions that often exceed government mandates and prioritize values over immediate economic interests.
- The crisis exposes the deep reliance of Silicon Valley on Ukrainian engineering hubs, forcing a strategic re-evaluation of geographic concentration in technical teams.
- A permanent shift in the European tech landscape is likely as talent migrates from Ukraine, Russia, and Belarus toward more stable EU hubs like Poland.
- The rapid devaluation of the Ruble and SWIFT exclusions are driving an unprecedented brain drain of Russia's most employable technical professionals.
What TPMs Do and What Software Engineers Can Learn From Them
The Technical Program Manager (TPM) role is a critical function in high-growth tech companies, acting as a force multiplier for large-scale engineering efforts. While Product Managers focus on the why and what, and Engineering Managers focus on the how, TPMs own the when and who. They manage complex, cross-team initiatives that do not fit into a single product or platform team. Key responsibilities include leading long-running projects like GDPR compliance, managing large-scale migrations, and handling technical debt that spans multiple departments. TPMs often serve as a right hand to the CTO, overseeing engineering strategy, branding, and operational processes like incident management or developer onboarding. The role originated at Microsoft and has evolved as startups reach a size, typically 50 or more engineers, where coordination becomes a bottleneck. Unlike traditional project managers, TPMs require deep technical knowledge to understand system constraints, reliability, and architectural trade-offs. For engineers and managers, working with a TPM is a partnership. TPMs handle the breadth of a project by syncing with other teams, updating leadership, and resolving deadlocks. This allows engineers to focus on depth and delivery. Senior engineers can learn organizational influence and program management from TPMs to advance their own careers.
Key Takeaways
- TPMs act as force multipliers by preventing large-scale programs from stalling, which is common when dozens of teams and hundreds of people are involved.
- The role fills a coordination gap that naturally appears when a company scales beyond a few teams and the overhead of cross-functional work exceeds what engineering managers can handle.
- Technical depth is a requirement for success because a strong TPM must understand system architecture and trade-offs to influence engineering decisions and represent teams to leadership.
- TPM is a common career path for outliers who transition from engineering, testing, or consulting and enjoy solving hard technical problems through people and process mindsets.
Real-World Engineering Challenges #3 - by Gergely Orosz
Roblox suffered a 73-hour outage in October 2021 that affected 50 million daily users. The failure originated in their HashiStack infrastructure, specifically within Consul, which serves as their service discovery and service mesh solution. Because Nomad and Vault both depended on Consul, the entire system collapsed. Investigation revealed two primary root causes: a Consul streaming feature that blocked writes under high load and a bug in the BoltDB persistence library that failed to delete old log entries. A critical strategic error was that Roblox's telemetry system was built in-house and depended on the very infrastructure that failed, leaving engineers flying blind for nearly two days. This highlights the necessity of keeping monitoring systems independent from core production infrastructure. At Amazon, Principal Engineers (PEs) represent only about 3% of the engineering workforce. They act as the glue for a company that often operates like a collection of chaotic startups. Amazon's PE community is unique for its company-wide design review process and high level of influence compared to peers at Google or Microsoft. Meanwhile, Uber manages payments fraud through a system called RADAR. This hybrid model combines AI pattern monitoring with human intervention for new threats. Interestingly, Uber uses JIRA to create audit trails for human decisions, ensuring explainability in fraud mitigation. LinkedIn manages content abuse through a massive cross-functional effort involving twelve different teams. Their funnel starts with AI models making filtering decisions within 300ms, followed by human reviews of flagged content and user reports. The success of this system relies on setting specific metrics for different layers of the funnel rather than just focusing on the technical build. Finally, GitHub Mobile maintains a weekly release cadence by clearly mapping which steps are automated and which require manual intervention, providing a blueprint for mobile teams to identify automation opportunities.
Key Takeaways
- Telemetry systems must remain decoupled from the primary infrastructure to prevent total visibility loss during major outages.
- Using standard project management tools like JIRA for fraud audit trails provides a pragmatic way to achieve explainability in complex AI systems.
- Defining specific metrics for different stages of a funnel is more critical for system success than the underlying technical architecture.
- Principal Engineering roles at scale function less as individual contributors and more as organizational glue that bridges disparate business units.
Gergely Orosz | Substack
This collection of essays and deep dives by Gergely Orosz explores the inner workings of major tech companies and the rapid evolution of software engineering in the age of AI. A significant portion of the content focuses on how companies like Google, Meta, Amazon, and Uber structure their engineering teams, manage performance, and maintain reliability at massive scale. It highlights a shift from peacetime growth to a more wartime focus on efficiency following the end of the zero interest rate period (ZIRP). The material provides a detailed look at the emerging AI engineering stack, specifically the rise of agentic workflows and tools like Cursor and Claude Code. It introduces concepts like vibe coding, where developers use natural language and intuition to guide AI, and the Model Context Protocol (MCP) as a new standard for connecting AI models to local data and tools. There is a strong emphasis on the practical impact of these tools, noting that while they boost productivity for senior engineers, they also introduce new risks like sloppy software and increased token costs that CTOs are now struggling to manage. Beyond AI, the archive covers essential SaaS and GTM topics, including the trimodal nature of tech compensation, the role of growth engineering, and the challenges of scaling infrastructure like AWS S3 or WhatsApp. It also touches on the fractional leadership trend, offering insights into how experienced executives are navigating the current market. The content serves as a tactical guide for anyone looking to understand how the best engineering organizations operate and how the craft of coding is being fundamentally rewritten by automation.
Key Takeaways
- AI is shifting the engineer's role from a writer of code to an orchestrator of agents, which requires a deeper focus on system design and human judgment rather than syntax.
- The end of the ZIRP era has forced a move toward flatter organizations and player-coach leadership models, making efficiency a core engineering metric.
- Protocols like MCP are becoming the essential glue for the agentic era, allowing AI tools to interact safely and effectively with complex internal environments.
- High-growth startups are increasingly using AI to disrupt established moats, as seen with engineers rewriting entire frameworks in days using agentic tools.
Holiday Tech and Business Book Recommendations
This collection features over 100 book recommendations curated specifically for software engineers, engineering managers, and tech leaders. The list is categorized into distinct domains including engineering management, leadership and organization design, career growth, technical interviews, product strategy, and general business. High-quality books are presented as a way to access years of hard-earned experience compressed into a few hours of reading, providing depth that short-form content often lacks. Key additions for 2022 include The Staff Engineer's Path by Tanya Reilly, which addresses the transition to staff-level roles, and Effective Software Testing by Maurício Aniche. In the management category, titles like An Elegant Puzzle by Will Larson and The Manager's Path by Camille Fournier are highlighted for their practical insights into high-growth tech environments. For organizational design, Team Topologies and The DevOps Handbook are noted for their impact on how teams are structured and how they deliver software. The list also dives into specific technical domains, recommending Designing Data Intensive Applications for distributed systems and API Security in Action for modern security design. Product and strategy recommendations focus on creating successful tech products, featuring Marty Cagan's Empowered and Inspired, alongside strategy essentials like 7 Powers and Good Strategy/Bad Strategy. Leadership and business books round out the collection, emphasizing psychological safety, communication, and negotiation with titles such as The Fearless Organization, Radical Candor, and Never Split the Difference. Each recommendation includes a brief testimonial or review snippet to explain its relevance to tech professionals.
Key Takeaways
- The emergence of dedicated literature for staff-level roles like The Staff Engineer's Path indicates that technical leadership is now recognized as a distinct career track requiring specific non-coding skills.
- Modern engineering management is shifting focus toward organizational design and team dynamics, with frameworks like Team Topologies becoming essential for scaling high-growth startups.
- There is a growing crossover between engineering and product management, where understanding customer discovery through books like The Mom Test is vital for engineers to avoid the build trap.
- Strategic business frameworks such as the 7 Powers are increasingly relevant for technical leaders who need to align engineering efforts with long-term company defensibility and monetization.
Working with Product Managers: Advice from PMs
Product development functions best when engineering and product management operate as a unified partnership, often described as a Yin and Yang relationship. Ebi Atawodi, a product leader at Netflix, highlights that engineers must see themselves as owners of the product and its outcomes. This means moving beyond just writing code to being customer-obsessed, understanding business context, and tracking key metrics. A shared North Star vision, built collaboratively, ensures both sides are excited about the same goals. Ross McNairn from TravelPerk points out that silos often form when teams distinguish between technical and product roadmaps. He argues that every engineering task, including tackling technical debt, should be communicated through its impact on the user experience or development speed. By owning the "how" of the operation, including planning and estimation, engineers free up PMs to focus more deeply on user needs and market analysis. This operational ownership keeps engineers closer to the problem space and prevents them from becoming mere order-takers. The importance of context is a recurring theme. Leaders like Juan Pablo and Lizzie Matusov urge engineers to constantly ask "why" for every proposal. This empathy for the user allows engineers to anticipate how features will actually be used, leading to more resilient code. Dipti Desai notes that PMs should be leveraged as a resource for understanding internal dynamics and business objectives. Shreef from Ankorstore emphasizes that the best partnerships involve investing in each other's success through regular 1:1s and daily updates. Execution requires high levels of transparency and communication. Krishna Nandakumar suggests that engineers should present a range of solutions, from quick hacks to highly scalable architectures, while over-communicating on blockers. Kyle Johnson recommends a practice where EMs and PMs explain each other's domains in their own words to verify alignment. Finally, Martijn Visser and Willem Spruijt advise that technical debt and migrations must be framed in terms of customer impact or sales influence to be prioritized effectively. When both roles maintain visibility into decision-making, the partnership becomes a multiplicative force.
Key Takeaways
- The most effective teams eliminate the boundary between technical and product roadmaps by tying every engineering task to a specific customer or business outcome.
- Engineers who take full ownership of the operational "how" create a strategic advantage by allowing their PM counterparts to focus exclusively on market discovery and user research.
- Asking "why" is a technical requirement because deep empathy for user needs directly informs better architectural and coding decisions.
- Framing technical debt as a risk to user experience or team velocity makes it a shared priority rather than a hidden engineering burden.
The Scoop: the Hiring Market - by Gergely Orosz
The tech industry reached a critical inflection point regarding remote work as office return dates from major players like Google, Amazon, and Apple were repeatedly delayed throughout 2021. This uncertainty led many engineers to move permanently, forcing companies to either embrace remote-first policies or lose senior talent to competitors like Shopify and Twitter. The shift is particularly visible in the attrition struggles at Amazon and AWS. Historically, Amazon used a frugality-based compensation model that prioritized equity over cash, assuming steady stock growth. However, as Amazon's stock price flattened and market demand for senior engineers spiked, many employees left for higher cash offers elsewhere. Amazon's notoriously stressful culture, including a mandatory 5-6% Performance Improvement Plan (PIP) target for managers, further exacerbated these attrition woes. To combat this, Amazon has resorted to aggressive counteroffers known as Dive and Save situations and significantly higher sign-on packages for new hires. In Europe, a specific hiring opportunity has emerged due to rigid pay structures at legacy multinationals like Mercedes, BMW, and Bosch. These companies often tie software engineering salaries to union-negotiated bands (such as EG 12 or 13 in Germany), which prevents them from matching the 20-35% market increases seen globally. Because these organizations cannot raise pay for engineers without raising it for all employees in the same grade, they are losing high-quality talent to remote-first startups and tech-native firms. This creates a strategic opening for hiring managers to poach senior engineers from sectors like automotive, pharma, and manufacturing where technology is not the core business. As the market remains heated, companies that can adjust compensation quickly and offer permanent remote flexibility hold a significant advantage in acquiring top-tier engineering talent.
Key Takeaways
- The repeated delay of office returns transformed remote work from a temporary measure into a permanent lifestyle choice for senior engineers, making it a non-negotiable benefit for retention.
- Amazon's frugality-led compensation model failed when stock growth stagnated, proving that equity-heavy packages are a liability in a high-demand, flat-market environment.
- Rigid, union-backed pay scales in European legacy industries create a massive talent arbitrage opportunity for agile tech companies that can offer market-rate salaries.
- The mandatory PIP culture at Amazon acts as a significant push factor for talent when external market compensation significantly outpaces internal rewards.
- Hiring managers can find high-quality, undervalued senior talent by targeting large multinationals where software is a secondary department rather than the core product.
Hiring (and Retaining) a Diverse Engineering Team
Building diverse engineering teams requires a dual focus on inclusive recruitment and long-term retention. Leaders from organizations like the Financial Times, Stripe, and SAP highlight that visible diversity in leadership is a primary driver for attracting underrepresented talent. At the Financial Times, Sarah Wells notes that gender parity was achieved by moving from 5% to 35% women and non-binary engineers through inclusive job descriptions that avoid "rockstar" terminology and by partnering with non-traditional coding routes like Makers Academy. Retention is supported by employee groups like FT Embrace and FT Women, alongside transparent promotion boards that use data to identify pay or advancement biases. Samuel Adjei emphasizes that diverse leaders naturally attract diverse pipelines because they understand the specific hurdles minorities face. He suggests evaluating candidates as individuals rather than comparing them to traditional benchmarks like specific universities or bootcamps. Uma Chingunde, formerly of Stripe, advocates for "structured opportunities." This involves creating formal processes for project leads or management roles instead of "tapping someone on the shoulder," which often favors the majority. This structure provides legitimacy to underrepresented hires and prevents the "tokenism" stigma. Other successful tactics include being lenient at the screening stage to absorb more risk for non-traditional profiles, as practiced at Prolific, and ensuring interview panels are diverse to counter affinity bias. Gabrielle Tang of SheSharp points out that while diverse teams may face more initial conflict due to differing perspectives, they ultimately produce superior creative outcomes. The core takeaway is that diversity is not a "box-ticking" exercise but a strategic commitment involving data-driven OKRs, bias training for leadership, and a culture of psychological safety where every voice is heard.
Key Takeaways
- Diverse leadership is the strongest magnet for diverse talent; candidates need to see role models in senior positions to believe a culture is truly inclusive.
- Structure is the enemy of bias. Formalizing how opportunities like project leads or promotions are assigned prevents 'shoulder-tapping' that naturally favors existing majorities.
- Retention hinges on 'belonging' rather than just representation. If the environment doesn't support healthy conflict and diverse thinking, underrepresented hires will leave for better cultures.
- Data-driven accountability is essential. Treating diversity like any other business metric using OKRs and salary surveys moves it from a 'nice-to-have' to a core operational priority.
Real-World Engineering Challenges Roundup
This issue highlights technical solutions to scaling and maintenance problems at major tech companies. Snap took over maintenance of Djinni, a C++ bridging tool originally open-sourced by Dropbox, to improve string marshalling and buffer performance for mobile apps. Snap's fork focused on better performance for large strings and zero-copy buffers using references over binary types, which reduced crashes caused by Java finalizers in the Android garbage collector. Stripe developed an internal tool for designing accessible color systems using perceptually uniform color spaces, moving beyond manual selection to ensure visual accessibility for impaired users. This tool allows designers to manipulate colors in a way that maintains visual consistency while meeting accessibility standards. Zalando utilized the parallel run migration pattern to extract business logic from their monolith, allowing them to verify new service results against production data before full rollout. This pattern includes a crucial cleanup step often missed by engineering teams. Lyft's journey in developer environments shows a progression from manual EC2 provisioning to sophisticated virtual machines called Devbox and Onebox, and finally to Kubernetes. This evolution emphasizes the need for dedicated platform teams as engineering headcount grows to maintain efficiency. Pinterest improved search relevance by implementing SearchSage, which uses two-tower models and query embeddings to better understand user intent across 15 search products. This system handles query polysemy more effectively than previous clustering methods. Finally, Grab applied the Jobs to Be Done (JTBD) framework, famously used by McDonald's in the 1990s to sell more milkshakes, to identify and build high-impact features like food bundles for GrabFood. This highlights how established business frameworks remain relevant for modern tech innovation.
Key Takeaways
- Open source sustainability often relies on passing the torch where a new primary user like Snap takes over maintenance of a tool when the original creator's needs change.
- The parallel run pattern is a high-confidence migration strategy because it allows for side-by-side verification of legacy and new systems with real production traffic before the final cutover.
- Developer productivity is a scaling bottleneck that requires dedicated investment; Lyft's evolution shows that infrastructure must become more automated and shareable as teams cross the 100-engineer mark.
- Applying non-tech frameworks like Jobs to Be Done can solve modern product problems by focusing on the underlying job a customer is hiring a service to perform.
Incident Review and Postmortem Best Practices
Incident reviews reveal the actual state of products and organizations, which often differs from the idealized version held by leadership. Most tech companies follow a standard lifecycle starting with detection and declaration, followed by mitigation and a decompression period before analysis begins. Common tools for this process include Slack for communication, PagerDuty for alerting, and Jira for tracking follow-up actions. While standard practices focus on generating action items, high-performing teams are shifting toward a learning-centric approach. This involves moving away from the Five Whys method, which often oversimplifies complex failures and inadvertently points toward individual blame. Instead, experts like John Allspaw and teams at Honeycomb advocate for socio-technical systems analysis. This approach examines both technical failures and the social context, such as how an engineer's mental model or organizational pressure influenced their decisions. A key insight is that tech has a massive data advantage over industries like healthcare or aviation because engineers have immediate access to millisecond-level logs and configuration history. However, documentation alone is insufficient for building expertise. Simulation-based training, where teams practice responding to mock outages, is often more effective than reading past reports. Effective incident handling also requires a culture where raising alarms is encouraged even when uncertain, and where roles like the Incident Commander are clearly defined to manage communication and mitigation separately during high-severity events.
Key Takeaways
- Focusing on learning is more valuable than tracking action items. Honeycomb found that incident reviews are most effective when they explore how systems surprised the team rather than just generating a list of tasks that would have been done anyway.
- The Five Whys method can be a trap. It often narrows the investigation to a single point of failure and encourages a search for a person to blame instead of broadening the understanding of systemic complexity.
- Simulation beats documentation for building expertise. Practical exercises and incident 'war games' create the tacit knowledge and muscle memory required for effective response, which cannot be gained by simply reading archived reports.
- Socio-technical analysis provides deeper resilience. By treating human error as a starting point rather than a conclusion, organizations can identify gaps in training, tooling, and on-call schedules that contribute to outages.
Real-World Engineering Challenges Roundup
This breakdown of real-world engineering challenges highlights how top-tier tech companies solve infrastructure bottlenecks during rapid growth. Shopify improved app performance by 20 percent by implementing backend caching for Rails database queries. Their approach involved choosing between Redis and Memcached while focusing on cache invalidation and write-through strategies to manage latency. DoorDash transitioned to a multi-tenancy model for user data by adding tenant IDs across their tables, allowing them to support merchant-specific storefronts within a shared architecture. Airbnb addressed developer friction in their 1.5 million line iOS codebase by moving to the Buck build system and introducing Dev Apps, which are on-demand workspaces for single modules that reduce build times for their 75 engineers. GitHub successfully partitioned its core MySQL database to reduce load and incidents. They achieved this without downtime by using virtual partitions, schema domains, and a write-cutover process supported by custom SQL linters. Finally, Nubank replaced their slow and flaky end-to-end test suite with consumer-driven contract testing, using queuing theory to demonstrate how traditional E2E tests cause exponential build delays as a system grows.
Key Takeaways
- Scaling mobile development requires treating the developer experience as a platform product, evidenced by Airbnb's dedicated infrastructure team and custom tooling to generate module-specific workspaces.
- Zero-downtime database migrations are possible even at GitHub's scale by using virtual boundaries and strict SQL linting to prevent cross-partition query issues before they hit production.
- Traditional end-to-end testing often reaches a point of diminishing returns where flakiness and maintenance costs make contract testing a more sustainable choice for complex microservice environments.
- Caching is rarely just about speed; the most difficult aspects involve the rollout strategy and ensuring cache invalidation doesn't create data consistency issues for millions of users.
How Big Tech Runs Tech Projects and the Curious Absence of Scrum
Big Tech companies like Google, Meta, and Uber typically avoid formal frameworks like Scrum, favoring autonomous, engineer-led project management. A survey of over 100 companies shows that while Scrum remains popular in non-tech firms and consultancies, high-growth tech organizations prioritize flexibility and speed. The success of a company often has little to do with its specific methodology. For instance, WhatsApp out-executed Skype despite Skype's heavy investment in formal Scrum training. In Big Tech, engineers usually lead projects rather than dedicated project managers. Technical Program Managers (TPMs) only step in for complex, cross-team initiatives, often at a ratio of one TPM per fifty engineers. This model relies on an organizational structure that treats engineers as problem solvers rather than resources. These companies invest heavily in developer tooling and platform teams, which can comprise up to 40% of the engineering workforce. This infrastructure allows teams to ship code daily behind feature flags, making traditional Scrum rituals like bi-weekly demos redundant. Product Managers in these environments focus on strategy and defining the what and why, while Engineering Managers handle execution. Satisfaction among engineers correlates strongly with team autonomy and the ability to choose their own workflows. Conversely, low satisfaction is often linked to mandated processes, lack of involvement in estimations, and heavy reliance on JIRA, which received a staggeringly low NPS of -83 in one high-growth company. While Scrum can help kitchen sink teams manage stakeholder interruptions or help new teams find their footing, it often becomes a bottleneck for high-performing teams capable of continuous delivery. The core differentiator is the empowerment of teams to solve business problems rather than just completing assigned tasks.
Key Takeaways
- High-caliber talent and high-trust environments are the primary drivers that allow Big Tech to function without rigid project management frameworks.
- Investment in platform engineering is a strategic necessity for removing process overhead, as it automates the feedback loops that Scrum rituals usually provide manually.
- Scrum often functions as a defensive mechanism for teams in low-trust or chaotic environments to protect their focus from external interruptions.
- The decoupling of strategy and execution allows Product Managers to focus on market fit while engineers own the technical delivery and timeline.
The Perfect Storm Causing an Insane Tech Hiring Market
The tech hiring market in late 2021 reached an unprecedented level of intensity, characterized by extreme competition for talent and skyrocketing compensation. Hiring managers reported a significant drop in candidate applications and a high frequency of candidates declining offers for better deals, even after verbal acceptance. This environment was driven by a perfect storm of six factors. First, the pandemic forced every industry to go all-in on digital, increasing tech budgets by up to 25% at traditional retailers like Best Buy. Second, capital markets were flush with cash, leading to record-breaking IPOs and VC funding that startups used to aggressively hire. Third, pent-up demand from 2020 hiring freezes flooded the market in 2021. Fourth, the shift to remote work meant local companies suddenly had to compete with Silicon Valley salaries. Fifth, the line between tech and non-tech companies blurred as legacy firms like IKEA and Walmart began hiring like Big Tech. Finally, a shrinking supply of senior talent occurred as experienced workers took sabbaticals or left to found their own startups. While senior talent saw raises of 50% to 150%, junior developers struggled because companies found it difficult to onboard them in remote settings. Data from Ukraine showed recruiter inbound requests up 250% while job applications dropped 35%. To survive, companies must prioritize retention through out-of-cycle raises, flexible remote policies, and reduced work stress. For hiring, speed is the most critical factor. Successful startups are using C-level executives to close candidates and moving away from rigid, coding-heavy interview processes in favor of faster, more personal conversations.
Key Takeaways
- Remote work effectively broke regional salary silos, forcing companies in lower-cost areas to compete with global Tier 1 compensation packages.
- The junior talent gap is a structural failure of remote onboarding; companies that invest in remote mentorship gain access to a less competitive talent pool.
- Retention is significantly more cost-effective than backfilling, making out-of-cycle promotions and sabbaticals essential strategic tools rather than just perks.
- The competitive landscape has shifted as traditional industries now treat tech as a profit center rather than a cost center, matching Big Tech salaries to survive.
- Hiring speed and personal touch from leadership are now more effective closing tools than high-friction, standardized technical assessments.
The Platform and Program Split at Uber - by Gergely Orosz
Uber's 2014 organizational pivot replaced a project-based structure with a permanent split between Program and Platform teams. Program teams are cross-functional units optimized for rapid execution and business innovation. They typically include engineers, product managers, designers, and data scientists working toward a specific mission like marketplace efficiency or regional growth. These teams serve external customers and are measured by business metrics like gross bookings or user retention. In contrast, Platform teams provide the technical building blocks that allow Program teams to move faster. They focus on specialized domains like storage, compute, or developer experience. Their customers are internal engineering teams, and they prioritize non-functional requirements like reliability, latency, and security. Uber implemented this change using a landing party approach, restructuring the entire organization at once even before teams were fully staffed. This bold move relied on high confidence in future hiring and business growth. Today, Uber maintains a significant investment in platforms, with 30% to 40% of its engineering workforce dedicated to foundational work. This structure helps prevent technical debt from stalling growth and ensures that common features, such as tipping or messaging, are built once as a service rather than duplicated across different product verticals. While the split improves efficiency and standardization, it introduces challenges like double-reporting for non-engineering functions and the risk of platform teams becoming disconnected from actual customer needs.
Key Takeaways
- Platform teams are a strategic necessity for scaling but often face pushback from business leaders who view them as invisible costs.
- The landing party rollout strategy shows that organizational restructuring can precede hiring if there is high confidence in business growth.
- Product platforms for shared features like ratings or payments are just as critical as infrastructure platforms for reducing engineering duplication.
- A 30-40% engineering allocation to platforms is a massive but necessary investment to maintain high leverage and prevent attrition caused by technical debt.
The Seniority Rollercoaster - by Gergely Orosz
The tech industry often presents a trade-off between job titles and compensation, a phenomenon called the seniority rollercoaster. When moving between companies, especially when jumping to a higher tier like Big Tech, engineers frequently face down-leveling. This happens because expectations vary across the industry. A senior role at a small agency or non-tech firm rarely matches the scale and complexity required for a senior title at a company like Google or Meta. Other factors include interview performance, changing technology stacks, and the rapid rise of market compensation which can leave long-tenured employees behind. Candidates should research what a specific level actually entails at a new company by reviewing career ladders and impact expectations. For example, a Principal Engineer at one firm might manage 15 people, while at another, they influence thousands. If a candidate feels mis-leveled, they should respectfully challenge the decision with evidence of past impact or leverage competing offers. Sometimes, rejecting a down-leveled offer is the right move to preserve career signaling. For engineering managers, down-leveling is a retention and morale risk. Managers should stay active in the hiring loop and personally extend offers to gauge a candidate's reaction to their assigned level. If a down-level is necessary, managers must be transparent about the rationale and set realistic promotion timelines. Promising a quick promotion to fix a down-level often backfires if the manager does not have total control over the process. Finally, while some established professionals claim titles do not matter, they are vital for underrepresented groups. For these individuals, a high-level title serves as a necessary signal of technical credibility that might otherwise be unfairly questioned.
Key Takeaways
- Down-leveling is often a strategic trade-off where you sacrifice a title for significantly higher compensation and a more challenging environment.
- Titles are not standardized across the industry, so you must evaluate a role based on its impact radius and the specific company's career ladder.
- For underrepresented groups, titles act as a critical shield against bias by providing immediate proof of technical competence.
- Managers who ignore a new hire's frustration with their level risk long-term morale issues and early turnover.
Advice, observations and inspiration for engineering leaders
The Pragmatic Engineer is a top-rated Substack newsletter tailored for engineering managers and senior engineers. Authored by Gergely Orosz, who brings experience from Uber, Microsoft, and various high-growth startups, the publication focuses on providing actionable advice for navigating the tech industry. The content specifically targets the inner workings of big tech companies and fast-moving startups, exploring why certain organizations outpace others. A core tenet of the newsletter is its independent viewpoint; Orosz does not accept sponsorships or ads, ensuring unbiased observations. Subscribers can choose between a free tier, which offers monthly updates, and a paid tier. Paid members receive weekly long-form articles, access to a full archive, and exclusive resources like templates for engineering managers. The newsletter also covers high-level industry trends, such as the absence of Scrum in big tech, the complexities of the tech hiring market, and engineering career paths. For professionals looking to use company budgets, the platform provides receipts and invoices suitable for corporate expensing, including EU VAT-compliant options. Group subscriptions and discounts for students or those in lower-income regions via purchasing power parity are also available.
Key Takeaways
- The newsletter serves as a primary resource for benchmarking engineering practices against industry leaders like Uber and Microsoft.
- Its ad-free, sponsorship-free model is a strategic choice to maintain editorial integrity and trust with a technical audience.
- The content strategy prioritizes actionable insights over theoretical concepts, aiming for immediate ROI in team efficiency or career progression.
Frequently Asked Questions
- Given that AI tools like Claude Code and Cursor enable rapid 'vibe coding' and high-volume output, how can engineering teams prevent the accumulation of 'vibe slop' and 'agentic regret' when models lack the first-principles reasoning required for long-term system design?
- In light of McKinsey's push for individual developer productivity metrics, how should engineering leaders reconcile the demand for quantifiable output with Dr. Nicole Forsgren's warning that measuring PR counts creates perverse incentives, especially when AI tools are shifting the developer's role from writing code to reviewing it?
- While AWS and Azure offer elastic infrastructure and managed services, how does the transition of companies like Bluesky to bare-metal servers on Vultr for '10x the performance' validate Oxide's clean-sheet approach to on-premise hardware and challenge the long-term economic viability of default cloud adoption?
- Given the '70% problem' where AI accelerates the creation of prototypes but struggles with production-ready polish, how can junior engineers avoid accepting 'house of cards code' when they lack the deep architectural mental models that senior engineers use to effectively curate AI output?
- How does the heavy process and 'promotion-driven development' of Google's L-level structure and Amazon's 'Unregulated Attrition (URA)' targets contrast with the 'talk, do, show' iteration cycle of startups like Linear, and what does this mean for engineers experiencing 'startup shock' after leaving Big Tech?
- Considering that code security is shifting to a 'shared responsibility' model where developers must verify AI-generated code, how can teams effectively mitigate new attack vectors like 'prompt injection' and 'slop-squatting' without reverting to the slow, compliance-driven gatekeeping of traditional security audits?
- Given Chris Lattner's observation of the 'two-world problem' in AI infrastructure—where researchers use Python but production requires C++ or CUDA—how do emerging languages like Mojo and Codespeak balance the need for high-performance hardware control with the rising abstraction levels of 'agentic workflows'?
- With the shift toward 'Observability 2.0' and wide structured events advocated by Charity Majors, how must traditional monitoring frameworks evolve to handle the 'non-deterministic' nature of AI-generated code and the 'high-cardinality' problems introduced by autonomous agent swarms?