/Scale

How Zapier Automates Billions Of Tasks

- Neo Kim tl;dr: Neo takes a look at Zapier's architecture, highlighting its use of Nginx, Python Django, MySQL, Redis, AWS Lambda, RabbitMQ, and Celery for automating billions of tasks. It details Zapier's tech stack, asynchronous processing, scalability strategies, and how they handle task execution and history tracking, using technologies like GraphQL, Next.js, AWS S3, Kafka, and Elasticsearch for efficiency and scalability. 

featured in #493


1.5+ Million PDFs In 25 minutes

- Sarat Chandra Karan Sharma tl;dr:  “In this blog post, we describe our journey of building an architecture from scratch which now enables us to process, generate, digitally sign, and e-mail out 1.5+ million PDF contract notes in about 25 minutes, incurring only negligible costs. We self-host all elements of this architecture relying on raw EC2 instances for compute and S3 for ephemeral storage. In addition, the concepts used for orchestration of this particular workflow can now be used for orchestrating many different kinds of distributed jobs within our infrastructure.”

featured in #492


Scaling ChatGPT: Five Real-World Engineering Challenges

- Gergely Orosz Evan Morikawa tl;dr: An interview with Evan Morikawa, who led the OpenAI Applied Engineering team as ChatGPT launched and scaled. Evan reveals the five engineering challenges along with lessons learned. Challenges are: (1) KV Cache & GPU RAM. (2) Optimizing batch size. (3) Finding the right metrics to measure. (4) Finding GPUs wherever they are. (5) Inability to autoscale.  

featured in #491


Ledger: Stripe’s System For Tracking And Validating Money Movement

- Ilya Ganelin tl;dr: “Ledger models internal data-producing systems with common patterns, and it relies on proactive alerting to surface issues and proposed solutions. Each day, Ledger sees five billion events and 99.99% of our dollar volume is fully ingested and verified within four days. Of that activity, 99.999% is monitored, categorized, and triaged through rich investigative tooling — while the remaining long-tail is reliably handled through manual analysis.” This post shares technical details on how Stripe built this money movement tracking system, and how teams interact with the data quality metrics that underlie our global payments network.

featured in #490


How Disney+ Hotstar Delivered 5 Billion Emojis in Real Time

- Neo Kim tl;dr: This post outlines how Disney+ Hotstar delivered billions of emojis in real-time during the cricket World Cup in India to create a more engaging live experience. The post described how emojis were received from clients, processed and delivered at scale.

featured in #489


Data-Caching Techniques For 1.2 Billion Daily API Requests

- Guillermo Pérez tl;dr: The cache needs to achieve three things: (1) Low latency: It needs to be fast. If a cache server has issues, you can’t retry. (2) Up and warm: It needs to hold the majority of the critical data. If you lose it, it would surely bring down the backend systems with too much load. (3) Consistency: It should never hold stale or incorrect data. “A lot of the techniques mentioned in this article are supported by our open source meta-memcache cache client.”

featured in #486


Meta's Serverless Platform Processes Trillions Of Function Calls A Day

- Leonardo Creed tl;dr: Meta’s XFaaS is their serverless platform that processes trillions of function calls per day on more than 100,000 servers spread across tens of datacenter regions. XFaaS is Meta’s internal version of public Function-as-a-Service (FaaS) options, such as AWS Lambda, Google Cloud Functions, and Azure Functions. Leonardo shares his high-level takeaways and lessons and then a more detailed walkthrough about the architecture behind XFaaS. 

featured in #485


Sensenmann: Code Deletion At Scale

- Phil Norman tl;dr: “What if we could clean up dead code automatically? That was exactly what people started thinking several years ago, during the Zürich Engineering Productivity team's annual hackathon. The Sensenmann project, named after the German word for the embodiment of Death, has been highly successful. It submits over 1000 deletion changelists per week, and has so far deleted nearly 5% of all C++ at Google. Its goal is simple (at least, in principle): automatically identify dead code, and send code review requests to delete it.” Phil discusses its logic. 

featured in #482


How Apple Built iCloud To Store Billions Of Databases

- Leonardo Creed tl;dr: Apple uses FoundationDB and Cassandra for iCloud and CloudKit, their cloud backend service, running one of the largest Cassandra deployments in the world. This deployment includes over 300,000 instances or nodes, managing hundreds of petabytes of data, possibly extending to exabytes. Each cluster in this deployment can handle over two petabytes of data, and there are thousands of such clusters. This indicates a highly distributed and scalable storage system spread across multiple data centers. 

featured in #480


How Discord Serves 15-Million Users On One Server

- Alex Xu tl;dr: "Internally, each Discord community is called a “guild”. A dedicated Elixir “guild process” handles coordination and routing for each guild. This tracks all connected users to the guild. Every online user has a separate Elixir "session process”. When the guild process gets a new message, event, or update, it fans out this information to the relevant session processes. These session processes then push the update over WebSocket to the Discord clients."

featured in #479