/ML

Introducing Natural Language Search For Podcast Episodes

- Alexandre Tamborrino tl;dr: "To enable users to find more relevant content with less effort, we started investigating a technique called Natural Language Search, also known as Semantic Search. In a nutshell, Natural Language Search matches a query and a textual document that are semantically correlated instead of needing exact word matches. It matches synonyms, paraphrases, etc., and any variation of natural language that express the same meaning."  

featured in #336


The Berkeley Crossword Solver

tl;dr: "The BCS uses a two-step process to solve crossword puzzles. First, it generates a probability distribution over possible answers to each clue using a question answering (QA) model; second, it uses probabilistic inference, combined with local search and a generative language model, to handle conflicts between proposed intersecting answers."

featured in #331


In Search Of The Least Viewed Article On Wikipedia

- Colin Morris tl;dr: "Based on our findings above, the least viewed articles on Wikipedia are not going to be merely about topics with little popular interest - they must also be “unlucky” in the sense of having very small random gaps... Of these 600,000 least lucky articles, all received at least a few views in 2021. The booby prize for least popular article of 2021 is shared by two articles which received exactly 3 probably-human pageviews."

featured in #322


Evolution Of ML Fact Store

- Vivek Kaushal tl;dr: "This post will focus on the large volume of high-quality data stored in Axion — our fact store that is leveraged to compute ML features offline. We built Axion primarily to remove any training-serving skew and make offline experimentation faster. We will share how its design has evolved over the years and the lessons learned while building it."

featured in #321


How DALL-E 2 Actually Works

- Ryan O'Connor tl;dr: A URI is a string that identifies a resource. From a syntactical point of view, a URI string mostly follows the same format as the URL. A URN identifies resources in a permanent way, even after that resource does not exist anymore.

featured in #311


Real World Recommendation System - Part 1

- Nikhil Garg tl;dr: "FAANG and other top tech companies have independently converged on a common architecture for production grade recommendation systems." This architecture is domain / vertical agnostic and can power all sorts of applications — from e-commerce and feeds to search, notifications, etc... Nikhil starts from the basics, explains nuances and describes this universal architecture.

featured in #310


On Owning A Software Problem

- Vicki Boykis tl;dr: What is a low-friction small thing that most will not notice, but that when they do, is a sign of craftsmanship, expertise, and pride in one's work? Vicki has created a list relevant for ML and Data Science: (1) Python code has type annotations. (2) Accurate documentation of a repo and an easy, reproducible way to run the project. (3) Formatted and linted SQL statements. And more.

featured in #293


How We Optimized Python API Server Code 100x

- Vadim Markovtsev tl;dr: "Some of the tricks we used to speed up calls to our analytical API written in Python: played with asyncio, messed with SQLAlchemy, hacked deep in asyncpg, rewrote parts in Cython, found better data structures, replaced some pandas with pure numpy."

featured in #291


Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape

- Matt Turck tl;dr: Matt covers the macro view: making sense of the ecosystem’s complexity, financings, IPOs and M&A, a landscape of the ecosystem, key trends and more.

featured in #258


Machine Learning Is Going Real-time

- Chip Huyen tl;dr: Chip discusses two approaches: (1) Online predictions, where an ML system makes predictions in real-time. (2) Online learning, where ML system incorporate new data and update models in real-time.

featured in #219