Alberto Purpura | Vital: Building a Free Food Journal with On-Device BM25 Search

I wanted to track what I was eating without paying for it or any confusing data entry User Interfaces (UIs). That turned out to require building the whole thing from scratch of course. The first challenge was to find a large database of food products with nutritional information since I was not planning to add detailed nutritional info from the label of every food I was going to track. After some searching, the obvious choice was Open Food Facts, a community-maintained database of over 3 million products with full nutritional data, released under an open license.

The challenge however became turning that dataset into a fast, pleasant search experience on a phone.

First Attempt: The Open Food Facts API

Open Food Facts exposes a search endpoint. The first version of Vital called it on every keystroke:

GET https://world.openfoodfacts.org/cgi/search.pl?search_terms=greek+yogurt&json=1

This failed in three ways. First, latency: a round-trip to France on a mobile connection takes 300–800 ms per keystroke. The UI felt laggy before I'd typed two characters. Second, ranking: the API returns results ordered by last-edit date, not relevance. "Greek yogurt" returned obscure regional products before the major brands. Third, rate limits: the API is not designed for high-frequency per-keystroke queries from a mobile app.

The API is great for fetching a specific product by barcode. It's not great for search-as-you-type.

The Decision to Go On-Device

The solution: ship a search index inside the app bundle. The user downloads it once, and every subsequent search is a local file read—zero network round-trips, zero rate limits, full control over ranking.

The two obvious options were SQLite FTS5 (full-text search built into SQLite) and a custom BM25 index. I went with BM25. A custom binary index is more compact than an SQLite database for the same data—no row overhead or page alignment waste. More importantly, I could tune tokenization, stopword handling, and the k1/b parameters directly, without fighting FTS5's defaults. And writing it from scratch meant I understood exactly why a result ranked where it did, which turned out to matter when I started debugging surprising rankings.

Building the Index in Python

The index is built offline from the Open Food Facts CSV export (~750K English-language products after filtering). The Python build script tokenizes product names, builds an inverted index, and serializes it to 6 binary files that ship in the app bundle.

import re

STOPWORDS = {"a", "an", "the", "of", "in", "and", "with", "for", "to", "g", "ml"}

def tokenize(text: str) -> list[str]:
    text = text.lower()
    tokens = re.findall(r"[a-z0-9]+", text)
    return [t for t in tokens if t not in STOPWORDS and len(t) > 1]

The inverted index maps each term to a list of (doc_id, term_frequency) pairs. Each entry is packed as 6 bytes: a 4-byte offset into the doc list (tf_offset) and a 2-byte document length (total tokens in that product name), allowing BM25 normalization without loading the full document:

import struct

# For each (doc_id, tf) pair in a term's posting list:
entry = struct.pack('<IH', tf_offset, doc_length)  # 6 bytes

The 6 output files are: products_ids.txt (one product ID per line), products_display.txt (name + kcal display string per line), products_docs.bin (concatenated posting lists), term_offsets.bin (byte offset of each term's posting list in docs.bin), term_lengths.bin (number of entries per term), and vocab.txt (sorted vocabulary, one term per line). The vocab file is binary-searched at query time to find a term's index.

BM25 Scoring in Swift

The Swift side loads all index files from the app bundle at startup into memory-mapped data. At query time, each query term is binary-searched in the vocabulary, its posting list is loaded, and BM25 scores are accumulated:

func bm25Score(
    queryTerms: [String],
    termFreqs: [String: Int],   // term -> tf in this document
    docLength: Int,
    meta: IndexMeta             // N, avgdl, idf per term
) -> Double {
    let k1 = 1.2
    let b  = 0.75
    var score = 0.0
    for term in queryTerms {
        guard let tf = termFreqs[term],
              let idf = meta.idf[term] else { continue }
        let norm = Double(docLength) / meta.avgDocLength
        let tfNorm = Double(tf) * (k1 + 1.0) /
                     (Double(tf) + k1 * (1.0 - b + b * norm))
        score += idf * tfNorm
    }
    return score
}

For a query of 2–3 terms, this scores the union of matching documents very quickly on my iPhone—fast enough to run on every keystroke with no debounce.

The Hybrid Architecture

The local index stores only what's needed for search and display: barcode, product name, and calorie count. Full nutritional data (macros, micronutrients, serving sizes, ingredients) is fetched from the Open Food Facts API only when the user taps a specific product to log it.

This hybrid approach keeps the bundle small (~40 MB for the index) while ensuring the nutritional detail shown to the user is always current—Open Food Facts is continuously updated by its community, and the per-product API call gets the latest version of that record.

Vital: Building a Free Food Journal with On-Device BM25 Search

The Problem with Food Tracking Apps

First Attempt: The Open Food Facts API

The Decision to Go On-Device

Building the Index in Python

BM25 Scoring in Swift

The Hybrid Architecture

Lessons Learned