Alberto Purpura | Parola: Building an AI Language Learning App with On-Device LLMs

Parola: Building an AI Language Learning App with On-Device LLMs

Jan 2025 · 8 min read

The Motivation

My parents are learning a new language. Their main bottleneck isn't vocabulary lists or grammar rules—it's that reading anything natural, like a news article or a recipe, requires stopping to look up every third word. By the time you've looked up four words in a paragraph, you've lost the thread of the sentence.

So I thought, why not build something that can help remove this attrition? With that in mind, I created Parola, an app with different AI workflows to generate short stories calibrated to the learner's vocabulary level, then surface the hard words inline with translations and quizzes. The learner reads in context, and because the stories are generated fresh each time, there's no static content to exhaust or boring topics to read about.

Why On-Device?

Apple also recently rolled out their Apple Intelligence framework and that comes with local System-level Foundation Models APIs. Apple's FoundationModels framework (available on iOS 26+) exposes an on-device LLM that runs entirely on the Neural Engine. No API key, no server, no data leaving the device. I was curious what a model running on your phone's Neural Engine could actually do—whether it could produce coherent stories in a foreign language, handle per-word translation, stay within tight context limits. That question was another big reason I built this.

The catch is capability though. And I learned it the hard way. On-device models are smaller than GPT-4—roughly in the 3B–7B parameter range. They're capable of coherent short-form text generation, but they have very real limitations:

Context windows are smaller, so prompts must be concise.
JSON output is very unreliable—the model sometimes wraps it in markdown fences, adds prose before the JSON, or truncates mid-object.
Quality on low-resource languages is noticeably weaker and the quality on English is not so great either, especially on long form text.

Working with these constraints shaped every design decision in Parola.

The Story-First Approach

An earlier prototype went vocabulary-first: the user selected words they wanted to practice, and the app generated a story using those words. This produced stilted, unnatural text—the model was clearly trying to wedge in vocabulary items awkwardly.

So I tried to go story-first and generate a natural story first, then extract the vocabulary from it. The flow became:

Generate a short story (50-60 words for local models) at the target difficulty level.
Tokenize the story into unique content words.
For each word, ask the model for a translation and an example sentence.
Run a validation loop: if the translation looks wrong or is in the wrong language, retry.

This produces vocabulary that actually appeared in a natural context—the learner can always scroll back to see the word used in the sentence they just read.

Working with Small LLMs: The Real Challenges

The most important constraint is context length: both because the local foundation model supports a very short context, and because the model's understanding of the prompt degrades with length. Asking the on-device model for a story plus vocabulary extraction in a single prompt produces unreliable output. Parola breaks it into multiple smaller requests:

Story generation: a single prompt asking for a 50–60 word story. Short enough that the model doesn't lose coherence.
Per-word translation: one prompt per word, asking for a JSON object with translation and example fields. Keeping the prompt under ~500 characters dramatically improves output reliability.

Unreliable JSON output is the other major challenge. The on-device model frequently produces things like:

Sure! Here is the translation:
```json
{"translation": "ciao", "example": "Ciao, come stai?"}
```

Or truncated output when the context fills up. Robust JSON parsing is not optional.

The Retry Strategy

Along the same lines as json parsing, every LLM call in Parola goes through a retry helper. The parameters were tuned empirically: 5 attempts handles the vast majority of transient failures; 500 ms backoff is enough for the on-device model to "cool down" between attempts without making the UX feel broken.

func retry<T>(
    attempts: Int = 5,
    delay: TimeInterval = 0.5,
    operation: () async throws -> T
) async throws -> T {
    var lastError: Error?
    for attempt in 0..<attempts {
        do {
            return try await operation()
        } catch {
            lastError = error
            if attempt < attempts - 1 {
                try await Task.sleep(nanoseconds: UInt64(delay * 1_000_000_000))
            }
        }
    }
    throw lastError!
}

Prompting in the Target Language

One counterintuitive finding: asking the model to produce English output works better when the prompt itself is written in English—but asking for stories in the learner's target language (if it's different than English) works better when the prompt is in that language too. An Italian prompt asking for an Italian story produces more natural phrasing than an English prompt saying "write a story in Italian."

For this reason, Parola relies on prompts in the target language. The translation validation loop checks that the returned translation is actually in the user's native language rather than accidentally in the target language—a failure mode that occurs more often than you'd expect with small models.

The Hybrid Option

In the end, on device models could only produce very short stories that may not have been satisfactory to the reader. So I decided to add a paid tier for users who want higher quality Parola offers an optional GPT-4o-mini backend via subscription. Both backends use the same prompt structure—the only difference is the API call site.

Lessons Learned

Design for failure, not success. The on-device model fails more often than a cloud model. Every LLM call in Parola assumes it will need to retry. Building retry logic in from the start—rather than adding it after seeing failures in production—saved a lot of pain.
Optimistic loading as a UX trick. Story generation takes 8–15 seconds on-device. Parola shows a "loading" animation with partial results as they arrive showing the user how quickly progress is made. That helps reduce perceived wait times.