LiberaGPT White Paper
March 2026
A Local-First Cognitive Interface for iPhone iOS (iPhone 17 Pro Architecture)
Abstract
LiberaGPT is a native iPhone system for private inference, grounded retrieval, local voice interaction, and disciplined user control. Its central claim is not that a handset can host a frontier model in the server sense. Its claim is stricter and more consequential: a phone-class system becomes a serious personal reasoning surface when computation, memory, interface, and safety are co-designed around the limits and strengths of the device rather than treated as scaled-down versions of cloud infrastructure.
The hardware substrate is now credible. Apple specifies the iPhone 17 Pro and Pro Max around the A19 Pro, with a 6-core CPU, a 6-core GPU with Neural Accelerators, and a 16-core Neural Engine. Apple exposes an on-device foundation model to developers through the Foundation Models framework, while Core ML provides compression, stateful execution, and transformer-specific support for fully on-device deployment. The system is engineered for this substrate.
This paper describes a completed system, not a proposal. LiberaGPT implements a layered runtime in which a compact on-device language core, explicit local memory with retrieval-augmented generation, thermal-aware computation routing, native voice ingress, and a governance-first interface operate together to produce a tool that is private, legible, and aware of its own hardware limits. The system monitors its own thermal state and memory pressure in real time and adapts its inference behaviour accordingly. It parses, chunks, embeds, and retrieves from the user's own imported documents entirely on-device. It streams generation token by token with cooperative cancellation at every boundary. It logs errors locally under encryption and exposes full diagnostic state to the user. Nothing leaves the device. Nothing is hidden from the user.
Verified claims: The hardware description of iPhone 17 Pro class devices, the availability of Apple's Foundation Models framework, and Core ML support for on-device generative deployment are all directly grounded in Apple documentation. The statements about transformer foundations, retrieval-augmented generation, and low-precision inference are likewise grounded in primary technical sources. The described system capabilities reflect the implemented codebase as of March 2026.
Design thesis: The claim that LiberaGPT defines a new category of personal reasoning surface is a design position. It is presented as an engineering conclusion drawn from the constraints and affordances of current handset hardware, not as a market prediction.
1. Problem Setting
Mobile computation has matured faster than mobile software architecture. The contemporary handset is capable of serious local inference, but most products still treat the device as a terminal for remote computation. This leaves a structural gap. Users want immediacy, privacy, and continuity of context, yet cloud-first systems introduce network dependency, externalised storage, and weak legibility around what is remembered and why. Apple's public framing of on-device intelligence points in the same direction: local execution is valuable not merely for latency, but for responsiveness and privacy. Core ML states this plainly by emphasising that models run strictly on-device and remove any need for a network connection while keeping user data private.
The technical problem is to construct a mobile cognitive interface that respects device limits without collapsing into triviality. Apple's own description of the Foundation Models framework is revealing here. The on-device model at the core of Apple Intelligence is approximately 3 billion parameters and is described as effective for summarisation, extraction, refinement, and short dialog, while Apple explicitly notes that it is not designed as a chatbot for general world knowledge. That statement is not a limitation in the ordinary sense. It is an architectural cue. The correct design is an application-specific reasoning system with local memory and controlled scope, not an unconstrained general world oracle.
But the problem extends beyond model selection. A phone-class inference system must also contend with thermal dynamics, memory pressure, battery budget, and the user's right to understand what the system is doing at every moment. These are not secondary concerns. They are governing constraints. Ignore thermal state and the device throttles unpredictably. Ignore memory pressure and it crashes under load. Hide the reasoning path and the user stops trusting the output. LiberaGPT addresses all three.
The objective is maximum trust per watt, maximum clarity per screen, and maximum useful reasoning per byte of context.
Verified claims: Core ML does explicitly position strict on-device execution as a privacy and responsiveness advantage. Apple does explicitly describe the on-device foundation model as strong at text tasks but not intended as a general world knowledge chatbot. Thermal throttling is a documented constraint on sustained workloads on iPhone-class devices.
Design criteria:The phrases “maximum trust per watt” and “maximum useful reasoning per byte of context” are conceptual design criteria for LiberaGPT, not measured benchmarks.
2. Technical Basis
The foundational model layer is a routing problem rather than a monolith. Apple exposes a production-grade on-device language model through the Foundation Models framework, suitable for summarisation, text understanding, extraction, refinement, and short dialog inside application-specific workflows. At the same time, open model families aimed at everyday devices have matured. Google's Gemma 3n is explicitly designed for efficient execution on low-resource devices and supports multimodal input across text, image, video, and audio with open weights for pretrained and instruction-tuned variants. Families such as Qwen 2.5 and Phi-3 offer additional trade-offs in context length, parameter count, and quantisation format.
Apple's on-device model provides immediate native capability and a stable deployment path on Apple silicon. An open compact model family provides a route to tunability, domain specificity, and longer-term independence from any single vendor. One model does not replace the other. A serious mobile system must separate its inference providers from its orchestration and interface so that models can be swapped, upgraded, or run in parallel without rewriting the product surface. Apple's Core ML stack supports model compression, stateful execution, and efficient transformer operations, all of which are central to keeping a local language system responsive under handset constraints.
The memory layer must be explicit rather than latent. Lewis et al. showed that retrieval-augmented generation combines parametric memory with an external dense index and can yield responses that are more specific and factual than parametric-only baselines on knowledge-intensive tasks. On a handset where the model is small and the context window finite, then a local non-parametric memory is not an accessory. It is the mechanism by which the system becomes accountable to the user's own corpus rather than merely fluent.
Dettmers et al. demonstrate that low-precision matrix multiplication sharply reduces inference memory demands while preserving strong performance characteristics, and Apple's own deployment guidance for Apple silicon treats compression as a first-class engineering concern with explicit trade-offs among storage size, latency, power usage, and accuracy. On a phone, this governs every deployment decision. LiberaGPT is engineered with quantisation-aware model management, where each model pack declares its compression format, memory footprint, and expected throughput, and the system validates these against available device resources before loading.
Verified claims: Apple's framework capabilities, Gemma 3n device orientation and multimodal scope, retrieval-augmented generation as an explicit memory method, and low-precision inference as a memory efficiency strategy are all grounded in the cited sources. Qwen 2.5 and Phi-3 are publicly documented compact model families with published model cards.
Implementation claims: The model pack management system with declared footprints, compression formats, and device validation is implemented in the current codebase. Model packs include SHA-256 integrity verification on download.
3. Inference Routing and Dual-Mode Execution
LiberaGPT implements two operational modes that govern the entire inference pipeline: Fast and Deep. This is not a cosmetic toggle. Each mode produces materially different runtime behaviour across retrieval depth, context length, system prompt construction, and thermal budget allocation.
Fast modefavours low latency and short context. It restricts retrieval to a small number of highly relevant document chunks, applies a higher relevance threshold to filter weak matches, and issues concise system instructions that prioritise accuracy over elaboration. When Apple's native Foundation Models provider is available and the device supports it, Fast mode routes to that path for maximum throughput on Apple silicon.
Deep mode allows heavier retrieval, broader context windows, and longer system instructions that permit more thorough analysis of retrieved material. When a compact local model is installed and sufficient memory is available, Deep mode can route to that provider for domain-specific reasoning with larger context. This is an operational distinction between short-path and high-context execution, not a claim about hidden chain of thought.
Both modes are subject to real-time thermal and memory constraints. If the device enters a critical thermal state during Deep mode operation, the system can force a fallback to Fast mode to protect the device and the user experience. If available memory drops below the threshold required to hold a loaded model, the system blocks new model loading and may unload existing models. These constraints are not imposed after the fact. They are checked before every generation request and enforced continuously during streaming.
Provider selection is exposed to the user. The system checks device compatibility, operating system version, Neural Engine availability, physical memory, and available storage before declaring a provider available. The user always knows which provider is active, which mode is selected, and what constraints are in effect. There is no silent fallback without visibility.
Implementation claims: The dual-mode retrieval parameters, thermal fallback logic, memory-gated model loading, and provider availability checking are all implemented in the current system. Device compatibility checking includes hardware model identification, OS version verification, and real-time resource validation.
Design claims: The specific retrieval thresholds and chunk limits represent current tuning parameters, not experimentally optimised values. They are subject to adjustment as real-world usage data accumulates.
4. Local Memory and Retrieval-Augmented Generation
The memory system in LiberaGPT is not latent. It is an explicit, structured pipeline that converts user-imported documents into retrievable evidence for generation. The pipeline operates entirely on-device and consists of five stages: parsing, chunking, embedding, storage, and retrieval.
Parsing.The system accepts PDF documents and images. PDF content is extracted with full-text analysis including page structure and metadata. Images are processed through optical character recognition using the platform's Vision framework with language correction, yielding structured text from photographed documents, handwritten notes, screenshots, and similar visual material. Plain text files are ingested directly. The parser detects file type and routes to the appropriate extraction path automatically.
Chunking. Extracted text is divided into segments of controlled size with overlap between adjacent chunks to preserve contextual continuity across segment boundaries. The chunker respects sentence boundaries and handles abbreviations, decimal numbers, and other edge cases that would otherwise produce malformed segments. Each chunk carries metadata including word count, token estimate, character offsets, and its position within the source document.
Embedding. Each chunk is converted into a dense vector representation suitable for similarity comparison. Embedding is performed in batches to manage memory consumption during ingestion of large documents. The embedding model targets 384-dimensional vectors compatible with cosine similarity search, following the MiniLM family of compact sentence transformers.
Storage.Vectors and their associated chunk metadata are persisted in a local database with full ACID transaction guarantees. The database operates in write-ahead logging mode for crash safety and concurrent read access. All stored data is protected under the platform's strongest file encryption, rendering it inaccessible when the device is locked. Foreign key constraints ensure referential integrity between documents, chunks, and their vector representations. Deletion of a document cascades through all dependent records automatically.
Retrieval. When a user sends a query, the system embeds the query, searches the vector store for the most similar chunks, loads the corresponding document metadata, and injects the retrieved content into the generation prompt. The number of chunks retrieved and the minimum similarity threshold are governed by the active mode (Fast or Deep) and further constrained by the current thermal state of the device. Retrieved content is presented to the model as reference material under explicit instructions not to treat it as authoritative instruction. Source citations are displayed to the user alongside the generated response, showing which documents were consulted and the relevance score of each retrieved chunk.
Query results are cached to avoid redundant embedding computation for repeated or similar queries within a session. The semantic search engine maintains a bounded cache that is cleared when the underlying document corpus changes.
Verified claims: Retrieval-augmented generation as a method for grounding language model output in external evidence is established by Lewis et al. MiniLM-class sentence transformers are a documented family of compact embedding models with published benchmarks.
Implementation claims: The five-stage pipeline (parse, chunk, embed, store, retrieve), ACID-compliant local storage with write-ahead logging, cascading deletes, file encryption at rest, sentence-aware chunking, batch embedding, query caching, and per-chunk source citation display are all implemented in the current system. The Vision framework OCR integration for image-based document ingestion is functional.
5. Thermal-Aware Computation
An iPhone is not a server. It has no fan, no vapour chamber, and no active cooling of any kind. Its thermal headroom is limited to what passive dissipation provides. Any system that runs sustained inference on a phone without monitoring thermal state is building on unstable ground. The device will throttle the CPU, GPU, and Neural Engine as thermal pressure rises, and the user will experience this as unpredictable slowdowns, degraded output quality, or application termination. LiberaGPT treats thermal management as a first-class system concern, not a background optimisation.
The system monitors the device's thermal state continuously at runtime and classifies it into operational tiers. Under nominal and fair thermal conditions, the system operates at full capacity: unrestricted retrieval depth, unrestricted generation length, and full model access. As thermal pressure increases to a serious level, the system reduces retrieval depth, introduces inter-token generation delays to lower sustained load, defers non-critical embedding operations, and restricts the number of context chunks injected into prompts. Under critical thermal conditions, the system blocks heavy inference entirely and can force a mode downgrade from Deep to Fast to protect the device.
Memory management operates on a parallel axis. The system tracks available device memory in real time using operating system primitives, maintains a registry of loaded model footprints, and enforces hard thresholds for model loading and retention. When available memory drops below a warning threshold, the system blocks new model loads. When memory pressure reaches critical levels, the system actively unloads models by executing registered cleanup callbacks. When the application moves to the background, all models are unloaded immediately. When the operating system issues a memory warning, the system responds by unloading all registered models without delay.
A reasoning system that is unaware of its own thermal and memory state cannot make reliable promises to the user about response quality, latency, or availability. LiberaGPT exposes its current thermal state, memory pressure level, and active constraints to the user through a diagnostics surface. The user can see exactly what the system can and cannot do at any moment and why.
Verified claims: iOS devices do throttle CPU, GPU, and Neural Engine performance under thermal pressure. The ProcessInfo thermal state API and os_proc_available_memory function are documented Apple platform capabilities. Memory warnings are a documented system signal on iOS.
Implementation claims: Real-time thermal monitoring with tiered quality adaptation, memory-pressure-aware model lifecycle management with registered unload callbacks, background model unloading, and user-facing diagnostics for thermal and memory state are all implemented in the current system. The system instruments all critical paths with OS-level signposts for profiling in Instruments.
6. Interface Doctrine
A mobile reasoning system fails first in the interface. If the user cannot see state, cannot distinguish memory from model output, or cannot predict the consequences of import and deletion, then technical sophistication is wasted. LiberaGPT adopts a severe visual grammar: a quiet neutral field, sparse accent colour for state, a single button language, a single card language, and rigid typography on an 8-point spatial grid. The purpose is minimising cognitive switching costs in a context where the primary task is thought, not interface interpretation.
The main screen contains only the core surfaces: message transcript, text composer, document import, voice input, mode selection, and settings access. The visible hierarchy answers three questions at all times. What mode is active. What the system is doing. What local evidence supports the answer. Any element that does not improve one of those three disclosures is treated as visual debt.
Voice input operates through two interaction patterns: tap to toggle recording, and press-and-hold for push-to-talk. During recording, a real-time waveform visualisation reflects audio level, a duration counter tracks elapsed time, and partial transcripts appear as speech is recognised. Voice activity detection can automatically stop recording after sustained silence, configurable by the user. The voice subsystem targets on-device speech recognition using the WhisperKit framework, preserving the local-first thesis by avoiding remote transcription as an unnecessary privacy dependency.
Every assistant response carries metadata: which provider generated it, which mode was active, whether retrieval was used, and how many document chunks were consulted. When retrieval is active, source citation chips appear below the response showing the document title and relevance score of each consulted chunk. The user can tap any citation to inspect the retrieved passage. This is how a small local model earns trust: by showing what it consulted, not by concealing its limits.
Settings are part of the product, not a maintenance annex. The settings surface exposes provider selection with real-time availability indicators, mode selection, retrieval parameters (depth and minimum relevance threshold), voice configuration, haptic feedback controls, and privacy settings. A dedicated diagnostics view reveals active provider capabilities, token throughput, thermal state, memory pressure, storage statistics, and a full local error log categorised by subsystem. A separate data management surface provides explicit destructive controls—clear conversations, delete imported documents, purge error logs, reset settings to defaults—each behind a confirmation dialog. Apple requires a privacy policy URL and privacy practice disclosure in App Store Connect. That external compliance burden implies an internal design requirement: settings must clearly reveal what the system stores, how it behaves, and how the user can destroy their data.
Accessibility is structural. Every interactive element carries a VoiceOver label and hint describing its function and the gesture required to activate it. Message rows announce their content, provider, mode, and source count to assistive technology. Reduced motion preferences are respected throughout: animations, transitions, and waveform visualisation all degrade gracefully. Haptic feedback—categorised into light, medium, success, warning, and error intensities—is applied at meaningful state transitions and can be disabled globally. Dynamic type support ensures the interface scales with the user's system font size preference.
Verified claims: Apple requires a privacy policy URL and app privacy disclosure in App Store Connect. WhisperKit is a documented on-device speech recognition framework for Apple platforms. Apple's Human Interface Guidelines and accessibility documentation describe the VoiceOver, reduced motion, and dynamic type requirements referenced here.
Implementation claims: The 8-point spatial grid, source citation chips with relevance scores, provider metadata on responses, diagnostics view with thermal and memory state, data management with cascading deletion, VoiceOver labels on all interactive elements, reduced motion support, haptic feedback with intensity classification, dynamic type scaling, dual voice input patterns (tap and push-to-talk), waveform visualisation, and configurable voice activity detection are all implemented in the current system.
7. Privacy, Safety, and Governance
A local-first system does not become safe merely by being local. It acquires a different attack surface. Imported documents can contain hostile instructions. Memory can become contaminated. Unsafe outputs can still be generated entirely on-device. For this reason, safety in LiberaGPT is architectural rather than rhetorical. The system implements concrete control points at specific boundaries in the inference pipeline.
Retrieval boundary sanitisation.Before any retrieved document content enters the generation prompt, it passes through a sanitisation layer that strips known instruction markers used by major language model families—role delimiters, system prompt tags, instruction formatting tokens, and similar control sequences. This prevents imported documents from injecting false role boundaries or system-level instructions into the prompt stream. The sanitisation operates on a pattern-matching basis covering the instruction formats of widely deployed model families. Retrieved content is further bracketed by explicit directives instructing the model to treat it as reference material only and to disregard any embedded commands, role overrides, or instruction-like content found within it.
Instruction hierarchy.The prompt construction follows a strict ordering: system policy first, then mode-specific operational instructions, then sanitised retrieved context, then user input. This ordering is maintained consistently across both Fast and Deep modes. The rationale is direct: prompt injection attacks exploit the fact that natural language data and natural language instructions are processed in the same channel, and OWASP now treats prompt injection as a primary risk class for language model applications. By establishing a fixed ordering and sanitising the untrusted layer (imported documents), the system reduces the attack surface without relying solely on the model's ability to distinguish instruction from data.
Data minimisation and lifecycle. The UK GDPR principles of purpose limitation, data minimisation, storage limitation, integrity, confidentiality, and accountability remain directly relevant even when data stays on-device. Local storage, diagnostic logs, and deletion flows still create a regulated data lifecycle. LiberaGPT stores only the data necessary for its scoped functions. Conversations are persisted in a local database with ACID guarantees. Error logs are maintained in a bounded circular buffer with a fixed maximum entry count, persisted locally under full file encryption, and clearable by the user at any time. All diagnostic logging uses privacy-annotated output that redacts message content in release builds. There is no telemetry, no remote error reporting, and no analytics collection of any kind.
Device security alignment.Apple's platform security documentation states that Data Protection secures data stored in flash storage on Apple devices and that each file is protected with a unique per-file key wrapped by class keys according to access requirements. LiberaGPT aligns its storage design with this model. The local database and error log files use the strongest available file protection class, rendering all persistent data inaccessible when the device is locked. The system does not implement a separate application-layer encryption scheme. Security is strongest when the app composes with the platform's own trust model rather than inventing ad hoc alternatives.
Deletion semantics.Every category of user data—conversations, imported documents, error logs, and application settings—can be destroyed independently through the data management surface. Conversation deletion cascades through all associated messages and attachments via foreign key constraints enforced at the database level. Document deletion cascades through all associated chunks and vector embeddings. Settings reset restores defaults without affecting stored conversations or documents. All deletions are immediate and permanent. There is no soft delete, no trash, and no recovery mechanism. The user's instruction to destroy data is executed without ambiguity.
Verified claims: Prompt injection is a documented and material security risk in deployed language applications. UK GDPR principles do include purpose limitation, minimisation, storage limitation, integrity, confidentiality, and accountability. Apple platform security does describe Data Protection and per-file keying on device. OWASP does classify prompt injection as a primary risk for LLM applications.
Implementation claims: Retrieval boundary sanitisation covering major model instruction formats, fixed prompt construction ordering, bounded circular error log with privacy annotations, complete file protection on persistent data, cascading deletion via foreign key constraints, and independent destruction of each data category are all implemented in the current system. The absence of telemetry and remote reporting is a verified design constraint, not an omission.
Limitations disclosed: The sanitisation layer operates on pattern matching and explicit textual instruction. It is a heuristic defence, not a formal guarantee against all adversarial prompt constructions. The system does not currently implement language-model-based prompt injection detection or output content classification. These represent areas for future hardening.
8. Strategic Conclusion
The opportunity opened by current handset hardware is narrower and more serious than the market usually admits. The question is no longer whether a phone can run a language model. It can. The deeper question is whether a phone can host a trustworthy personal reasoning system whose virtues arise from boundedness rather than from scale. LiberaGPT is built on the conviction that the answer is yes, provided the product is constructed as a local-first composition of compact inference, explicit memory, disciplined retrieval, visible provenance, thermal-aware resource management, and platform-aligned security.
The system is not a chatbot. It is not a miniature cloud service. It is a cognitive interface for the device itself. Its technical novelty lies in the union of several capabilities that do not typically coexist in mobile software: on-device retrieval-augmented generation with full document parsing and source citation, real-time thermal and memory state awareness that governs inference behaviour, dual-mode execution that adapts retrieval depth and provider selection to task demands, cooperative cancellation across every boundary from voice input through generation streaming, and a governance-first interface that never hides state from the user.
Each of these capabilities is individually understood in the research literature. Their union in a single native iPhone application, operating entirely on-device under real hardware constraints, is not. That union is the contribution. It is what makes the system defensible: not any single technique, but the discipline of composing them correctly under the thermal, memory, and interface constraints of a phone-class device.
Apple's current stack makes this technically plausible, and compact open models such as Gemma 3n, Qwen 2.5, and Phi-3 make it strategically extensible beyond any single vendor's model roadmap. The system is designed so that inference providers can be added, replaced, or removed without altering the orchestration, memory, or interface layers. The model layer changes faster than any other layer in a language product. The separation accounts for that.
If the union of local execution, explicit memory, thermal awareness, and interface transparency is preserved, the product occupies a position that cloud-dependent assistants cannot replicate and that toy on-device demos do not approach. If that union is abandoned in favour of ornamental interaction or indiscriminate model breadth, the project dissolves into the general noise of assistant software. The system is intentionally narrow because narrowness is the price of seriousness on-device.
References
- Vaswani et al., Attention Is All You Need, NeurIPS 2017.
- Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020.
- Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022.
- Apple, iPhone 17 Pro and iPhone 17 Pro Max Technical Specifications.
- Apple, Updates to Apple's On-Device and Server Foundation Models, 2025.
- Apple Developer, Foundation Models Documentation.
- Apple Developer, Core ML.
- Apple Machine Learning Research, On-Device Llama 3.1 with Core ML, 2024.
- Google AI for Developers, Gemma 3n Model Card, 2025.
- Argmax, WhisperKit.
- Liu et al., Prompt Injection Attack against LLM-Integrated Applications, 2023.
- OWASP, Top 10 for Large Language Model Applications and LLM01 Prompt Injection.
- ICO, A Guide to the Data Protection Principles.
- Apple Developer, App Privacy Details and Manage App Privacy.
- Apple Platform Security, Data Protection Overview and Data Protection in Apple Devices.
- Reimers & Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, EMNLP 2019.
- Wang et al., MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers, NeurIPS 2020.