LiberaGPT: Technical Architecture for Bounded, Accountable AI on iPhone

Stephen J. Pereira

March 2026

DOI: 10.5281/zenodo.19232610

A Local-First Cognitive Interface for iPhone iOS (iPhone 17 Pro Architecture)

Abstract

Modern iPhones ship with serious compute capabilities. The iPhone 17 Pro runs the A19 Pro chip with a 16-core Neural Engine, and Apple provides an on-device foundation model through the Foundation Models framework. Core ML handles compression and stateful execution for local deployment.

LiberaGPT implements a layered runtime: compact on-device language models and thermal-aware computation routing. The app monitors thermal state and memory pressure in real time and adapts inference accordingly in the background. Generation streams token by token with cooperative cancellation at every boundary. Nothing leaves the device.

Review of Abstract:
Verified claims: The hardware description of the iPhone 17 Pro and Apple's A19 Pro chipset, the availability of native Core ML support for on-device generative deployment are all directly grounded in Apple documentation. The statements about transformer foundations and low-precision inference are likewise grounded in primary technical sources. The described capabilities reflect the implemented codebase as of the launch of iOS26 in late 2025.

Design thesis: The claim that LiberaGPT defines a new category of personal reasoning surface is not only a design position, but is an engineering conclusion drawn from the constraints and affordances of current handset hardware, not as a market prediction.

Advanced Coding on iPhone

Watch Nvidia's AceInstruct, a light-weight 1.5 billion parameter language model, generate a simple HTML/CSS website from a single conversational prompt. No server calls. No network dependency. The entire generation happens locally on the iPhone 17 Pro's A19 Pro chip, streaming tokens in real time as the Neural Engine processes the request.

The model constructs semantic HTML markup and responsive CSS layouts while the app monitors device thermals, adjusting inference speed to prevent throttling. What you're seeing is the intersection of language understanding, code generation, and thermal-aware computation running entirely within the constraints of a handheld device. The output is immediate, private, consistent and without ever leaving the device's local memory.

Model: Nvidia AceInstruct (1.5B parameters, 4 bit quantization)

1. Problem Setting

Mobile computation has matured faster than mobile software architecture. Current handsets can run serious local inference, but most products still treat the device as a terminal for remote computation. Users want immediacy, privacy, and continuity of context. Cloud-first systems introduce network dependency, external storage, and unclear memory semantics. Core ML documentation emphasises that on-device models remove network requirements while keeping user data private.

The technical problem is building a mobile cognitive interface that respects device limits without becoming trivial. Apple describes the on-device model at the core of Apple Intelligence as approximately 3 billion parameters, effective for summarisation, extraction, refinement, and short dialog, but not designed as a chatbot for general world knowledge. The correct design is an application-specific reasoning system with local memory and controlled scope.

A phone-class inference system must handle thermal dynamics, memory pressure, battery budget, and transparent operation. Ignore thermal state and the device throttles unpredictably. Ignore memory pressure and it crashes. Hide the reasoning path and the user stops trusting the output.

LiberaGPT optimises for trust, clarity, and useful reasoning under device constraints.

Review of Section 1:
Verified claims: Core ML does explicitly position strict on-device execution as a privacy and responsiveness advantage. Apple does explicitly describe the on-device foundation model as strong at text tasks but not intended as a general world knowledge chatbot. Thermal throttling is a documented constraint on sustained workloads on iPhone-class devices.

Design criteria:The phrases “maximum trust per watt” and “maximum useful reasoning per byte of context” are conceptual design criteria for LiberaGPT, not measured benchmarks.

2. Technical Basis

Open model families offer trade-offs in context length, parameter count, and quantisation format. LiberaGPT currently supports SmolLM3, Phi-4 Mini, StableLM Zephyr, EXAONE Deep, and AceInstruct.

Apple's on-device model provides native capability and stable deployment on Apple silicon. Open compact models provide tunability, domain specificity, and vendor independence. LiberaGPT separates inference models from orchestration and interface so models can be swapped, upgraded, or run in parallel without rewriting the product surface. Core ML supports model compression, stateful execution, and efficient transformer operations.

Dettmers et al. demonstrate that low-precision matrix multiplication reduces inference memory demands while preserving performance. Apple's deployment guidance for Apple silicon treats compression as a first-class concern with explicit trade-offs among storage size, latency, power, and accuracy. LiberaGPT implements quantisation-aware model management. Each publically available model pack declares its compression format, memory footprint, and expected throughput. The app validates these against available device resources before loading.

Review of Section 2:
Verified claims: Apple's framework capabilities and low-precision inference as a memory efficiency strategy are all grounded in the cited sources. The listed open model families are publicly documented with published model cards.

Implementation claims: The model pack management system with declared footprints, compression formats, and device validation is implemented in the current codebase. Model packs include SHA-256 integrity verification on download.

3. Model Specialisation Over Scale

The dominant paradigm in commercial AI is consolidation: train a single model with hundreds of billions of parameters until it handles every task acceptably. GPT-4 class models operate at roughly 1.8 trillion parameters. Claude, Gemini, and their peers occupy similar scales. These models require datacenter-grade hardware, drawing tremendous wattage under load across clusters of accelerators with multiples of gigabytes in high-bandwidth memory. They cannot be locally deployed to a phone, a tablet, or a lightweight laptop yet.

LiberaGPT takes the opposite approach: multiple compact models, each selected for a specific task domain. Rather than compressing one enormous generalist into a device that cannot sustain it, the app maintains a catalogue of models between 1 and 4 billion parameters, each quantised to 4-bit precision, each occupying between 0.9 and 2.5 gigabytes on disk. The user loads whichever model matches their current task. Only one model occupies memory at a time.

An under 1.6 billion parameter model trained specifically on code generation (AceInstruct) will produce better structured code output than a 3 billion parameter generalist asked to do the same. A 2.4 billion parameter model trained on mathematical reasoning (EXAONE Deep) will handle formal logic and step-by-step derivation more reliably than a larger model that treats mathematics as one task among thousands. Specialisation at small scale routinely outperforms generalism at moderate scale on the tasks each specialist was trained for.

The practical consequence is that a phone carrying five specialist models has broader effective capability than a phone carrying one generalist of equivalent total size. SmolLM3 handles extended conversational context. Phi-4 Mini handles analytical reasoning. EXAONE Deep handles mathematics and coding logic. AceInstruct handles code generation. StableLM Zephyr handles lightweight quick responses. The user selects the right tool for the job. The app manages the rest: model loading, memory lifecycle, context window budgeting, and thermal adaptation.

This architecture also decouples the product from any single model vendor's release schedule. When a better coding model ships, it replaces AceInstruct without touching the reasoning or conversational models. When a new efficient architecture emerges at the 2 billion parameter class, it slots into the existing catalogue. The model layer evolves independently of the interface and orchestration layers. In a field where model capabilities shift quarterly, this modularity is a structural advantage that monolithic architectures cannot replicate.

Review of Section 3:
Verified claims: GPT-4 class parameter counts and datacenter hardware requirements are documented in published technical reports and industry analyses. The performance advantages of task-specific fine-tuned models over larger generalists on domain tasks are supported by published benchmark comparisons across model families. Quantisation to 4-bit precision (Q4_K) as a viable deployment strategy for compact models is documented in Dettmers et al. and Apple's Core ML deployment guidance.

Implementation claims: The five-model catalogue, single-model-in-memory lifecycle, per-model memory footprint declarations, and hot-swap model loading are all implemented in the current app. Model selection is a user-facing choice, not an automated routing decision.

4. Thermal-Aware Computation

iPhones have no fan, no vapour chamber, and no active cooling. Thermal headroom is limited to passive dissipation. Sustained inference without monitoring thermal state leads to CPU, GPU, and Neural Engine throttling, manifesting as unpredictable slowdowns, degraded output quality, or application termination. LiberaGPT treats thermal management as a first-class concern.

The app monitors thermal state continuously and classifies it into operational tiers. Under nominal and fair thermal conditions, the app operates at full capacity: unrestricted generation length and full model access. As thermal pressure increases to a serious level, the app introduces inter-token generation delays to lower sustained load and restricts generation length. Under critical thermal conditions, the app blocks heavy inference and can force a mode downgrade from Deep to Fast.

Memory management operates in parallel. The app tracks available device memory in real time using OS primitives, maintains a registry of loaded model footprints, and enforces hard thresholds for model loading and retention. When available memory drops below a warning threshold, the app blocks new model loads. When memory pressure reaches critical levels, the app actively unloads models via registered cleanup callbacks. When the application moves to the background, all models are unloaded immediately. When the OS issues a memory warning, the app unloads all registered models.

All thermal and memory management runs transparently in the background. The user is never interrupted by technical diagnostics. The app adjusts its own behaviour automatically based on device conditions.

Review of Section 4:
Verified claims: iOS devices do throttle CPU, GPU, and Neural Engine performance under thermal pressure. The ProcessInfo thermal state API and os_proc_available_memory function are documented Apple platform capabilities. Memory warnings are a documented system signal on iOS.

Implementation claims: Real-time thermal monitoring with tiered quality adaptation, memory-pressure-aware model lifecycle management with registered unload callbacks, and background model unloading are all implemented in the current app. The app instruments all critical paths with OS-level signposts for profiling in Instruments.

5. Silicon Alignment: Composing with the A19 Pro

The A19 Pro is not a general-purpose processor that happens to run machine learning. It is a heterogeneous compute system with dedicated silicon for neural network inference. The chip contains a 16-core Neural Engine rated at over 35 TOPS (trillions of operations per second) for integer and low-precision workloads, a 6-core GPU with Metal 3 support, and a multi-core CPU cluster. These are not shared resources competing for the same transistors. They are physically distinct execution units with separate power domains and scheduling characteristics.

LiberaGPT's inference workload maps directly onto this architecture. A 4-bit quantised transformer model is dominated by integer matrix multiplications, exactly the operation class the Neural Engine is designed to accelerate. Core ML compiles models into chip-specific execution plans that distribute operations across the Neural Engine, GPU, and CPU based on the characteristics of each layer. Attention heads may run on the GPU via Metal compute shaders while feedforward layers route through the Neural Engine's integer pipelines. The developer does not manually schedule this. The Core ML runtime and the chip's hardware scheduler handle dispatch.

The critical architectural advantage is unified memory. Unlike desktop and server systems where data must be copied between CPU RAM and GPU VRAM across a PCIe bus, the A19 Pro shares a single memory pool across all compute units. A model loaded once is accessible to the Neural Engine, GPU, and CPU without transfer overhead. For a 2.5 gigabyte quantised model, this eliminates what would otherwise be the dominant bottleneck in inference startup: the time and energy spent moving weights between memory domains.

This hardware design also explains why 4-bit quantisation is not merely a compression compromise but an architectural alignment. The Neural Engine's integer compute paths are purpose-built for low-precision arithmetic. A Q4_K quantised model does not simply use less memory. It executes on silicon that was designed for exactly its precision class, achieving higher throughput per watt than higher-precision formats running on the same chip. The result is faster token generation with lower thermal output, which directly extends the sustainable inference window before thermal throttling intervenes.

LiberaGPT composes with this hardware rather than fighting it. The app does not implement custom compute kernels or bypass the platform runtime. It uses Core ML for model execution, Metal for GPU acceleration where Core ML dispatches it, and the Neural Engine through Apple's own scheduling. When the hardware improves, whether through higher TOPS counts, wider memory bandwidth, or more efficient power delivery, the app benefits automatically without code changes. This is a deliberate design constraint: stay aligned with the platform's execution model, and inherit its improvements.

Review of Section 5:
Verified claims: The A19 Pro's heterogeneous architecture with dedicated Neural Engine, GPU, and CPU clusters is documented in Apple's technical specifications. Unified memory architecture eliminating CPU-GPU transfer overhead is a documented feature of Apple silicon since M1/A-series. Core ML's ability to compile models into chip-specific execution plans with automatic dispatch across compute units is documented in Apple's Core ML and WWDC materials. The Neural Engine's integer compute capabilities and TOPS ratings are published hardware specifications.

Design claims:The characterisation of 4-bit quantisation as an architectural alignment rather than purely a compression strategy is an engineering interpretation supported by the hardware design but not independently benchmarked in this paper. The claim that composing with platform runtimes rather than custom kernels produces better long-term outcomes is a design position informed by Apple's deployment guidance.

6. Privacy, Safety, and Governance

Local-first systems acquire a different attack surface. Safety in LiberaGPT is implemented through concrete control points at specific boundaries in the inference pipeline.

Prompt construction follows a strict ordering: system policy first, then model-specific operational instructions, then user input. This ordering is maintained across all model configurations. Prompt injection attacks exploit the fact that natural language data and natural language instructions are processed in the same channel. OWASP treats prompt injection as a primary risk class for language model applications. By establishing fixed ordering, the app reduces the attack surface without relying solely on the model's ability to distinguish instruction from data.

Data protection principles of purpose limitation (for example within UK GDPR laws), data minimisation, storage limitation, integrity, confidentiality, and accountability remain relevant when data stays on-device.Local storage and deletion flows create a regulated data lifecycle. LiberaGPT stores only the data necessary for its scoped functions. Conversations are persisted in a local database with ACID guarantees. Error logs are maintained in a bounded circular buffer with a fixed maximum entry count, persisted locally under file encryption, and clearable by the user. All diagnostic logging uses privacy-annotated output that redacts message content in release builds. There is no telemetry, no remote error reporting, and no analytics collection.

The local database and error log files use the strongest available file protection class, rendering all persistent data inaccessible when the device is locked. The app does not implement a separate application-layer encryption scheme. Security is strongest when the app composes with the platform's trust model rather than inventing ad hoc alternatives.

Every category of user data can be destroyed independently through the data management surface. This includes conversations, error logs, and application settings. Conversation deletion cascades through all associated messages via foreign key constraints enforced at the database level. Settings reset restores defaults without affecting stored conversations. All deletions are immediate and permanent. There is no soft delete, no trash, and no recovery mechanism.

Review of Section 6:
Verified claims: Prompt injection is a documented and material security risk in deployed language applications. UK GDPR principles do include purpose limitation, minimisation, storage limitation, integrity, confidentiality, and accountability. Apple platform security does describe Data Protection and per-file keying on device. OWASP does classify prompt injection as a primary risk for LLM applications.

Implementation claims: Fixed prompt construction ordering, bounded circular error log with privacy annotations, complete file protection on persistent data, cascading deletion via foreign key constraints, and independent destruction of each data category are all implemented in the current app. The absence of telemetry and remote reporting is a verified design constraint, not an omission.

7. Strategic Conclusion

Current handset hardware can run language models. The question is whether a phone can host a trustworthy personal reasoning system whose virtues arise from boundedness rather than scale. LiberaGPT is built on the premise that it can, provided the product is constructed as a local-first composition of compact inference, thermal-aware resource management, and platform-aligned security.

The technical contribution is the union of several capabilities that do not typically coexist in mobile software: a catalogue of specialist models selected per task rather than one generalist compressed to fit, real-time thermal and memory state awareness governing inference in the background, cooperative cancellation across every boundary in the generation pipeline, and composition with Apple silicon's heterogeneous compute architecture rather than against it.

Each of these capabilities is individually understood in the research literature. Their integration in a single native iPhone application, operating under real hardware constraints with real models producing real output, is the contribution. The app is defensible through the discipline of composing these techniques correctly under the thermal, memory, and compute constraints of a phone.

Apple's current stack makes this technically plausible. Compact open models make it strategically extensible beyond any single vendor's roadmap. The app is designed so that inference models can be added, replaced, or removed without altering the orchestration or interface layers. The model layer changes faster than any other layer in a language product. When better models appear at the 1 to 4 billion parameter class, they slot in immediately.

The name carries the thesis. Libera is Latin for free. Language models today are overwhelmingly bound to datacentres: vast facilities consuming megawatts of power, accessible only through network connections, governed by terms of service that change without notice. The models themselves are capable. The confinement is the problem. As quantisation techniques improve and silicon grows more efficient, the same models that today require a rack of accelerators will run on the devices people already carry. LiberaGPT is built for that transition. The architecture assumes that on-device capability will only increase, and that the correct engineering response is to build the orchestration, safety, and interface layers now so that each generation of hardware and each generation of compact models improves the product without rebuilding it.

The goal was never to replicate a cloud assistant on a phone. It was to build something a cloud assistant cannot be: private by design, capable through model diversity, and available anywhere, regardless of connectivity.

References

Vaswani et al., Attention Is All You Need, NeurIPS 2017.
Dettmers et al., LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, 2022.
Apple, iPhone 17 Pro and iPhone 17 Pro Max Technical Specifications.
Apple, Updates to Apple's On-Device and Server Foundation Models, 2025.
Apple Developer, Foundation Models Documentation.
Apple Developer, Core ML.
Apple Machine Learning Research, On-Device Llama 3.1 with Core ML, 2024.
Liu et al., Prompt Injection Attack against LLM-Integrated Applications, 2023.
OWASP, Top 10 for Large Language Model Applications and LLM01 Prompt Injection.
ICO, A Guide to the Data Protection Principles.
Apple Developer, App Privacy Details and Manage App Privacy.
Apple Platform Security, Data Protection Overview and Data Protection in Apple Devices.