Modelos de IA en el Dispositivo

LiberaGPT usa modelos de IA compactos y optimizados que funcionan completamente sin conexión en tu iPhone. Descarga solo los modelos que necesitas. Cada modelo tiene sus propias fortalezas y está ajustado para el Neural Engine con cuantización que equilibra calidad y velocidad.

Elige los modelos que necesitas, descárgalos y úsalos sin conexión.

Renderizado 3D de LiberaGPT

Modelos Disponibles

Seis modelos cuantizados (Q4_K) optimizados para inferencia en el dispositivo vía llama.cpp. Descarga lo que necesites en Ajustes. Optimizado para el chip A19 Pro.

SmolLM3 3B Instruct

3B params · 128K context · Q4_K_M · 1.9 GB · ~25 tok/sec

El modelo compacto pero capaz de HuggingFace con una ventana de contexto masiva. Conversaciones extendidas, análisis de documentos largos y razonamiento detallado. Rendimiento equilibrado para la mayoría de tareas.

Caso de uso: Contexto largo, conversaciones extendidas, análisis de documentos
Licencia: Apache 2.0
Modelo en HuggingFace →© 2025 HuggingFace

Phi-4 Mini 3.8B Instruct

3.8B params · 128K context · Q4_K_M · 2.5 GB · ~18 tok/sec

La última generación Phi de Microsoft. Sólidas capacidades de razonamiento, seguimiento de instrucciones y comprensión del lenguaje. La mejor calidad de su clase. Se recomienda el Modo Profundo.

Caso de uso: Razonamiento profundo, instrucciones complejas, respuestas de alta calidad
Licencia: MIT
Modelo en HuggingFace →© 2025 Microsoft

StableLM Zephyr 1.6B

1.6B params · 4K context · Q4_K_S · 989 MB · ~28 tok/sec

El modelo de instrucción ligero de Stability AI. Excelente relación velocidad-calidad para tareas cotidianas. Huella de memoria mínima, ideal para dispositivos más antiguos o flujos de trabajo sensibles a la batería.

Caso de uso: Modo Rápido, eficiencia de batería, dispositivos más antiguos
Licencia: Apache 2.0
Modelo en HuggingFace →© 2024 Stability AI

EXAONE Deep 2.4B

2.4B params · 32K context · Q4_K_M · 1.6 GB · ~22 tok/sec

El modelo bilingüe (inglés/coreano) de LG AI Research. Sólidas capacidades multilingües y de razonamiento. Optimizado para tareas técnicas y analíticas con soporte de contexto extendido.

Caso de uso: Multilingüe (coreano), tareas técnicas, razonamiento analítico
Licencia: Apache 2.0
Modelo en HuggingFace →© 2024 LG AI Research

AceInstruct 1.5B

1.5B params · 128K context · Q4_K_M · 1.2 GB · ~26 tok/sec

Variante ajustada a instrucciones con sólidas capacidades de seguimiento. Confiable para salidas estructuradas, ejecución de tareas e instrucciones complejas de múltiples pasos. Buen equilibrio entre tamaño y capacidad.

Caso de uso: Seguimiento de instrucciones, tareas estructuradas, flujos de trabajo de múltiples pasos
Licencia: Apache 2.0
Modelo en HuggingFace →© 2024 TinyLlama Team

Model Benchmarks

This benchmark evaluates the five small language models available in LiberaGPT under actual mobile inference conditions rather than cloud-hosted assumptions. The test context matters. LiberaGPT presents itself as a fully on-device iPhone system using 4-bit quantised GGUF models, with no cloud execution, and states that it is optimised for A18 and A19 Pro class devices with Neural Engine and Metal GPU acceleration. On the Apple side, iPhone 17 Pro is built on the A19 Pro, not the base A19, and Apple lists a 16-core Neural Engine together with Neural Accelerators integrated into each GPU core. The correct framing is therefore constrained local inference on iPhone 17 Pro hardware, not generic model performance in the abstract.

Method

Each model was tested against the same ten prompts across five categories: reasoning, factual knowledge, instruction following, code, and judgment under ambiguity. Each category contained two prompts. Every response was scored on eight criteria: factual correctness, reasoning validity, completeness, instruction compliance, clarity, precision, hallucination resistance, and efficiency. Category-specific weighting was then applied so that reasoning tasks rewarded logic, factual tasks rewarded accuracy and fabrication control, instruction tasks rewarded constraint obedience, code tasks rewarded technical correctness, and judgment tasks rewarded sound prioritisation.

Scores were normalised to 100. Category scores are the mean of the two prompts in that category. Overall score is the mean across all five category scores. Hard failures materially reduced scores where the response was truncated, contradicted itself, failed explicit formatting constraints, or produced content that was plainly wrong.

Top Line Result

SmolLM3 was the strongest model in the benchmark and the only system that performed at a high level across all major categories without a major collapse. Phi-4 Mini ranked second and was the most controlled model on strict instruction-following tasks. AceInstruct ranked third and remained strong in code and general knowledge, but it was less reliable under tighter constraint pressure and weaker on the second reasoning problem.

EXAONE Deep showed real upside in knowledge and judgment, but repeated output-control failures materially reduced its operational score. StableLM Zephyr was the weakest model in the set, with major failures in basic reasoning and factual reliability.

Overall Weighted Score

SmolLM3 (3.0B)88.6
Phi-4 Mini (3.8B)82.3
AceInstruct (1.5B)81.7
EXAONE Deep (1.2B)63.4
StableLM Zephyr (1.6B)48.6

Ranked by overall weighted performance across all five benchmark categories.

Scored Results

ModelOverallGradeReasoningKnowledgeInstructionCodeJudgment
SmolLM3 (3.0B)88.6Best in cohort93.186.376.695.391.6
Phi-4 Mini (3.8B)82.3High-performing72.875.190.493.379.9
AceInstruct (1.5B)81.7High-performing70.685.778.892.580.7
EXAONE Deep (1.2B)63.4Limited utility64.184.338.149.381.3
StableLM Zephyr (1.6B)48.6Weak15.043.149.462.872.9

Category Comparison

Reasoning
SmolLM3
93.1
Phi-4 Mini
72.8
AceInstruct
70.6
EXAONE Deep
64.1
StableLM
15.0
Knowledge
SmolLM3
86.3
Phi-4 Mini
75.1
AceInstruct
85.7
EXAONE Deep
84.3
StableLM
43.1
Instruction
SmolLM3
76.6
Phi-4 Mini
90.4
AceInstruct
78.8
EXAONE Deep
38.1
StableLM
49.4
Code
SmolLM3
95.3
Phi-4 Mini
93.3
AceInstruct
92.5
EXAONE Deep
49.3
StableLM
62.8
Judgment
SmolLM3
91.6
Phi-4 Mini
79.9
AceInstruct
80.7
EXAONE Deep
81.3
StableLM
72.9

Performance breakdown across five benchmark categories. Higher bars indicate better performance.

Consistency by Model

SmolLM37.5
AceInstruct8.1
Phi-4 Mini9.2
EXAONE Deep20.0
StableLM22.1

Standard deviation of category scores. Lower is better. Green = highly consistent, yellow = moderate variance, red = erratic.

Number of Weak Responses

SmolLM30
AceInstruct1
Phi-4 Mini1
EXAONE Deep4
StableLM7

Count of question-level scores below 50. Green = no failures, yellow = 1-2 failures, red = 3+ failures.

Findings

1. Best overall model

SmolLM3 was the best model in the test set. It led the benchmark overall and finished first in reasoning, knowledge, code, and judgment. Its main weakness was strict format compliance. It missed some tight instruction constraints, but unlike the weaker models it did not collapse on logic, code, or factual grounding. It is the clearest candidate for default on-device use where broad reliability matters more than one specialised strength.

2. Best controlled model

Phi-4 Mini was the cleanest model under explicit constraints. It scored first in instruction following and remained near the top in code. It was not the strongest factual model and it lost ground on the astronomy question, but it was disciplined, technically competent, and less erratic than EXAONE Deep or StableLM. For tasks where structure matters as much as content, Phi-4 Mini has a strong claim.

3. Strong but slightly brittle performer

AceInstruct finished close behind Phi-4 Mini. It performed very well in code and general knowledge and was generally clear. Its weaker points were exact compliance and combinatorial reasoning. It handled straightforward tasks well, but when the prompt demanded precise bounded output or a tighter search strategy, its performance dropped.

4. High upside, weak output control

EXAONE Deep is the most split model in the set. On substance alone it can look strong. Its knowledge and judgment scores were competitive, and some final answers were good. The problem was operational reliability. Several responses exposed internal reasoning text, one code answer never reached a usable solution, one instruction task failed completely, and one reasoning answer was cut off mid-stream. In a production setting that is a serious weakness. The issue is not intelligence alone; it is answer control.

5. Weakest benchmarked model

StableLM Zephyr was the weakest model overall. It failed the simple price-calculation task, failed the heavier-ball reasoning task, produced major factual errors in the astronomy answer, and missed strict formatting requirements badly. It was not unusable everywhere; some judgment and code responses were serviceable. But the benchmark does not support positioning it as a reliable default model inside a high-trust mobile product.

What the Benchmark Shows

The main result is that parameter count alone did not determine usefulness. The best-performing model here was not the largest, and the weakest model was not the smallest. The decisive factors were reasoning stability, factual restraint, and output control under mobile constraints.

The second result is that format discipline is a real separator in small on-device models. Phi-4 Mini won that category clearly. SmolLM3, despite winning overall, still lost points where the task required exact bullet lengths or strict word-count boundaries.

The third result is that output control matters as much as intelligence in a mobile app. EXAONE Deep produced some strong underlying content, but repeated chain leakage and truncation made it harder to trust operationally than the raw category scores alone would suggest.

Question-Level Performance Heatmap

QuestionSmolLM3Phi-4 MiniAceInstructEXAONE DeepStableLM
R199.498.899.490.018.1
R286.946.941.938.111.9
K177.966.593.288.241.5
K294.783.878.280.344.7
I178.889.487.927.649.7
I274.491.569.748.549.1
C195.395.393.113.382.5
C295.391.491.985.343.1
J190.974.181.287.465.6
J292.485.680.375.380.3

R = Reasoning, K = Knowledge, I = Instruction, C = Code, J = Judgment. Green = excellent (≥90), Turquoise = good (≥75), Yellow = acceptable (≥50), Red = poor (<50).