온디바이스 AI 모델

LiberaGPT는 iPhone에서 완전히 오프라인으로 실행되는 컴팩트하고 최적화된 AI 모델을 사용합니다. 필요한 모델만 다운로드하세요. 각 모델은 고유한 강점이 있으며 품질과 속도를 균형 있게 조정하는 양자화로 Neural Engine에 최적화되어 있습니다.

필요한 모델을 선택하여 다운로드하고 오프라인에서 사용하세요.

사용 가능한 모델

여섯 개의 양자화 모델(Q4_K)이 llama.cpp를 통해 온디바이스 추론에 최적화되어 있습니다. 설정에서 필요한 것을 다운로드하세요. A19 Pro 칩에 최적화.

SmolLM3 3B Instruct

iPhoneAndroidBuilt-in

3B params · 128K context · Q4_K_M · 1.9 GB · ~25 tok/sec

▸

거대한 컨텍스트 창을 가진 HuggingFace의 컴팩트하지만 강력한 모델. 긴 대화, 장문서 분석, 상세한 추론. 대부분의 작업에서 균형 잡힌 성능.

사용 사례: 긴 컨텍스트, 확장 대화, 문서 분석

라이선스: Apache 2.0

Phi-4 Mini 3.8B Instruct

iPhoneAndroid

3.8B params · 128K context · Q4_K_M · 2.5 GB · ~18 tok/sec

▸

Microsoft의 최신 Phi 세대. 강력한 추론 능력, 지시 이행, 언어 이해. 크기 대비 최고의 품질. 딥 모드 권장.

사용 사례: 심층 추론, 복잡한 지시, 고품질 응답

라이선스: MIT

StableLM Zephyr 1.6B

iPhoneAndroid

1.6B params · 4K context · Q4_K_S · 989 MB · ~28 tok/sec

▸

Stability AI의 경량 지시 모델. 일상 작업에서 탁월한 속도 대 품질 비율. 최소 메모리 풋프린트, 구형 기기나 배터리 민감한 워크플로에 이상적.

사용 사례: 빠른 모드, 배터리 효율, 구형 기기

라이선스: Apache 2.0

EXAONE Deep 2.4B

iPhoneAndroid

2.4B params · 32K context · Q4_K_M · 1.6 GB · ~22 tok/sec

▸

LG AI Research의 이중 언어 모델(영어/한국어). 강력한 다국어 능력과 추론. 확장 컨텍스트 지원으로 기술적, 분석적 작업에 최적화.

사용 사례: 다국어(한국어), 기술 작업, 분석적 추론

라이선스: Apache 2.0

AceInstruct 1.5B

iPhoneAndroid

1.5B params · 128K context · Q4_K_M · 1.2 GB · ~26 tok/sec

▸

Qwen2.5 기반으로 구축된 NVIDIA의 지시 조정 모델입니다. 코딩, 수학, 일반 작업에 걸쳐 다재다능합니다. 구조화된 출력과 다단계 추론에 신뢰할 수 있습니다.

사용 사례: 코딩, 수학, 지시 따르기, 다단계 추론

라이선스: Apache 2.0

Mistral Small 3.2 · 24B

iPhonePro · A19 Pro

24B params · 4K context · Q8_0 KV · ~8 tok/sec · iPhone 17 Pro only

▸

Mistral Small 3.2 running natively on iPhone 17 Pro. The most capable model in the iOS lineup. Requires the Advanced GPU Tuning Pro unlock. A19 Pro Neural Engine only.

사용 사례: Complex reasoning, high-quality long-form output, advanced multi-step tasks

라이선스: Apache 2.0

Ternary Bonsai 4B

iPhonePro · iOS 16+

4B params · MLX · Text to Image · iOS 16+ · ~4 GB

▸

텍스트에서 이미지를 생성하는 기능이 iPhone에서 완전히 실행됩니다. MLX를 통한 3진 양자화 FLUX 파이프라인. 클라우드 없음, API 키 없음, 업로드 없음. iPhone 16 이상.

사용 사례: 텍스트 프롬프트에서 이미지 생성

라이선스: Apache 2.0

Qwen3 30B MoE

Android16 GB RAM

30B params (3.3B active) · 40K context · Q4_K_M · 18.6 GB · ~12 tok/sec · 16 GB RAM

▸

Alibaba's 30B Mixture-of-Experts flagship. Only 3.3B parameters are active per token, delivering 30B-class intelligence at 14B decode speeds. Multilingual across 10 languages. Requires 16 GB RAM.

사용 사례: High-quality multilingual output, complex reasoning, long-form generation

라이선스: Apache 2.0

Llama 3.3 70B Instruct

AndroidRecord · 24 GB RAM

70B params · 2K context · IQ2_XS · 21 GB · ~5 tok/sec · 24 GB RAM

▸

Meta's 70B flagship at IQ2_XS quantization — 21 GB on disk, 2048-token context. Record Mode: the largest language model ever deployed on a consumer smartphone. Requires 24 GB RAM and Advanced GPU Tuning.

사용 사례: Maximum capability, frontier-class reasoning and generation on-device

라이선스: Llama 3.3 Community

Model Benchmarks

This benchmark evaluates the five small language models available in LiberaGPT under actual mobile inference conditions rather than cloud-hosted assumptions. The test context matters. LiberaGPT presents itself as a fully on-device iPhone system using 4-bit quantised GGUF models, with no cloud execution, and states that it is optimised for A18 and A19 Pro class devices with Neural Engine and Metal GPU acceleration. On the Apple side, iPhone 17 Pro is built on the A19 Pro, not the base A19, and Apple lists a 16-core Neural Engine together with Neural Accelerators integrated into each GPU core. The correct framing is therefore constrained local inference on iPhone 17 Pro hardware, not generic model performance in the abstract.

Method

Each model was tested against the same ten prompts across five categories: reasoning, factual knowledge, instruction following, code, and judgment under ambiguity. Each category contained two prompts. Every response was scored on eight criteria: factual correctness, reasoning validity, completeness, instruction compliance, clarity, precision, hallucination resistance, and efficiency. Category-specific weighting was then applied so that reasoning tasks rewarded logic, factual tasks rewarded accuracy and fabrication control, instruction tasks rewarded constraint obedience, code tasks rewarded technical correctness, and judgment tasks rewarded sound prioritisation.

Scores were normalised to 100. Category scores are the mean of the two prompts in that category. Overall score is the mean across all five category scores. Hard failures materially reduced scores where the response was truncated, contradicted itself, failed explicit formatting constraints, or produced content that was plainly wrong.

Top Line Result

SmolLM3 was the strongest model in the benchmark and the only system that performed at a high level across all major categories without a major collapse. Phi-4 Mini ranked second and was the most controlled model on strict instruction-following tasks. AceInstruct ranked third and remained strong in code and general knowledge, but it was less reliable under tighter constraint pressure and weaker on the second reasoning problem.

EXAONE Deep showed real upside in knowledge and judgment, but repeated output-control failures materially reduced its operational score. StableLM Zephyr was the weakest model in the set, with major failures in basic reasoning and factual reliability.

Overall Weighted Score

SmolLM3 (3.0B)88.6

Phi-4 Mini (3.8B)82.3

AceInstruct (1.5B)81.7

EXAONE Deep (1.2B)63.4

StableLM Zephyr (1.6B)48.6

Ranked by overall weighted performance across all five benchmark categories.

Scored Results

Model	Overall	Grade	Reasoning	Knowledge	Instruction	Code	Judgment
SmolLM3 (3.0B)	88.6	Best in cohort	93.1	86.3	76.6	95.3	91.6
Phi-4 Mini (3.8B)	82.3	High-performing	72.8	75.1	90.4	93.3	79.9
AceInstruct (1.5B)	81.7	High-performing	70.6	85.7	78.8	92.5	80.7
EXAONE Deep (1.2B)	63.4	Limited utility	64.1	84.3	38.1	49.3	81.3
StableLM Zephyr (1.6B)	48.6	Weak	15.0	43.1	49.4	62.8	72.9

Category Comparison

Reasoning

SmolLM3

93.1

Phi-4 Mini

72.8

AceInstruct

70.6

EXAONE Deep

64.1

StableLM

15.0

Knowledge

SmolLM3

86.3

Phi-4 Mini

75.1

AceInstruct

85.7

EXAONE Deep

84.3

StableLM

43.1

Instruction

SmolLM3

76.6

Phi-4 Mini

90.4

AceInstruct

78.8

EXAONE Deep

38.1

StableLM

49.4

Code

SmolLM3

95.3

Phi-4 Mini

93.3

AceInstruct

92.5

EXAONE Deep

49.3

StableLM

62.8

Judgment

SmolLM3

91.6

Phi-4 Mini

79.9

AceInstruct

80.7

EXAONE Deep

81.3

StableLM

72.9

Performance breakdown across five benchmark categories. Higher bars indicate better performance.

Consistency by Model

SmolLM37.5

AceInstruct8.1

Phi-4 Mini9.2

EXAONE Deep20.0

StableLM22.1

Standard deviation of category scores. Lower is better. Green = highly consistent, yellow = moderate variance, red = erratic.

Number of Weak Responses

SmolLM30

AceInstruct1

Phi-4 Mini1

EXAONE Deep4

StableLM7

Count of question-level scores below 50. Green = no failures, yellow = 1-2 failures, red = 3+ failures.

Findings

1. Best overall model

SmolLM3 was the best model in the test set. It led the benchmark overall and finished first in reasoning, knowledge, code, and judgment. Its main weakness was strict format compliance. It missed some tight instruction constraints, but unlike the weaker models it did not collapse on logic, code, or factual grounding. It is the clearest candidate for default on-device use where broad reliability matters more than one specialised strength.

2. Best controlled model

Phi-4 Mini was the cleanest model under explicit constraints. It scored first in instruction following and remained near the top in code. It was not the strongest factual model and it lost ground on the astronomy question, but it was disciplined, technically competent, and less erratic than EXAONE Deep or StableLM. For tasks where structure matters as much as content, Phi-4 Mini has a strong claim.

3. Strong but slightly brittle performer

AceInstruct finished close behind Phi-4 Mini. It performed very well in code and general knowledge and was generally clear. Its weaker points were exact compliance and combinatorial reasoning. It handled straightforward tasks well, but when the prompt demanded precise bounded output or a tighter search strategy, its performance dropped.

4. High upside, weak output control

EXAONE Deep is the most split model in the set. On substance alone it can look strong. Its knowledge and judgment scores were competitive, and some final answers were good. The problem was operational reliability. Several responses exposed internal reasoning text, one code answer never reached a usable solution, one instruction task failed completely, and one reasoning answer was cut off mid-stream. In a production setting that is a serious weakness. The issue is not intelligence alone; it is answer control.

5. Weakest benchmarked model

StableLM Zephyr was the weakest model overall. It failed the simple price-calculation task, failed the heavier-ball reasoning task, produced major factual errors in the astronomy answer, and missed strict formatting requirements badly. It was not unusable everywhere; some judgment and code responses were serviceable. But the benchmark does not support positioning it as a reliable default model inside a high-trust mobile product.

What the Benchmark Shows

The main result is that parameter count alone did not determine usefulness. The best-performing model here was not the largest, and the weakest model was not the smallest. The decisive factors were reasoning stability, factual restraint, and output control under mobile constraints.

The second result is that format discipline is a real separator in small on-device models. Phi-4 Mini won that category clearly. SmolLM3, despite winning overall, still lost points where the task required exact bullet lengths or strict word-count boundaries.

The third result is that output control matters as much as intelligence in a mobile app. EXAONE Deep produced some strong underlying content, but repeated chain leakage and truncation made it harder to trust operationally than the raw category scores alone would suggest.

Question-Level Performance Heatmap

Question	SmolLM3	Phi-4 Mini	AceInstruct	EXAONE Deep	StableLM
R1	99.4	98.8	99.4	90.0	18.1
R2	86.9	46.9	41.9	38.1	11.9
K1	77.9	66.5	93.2	88.2	41.5
K2	94.7	83.8	78.2	80.3	44.7
I1	78.8	89.4	87.9	27.6	49.7
I2	74.4	91.5	69.7	48.5	49.1
C1	95.3	95.3	93.1	13.3	82.5
C2	95.3	91.4	91.9	85.3	43.1
J1	90.9	74.1	81.2	87.4	65.6
J2	92.4	85.6	80.3	75.3	80.3

R = Reasoning, K = Knowledge, I = Instruction, C = Code, J = Judgment. Green = excellent (≥90), Turquoise = good (≥75), Yellow = acceptable (≥50), Red = poor (<50).