Defying the gravitational pull of massive AI models

Run unprecedentedly large, trillion-parameter AI models directly on minimal hardware like the Mac Mini. The ultimate open-source inference engine powered by Ternary Quantization, Dynamic Sparsity, and MMap Layer Streaming.

View Source on GitHub Explore Tech

The Gravity-Defying Architecture

🗜️

Extreme Quantization

Features state-of-the-art 1.58-bit Ternary quantization, collapsing 16-bit weights down to just {-1, 0, +1} for absolutely massive 10x compression ratios.

⚡

Dynamic Sparsity

Replaces dense computations with Top-K zeroing and Mixture of Experts (MoE) routing. Dynamically prunes 70%+ of the compute pipeline per token natively.

💿

Layer Streaming

Limits shattered. Bypasses physical RAM limitations by memory-mapping (mmap) weights asynchronously directly from NVMe SSDs to the computation engine.

🧠

Speculative Decoding

Accelerates generation by 2-3x by utilizing Draft vs Target generation heuristics, bypassing memory bandwidth walls during autoregressive decoding.

Actual Benchmarks

TinyLlama-1.1B Memory Footprint

🔴 Baseline (FP16): 2.05 GB
🟢 Graviton INT4: 0.24 GB (8.4x smaller)
🟣 Graviton Ternary (1.58-bit): 0.24 GB (8.4x smaller)

* Tested natively on Apple Silicon using Graviton's custom Metal & C++ tensor unpacking.

Extreme Stress Test (140B Scale)

💻 Hardware: Apple M1 Max (64GB)
🔴 Original FP16 Model: ~280 GB (OOM Crash)
🟢 Graviton Ternary Model: ~35.0 GB (Fits in RAM)
⚡ Quantization Speed: 0.98 GB/s

* Synthetic tensor test verifying 140 Billion parameter dimensions bypass Apple's unified memory limitations via pure 1.58-bit packing.

Initialize in Seconds

bash


# 1. Clone the Graviton core
git clone https://github.com/opengraviton/graviton.git
cd graviton

# 2. View your hardware capabilities and theoretical model bounds
python3 -m graviton.cli.main info

# 3. Enter orbit and start generating
python3 -m graviton.cli.main run "mixtral-8x22b" -p "Explain quantum gravity"