Defying the gravitational pull of massive AI models
Run unprecedentedly large, trillion-parameter AI models directly on minimal hardware like the Mac Mini. The ultimate open-source inference engine powered by Ternary Quantization, Dynamic Sparsity, and MMap Layer Streaming.
The Gravity-Defying Architecture
Extreme Quantization
Features state-of-the-art 1.58-bit Ternary quantization, collapsing 16-bit weights down to just {-1, 0, +1} for absolutely massive 10x compression ratios.
Dynamic Sparsity
Replaces dense computations with Top-K zeroing and Mixture of Experts (MoE) routing. Dynamically prunes 70%+ of the compute pipeline per token natively.
Layer Streaming
Limits shattered. Bypasses physical RAM limitations by memory-mapping (mmap) weights asynchronously directly from NVMe SSDs to the computation engine.
Speculative Decoding
Accelerates generation by 2-3x by utilizing Draft vs Target generation heuristics, bypassing memory bandwidth walls during autoregressive decoding.
Actual Benchmarks
TinyLlama-1.1B Memory Footprint
- 🔴 Baseline (FP16): 2.05 GB
- 🟢 Graviton INT4: 0.24 GB (8.4x smaller)
- 🟣 Graviton Ternary (1.58-bit): 0.24 GB (8.4x smaller)
Extreme Stress Test (140B Scale)
- 💻 Hardware: Apple M1 Max (64GB)
- 🔴 Original FP16 Model: ~280 GB (OOM Crash)
- 🟢 Graviton Ternary Model: ~35.0 GB (Fits in RAM)
- ⚡ Quantization Speed: 0.98 GB/s
Initialize in Seconds
# 1. Clone the Graviton core
git clone https://github.com/opengraviton/graviton.git
cd graviton
# 2. View your hardware capabilities and theoretical model bounds
python3 -m graviton.cli.main info
# 3. Enter orbit and start generating
python3 -m graviton.cli.main run "mixtral-8x22b" -p "Explain quantum gravity"