Defying the gravitational pull of massive AI models

Run unprecedentedly large, trillion-parameter AI models directly on minimal hardware like the Mac Mini. The ultimate open-source inference engine powered by Ternary Quantization, Dynamic Sparsity, and MMap Layer Streaming.

The Gravity-Defying Architecture

🗜️

Extreme Quantization

Features state-of-the-art 1.58-bit Ternary quantization, collapsing 16-bit weights down to just {-1, 0, +1} for absolutely massive 10x compression ratios.

Dynamic Sparsity

Replaces dense computations with Top-K zeroing and Mixture of Experts (MoE) routing. Dynamically prunes 70%+ of the compute pipeline per token natively.

💿

Layer Streaming

Limits shattered. Bypasses physical RAM limitations by memory-mapping (mmap) weights asynchronously directly from NVMe SSDs to the computation engine.

🧠

Speculative Decoding

Accelerates generation by 2-3x by utilizing Draft vs Target generation heuristics, bypassing memory bandwidth walls during autoregressive decoding.

Actual Benchmarks

TinyLlama-1.1B Memory Footprint

  • 🔴 Baseline (FP16): 2.05 GB
  • 🟢 Graviton INT4: 0.24 GB (8.4x smaller)
  • 🟣 Graviton Ternary (1.58-bit): 0.24 GB (8.4x smaller)
* Tested natively on Apple Silicon using Graviton's custom Metal & C++ tensor unpacking.

Extreme Stress Test (140B Scale)

  • 💻 Hardware: Apple M1 Max (64GB)
  • 🔴 Original FP16 Model: ~280 GB (OOM Crash)
  • 🟢 Graviton Ternary Model: ~35.0 GB (Fits in RAM)
  • Quantization Speed: 0.98 GB/s
* Synthetic tensor test verifying 140 Billion parameter dimensions bypass Apple's unified memory limitations via pure 1.58-bit packing.

Initialize in Seconds

bash

# 1. Clone the Graviton core
git clone https://github.com/opengraviton/graviton.git
cd graviton

# 2. View your hardware capabilities and theoretical model bounds
python3 -m graviton.cli.main info

# 3. Enter orbit and start generating
python3 -m graviton.cli.main run "mixtral-8x22b" -p "Explain quantum gravity"