Graviton-Native: Efficient LLM Architectures for 32GB RAM
1. Introduction
The scaling of large language models has created a memory crisis: models with 70B+ parameters require 140GB+ of RAM, effectively limiting access to cloud providers and well-funded institutions. Post-training quantization (INT8, INT4) reduces memory by 2–4× but does not address the fundamental architectural inefficiency. We argue that architectural change—training models natively with efficient representations—is necessary to democratize AI.
Graviton-Native introduces two complementary approaches: BitNet b1.58 (ternary weights) and Mixture of Experts (MoE). Both are designed for training from scratch and integrate seamlessly with the Graviton inference engine.
2. BitNet b1.58 — Native Ternary Weights
We adopt the BitNet b1.58 formulation [1]: weights are constrained to {-1, 0, +1} during training and inference. This yields:
- ~10× memory reduction vs FP16 (1.58 bits vs 16 bits per parameter)
- Add/subtract-only matmul — no floating-point multiply in the core operation
- Energy efficiency — significantly lower power consumption
Quantization uses absmean thresholding: threshold = α × mean(|W|), with values above threshold mapped to sign(W) and below to 0. Per-group scaling preserves magnitude.
| Model | FP16 | Ternary (1.58-bit) | Reduction |
|---|---|---|---|
| 350M params | 672 MB | ~66 MB | ~10× |
| 2B params | 4 GB | ~400 MB | ~10× |
| 70B params | 140 GB | ~14 GB | ~10× |
3. Mixture of Experts (MoE)
MoE architectures enable total parameter counts far exceeding available memory by activating only a subset of experts per token. We use top-k routing: each token is routed to the k experts with highest router logits. With k=2 and 8 experts, only 25% of parameters are active per forward pass.
For 500B total parameters with 10B active per token:
- Total model: 500B × 4 bit ≈ 250 GB (stored on disk, streamed)
- Active per token: 10B × 4 bit ≈ 5 GB (fits in 32GB RAM with overhead)
This makes 500B models feasible on consumer hardware when combined with Graviton's streaming loader.
4. Implementation
Graviton-Native is implemented in Python/PyTorch and provides:
- BitLinear — ternary linear layer with efficient forward pass
- BitNetBlock — transformer block with ReLU² activation
- MoELayer — top-k router + expert FFNs
- Training pipeline — HuggingFace datasets, WikiText, C4, custom JSONL
Checkpoints are compatible with Graviton's inference engine. The engine auto-detects BitNet (via use_ternary_weights or model_type: bitnet) and MoE (via num_experts) and loads the appropriate model class.
5. Results
We validate the framework with small-scale experiments:
| Architecture | Params | Memory | Status |
|---|---|---|---|
| BitNet 350M | 336M | ~66 MB | ✓ Trained, inference verified |
| BitNet 2B | 2B | ~400 MB | ✓ Preset available |
| MoE small | 61M | ~3M active/token | ✓ Trained, inference verified |
| MoE large | 500M+ | ~20M active/token | Preset available |
6. Conclusion
Graviton-Native demonstrates that architectural innovation—native ternary weights and MoE—enables large language models to run on hardware accessible to everyone. By training models efficiently from scratch rather than compressing after the fact, we move toward a future where AI is not confined to data centers.
References
[1] Liu et al., "BitNet b1.58: Scaling 1-bit Transformers," arXiv:2402.17764, 2024.
[2] Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017.
Code: github.com/opengraviton/graviton-native
Inference: github.com/opengraviton/graviton