← Back to Graviton

Graviton-Native: Efficient LLM Architectures for 32GB RAM

OpenGraviton Community · Technical Report · 2025

Abstract. Large language models (LLMs) increasingly require hundreds of gigabytes of memory, excluding most users from running state-of-the-art AI locally. We present Graviton-Native, a framework for training and deploying efficient LLM architectures that achieve 500B+ parameter capacity on consumer hardware with 32GB RAM. Our approach combines (1) native ternary (1.58-bit) weight training inspired by BitNet, (2) Mixture-of-Experts (MoE) with top-k routing for sparse activation, and (3) integration with the Graviton inference engine for streaming and quantization. We demonstrate that a 350M parameter BitNet-style model requires ~66 MB (vs 672 MB FP16), and MoE architectures enable 500B total parameters with ~10B active per token. Our work enables AI democratization by making large models accessible on hardware users already own.

1. Introduction

The scaling of large language models has created a memory crisis: models with 70B+ parameters require 140GB+ of RAM, effectively limiting access to cloud providers and well-funded institutions. Post-training quantization (INT8, INT4) reduces memory by 2–4× but does not address the fundamental architectural inefficiency. We argue that architectural change—training models natively with efficient representations—is necessary to democratize AI.

Graviton-Native introduces two complementary approaches: BitNet b1.58 (ternary weights) and Mixture of Experts (MoE). Both are designed for training from scratch and integrate seamlessly with the Graviton inference engine.

2. BitNet b1.58 — Native Ternary Weights

We adopt the BitNet b1.58 formulation [1]: weights are constrained to {-1, 0, +1} during training and inference. This yields:

Quantization uses absmean thresholding: threshold = α × mean(|W|), with values above threshold mapped to sign(W) and below to 0. Per-group scaling preserves magnitude.

ModelFP16Ternary (1.58-bit)Reduction
350M params672 MB~66 MB~10×
2B params4 GB~400 MB~10×
70B params140 GB~14 GB~10×

3. Mixture of Experts (MoE)

MoE architectures enable total parameter counts far exceeding available memory by activating only a subset of experts per token. We use top-k routing: each token is routed to the k experts with highest router logits. With k=2 and 8 experts, only 25% of parameters are active per forward pass.

For 500B total parameters with 10B active per token:

This makes 500B models feasible on consumer hardware when combined with Graviton's streaming loader.

4. Implementation

Graviton-Native is implemented in Python/PyTorch and provides:

Checkpoints are compatible with Graviton's inference engine. The engine auto-detects BitNet (via use_ternary_weights or model_type: bitnet) and MoE (via num_experts) and loads the appropriate model class.

5. Results

We validate the framework with small-scale experiments:

ArchitectureParamsMemoryStatus
BitNet 350M336M~66 MB✓ Trained, inference verified
BitNet 2B2B~400 MB✓ Preset available
MoE small61M~3M active/token✓ Trained, inference verified
MoE large500M+~20M active/tokenPreset available

6. Conclusion

Graviton-Native demonstrates that architectural innovation—native ternary weights and MoE—enables large language models to run on hardware accessible to everyone. By training models efficiently from scratch rather than compressing after the fact, we move toward a future where AI is not confined to data centers.

References

[1] Liu et al., "BitNet b1.58: Scaling 1-bit Transformers," arXiv:2402.17764, 2024.

[2] Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017.

Code: github.com/opengraviton/graviton-native

Inference: github.com/opengraviton/graviton