Parallax

The World Inference Engine

LLMs are reshaping how we think, build, and create, but their demand for tokens is outpacing what centralized infra can deliver. Chips saturated; Power grids strained; Intelligence remains locked behind high-cost silos.

Parallax reimagines model inference as a global, collaborative process, one where models are no longer chained to centralized infrastructure, but are instead recomposed, executed, and verified across a global mesh of compute.

The engine introduces 3 foundational shifts:

  1. Intelligence sovereignty: serve models from the hardware you trust

  2. Composable inference: GPUs, Apple Silicon, and desktops working in harmony

  3. Latent compute: activate into the world’s untapped compute

The Architecture

Parallax is a system specifically engineered for high-performance structured generation on decentralized machine networks. The system consists of three main layers: Runtime, Communication, and Worker.

Runtime Layer

The Parallax Runtime Layer is the core orchestration engine for high-throughput, server-side LLM serving on distributed, heterogeneous networks. It's composed of an Executor (control loop), Model Shard Holder, Request Manager, Scheduler, and Paged KV Cache Manager. As the first framework of its kind for the MLX ecosystem, the Parallax runtime pioneers professional-grade serving on Apple Silicon.

Communication Layer

Manages gRPC and tensor streaming between peers, ensuring forward and backward passes succeed even if some nodes fail. Built on top of Hivemind’s Distributed Hash Table (DHT), a decentralized system that ensures fast, reliable information sharing without a central coordinator, to distribute values across peers.

Worker Layer

Executes inference tasks across heterogeneous hardware platforms, ensuring optimal performance and scalability via a dual-platform approach:

  • GPU Workers: Use a modified version of SGLang (a fast serving framework for LLMs, which utilizes PyTorch with CUDA kernels to harness the full computational power of NVIDIA GPUs) with asynchronous batching for heterogeneous compute adaptations.

  • Apple Workers: We have engineered a state-of-the-art serving engine built upon our pioneering MLX-compatible runtime, with the integration of a highly optimized Metals kernel, including a Paged Flash Attention kernel, unlocking unprecedented inference efficiency and throughput on Apple Silicon.

The Swarm

Parallax runs on a distributed architecture called the Swarm: a fully distributed machine mesh where your prompt is tokenized, segmented, and routed across nodes holding model shards.

Each node executes its assigned layers of the LLM, passing hidden states forward until the full inference is complete.

Optimal nodes are selected based on availability, compute, and latency. Coordination happens peer-to-peer via a DHT, enabling efficient routing, self-healing, and fault tolerance.

Learn More

Read the blog for more details. Learn more about Parallax's architecture and benchmark results in the research paper.

Last updated