# Parallax

{% hint style="success" %}
Experience distributed inference with the [Gradient Chatbot](https://chat.gradient.network),  powered by Parallax.

Models available:

`qwen3-235b-a22b`&#x20;

`gpt-oss-120b`
{% endhint %}

{% hint style="success" %}
Research Paper: [published](https://gradient.network/parallax.pdf)
{% endhint %}

LLMs are reshaping how we think, build, and create, but their demand for tokens is outpacing what centralized infra can deliver. Chips saturated; Power grids strained; Intelligence remains locked behind high-cost silos.

Parallax reimagines model inference as a global, collaborative process, one where models are no longer chained to centralized infrastructure, but are instead recomposed, executed, and verified across a global mesh of compute.

The engine introduces 3 foundational shifts:

1. Intelligence sovereignty: serve models from the hardware you trust
2. Composable inference: GPUs, Apple Silicon, and desktops working in harmony
3. Latent compute: activate into the world’s untapped compute

## The Architecture

Parallax is a system specifically engineered for high-performance structured generation on decentralized machine networks. The system consists of three main layers: Runtime, Communication, and Worker.

<figure><img src="/files/4DpKYOuKv7F3A13PwvKr" alt=""><figcaption></figcaption></figure>

### Runtime Layer

The Parallax Runtime Layer is the core orchestration engine for high-throughput, server-side LLM serving on distributed, heterogeneous networks. It's composed of an Executor (control loop), Model Shard Holder, Request Manager, Scheduler, and Paged KV Cache Manager.\
\
As the first framework of its kind for the MLX ecosystem, the Parallax runtime pioneers professional-grade serving on Apple Silicon.&#x20;

### Communication Layer

Manages gRPC and tensor streaming between peers, ensuring forward and backward passes succeed even if some nodes fail. Built on top of Hivemind’s Distributed Hash Table (DHT), a decentralized system that ensures fast, reliable information sharing without a central coordinator, to distribute values across peers.&#x20;

### Worker Layer

Executes inference tasks across heterogeneous hardware platforms, ensuring optimal performance and scalability via a dual-platform approach:

* GPU Workers: Use a modified version of SGLang (a fast serving framework for LLMs, which utilizes PyTorch with CUDA kernels to harness the full computational power of NVIDIA GPUs) with asynchronous batching for heterogeneous compute adaptations.
* Apple Workers: We have engineered a state-of-the-art serving engine built upon our pioneering MLX-compatible runtime, with the integration of a highly optimized Metals kernel, including a Paged Flash Attention kernel, unlocking unprecedented inference efficiency and throughput on Apple Silicon.

## The Swarm

<figure><img src="/files/AWnWhQ5qV551VV7hBPaq" alt=""><figcaption></figcaption></figure>

Parallax runs on a distributed architecture called the Swarm: a fully distributed machine mesh where your prompt is tokenized, segmented, and routed across nodes holding model shards.&#x20;

Each node executes its assigned layers of the LLM, passing hidden states forward until the full inference is complete.&#x20;

Optimal nodes are selected based on availability, compute, and latency. Coordination happens peer-to-peer via a DHT, enabling efficient routing, self-healing, and fault tolerance.&#x20;

## Learn More

Read the [blog](https://gradient.network/blog/parallax-world-inference-engine) for more details.\
\
Learn more about Parallax's architecture and benchmark results in the [research paper](https://gradient.network/parallax.pdf).[<br>](https://x.com/Gradient_HQ/status/1943581029953892628/photo/1)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.gradient.network/the-open-intelligence-stack/parallax.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
