Ecosystem – PyTorch

Mooncake Joins PyTorch Ecosystem

The Mooncake Team — Thu, 12 Feb 2026 22:38:35 +0000

We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem! By integrating Mooncake’s high-performance KVCache transfer and storage capabilities with PyTorch-native inference engines like SGLang, and vLLM, and TensorRT-LLM, we are unlocking new levels of throughput and scalability for large language model deployments.

To view the PyTorch Ecosystem, see the PyTorch Landscape. Learn more about how projects can join the PyTorch Ecosystem.

About Mooncake

Mooncake is designed to solve the “memory wall” in LLM serving. As context lengths grow and models scale, the static binding of Key-Value (KV) cache to specific GPU workers becomes a primary bottleneck.

Mooncake empowers inference engines to break this binding, unlocking four critical capabilities:

(Encoder) Prefill-Decode Disaggregation: Mooncake’s high-performance Mooncake Transfer Engine separates heavy computation (prefill/encoder) from latency-sensitive generation (decoding) into distinct clusters.
Global KVCache Reuse: By acting as a distributed shared memory for KV blocks, Mooncake Store enables valid cache to be reused globally across different requests and engine instances.
Elastic Expert Parallelism: By decoupling experts from specific workers, Mooncake-EP enables elastic and resilient serving where experts of Mixture-of-Experts (MoE) models can be dynamically routed or recovered, ensuring high availability even during partial node failures.
PyTorch Distributed Backend: Mooncake Backend serves as a fault-tolerant PyTorch distributed backend. It provides robust collective communication primitives capable of continuing operation seamlessly in the presence of rank failures.
Weighs Updating: Mooncake Store enables rapid weight updates for RL and checkpoint scenarios by storing weights internally. It offers tensor-native and zero-copy APIs.

Wide Industry Adoption

Mooncake originated from a research collaboration between Moonshot AI and Tsinghua University. It was born from the need to solve the “memory wall” in serving massive-scale models like Kimi. Since open-sourcing, it has evolved into a thriving community-driven project.

Mooncake’s architecture has been battle-tested in some of the world’s most demanding production environments. Its ability to decouple compute from memory has led to wide adoption across leading organizations, including Moonshot AI (Kimi), Alibaba Cloud, Ant Group, JD.com, Tencent, Meituan, Approaching.AI and LightSeek Foundation.

These organizations utilize Mooncake to maximize GPU utilization and ensure smooth serving for millions of concurrent users.

In Action: A Joint Solution

To demonstrate the full potential of this architecture, we present a joint solution that combines Mooncake with the ecosystem’s leading inference engines and orchestration tools.

In this architecture, we will use RoleBasedGroup (RBG, https://github.com/sgl-project/rbg) to orchestrate the entire topology, defining the relationships and startup order of the cluster. It deploys Shepherd Model Gateway (SMG, https://github.com/lightseekorg/smg) as the critical routing layer, which intelligently directs incoming requests to the appropriate workers based on cache locality and system load. The heavy lifting is then performed by SGLang (https://github.com/sgl-project/sglang) or vLLM (https://github.com/vllm-project/vllm) instances serving as compute workers, while Mooncake functions as the high-speed data plane: its Transfer Engine pushes prefilled KV cache via RDMA/NVLink, and its Store persists that cache for global reuse by decoding nodes.

1. Deployment with SGLang + Mooncake + SMG

Below is the RBG configuration that immediately deploys a complete SGLang architecture. In this case, both Prefill-Decode Disaggregation and Global KVCache-Reuse are enabled. The Prefill instances utilize Mooncake TE to transfer kvcache to Decode instances, while Mooncake Store facilitates reusing KVCache across different requests within the Prefill instance (more details in KEP-74 Mooncake Integration and pd-disaggregated-with-mooncake.yaml).

    YAML

# Joint Solution: RBG + SMG + SGLang + Mooncake (Production Ready)
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: sglang-mooncake-smg-v2
spec:
  roles:
    # 1. Mooncake Master: Centralized Metadata Server for TE and Store
    - name: mooncake-master
      replicas: 1
      template:
        spec:
          containers:
            - name: master
              image: lmsysorg/sglang:latest
              env:
                - name: POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              command: ["mooncake_master"]
              args:
                - --enable_http_metadata_server=true
                - --rpc_address=$(POD_IP)
                - --rpc_port=50051
                - --http_metadata_server_host=$(POD_IP)
                - --http_metadata_server_port=8080
                - --metrics_port=9003

    # 2. Mooncake Store: Distributed KVCache Storage Nodes
    - name: mooncake-store
      replicas: 3
      dependencies: ["mooncake-master"]
      template:
        spec:
          containers:
            - name: store-node
              image: lmsysorg/sglang:latest
              env:
                - name: MOONCAKE_MASTER
                  value: "s-sglang-mooncake-smg-v2-mooncake-master:50051"
                - name: MOONCAKE_TE_META_DATA_SERVER
                  value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata"
                - name: MOONCAKE_GLOBAL_SEGMENT_SIZE
                  value: "45gb"
                - name: MOONCAKE_PROTOCOL
                  value: "rdma" # Use RDMA for zero-copy KVCache transfer
              command: ["python3", "-m", "mooncake.mooncake_store_service"]
              resources:
                limits:
                  memory: "50Gi"
                  rdma/hca: 1 # Required for high-speed TE transfer
                requests:
                  memory: "50Gi"
                  rdma/hca: 1

    # 3. Prefill Worker (SGLang): High-throughput Prefill with Mooncake Push
    - name: prefill-worker
      replicas: 1
      dependencies: ["mooncake-master", "mooncake-store"]
      template:
        spec:
          containers:
            - name: sglang-prefill
              image: lmsysorg/sglang:latest
              env:
                - name: MOONCAKE_MASTER
                  value: "s-sglang-mooncake-smg-v2-mooncake-master:50051"
                - name: MOONCAKE_TE_META_DATA_SERVER
                  value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata"
                - name: MOONCAKE_PROTOCOL
                  value: "rdma"
              command:
                - python3
                - -m
                - sglang.launch_server
                - --model-path /models/Qwen3
                - --tp 4
                - --disaggregation-mode prefill
                - --disaggregation-transfer-backend mooncake # Activates Mooncake TE for KVCache Push
                - --enable-hierarchical-cache # Enables KVCache offloading
                - --hicache-storage-backend mooncake # Uses Mooncake as the L2/L3 cache backend
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/hca: 1

    # 4. Decode Worker (SGLang): Low-latency Generation with Mooncake Pull
    - name: decode-worker
      replicas: 2
      dependencies: ["mooncake-master", "prefill-worker"]
      template:
        spec:
          containers:
            - name: sglang-decode
              image: lmsysorg/sglang:latest
              command:
                - python3
                - -m
                - sglang.launch_server
                - --model-path /models/Qwen3
                - --tp 4
                - --disaggregation-mode decode # Pulls shared KVCache from Mooncake Store
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/hca: 1

    # 5. Shepherd Model Gateway (SMG): Intelligent PD-Disaggregation Router
    - name: smg-router
      replicas: 1
      dependencies: ["prefill-worker", "decode-worker"]
      template:
        spec:
          containers:
            - name: router
              image: lightseekorg/smg:latest
              command:
                - smg
                - --pd-disaggregation
                - --prefill http://s-sglang-mooncake-smg-v2-prefill-worker:8000
                - --decode http://s-sglang-mooncake-smg-v2-decode-worker:8000
                - --host 0.0.0.0
                - --port 8000

2. Deployment with vLLM + Mooncake

vLLM has also integrated Mooncake support, allowing users to leverage Mooncake connectors for seamless KV transfer. Below is the equivalent rbg() solution for deploying vLLM in a disaggregated setup using Mooncake connectors.

YAML

# Joint Solution: RBG + vLLM + Mooncake Connector
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: vllm-pd-with-mooncake-demo
spec:
  roles:
    # 1. Gateway: Routing to vLLM instances (SMG or vLLM Proxy)
  - name: proxy
    dependencies: [ "prefill", "decode" ]
    replicas: 1
    template:
      spec:
        containers:
        - name: proxy
          image: lightseekorg/smg:latest
          command:
          - smg
          - --prefiller-host
          - http://vllm-pd-with-mooncake-demo-prefill-0.s-vllm-pd-with-mooncake-demo-prefill
          - --prefiller-port
          - "8000"
          - --decoder-host
          - http://vllm-pd-with-mooncake-demo-decode-0.s-vllm-pd-with-mooncake-demo-decode
          - --decoder-port
          - "8000"

    # 2. Prefill Worker (vLLM): Producer role
  - name: prefill
    replicas: 1
    template:
      spec:
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: qwen2.5-7b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
        containers:
        - name: prefill
          image: vllm/vllm-openai:latest
          command:
          - sh
          - -c
          - |
            pip install mooncake-transfer-engine && \
            vllm serve /models/Qwen2.5-7B-Instruct \
              --port 8000 \
              --tensor-parallel-size 4 \
              --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
          ports:
          - containerPort: 8000
            name: http
          readinessProbe:
            initialDelaySeconds: 30
            periodSeconds: 10
            tcpSocket:
              port: 8000
          resources:
            limits:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
            requests:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
          volumeMounts:
          - mountPath: /models/Qwen2.5-7B-Instruct
            name: model
          - mountPath: /dev/shm
            name: dshm

    # 3. Decode Worker (vLLM): Consumer role
  - name: decode
    replicas: 1
    template:
      spec:
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: qwen2.5-7b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
        containers:
        - name: decode
          image: vllm/vllm-openai:latest
          command:
          - sh
          - -c
          - |
            pip install mooncake-transfer-engine && \
            vllm serve /models/Qwen2.5-7B-Instruct \
              --port 8000 \
              --tensor-parallel-size 4 \
              --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
          ports:
          - containerPort: 8000
            name: http
          readinessProbe:
            initialDelaySeconds: 30
            periodSeconds: 10
            tcpSocket:
              port: 8000
          resources:
            limits:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
            requests:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
          volumeMounts:
          - mountPath: /models/Qwen2.5-7B-Instruct
            name: model
          - mountPath: /dev/shm
            name: dshm
---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: vllm-pd-with-mooncake-demo
  name: vllm-pd-with-mooncake-demo
  namespace: default
spec:
  ports:
  - name: http
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    rolebasedgroup.workloads.x-k8s.io/name: vllm-pd-with-mooncake-demo
    rolebasedgroup.workloads.x-k8s.io/role: proxy
  type: ClusterIP

Conclusion

Mooncake adds a vital layer of memory virtualization to the open-source AI stack. By enabling PyTorch engines—whether SGLang, vLLM, or TensorRT-LLM —to adopt KVCache-centric architectures, we are paving the way for more efficient, scalable, and lower-latency LLM services.

We invite you to explore the project and start building:

Mooncake GitHub: https://github.com/kvcache-ai/Mooncake

Mooncake Project Doc: https://kvcache-ai.github.io/Mooncake/

PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem

Mon, 28 Jul 2025 21:52:53 +0000

We’re thrilled to announce that the Kubeflow Trainer project has been integrated into the PyTorch ecosystem! This integration ensures that Kubeflow Trainer aligns with PyTorch’s standards and practices, giving developers a reliable, scalable, and community-backed solution to run PyTorch on Kubernetes.

To view the PyTorch Ecosystem, see the PyTorch Landscape. Learn more about how projects can join the PyTorch Ecosystem.

About Kubeflow Trainer

Kubeflow Trainer is a Kubernetes-native project enabling scalable, distributed training of AI models and purpose-built for fine-tuning large language models (LLMs). It simplifies the scale-out of training workloads on multiple nodes, managing large datasets efficiently and ensuring fault-tolerance.

The core features include:

Simplify Kubernetes complexity: Kubeflow Trainer APIs are designed for two primary user personas – AI practitioners – ML engineers and data scientists who develop AI models using the Kubeflow Python SDK and TrainJob APIs, platform admins – administrators and DevOps engineers responsible for managing Kubernetes clusters and Kubeflow Trainer runtimes APIs. AI practitioners can focus on the application code in PyTorch without worrying about infrastructure details. Meanwhile, platform admins can flexibly schedule workload resources for maximum cluster utilization and cost efficiency. To support these roles, Kubeflow Trainer specifies purpose-built Kubernetes Custom Resource Definitions (CRDs) that streamline model training and infrastructure management.

Python SDK: A Pythonic interface designed for AI practitioners, abstract the details of interacting directly with Kubernetes APIs. It enables users to focus on developing PyTorch models without worrying about Kubernetes YAML configurations.
Blueprints for LLMs fine-tuning on Kubernetes: With built-in trainers, Kubeflow Trainer enables AI practitioners to seamlessly fine-tune their favorite LLMs using the desired configuration for datasets, LoRA parameters, learning rate, etc. In the first release, it implements recipes to support various fine-tuning strategies, including Supervised Fine-Tuning (SFT), Knowledge Distillation, DPO, PPO, GRPO, and Quantization-Aware Training. Community is working towards adding more builtin trainers powered by LLaMA-Factory, Unsloth, HuggingFace TRL to enable efficient LLMs fine-tuning.
Optimized GPU utilization: Kubeflow Trainer maximizes GPU efficiency by streaming large-scale data directly to distributed GPUs using an in-memory distributed data cache powered by Apache Arrow and Apache DataFusion
Advanced scheduling capabilities: Kubeflow Trainer supports gang scheduling through the PodGroupPolicy API, enabling coordinated scheduling of pods across nodes. It also integrates with Kubernetes schedulers such as Kueue, Coscheduling, Volcano, and KAI Scheduler to ensure all required resources are allocated before training jobs start.
Accelerate MPI workloads on Kubernetes: Kubeflow Trainer supports MPI-based runtimes such as DeepSpeed and MLX. It handles all necessary orchestration of MPI workloads with SSH-based optimization to boost MPI performance.
Improved resilience and fault-tolerance: By leveraging Kubernetes-native APIs like Jobs and JobSets, Kubeflow Trainer improves reliability and efficiency of AI workloads. With support for the PodFailurePolicy API, users can reduce cost by avoiding unnecessary restarts. Additionally, the SuccessPolicy API allows training jobs to complete early once the target objective is achieved.

Background and Evolution

This project was originally started as a distributed training operator for TensorFlow (e.g. TFJob), and later we merged efforts from other Kubeflow Training Operators (e.g. PyTorchJob, MPIJob) to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We’d also like to thank everyone who’s contributed to and maintained the original operators.

By joining the PyTorch Ecosystem, we strive to apply best practices of deploying distributed PyTorch applications on Kubernetes and bring first-class PyTorch support in Kubeflow Trainer.

Integrations with PyTorch Ecosystem

Kubeflow Trainer is deeply integrated with the PyTorch ecosystem, supporting a broad range of tools and libraries—including torch, DeepSpeed, HuggingFace, Horovod, and more.

It empowers PyTorch users to implement advanced distributed training strategies such as Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP & FSDP2), and Tensor Parallelism, enabling efficient large-scale model training on Kubernetes.

Additionally, Kubeflow Trainer supports data parallelism using PyTorch IterableDatasets, streaming data directly from distributed in-memory data cache nodes. This allows scalable training even with massive datasets that exceed local memory capacity.

Quick Start

Follow the steps below to quickly deploy Kubeflow Trainer and run your first training job.

Prerequisites

Installed kubectl
Installed kind

Install Kubeflow Trainer

Deploy Kubeflow Trainer control plane on your local kind cluster:

$ kind create cluster

$ kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"


# Ensure that JobSet and Trainer controller manager are running.
$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s


# Deploy the Kubeflow Trainer runtimes.
$ kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

# Install Kubeflow SDK
$ pip install git+https://github.com/kubeflow/sdk.git@64d74db2b6c9a0854e39450d8d1c0201e1e9b3f7#subdirectory=python

Define PyTorch Training Function

After installing the Kubeflow Trainer, define your PyTorch training function that contains end-to-end training script:

def train_pytorch():
    import os
    import torch
    import torch.distributed as dist
    from torch.utils.data import DataLoader, DistributedSampler
    from torchvision import datasets, transforms, models

    # [1] Configure CPU/GPU device and distributed backend.
    device, backend = ("cuda", "nccl") if torch.cuda.is_available() else ("cpu", "gloo")
    dist.init_process_group(backend=backend)
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    device = torch.device(f"{device}:{local_rank}")
    
    # [2] Get the pre-defined model.
    model = models.shufflenet_v2_x0_5(num_classes=10)
    model.conv1 = torch.nn.Conv2d(1, 24, kernel_size=3, stride=2, padding=1, bias=False)
    model = torch.nn.parallel.DistributedDataParallel(model.to(device))
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
   
    # [3] Get the FashionMNIST dataset and distribute it across all available devices.
    if local_rank == 0: # Download dataset only on local_rank=0 process.
        dataset = datasets.FashionMNIST("./data", train=True, download=True, transform=transforms.Compose([transforms.ToTensor()]))
    dist.barrier()
    dataset = datasets.FashionMNIST("./data", train=True, download=False, transform=transforms.Compose([transforms.ToTensor()]))
    train_loader = DataLoader(dataset, batch_size=100, sampler=DistributedSampler(dataset))

    # [4] Define the PyTorch training loop.
    for epoch in range(3):
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            # Forward and Backward pass
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(f"Epoch {epoch} [{batch_idx * len(inputs)}/{len(train_loader.dataset)}] "
                    f"Loss: {loss.item():.4f}"
                )

Run PyTorch on Kubernetes with TrainJob

After defining the training function, use the Kubeflow SDK to create TrainJob:

from kubeflow.trainer import TrainerClient, CustomTrainer

job_id = TrainerClient().train(
    trainer=CustomTrainer(
        func=train_pytorch,
        num_nodes=2,
        resources_per_node={
            "cpu": 3,
            "memory": "3Gi",
            # "gpu": 2, # Uncomment this line if you have GPUs.
        },
    ),
    runtime=TrainerClient().get_runtime("torch-distributed"),
)

Get the TrainJob Results

After creating the TrainJob, you should be able to list it:

for job in TrainerClient().list_jobs():
    print(f"TrainJob: {job.name}, Status: {job.status}")

TrainJob: q33a18f65635, Status: Created

It may take a few minutes for the TrainJob to pull the PyTorch image the first time. Once the image is pulled, the TrainJob‘s steps should transition to Running status. Each step represents a training node, and the number of devices per step corresponds to the number of devices on that node.

for s in TrainerClient().get_job(name=job_id).steps:
    print(f"Step: {s.name}, Status: {s.status}, Devices: {s.device} x {s.device_count}")

Step: node-0, Status: Running, Devices: cpu x 3
Step: node-1, Status: Running, Devices: cpu x 3

After steps are running, you can check the TrainJob logs. The dataset of 60,000 samples has been evenly distributed across 6 CPUs, with each device processing 10,000 samples: 60,000 / 6 = 10,000

print(TrainerClient().get_job_logs(name=job_id)["node-0"])

...
Epoch 0 [8000/60000] Loss: 0.4476
Epoch 0 [9000/60000] Loss: 0.4784
Epoch 1 [0/60000] Loss: 0.3909
Epoch 1 [1000/60000] Loss: 0.4888
Epoch 1 [2000/60000] Loss: 0.4100
...

Congratulations, you created your first distributed training job with PyTorch and Kubeflow Trainer!

What’s next

Kubeflow Trainer has exciting roadmap including the following items:

Local TrainJob Execution – run Kubeflow Trainer jobs locally without Kubernetes.
Distributed Data Cache – stream in-memory distributed data powered by Apache Arrow and Apache DataFusion.
Advanced scheduling capabilities – improve resources management and gang-scheduling capabilities by integrating with Kueue, KAI Scheduler, Volcano.
Support for JAX runtime.
Automate Checkpointing for GPU-accelerated workloads.

Call to Action

We are excited to welcome Kubeflow Trainer to the PyTorch ecosystem! Kubeflow Trainer democratizes AI model training on Kubernetes and significantly improves the development experience for AI practitioners. We invite you to explore the following resources to learn more about the project:

Read the Kubeflow Trainer v2 announcement blog post and release notes.
Explore the official Kubeflow Trainer documentation.
Join the conversations in the #kubeflow-trainer Slack channel.
Attend our bi-weekly Kubeflow Trainer community calls every Wednesday.
Share your use cases or feature proposals by opening an issue on the GitHub repository.
Tell your store by writing a Kubeflow blog post or speaking at upcoming Kubeflow Events.
Explore the Kubeflow Python SDK for AI practitioners.

We can’t wait to see what you’ll build with Kubeflow Trainer!

FlagGems Joins the PyTorch Ecosystem: Triton-Powered Operator Library for Universal AI Acceleration

FlagGems Team — Wed, 25 Jun 2025 16:53:26 +0000

In the race to accelerate large language models across diverse AI hardware, FlagGems delivers a high-performance, flexible, and scalable solution. Built on Triton language, FlagGems is a plugin-based PyTorch operator and kernel library designed to democratize AI compute. Its mission: to enable a write-once, JIT-everywhere experience, so developers can deploy optimized kernels effortlessly across a wide spectrum of hardware backends. FlagGems recently joined the PyTorch Ecosystem upon acceptance by the PyTorch Ecosystem Working Group.

With over 180 operators already implemented—spanning native PyTorch ops and widely used custom ops for large models—FlagGems is evolving fast to keep pace with the generative AI frontier.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

Key Features

Extensive Operator Library: 180+ PyTorch-compatible operators and growing
Performance Optimized: Select operators hand-tuned for speed
Torch.compile Independent: Fully functional in eager mode
Pointwise Operator Codegen: Auto-generates kernels for arbitrary input types and layouts
Fast Kernel Dispatching: Per-function runtime dispatch logic
C++ Triton Dispatcher: In development for even faster execution
Multi-Backend Ready: Works across 10+ hardware platforms with a backend-neutral runtime API

Architecture

FlagGems extends the PyTorch dispatch system out-of-tree through a multi-backend library powered by Triton. It intercepts ATen operator calls and provides backend-specific Triton implementations, making it easy to support alternative GPUs and domain-specific accelerators (DSAs).

Plug-and-Play

Registers with PyTorch’s dispatch system
Intercepts ATen operator calls
Seamlessly replaces CUDA operator implementations

Write Once, Compile Anywhere

Unified operator library code
Compilable on any backend with Triton support
Supports GPUs and heterogeneous chips like DSAs

Getting Started in 3 Steps

1. Install dependencies

pip install torch>=2.2.0  # 2.6.0 preferred

pip install triton>=2.2.0 # 3.2.0 preferred

2. Install FlagGems

git clone https://github.com/FlagOpen/FlagGems.git

cd FlagGems

pip install --no-build-isolation .

or editable install:

pip install --no-build-isolation -e .

3. Enable FlagGems in your project

import flag_gems

flag_gems.enable()  # Replaces supported PyTorch ops globally

Prefer finer control? Use a managed context:

with flag_gems.use_gems():

    output = model.generate(**inputs)

Need explicit ops?

out = flag_gems.ops.slice_scatter(inp, dim=0, src=src, start=0, end=10, step=1)

Automatic Codegen for Pointwise Ops

With the @pointwise_dynamic decorator, FlagGems can auto-generate efficient kernels with broadcast, fusion, and memory layout support. Here’s an example implementing fused GeLU and element-wise multiplication:

@pointwise_dynamic(promotion_methods=[(0, 1, “DEFAULT”)])

@triton.jit

def gelu_tanh_and_mul_kernel(x, y):

    x_fp32 = x.to(tl.float32)

    x_gelu = 0.5 * x_fp32 * (1 + tanh(x_fp32 * 0.79788456 * (1 + 0.044715 * pow(x_fp32, 2))))

    return x_gelu * y

Performance Validation

FlagGems includes built-in testing and benchmarking:

Accuracy Testing

cd tests

pytest test__ops.py --ref cpu

Performance Benchmarks

cd benchmark

pytest test__perf.py -s  # CUDA microbenchmarks

pytest test__perf.py -s --mode cpu  # End-to-end comparison

Benchmark Results

Initial benchmark results of FlagGems showcase its performance against PyTorch’s native operator implementations. The results represent the average measured speedups, a value greater than 1 indicating that FlagGems is faster than the native PyTorch operator. For a vast majority of operators, FlagGems either matches or significantly surpasses the performance of PyTorch’s native implementations.

For a significant portion of the 180+ operators, FlagGems achieves a speedup close to 1.0, indicating performance on par with the native PyTorch implementations.

Some of the core operations like LAYERNORM, CROSS_ENTROPY_LOSS, ADDMM and SOFTMAX also show impressive speedups.

Multi-Backend Support

FlagGems is vendor-flexible and backend-aware:

Set the desired backend with:

export GEMS_VENDOR=

Check active backend in Python:

import flag_gems

print(flag_gems.vendor_name)

Summary

FlagGems delivers a unified kernel library for large models acceleration that bridges software portability and hardware performance. With broad backend support, a growing op set, and advanced codegen features, it’s your go-to Triton playground for pushing the limits of AI compute.

Introducing the PyTorch Ecosystem Working Group and Project Spotlights

PyTorch Ecosystem Working Group — Thu, 05 Jun 2025 19:57:09 +0000

The PyTorch Ecosystem goes back several years, with some of its earliest projects like Hugging Face, Fast.ai, and PyTorch Lightning going on to grow incredible communities of their own. The goal from the beginning was to bring together innovative open source AI projects that extend, integrate with, or build upon PyTorch. Some of the key aspects we looked at were, for example, that they were well tested and maintained (including CI), were easy to onboard as a user, and there was a growing community. Fast forward several years, and the ecosystem continues to thrive with a vibrant landscape of dozens of projects spanning privacy, computer vision, to reinforcement learning. Enter the PyTorch Ecosystem Working Group.

In early 2025, the PyTorch Foundation created the PyTorch Ecosystem Working Group to showcase projects that could be of interest to the community and represent projects that are mature and healthy, standing out in their respective domain. The working group, composed of members across the ecosystem, was tasked with defining a clear bar including functional requirements (e.g., CI, licensing…), measurable requirements (e.g., commits and contributors), and the implementation of best practices for how to structure their repos. The working group also implemented a streamlined submission and review process and a transparent lifecycle. It’s still very early, but the reception from the community has been great, with 21 submissions so far and a strong pipeline of projects in review. You can learn more about this working group’s goals here, including the requirements and application process.

As part of this new blog series, every quarter we will update the community on new entries in the PyTorch Ecosystem, as well as highlight up and coming projects that are in consideration that will benefit from more eyes and contributors.

Ecosystem Project Spotlights

We’re happy to welcome SGlang and docTR to the PyTorch Ecosystem. Here’s a short intro to both.

SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. Learn more.

docTR

docTR is an Apache 2.0 project developed and distributed by Mindee to help developers integrate OCR capabilities into applications with no prior knowledge required.

To quickly and efficiently extract text information, docTR uses a two-stage approach:

First, it performs text detection to localize words.
Then, it conducts text recognition to identify all characters in a word.

Detection and recognition are performed by state-of-the-art models written in PyTorch. Learn more.

Up and Coming Project Spotlights

As part of this series, we highlight projects that are in consideration for the PyTorch Ecosystem, and that we believe will benefit from more eyes and contributors. This time it’s the turn of EIR and torchcvnn.

EIR

EIR is a comprehensive deep learning framework built on PyTorch that enables researchers and developers to perform supervised modeling, sequence generation, image/array generation, and survival analysis across multiple data modalities. EIR specializes in handling complex data types, including genotype, tabular, sequence, image, array, and binary inputs. While it has particular strengths in genomics and biomedical applications, its versatile handling of these diverse data types allows for broader applications across various sectors. For example, EIR’s multi-modal approach can enhance tasks such as detecting manufacturing defects by linking images with equipment readings (e.g., for an imperfect silicon wafer), monitoring infrastructure by analyzing site photos along with operational logs (e.g., to identify cracks in a pipeline), or improving retail insights by combining product images with their descriptions and sales figures. This demonstrates how EIR’s multi-modal capabilities can bring value to a wide range of industries.

The framework provides a high-level, yet modular API that reduces the amount of boilerplate code and pre-processing required to train models, allowing users to focus on their end goals rather than implementation details. To learn more and explore practical examples, please refer to the documentation.

Key features include:

Multi-modal inputs: Seamless integration of genotype, tabular, sequence, image, array, and binary data.
Varied modeling options: Use any of the input modalities above for supervised learning, sequence generation, image/array generation, and survival analysis.
Scaling: Capabilities for custom data streaming for model training.
Explainability: Built-in explainability functionality for when performing supervised learning and survival analysis.
Model Deployment: Serve any of your trained models with just one command, allowing you or others to interact with your models via web services.

To explore EIR and consider how it might enhance your work with multi-modal data:

Install: Installation is simple:
- pip install eir-dl
- Please refer to the README for more information.
Explore tutorials: Documentation available at eir.readthedocs.io, examples include:
- Genetic ancestry prediction: Explore the basics of how EIR works by training a model to predict genetic ancestry.
- Multi-modal training: Combine tabular, text, and image data for pet adoption prediction.
- Sequence generation: Train and serve an image captioning model.
- Image Generation: Use guided diffusion for image generation.
- Scaling: Implement custom data streaming for GPT-style training.
Contribute: Please visit our GitHub repository.

torchcvnn

torchcvnn is a library that helps researchers, developers, and organizations to easily experiment with Complex-valued Neural Networks (CVNNs)! In several domains, data are naturally represented in real-imaginary form, for instance, remote sensing, MRI, and many more. These domains would benefit from direct complex-valued computations, giving understanding about critical physical characteristics to the neural networks during the learning process.

torchcvnn gives you easy access to:

Standard datasets for both remote sensing (SLC and ALOS2 formats) and MRI, and for different tasks (classification, segmentation, reconstruction, super-resolution)

Various activation functions, either operating independently on the real/imaginary components or fully exploiting the complex nature of the representations,

Normalization layers with the complex-valued BatchNorm of Trabelsi et al.(2018), LayerNorm, RMSNorm,

Complex-valued attention layer as introduced in Eilers et al. (2023),

PyTorch already supports optimization of complex-valued neural networks by implementing Wirtinger Calculus. However, there are still complex-valued building blocks missing to really be able to explore the capabilities of complex-valued neural networks. The objective of torchcvnn is to fill in this gap and to provide a library helping the PyTorch users to dig into the realm of complex-valued neural networks.

torchcvnn warmly welcomes contributions to both the core torchcvnn library or to the examples’ repository for whether spotting a bug, having suggestions for improvements, or even wanting to contribute to the source code. All the components are described in the documentation of the project. The torchcvnn team will be present at IJCNN 2025 in July in Rome during the special session on “Complex- and Hypercomplex-valued Neural Networks.”

How to Join the PyTorch Ecosystem

If you’re developing a project that supports the PyTorch community, you’re welcome to apply for inclusion in the Ecosystem. Please review the PyTorch Ecosystem review process to ensure that you meet the minimum expectations before applying.

Cheers!

The PyTorch Ecosystem Working Group

SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine

SGLang Team — Wed, 19 Mar 2025 15:34:29 +0000

We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

About SGLang

SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

The core features include:

Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ).
Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
Active Community: SGLang is open source and backed by an active community with industry adoption.

SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. You can learn more about the underlying techniques from the past release blog posts: v0.2 blog, v0.3 blog, v0.4 blog.

SGLang has been widely adopted by leading industry companies and frontier research labs. For example, xAI uses SGLang to serve its flagship model, Grok 3, which is currently the best model according to the Chatbot Arena leaderboard. Microsoft Azure uses SGLang to serve DeepSeek R1 on AMD GPUs, which is currently the best open source model.

Serving DeepSeek Models

You can easily launch a Docker container to serve a DeepSeek model with the following command:

# Pull the latest image
docker pull lmsysorg/sglang:latest

# Launch a server
docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

Then you can query the server with the OpenAI-compatible API

import openai
client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

The server launch command above works for 8xH200. You can find detailed instructions for other hardware (MI300X, H100, A100, H20, L40S) at https://docs.sglang.ai/references/deepseek.html.

SGLang integrates DeepSeek-specific optimizations, such as MLA throughput optimizations, MLA-optimized kernels, data-parallel attention, multi-token prediction, and DeepGemm, making it the top choice for serving DeepSeek models by dozens of companies, including AMD, NVIDIA, and many cloud providers. The team is actively working on integrating more optimizations following the 2025 H1 roadmap below.

Serving Llama Models

Similarly, you can launch the server for a Llama 3.1 text model with:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct

Or a Llama 3.2 multimodal model with:

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct  --chat-template=llama_3_vision

Roadmap

This year, the SGLang team will continue to push the boundaries of system efficiency. You can find the roadmap of 2025H1 here. The focus is

Throughput-oriented large-scale deployment similar to the DeepSeek inference system
Long context optimizations
Low latency speculative decoding
Reinforcement learning training framework integration
Kernel optimizations

Community

SGLang has been deployed to large-scale production, generating trillions of tokens every day. It has an active community with over three hundred contributors on GitHub. It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, iFlytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.

Conclusion

We’re excited to welcome SGLang to the PyTorch ecosystem. SGLang accelerates the serving of large language and vision language models. It’s widely adopted by industry, powering the large-scale online serving of frontier models like Grok and DeepSeek.

We invite you to explore the SGLang GitHub repo, join the community on Slack, and reach out to contact@sglang.ai for inquiries or collaboration opportunities. Together, we can make powerful AI models accessible to everyone.

docTR joins PyTorch Ecosystem: From Pixels to Data, Building a Recognition Pipeline with PyTorch and docTR

Olivier Dulcy & Sebastian Olivera, Mindee — Wed, 18 Dec 2024 13:35:00 +0000

We’re thrilled to announce that the docTR project has been integrated into the PyTorch ecosystem! This integration ensures that docTR aligns with PyTorch’s standards and practices, giving developers a reliable, community-backed solution for powerful OCR workflows.

For more information on what it means to be a PyTorch ecosystem project, see the PyTorch Ecosystem Tools page.

About docTR

docTR is an Apache 2.0 project developed and distributed by Mindee to help developers integrate OCR capabilities into applications with no prior knowledge required.

To quickly and efficiently extract text information, docTR uses a two-stage approach:

First, it performs text detection to localize words.
Then, it conducts text recognition to identify all characters in a word.

Detection and recognition are performed by state-of-the-art models written in PyTorch. To learn more about this approach, you can refer to the docTR documentation.

docTR enhances the user experience in PyTorch projects by providing high-performance OCR capabilities right out of the box. Its specially designed models require minimal to no fine-tuning for common use cases, allowing developers to quickly integrate advanced document analysis features.

Local installation

docTR requires Python >= 3.10 and supports Windows, Mac and Linux. Please refer to our README for necessary dependencies for MacBook with the M1 chip.

pip3 install -U pip
pip3 install "python-doctr[torch,viz]"

This will install docTR along with the latest version of PyTorch.

Note: docTR also provides docker images for an easy deployment, such as a part of Kubernetes cluster.

Text recognition

Now, let’s try docTR’s OCR recognition on this sample:

The OCR recognition model expects an image with only one word on it and will output the predicted word with a confidence score. You can use the following snippet to test OCR capabilities from docTR:

python
from doctr.io import DocumentFile
from doctr.models import recognition_predictor

doc = DocumentFile.from_images("/path/to/image")

# Load the OCR model
# This will download pre-trained models hosted by Mindee
model = recognition_predictor(pretrained=True)

result = model(doc)
print(result)

Here, the most important line of code is model = recognition_predictor(pretrained=True). This will load a default text recognition model, crnn_vgg16_bn, but you can select other models through the arch parameter. You can check out the available architectures.

When run on the sample, the recognition predictor retrieves the following data: [('MAGAZINE', 0.9872216582298279)]

Note: using the DocumentFile object docTR provides an easy way to manipulate PDF or Images.

Text detection

The last example was a crop on a single word. Now, what about an image with several words on it, like this one?

A text detection model is used before the text recognition to output a segmentation map representing the location of the text. Following that, the text recognition is applied on every detected patch.

Below is a snippet to run only the detection part:

from doctr.io import DocumentFile
from doctr.models import detection_predictor
from matplotlib import pyplot as plt
from doctr.utils.geometry import detach_scores
from doctr.utils.visualization import draw_boxes

doc = DocumentFile.from_images("path/to/my/file")
model = detection_predictor(pretrained=True)

result = model(doc)

draw_boxes(detach_scores([result[0]["words"]])[0][0], doc[0])
plt.axis('off')
plt.show()

Running it on the full sample yields the following:

Similarly to the text recognition, detection_predictor will load a default model (fast_base here). You can also load another one by providing it through the arch parameter.

The full implementation

Now, let’s plug both components into the same pipeline.

Conveniently, docTR provides a wrapper that does exactly that for us:

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

doc = DocumentFile.from_images("/path/to/image")

model = ocr_predictor(pretrained=True, assume_straight_pages=False)

result = model(doc)
result.show()

The last line should display a matplotlib window which shows the detected patches. Hovering the mouse over them will display their contents.

You can also do more with this output, such as reconstituting a synthetic document like so:

import matplotlib.pyplot as plt

synthetic_pages = result.synthesize()
plt.imshow(synthetic_pages[0])
plt.axis('off')
plt.show()

The pipeline is highly customizable, where you can modify the detection or recognition model behaviors by passing arguments to the ocr_predictor. Please refer to the documentation to learn more about it.

Conclusion

We’re excited to welcome docTR into the PyTorch Ecosystem, where it seamlessly integrates with PyTorch pipelines to deliver state-of-the-art OCR capabilities right out of the box.

By empowering developers to quickly extract text from images or PDFs using familiar tooling, docTR simplifies complex document analysis tasks and enhances the overall PyTorch experience.

We invite you to explore the docTR GitHub repository, join the docTR community on Slack, and reach out at contact@mindee.com for inquiries or collaboration opportunities.

Together, we can continue to push the boundaries of document understanding and develop even more powerful, accessible tools for everyone in the PyTorch community.