Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
LLaMA: 10x Faster Inference on TPU
How to achieve 6‑10× faster inference for LLaMA‑65B using PyTorch/XLA on TPU, covering implementation details, performance results, and practical tips.
The Path to Achieve Ultra-Low Inference Latency With LLaMA 65B on PyTorch/XLA: super excited to present our latest work on accelerating LLaMA, the iconic Large Language Model, using PyTorch/XLA on TPU systems. A whopping 6-10X inference performance gain!