Triton Serverless Inference Optimization

Learn to deploy models serverlessly with Triton Inference Server, achieving auto-scaling and optimized inferencing speed without infrastructure management.

Overview

Serverless deployment promises benefits like auto-scaling, granular billing, and multi-framework support, but often at the cost of inferencing speed. In this short talk, you’ll learn how to enable a true serverless experience on top of Triton Inference Server. This maintains Triton’s optimized inferencing speed while adding auto-scaling from 0 to N servers without infrastructure management. We’ll cover how to leverage Triton’s packaging capabilities to optimize any model for inferencing based on utilization, batching, and concurrency needs.

Links

https://www.inferless.com/
Inferless deploys auto-scaling ML models instantly on serverless GPU infrastructure.

Tech stack