Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
Triton Serverless Inference Optimization
Learn to deploy models serverlessly with Triton Inference Server, achieving auto-scaling and optimized inferencing speed without infrastructure management.
Serverless deployment promises benefits like auto-scaling, granular billing, and multi-framework support, but often at the cost of inferencing speed. In this short talk, you’ll learn how to enable a true serverless experience on top of Triton Inference Server. This maintains Triton’s optimized inferencing speed while adding auto-scaling from 0 to N servers without infrastructure management. We’ll cover how to leverage Triton’s packaging capabilities to optimize any model for inferencing based on utilization, batching, and concurrency needs.
Inferless deploys auto-scaling ML models instantly on serverless GPU infrastructure.