Run: ai releases advanced model service functions to help enterprises simplify AI deployment

Run:ai, the leader in orchestrating computing for AI workloads, has announced new features for the Atlas platform, including two-step model deployment — making it easier and faster to bring machine learning models into production. The company also announced a new integration with NVIDIA Triton Inference Server. These capabilities are specifically focused on supporting organizations in deploying and using AI models to infer NVIDIA Accelerated Computing workloads, so they can provide accurate, real-time responses. Features solidify Run: ai Atlas as a single, unified platform where AI teams, from data scientists to MLOps engineers, can build, train, and manage models in production from one simple interface.

AI models can be challenging to deploy in production; Despite the time and effort that goes into building and training the models, most of them never leave the lab. Creating a model, linking it to data and containers, and allocating only the required amount of computing are major barriers to bringing AI into production. Publishing a form usually requires manual editing and uploading of tedious YAML configuration files. Operation: The new two-step deployment of AI makes the process easy, enabling organizations to quickly switch between models, optimizing the economical use of GPUs, and ensuring models run efficiently in production.

“With new advanced inference capabilities, the Altas platform Run:ai now offers a solution for the entire AI lifecycle – from construction to training to inference – all delivered in one platform,” said Ronen Dar, chief technology officer and co-founder of Run:ai. . “Instead of using many different MLOps and orchestration tools, data scientists can take advantage of a unified and robust platform to manage all of their AI infrastructure needs.”

Run:ai also announced full integration with the NVIDIA Triton inference server, which allows organizations to deploy multiple models — or multiple instances of the same model — and run them in parallel within a single container. NVIDIA Triton Inference Server included in file NVIDIA AI Enterprise SuiteIt is fully powered and optimized for AI development and deployment. Run: AI Synchronization runs on top of the NVIDIA Triton and provides automatic scaling, customization, and prioritization based on each model – any Triton sizes automatically fit. Using Run: Atlas ai with NVIDIA Triton increases computing resource usage while simplifying AI infrastructure. Atlas platform Run.ai is about NVIDIA AI Acceleration The app, indicating that it was developed on the NVIDIA AI platform for performance and reliability.

Running inference workloads in production requires fewer resources than training, which consumes large amounts of GPU memory and memory. Organizations sometimes run inference workloads on CPUs rather than GPUs, but this can mean increased latency. In many AI use cases, the end user needs a real-time response: select a stop sign, facial recognition on a phone, or voice dictation, for example. CPU-dependent heuristics can be very slow for these applications.

Using GPUs to infer workloads results in lower latency and higher resolution, but this can be costly and wasteful when GPUs are not fully utilized. Run: The model-centric AI approach automatically adapts to the demands of diverse workloads. With Run:ai, using a full GPU for a single lightweight workload is no longer required, saving significant cost while maintaining low latency.

Other new features of Run: ai Atlas for inferential workloads include:

  • vision and control New inference-focused metrics and dashboards provide insight into the validity and performance of AI models in production.
  • Deploying models on micro GPUs – Proper scaling and deployment of GPU parts avoid wasted resources and ensure performance requirements are met.
  • Auto scaling Allows organizations to scale models up or down automatically based on predefined thresholds using built-in, custom GPU metrics. This ensures that Model SLAs (in terms of latency) are met.
  • scale to zero Automatically scales deployments to zero resources when possible, freeing up valuable resources that reduce cost and enable reallocation of resources for other workloads.

“The flexibility and portability of the NVIDIA Triton Inference Server, available with NVIDIA AI Enterprise support, enables rapid and simple scaling and deployment of AI models trained from any framework on any GPU or CPU based infrastructure,” said Shankar Chandrasekaran, Senior Product Manager. Central”. at NVIDIA. “The advanced performance and ease of use of the Triton Inference Server combined with coordination from the Run: Atlas AI platform make it the ideal foundation for deploying an AI model.”

Sign up for a free InsideBIGDATA the news.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1