Ray Serve is a powerful library for creating scalable online inference APIs. It works with various frameworks like PyTorch, TensorFlow, and Keras, as well as Scikit-Learn and custom Python logic. Notable features include response streaming, dynamic request batching, and multi-node/multi-GPU serving, making it ideal for Large Language Models.
Ray Serve is versatile for composing and serving multiple ML models and business logic in Python. It's built on Ray, facilitating easy scaling across machines with flexible scheduling, including fractional GPU support. This allows cost-effective sharing of resources and efficient serving of numerous machine learning models.