API Reference

Understanding API Endpoints

Every model you select receives a dedicated API endpoint—a URL that serves as the communication channel between your application and the model.

Endpoint Structure

Your endpoint will look like:

https://inference.ai.neevcloud.com/v1/chat/completions

This URL is where you send all inference requests for the selected model.

How Endpoints Work

Single Point of Access

The endpoint abstracts away all complexity. Your application doesn't need to know which GPU is running the model, how many instances are active, or where they're located geographically. You simply send HTTP requests to the endpoint URL.

Authentication and Security

Every request must include your API key in the Authorization header. This ensures:

  • Only authorized applications can access the model

  • Usage is tracked to your account for billing

  • Rate limits and quotas are enforced correctly

  • Requests are logged for debugging and monitoring

Automatic Scaling

Behind the endpoint, NeevCloud manages a pool of GPU instances running your model. When you send requests:

  • The endpoint distributes them across available instances

  • If demand increases, more instances spin up automatically

  • If demand decreases, instances scale down to reduce costs

  • Load balancing ensures consistent response times

Concurrent Request Handling

You can send multiple requests simultaneously to the same endpoint. The infrastructure handles:

  • Request queuing when all instances are busy

  • Parallel processing across multiple GPUs

  • Fair scheduling across different users

  • Timeout management for long-running requests

No Manual Deployment

Traditional ML deployment requires you to:

  • Build a container image with your model

  • Deploy to a Kubernetes cluster or cloud service

  • Configure autoscaling policies

  • Set up monitoring and logging

  • Manage model versions and updates

With NeevCloud endpoints, all of this is handled for you. You never touch infrastructure.

Endpoint Lifecycle

When you select a model:

  • The endpoint is instantly available for you to use.

  • The first request may take a few extra seconds as the model loads into GPU memory

  • Subsequent requests are fast because the model stays loaded

  • If inactive for an extended period, the model may unload to free resources

  • The next request triggers automatic reload

This lifecycle ensures you're not paying for idle resources while maintaining fast response times.

Last updated