API Reference
Understanding API Endpoints
Every model you select receives a dedicated API endpoint—a URL that serves as the communication channel between your application and the model.
Endpoint Structure
Your endpoint will look like:
https://inference.ai.neevcloud.com/v1/chat/completionsThis URL is where you send all inference requests for the selected model.
How Endpoints Work
Single Point of Access
The endpoint abstracts away all complexity. Your application doesn't need to know which GPU is running the model, how many instances are active, or where they're located geographically. You simply send HTTP requests to the endpoint URL.
Authentication and Security
Every request must include your API key in the Authorization header. This ensures:
Only authorized applications can access the model
Usage is tracked to your account for billing
Rate limits and quotas are enforced correctly
Requests are logged for debugging and monitoring
Automatic Scaling
Behind the endpoint, NeevCloud manages a pool of GPU instances running your model. When you send requests:
The endpoint distributes them across available instances
If demand increases, more instances spin up automatically
If demand decreases, instances scale down to reduce costs
Load balancing ensures consistent response times
Concurrent Request Handling
You can send multiple requests simultaneously to the same endpoint. The infrastructure handles:
Request queuing when all instances are busy
Parallel processing across multiple GPUs
Fair scheduling across different users
Timeout management for long-running requests
No Manual Deployment
Traditional ML deployment requires you to:
Build a container image with your model
Deploy to a Kubernetes cluster or cloud service
Configure autoscaling policies
Set up monitoring and logging
Manage model versions and updates
With NeevCloud endpoints, all of this is handled for you. You never touch infrastructure.
Endpoint Lifecycle
When you select a model:
The endpoint is instantly available for you to use.
The first request may take a few extra seconds as the model loads into GPU memory
Subsequent requests are fast because the model stays loaded
If inactive for an extended period, the model may unload to free resources
The next request triggers automatic reload
This lifecycle ensures you're not paying for idle resources while maintaining fast response times.
Last updated