API Reference

Understanding API Endpoints

Every model you select receives a dedicated API endpoint—a URL that serves as the communication channel between your application and the model.

Endpoint Structure

Your endpoint will look like:

https://inference.ai.neevcloud.com/v1/chat/completions

This URL is where you send all inference requests for the selected model.

How Endpoints Work

Single Point of Access

The endpoint abstracts away all complexity. Your application doesn't need to know which GPU is running the model, how many instances are active, or where they're located geographically. You simply send HTTP requests to the endpoint URL.

Authentication and Security

Every request must include your API key in the Authorization header. This ensures:

Only authorized applications can access the model
Usage is tracked to your account for billing
Rate limits and quotas are enforced correctly
Requests are logged for debugging and monitoring

Automatic Scaling

Behind the endpoint, NeevCloud manages a pool of GPU instances running your model. When you send requests:

The endpoint distributes them across available instances
If demand increases, more instances spin up automatically
If demand decreases, instances scale down to reduce costs
Load balancing ensures consistent response times

Concurrent Request Handling

You can send multiple requests simultaneously to the same endpoint. The infrastructure handles:

Request queuing when all instances are busy
Parallel processing across multiple GPUs
Fair scheduling across different users
Timeout management for long-running requests

No Manual Deployment

Traditional ML deployment requires you to:

Build a container image with your model
Deploy to a Kubernetes cluster or cloud service
Configure autoscaling policies
Set up monitoring and logging
Manage model versions and updates

With NeevCloud endpoints, all of this is handled for you. You never touch infrastructure.

Endpoint Lifecycle

When you select a model:

The endpoint is instantly available for you to use.
The first request may take a few extra seconds as the model loads into GPU memory
Subsequent requests are fast because the model stays loaded
If inactive for an extended period, the model may unload to free resources
The next request triggers automatic reload

This lifecycle ensures you're not paying for idle resources while maintaining fast response times.

PreviousDeveloper Guide NextMonitoring & Analytics

Last updated 21 days ago

Good night

hashtagUnderstanding API Endpoints

hashtagEndpoint Structure

hashtagHow Endpoints Work

hashtagSingle Point of Access

hashtagAuthentication and Security

hashtagAutomatic Scaling

hashtagConcurrent Request Handling

hashtagNo Manual Deployment

hashtagEndpoint Lifecycle