Running an LLM model using the vLLM AI template

This tutorial shows you how to deploy a GPU instance using the vLLM AI Template on NeevCloud and run LLM inference end to end.

By the end of this tutorial, you will:

Deploy a GPU using the vLLM AI Template
Connect to the instance using SSH
Start a vLLM inference server
Send test requests using curl

Prerequisites

Before you start, make sure you have:

A NeevCloud account
Basic knowledge of SSH

No prior vLLM experience is required.

Step 1: Deploy a GPU using the vLLM AI Template

Log in to NeevCloud
Go to AI Templates
Use the search bar and type vLLM
Select vLLM Inference and click on Deploy with this Template
Review the recommended configuration and make changes if needed
Add SSH key required to access the instance
Click Deploy GPU to start the instance.

Step 2: Connect to the GPU instance

Once deployment is complete, you will see GPU instace in running state.

Connect using SSH:

ssh -i /path/to/key-pair-name.pem -p <PORT> root@<IP_ADDRESS>

Verify GPU is available:
```
nvidia-smi
```
You should see your GPU listed.

Step 3: Verify vLLM installation

The vLLM AI Template already includes vLLM Inference Engine.

Check installation:
```
pip show vllm
```
Check Python version:
```
python3 --version
```
If the template provides a virtual environment, activate it:
```
source /opt/venv/bin/activate
```

Step 4: Start the vLLM inference server

For beginners, use the simplest working command:
```
uv tool run vllm serve \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8090
```
What this does:
- Loads the LLM model
- Starts an OpenAI compatible API server
- Listens on port 8090
Wait until you see logs showing the model is loaded.

Step 5: Test the server locally

Open a new terminal on the same machine or another machine with network access.
Run this command:
```
curl http://localhost:8090/v1/models
```
If you get a JSON response, the server is running correctly.

Step 6: Send a chat completion request

Since this is an instruct model, use the chat completion endpoint.

Send the request:

curl http://localhost:8090/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Explain GPU inference in simple terms"
      }
    ],
    "max_tokens": 100
  }'

You should receive a response with generated text.

Step 7: Increase throughput when ready

After basic testing, you can increase performance.

Run the optimized command:

uv tool run vllm serve \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --gpu-memory-utilization 0.95 \
  --max-num-batched-tokens 8192 \
  --host 0.0.0.0 \
  --port 8090

Use this only when:

You understand your GPU memory
You expect multiple concurrent users

Step 8: Monitor GPU usage

Run this command to monitor GPU load:
```
watch -n 1 nvidia-smi
```
This helps you:
- Check GPU utilization
- Detect memory issues
- Validate performance gains

Step 9: Expose the API externally

Make sure port 8090 is allowed in the deployment settings.

You can now access the API using:

http://<GPU_PUBLIC_IP>:8090/v1/chat/completions

This allows integration with:
- Backend services
- Web applications
- Internal tools

Step 10: Stop or delete the instance

When you are done:

Stop the GPU to save cost, or
Delete it if no longer needed.

What you learned

You successfully:

Used AI Templates during GPU deployment
Ran vLLM without manual setup
Served an LLM using an OpenAI compatible API
Tested and tuned inference performance

PreviousHow to Use NeevCloud GPUs with VS Code

Last updated 3 days ago

Good night

hashtagPrerequisites

hashtagStep 1: Deploy a GPU using the vLLM AI Template

hashtagStep 2: Connect to the GPU instance

hashtagStep 3: Verify vLLM installation

hashtagStep 4: Start the vLLM inference server

hashtagStep 5: Test the server locally

hashtagStep 6: Send a chat completion request

hashtagStep 7: Increase throughput when ready

hashtagStep 8: Monitor GPU usage

hashtagStep 9: Expose the API externally

hashtagStep 10: Stop or delete the instance

hashtagWhat you learned