Running an LLM model using the vLLM AI template

This tutorial shows you how to deploy a GPU instance using the vLLM AI Template on NeevCloud and run LLM inference end to end.

By the end of this tutorial, you will:

  • Deploy a GPU using the vLLM AI Template

  • Connect to the instance using SSH

  • Start a vLLM inference server

  • Send test requests using curl


Prerequisites

Before you start, make sure you have:

  • A NeevCloud account

  • Basic knowledge of SSH

No prior vLLM experience is required.


Step 1: Deploy a GPU using the vLLM AI Template

  1. Log in to NeevCloud

  2. Go to AI Templates

  3. Use the search bar and type vLLM

  4. Select vLLM Inference and click on Deploy with this Template

  5. Review the recommended configuration and make changes if needed

  6. Add SSH key required to access the instance

  7. Click Deploy GPU to start the instance.


Step 2: Connect to the GPU instance

  1. Once deployment is complete, you will see GPU instace in running state.

  2. Connect using SSH:

  3. Verify GPU is available:

    You should see your GPU listed.


Step 3: Verify vLLM installation

The vLLM AI Template already includes vLLM Inference Engine.

  1. Check installation:

  2. Check Python version:

  3. If the template provides a virtual environment, activate it:


Step 4: Start the vLLM inference server

  1. For beginners, use the simplest working command:

    What this does:

    • Loads the LLM model

    • Starts an OpenAI compatible API server

    • Listens on port 8090

  2. Wait until you see logs showing the model is loaded.


Step 5: Test the server locally

  1. Open a new terminal on the same machine or another machine with network access.

  2. Run this command:

    If you get a JSON response, the server is running correctly.


Step 6: Send a chat completion request

Since this is an instruct model, use the chat completion endpoint.

  1. Send the request:

  2. You should receive a response with generated text.


Step 7: Increase throughput when ready

After basic testing, you can increase performance.

  1. Run the optimized command:

    Use this only when:

    • You understand your GPU memory

    • You expect multiple concurrent users


Step 8: Monitor GPU usage

  1. Run this command to monitor GPU load:

    This helps you:

    • Check GPU utilization

    • Detect memory issues

    • Validate performance gains


Step 9: Expose the API externally

  1. Make sure port 8090 is allowed in the deployment settings.

  2. You can now access the API using:

  3. This allows integration with:

    • Backend services

    • Web applications

    • Internal tools


Step 10: Stop or delete the instance

When you are done:

  1. Stop the GPU to save cost, or

  2. Delete it if no longer needed.


What you learned

You successfully:

  • Used AI Templates during GPU deployment

  • Ran vLLM without manual setup

  • Served an LLM using an OpenAI compatible API

  • Tested and tuned inference performance

Last updated