Storage Types

When you provision a GPU instance, you need to decide what type of storage best fits your workflow. Here's how to choose and configure your storage.

Ephemeral Storage

Ephemeral storage is temporary storage that exists only while your GPU instance is running. When you terminate the instance, this storage and all its contents are permanently deleted.

When to Use Ephemeral Storage

You're running quick experiments or proof-of-concept tests.
You're processing data that can be easily regenerated.
You're prototyping a new model architecture.
Your workflow downloads data at runtime from external sources.
You want to avoid storage costs for temporary work.

Key Characteristics

Cost: Free (included with your GPU instance).
Lifecycle: Automatically deleted on instance termination.
Performance: Fast local access.
Use case: Temporary workloads, testing, quick iterations.

Example Scenario: You're testing different hyperparameters for a small model using publicly available datasets. Since you can re-download the data anytime and don't need to keep the experimental checkpoints, ephemeral storage is ideal.

Persistent Local Disk Storage

Persistent local disk storage resides on the same physical node as your GPU instance. This gives you the fastest possible read/write speeds because data doesn't travel over the network.

When to Use Persistent Local Disk

You're training models that require extremely fast data loading (high I/O operations).
You're working with datasets that have random access patterns.
You need minimal latency between storage and compute (1-5 ms).
You're running single-node training jobs.
You need to preserve data briefly but with maximum performance.

Key Characteristics

Latency: 1-5 ms (ultra-low).
Location: Same physical node as your GPU.
Lifecycle: Deleted when GPU instance is removed.
Performance: Highest I/O throughput.
Access: Single GPU instance only.

Important Note: Even though this storage is "persistent," it gets deleted when you terminate the GPU instance. Think of it as persistent during your instance lifecycle, but not after termination.

Example Scenario: You're training a large language model where loading batches of data needs to happen as fast as possible. The dataset fits on a single node, and you'll export the final model to external storage before terminating the instance.

Persistent Network Storage

Network storage is shared across multiple GPU nodes and remains available even after you delete the GPU instance that created it. This storage travels over the network, which adds some latency but provides crucial benefits for collaborative and distributed workflows.

When to Use Persistent Network Storage

You're running distributed training across multiple GPU instances.
Your team needs to access the same datasets or model checkpoints.
You want your data to survive GPU instance termination.
You're implementing a workflow where different instances process the same data.
You need to preserve training checkpoints for later resumption.

Key Characteristics

Latency: 5-20 ms (slightly higher than local disk).
Location: Network-attached, accessible from multiple nodes.
Lifecycle: Persists after GPU deletion (until you explicitly delete it).
Performance: Good for most AI/ML workloads.
Access: Multiple GPU instances can mount the same volume.

Example Scenario: You're training a distributed model across 4 GPU instances. All instances need access to the same training dataset, and you want your checkpoints to persist so you can resume training later or analyze results after the GPUs are terminated.

Existing Storage

If you've already created storage volumes in previous sessions, you can attach them to new GPU instances instead of creating new storage.

When to Use Existing Storage

You're resuming training from a previous checkpoint.
You're reusing preprocessed datasets across multiple experiments.
You've stored pre-trained models that you want to fine-tune.
You're continuing work from a previous session.

Key Characteristics

Mount point: Automatically mounted at /data (default path).
Setup: Zero additional configuration needed.
Use case: Leveraging previously created data, models, or checkpoints.

Example Scenario: Yesterday you preprocessed a large image dataset and saved it to network storage. Today you're starting a new GPU instance to train on that data—simply attach the existing storage and your preprocessed data is immediately available at /data.

Decision Guide

Here's a quick reference to help you choose:

Your Requirement

Recommended Storage Type

Testing or temporary work

Ephemeral Storage

Fastest possible I/O, single node

Persistent Local Disk

Data must survive GPU termination

Persistent Network Storage

Multi-node or team access

Persistent Network Storage

Reusing existing datasets/models

Existing Storage

Distributed training

Persistent Network Storage

Cost optimization for experiments

Ephemeral Storage

Configuration Parameters

When you configure storage for your GPU instance, you'll set these parameters:

Storage Name

You assign a friendly, identifiable name to your storage volume. This helps you track and manage multiple storage volumes across different projects.

Storage Size

You define the capacity of your storage volume in gigabytes (GB).

Constraints:

Minimum: 10 GB
Maximum: 2 TB (2,048 GB, depends on the instance type)

Sizing Guidance:

For datasets:
- Calculate your raw data size.
- Add 20-30% buffer for intermediate files and processing artifacts.
- Consider whether you'll store multiple versions or augmented data.
For model checkpoints:
- Estimate checkpoint size (model parameters + optimizer states).
- Multiply by number of checkpoints you want to retain.
- Add space for final model exports.
For general AI/ML work:
- Start with 50-100 GB for small experiments.
- Use 200-500 GB for typical training workflows.
- Go 500 GB+ for large-scale datasets or distributed training.

Example Calculation:

You have a 30 GB dataset, expect to generate 50 GB of checkpoints, and need 15 GB for logs and outputs. Recommended size: 30 + 50 + 15 + 20% buffer ≈ 115 GB → Provision 120-150 GB.

Important: You pay based on allocated size, not used space. Right-size your storage to avoid paying for unused capacity.

PreviousOverview NextAccessing Storage

Last updated 21 days ago

Good evening

hashtagEphemeral Storage

hashtagWhen to Use Ephemeral Storage

hashtagKey Characteristics

hashtagPersistent Local Disk Storage

hashtagWhen to Use Persistent Local Disk

hashtagKey Characteristics

hashtagPersistent Network Storage

hashtagWhen to Use Persistent Network Storage

hashtagKey Characteristics

hashtagExisting Storage

hashtagWhen to Use Existing Storage

hashtagKey Characteristics

hashtagDecision Guide

hashtagConfiguration Parameters

hashtagStorage Name

hashtagStorage Size

Ephemeral Storage

When to Use Ephemeral Storage

Key Characteristics

Persistent Local Disk Storage

When to Use Persistent Local Disk

Key Characteristics

Persistent Network Storage

When to Use Persistent Network Storage

Key Characteristics

Existing Storage

When to Use Existing Storage

Key Characteristics

Decision Guide

Configuration Parameters

Storage Name

Storage Size