Storage Types
When you provision a GPU instance, you need to decide what type of storage best fits your workflow. Here's how to choose and configure your storage.
Ephemeral Storage
Ephemeral storage is temporary storage that exists only while your GPU instance is running. When you terminate the instance, this storage and all its contents are permanently deleted.
When to Use Ephemeral Storage
You're running quick experiments or proof-of-concept tests.
You're processing data that can be easily regenerated.
You're prototyping a new model architecture.
Your workflow downloads data at runtime from external sources.
You want to avoid storage costs for temporary work.
Key Characteristics
Cost: Free (included with your GPU instance).
Lifecycle: Automatically deleted on instance termination.
Performance: Fast local access.
Use case: Temporary workloads, testing, quick iterations.
Example Scenario: You're testing different hyperparameters for a small model using publicly available datasets. Since you can re-download the data anytime and don't need to keep the experimental checkpoints, ephemeral storage is ideal.
Persistent Local Disk Storage
Persistent local disk storage resides on the same physical node as your GPU instance. This gives you the fastest possible read/write speeds because data doesn't travel over the network.
When to Use Persistent Local Disk
You're training models that require extremely fast data loading (high I/O operations).
You're working with datasets that have random access patterns.
You need minimal latency between storage and compute (1-5 ms).
You're running single-node training jobs.
You need to preserve data briefly but with maximum performance.
Key Characteristics
Latency: 1-5 ms (ultra-low).
Location: Same physical node as your GPU.
Lifecycle: Deleted when GPU instance is removed.
Performance: Highest I/O throughput.
Access: Single GPU instance only.
Important Note: Even though this storage is "persistent," it gets deleted when you terminate the GPU instance. Think of it as persistent during your instance lifecycle, but not after termination.
Example Scenario: You're training a large language model where loading batches of data needs to happen as fast as possible. The dataset fits on a single node, and you'll export the final model to external storage before terminating the instance.
Persistent Network Storage
Network storage is shared across multiple GPU nodes and remains available even after you delete the GPU instance that created it. This storage travels over the network, which adds some latency but provides crucial benefits for collaborative and distributed workflows.
When to Use Persistent Network Storage
You're running distributed training across multiple GPU instances.
Your team needs to access the same datasets or model checkpoints.
You want your data to survive GPU instance termination.
You're implementing a workflow where different instances process the same data.
You need to preserve training checkpoints for later resumption.
Key Characteristics
Latency: 5-20 ms (slightly higher than local disk).
Location: Network-attached, accessible from multiple nodes.
Lifecycle: Persists after GPU deletion (until you explicitly delete it).
Performance: Good for most AI/ML workloads.
Access: Multiple GPU instances can mount the same volume.
Example Scenario: You're training a distributed model across 4 GPU instances. All instances need access to the same training dataset, and you want your checkpoints to persist so you can resume training later or analyze results after the GPUs are terminated.
Existing Storage
If you've already created storage volumes in previous sessions, you can attach them to new GPU instances instead of creating new storage.
When to Use Existing Storage
You're resuming training from a previous checkpoint.
You're reusing preprocessed datasets across multiple experiments.
You've stored pre-trained models that you want to fine-tune.
You're continuing work from a previous session.
Key Characteristics
Mount point: Automatically mounted at
/data(default path).Setup: Zero additional configuration needed.
Use case: Leveraging previously created data, models, or checkpoints.
Example Scenario: Yesterday you preprocessed a large image dataset and saved it to network storage. Today you're starting a new GPU instance to train on that data—simply attach the existing storage and your preprocessed data is immediately available at
/data.
Decision Guide
Here's a quick reference to help you choose:
Testing or temporary work
Ephemeral Storage
Fastest possible I/O, single node
Persistent Local Disk
Data must survive GPU termination
Persistent Network Storage
Multi-node or team access
Persistent Network Storage
Reusing existing datasets/models
Existing Storage
Distributed training
Persistent Network Storage
Cost optimization for experiments
Ephemeral Storage
Configuration Parameters
When you configure storage for your GPU instance, you'll set these parameters:
Storage Name
You assign a friendly, identifiable name to your storage volume. This helps you track and manage multiple storage volumes across different projects.
Storage Size
You define the capacity of your storage volume in gigabytes (GB).
Constraints:
Minimum: 10 GB
Maximum: 2 TB (2,048 GB, depends on the instance type)
Sizing Guidance:
For datasets:
Calculate your raw data size.
Add 20-30% buffer for intermediate files and processing artifacts.
Consider whether you'll store multiple versions or augmented data.
For model checkpoints:
Estimate checkpoint size (model parameters + optimizer states).
Multiply by number of checkpoints you want to retain.
Add space for final model exports.
For general AI/ML work:
Start with 50-100 GB for small experiments.
Use 200-500 GB for typical training workflows.
Go 500 GB+ for large-scale datasets or distributed training.
Example Calculation:
You have a 30 GB dataset, expect to generate 50 GB of checkpoints, and need 15 GB for logs and outputs. Recommended size: 30 + 50 + 15 + 20% buffer ≈ 115 GB → Provision 120-150 GB.
Important: You pay based on allocated size, not used space. Right-size your storage to avoid paying for unused capacity.
Last updated