Performance Metrics

The output panel also displays critical performance metrics along with actual generated response that help you understand how efficiently the model is processing your requests. Understanding these metrics is essential for optimizing your deployment and troubleshooting performance issues.

Tokens Generated

This metric shows the total number of tokens produced in the model's response. It serves multiple purposes in your analysis:

Cost estimation: In production API usage, you are typically billed per token generated. This number directly maps to your per-request cost.
Verbosity monitoring: If you notice consistently high token counts, you may want to adjust your prompts to encourage more concise responses or increase your max tokens limit if responses are being truncated.
Performance correlation: Longer responses take more time to generate. You can use this metric to understand the relationship between output length and total latency.

Important Note: This count includes only the generated output tokens, not the input tokens from your prompt. Your total token usage for billing purposes would be input tokens plus output tokens.

Time to First Token (TTFT)

Time to First Token (TTFT) measures the latency between when you submit your request and when the model begins generating output. This is one of the most critical metrics for user experience in interactive applications.

What TTFT Includes

The TTFT measurement encompasses several processing steps:

Request queuing: Time spent waiting in queue if the system is handling multiple concurrent requests.
Model loading: If the model needs to be loaded into GPU memory (cold start scenario), this adds significant latency.
Prompt processing: The model must process your entire input context before it can begin generating. This scales with your input length.
First token generation: Computing the probability distribution and sampling the first output token.

Strategies to Reduce TTFT

Reduce input length: Shorter prompts require less processing time. Eliminate unnecessary context or use summarization for long documents.
Select smaller models for simple tasks: If the task does not require the full capability of a 120B model, using a smaller variant can significantly reduce TTFT.

PreviousPrompt Editor NextNext Steps

Last updated 19 hours ago

Good night

hashtagTokens Generated

hashtagTime to First Token (TTFT)

hashtagWhat TTFT Includes

hashtagStrategies to Reduce TTFT

Tokens Generated

Time to First Token (TTFT)

What TTFT Includes

Strategies to Reduce TTFT