Performance Metrics
The output panel also displays critical performance metrics along with actual generated response that help you understand how efficiently the model is processing your requests. Understanding these metrics is essential for optimizing your deployment and troubleshooting performance issues.
Tokens Generated
This metric shows the total number of tokens produced in the model's response. It serves multiple purposes in your analysis:
Cost estimation: In production API usage, you are typically billed per token generated. This number directly maps to your per-request cost.
Verbosity monitoring: If you notice consistently high token counts, you may want to adjust your prompts to encourage more concise responses or increase your max tokens limit if responses are being truncated.
Performance correlation: Longer responses take more time to generate. You can use this metric to understand the relationship between output length and total latency.
Important Note: This count includes only the generated output tokens, not the input tokens from your prompt. Your total token usage for billing purposes would be input tokens plus output tokens.
Time to First Token (TTFT)
Time to First Token (TTFT) measures the latency between when you submit your request and when the model begins generating output. This is one of the most critical metrics for user experience in interactive applications.
What TTFT Includes
The TTFT measurement encompasses several processing steps:
Request queuing: Time spent waiting in queue if the system is handling multiple concurrent requests.
Model loading: If the model needs to be loaded into GPU memory (cold start scenario), this adds significant latency.
Prompt processing: The model must process your entire input context before it can begin generating. This scales with your input length.
First token generation: Computing the probability distribution and sampling the first output token.
Strategies to Reduce TTFT
Reduce input length: Shorter prompts require less processing time. Eliminate unnecessary context or use summarization for long documents.
Select smaller models for simple tasks: If the task does not require the full capability of a 120B model, using a smaller variant can significantly reduce TTFT.
Last updated