Configuration Panel
The Configuration Panel gives you control over the model selection and three critical inference parameters. Understanding how to adjust these parameters is essential for optimizing your model's behavior for specific tasks.
Model Selection
The model selector allows you to choose from available large language models. Each model has different characteristics in terms of:
Model capacity: Larger models (indicated by parameter count, e.g., 120B = 120 billion parameters) typically demonstrate better reasoning capabilities and broader knowledge.
Inference latency: Larger models require more computational resources and typically have higher time-to-first-token (TTFT) and overall generation time.
Specialization: Some models may be fine-tuned for specific domains (e.g., code generation, scientific reasoning, multilingual tasks).
Cost implications: In production environments, larger models generally incur higher per-token costs.
Temperature Parameter
Temperature controls the randomness of the model's output. This parameter fundamentally affects how the model selects the next token in its generation sequence.
How Temperature Works
During inference, the model produces a probability distribution over all possible next tokens. Temperature modifies this distribution:
Lower temperature (approaching 0.0): Makes the distribution more peaked, causing the model to favor high-probability tokens more strongly. This results in more deterministic, focused outputs.
Higher temperature (approaching 2.0): Flattens the distribution, giving lower-probability tokens a better chance of being selected. This increases randomness and creativity but may reduce coherence.
Practical Guidelines
You should adjust temperature based on your use case: (Add specific use cases here if desired, e.g., Code generation: 0.2, Creative writing: 0.8)
Note: The current setting of 0.70 represents a moderate temperature suitable for general-purpose applications where you want some variation in outputs while maintaining coherence and relevance.
Top-p Sampling
Top-p sampling, also known as nucleus sampling, provides an alternative approach to controlling output randomness. Instead of considering all possible tokens (as with pure temperature sampling), top-p dynamically selects from a subset of tokens whose cumulative probability exceeds the threshold p.
How Top-p Works
At each generation step, the algorithm:
Sorts all tokens by their probability in descending order.
Computes the cumulative probability starting from the highest probability token.
Selects only those tokens whose cumulative probability does not exceed
p.Samples from this filtered set of tokens.
Practical Examples
p = 1.0 (current setting): All tokens are considered (no filtering). The behavior is determined entirely by temperature. This is the default setting when you want temperature alone to control randomness.
p = 0.9: Only considers the smallest set of tokens whose cumulative probability is at least 90%. This typically includes 10-50 tokens, filtering out unlikely options while preserving diversity.
p = 0.5: Very restrictive. Only the most probable tokens (usually 3-10) are considered. Results in more focused, predictable outputs.
p = 0.1: Extremely restrictive. Typically selects only the 1-2 most probable tokens, resulting in nearly deterministic behavior regardless of temperature.
Temperature vs Top-p: When to Use Each
You can use both parameters simultaneously, though they interact in complex ways. Here are some guidelines for your configuration strategy:
Use temperature alone (top-p = 1.0): When you want consistent behavior across different contexts. Temperature affects all tokens uniformly.
Use top-p alone (temperature = 1.0): When you want adaptive filtering that preserves high-probability tokens but filters out the long tail of unlikely options.
Use both: For fine-grained control. A common pattern is temperature = 0.7 with top-p = 0.9, which provides moderate randomness while filtering extreme outliers.
Max Tokens
Max Tokens defines the maximum length of the generated response measured in tokens. This parameter serves as a hard limit on the model's output length and directly impacts both latency and cost.
Understanding Tokens
Tokens are the atomic units that language models process. They are not equivalent to words or characters:
Token-to-word ratio: In English, one token roughly equals 0.75 words on average. So 2048 tokens approximates 1500-1600 words.
Token composition: Common words are typically single tokens. Longer or less common words may be split into multiple tokens. For example, 'running' might be one token, while 'extraordinary' could be two or three.
Non-English text: Languages with different character systems may have different tokenization ratios. For instance, Chinese characters or Arabic script may use more tokens per semantic unit.
Practical Implications
Response truncation: If the model reaches the max tokens limit before naturally concluding its response, the output will be cut off mid-sentence. You will not receive any special indicator that truncation occurred.
Latency impact: Higher max tokens allows for longer responses but does not directly increase latency unless the model actually generates those tokens. The parameter sets a ceiling, not a target length.
Cost considerations: In production API deployments, you are typically charged per token generated. Setting max tokens too high may increase costs if the model produces unnecessarily verbose responses.
Context window management: The combined length of your input prompt plus max tokens must not exceed the model's context window. For instance, if a model has a 4096 token context window and your prompt is 2000 tokens, you can only set max tokens to 2096 or less.
Best Practice: Start with a conservative max tokens value based on your expected response length, then increase if you observe truncation. This approach optimizes for both performance and cost.
System Prompt
The System Prompt is a special type of input that defines the model's role, behavior, and constraints before any user interaction occurs. Think of it as the initial instructions that establish the model's persona and operating parameters for the entire conversation session.
Purpose and Function
System prompts serve several critical functions in controlling model behavior:
Role definition: You can instruct the model to assume a specific identity or expertise, such as 'You are a Python programming expert' or 'You are a medical advisor specializing in cardiology.'
Behavioral constraints: You can establish boundaries for what the model should or should not do, like 'Never provide financial advice' or 'Always cite sources when making factual claims.'
Output formatting: You can specify the structure of responses, such as 'Always respond in JSON format' or 'Use bullet points for lists.'
Tone and style: You can define the communication style, like 'Use a professional, formal tone' or 'Be concise and avoid technical jargon.'
Best Practices for System Prompts
Be specific and explicit: Vague instructions like 'be helpful' are less effective than concrete directives like 'provide step-by-step solutions with code examples.'
Include examples when needed: If you want a specific output format, show an example in the system prompt. The model learns better from demonstrations than from abstract descriptions.
Keep it focused: While comprehensive instructions are important, extremely long system prompts (over 1000 tokens) may dilute the model's attention. Prioritize the most critical instructions.
Test iteratively: System prompts require experimentation. Start with a basic version, observe the model's behavior, then refine based on where it deviates from your expectations.
Last updated