Baseten Support¶

GenAI Bench provides support for Baseten model endpoints, including multiple request and response formats.

Overview¶

The Baseten backend can handle both OpenAI-compatible chat format and a simple prompt format. This allows benchmarking of both instruct-tuned and base models.

Key Features¶

Dual Request Format Support¶

OpenAI-Compatible Format (Default) - Uses {"messages": [{"role": "user", "content": "..."}]} structure - Compatible with instruct-tuned models - Supports image content for vision models

Simple Prompt Format - Uses {"prompt": "..."} structure - Suitable for non-instruct models - Enabled via {"use_prompt_format": true} in additional_request_params

Streaming Control¶

Supports both streaming and non-streaming responses
Uses global --disable-streaming flag (consistent with other backends)
Automatically filters out stream parameter from additional_request_params

Response Format Flexibility¶

Handles OpenAI-compatible JSON responses
Non-OpenAI format: Automatically detects and parses various JSON field names (text, output, response, generated_text) or plain text responses

Usage Examples¶

Basic OpenAI-Compatible Format¶

genai-bench benchmark \
  --api-backend baseten \
  --api-base "your-endpoint-url" \
  --api-key "your-baseten-api-key" \
  --api-model-name "Qwen3-30B-A3B-Instruct-2507-FP8" \
  --model-tokenizer "Qwen/Qwen3-30B-A3B-Instruct-2507" \
  --task text-to-text \
  --max-requests-per-run 200 \
  --num-concurrency 8 \
  --max-time-per-run 600 \
  --additional-request-params '{"temperature": 0.7}'

Simple Prompt Format for Non-Instruct Models¶

genai-bench benchmark \
  --api-backend baseten \
  --api-base "your-endpoint-url" \
  --api-key "your-baseten-api-key" \
  --api-model-name "Mistral-7B-v0.1" \
  --model-tokenizer "mistralai/Mistral-7B-v0.1" \
  --task text-to-text \
  --additional-request-params '{"use_prompt_format": true, "temperature": 0.7}' \
  --num-concurrency 1 \
  --traffic-scenario "N(100,100)/(100,100)"

Non-Streaming Mode¶

genai-bench benchmark \
  --api-backend baseten \
  --api-base "your-endpoint-url" \
  --api-key "your-baseten-api-key" \
  --api-model-name "test-model" \
  --model-tokenizer "test/tokenizer" \
  --task text-to-text \
  --disable-streaming \
  --additional-request-params '{"use_prompt_format": true, "temperature": 0.7}'

Image-to-Text Benchmarking¶

genai-bench benchmark \
  --api-backend baseten \
  --api-base "your-endpoint-url" \
  --api-key "your-baseten-api-key" \
  --api-model-name "vision-model" \
  --model-tokenizer "vision/tokenizer" \
  --task image-text-to-text \
  --dataset-path /path/to/images \
  --max-requests-per-run 50 \
  --max-time-per-run 10

Request Format Details¶

OpenAI-Compatible Format Payload¶

{
  "model": "model-name",
  "messages": [
    {
      "role": "user",
      "content": "Hello, world!"
    }
  ],
  "max_tokens": 100,
  "temperature": 0.7,
  "ignore_eos": true,
  "stream": true,
  "stream_options": {
    "include_usage": true
  }
}

Simple Prompt Format Payload¶

{
  "prompt": "Hello, world!",
  "max_tokens": 100,
  "temperature": 0.7,
  "stream": true
}

Image Content Format¶

{
  "model": "vision-model",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Describe this image"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,base64_image_data"
          }
        }
      ]
    }
  ],
  "max_tokens": 100,
  "temperature": 0.7,
  "stream": true
}

Response Handling¶

OpenAI-Compatible Response¶

{
  "choices": [
    {
      "message": {
        "content": "This is the generated response"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 50,
    "total_tokens": 60
  }
}

Simple Text Response¶

This is a plain text response

JSON Response with Alternative Fields¶

{
  "text": "Response from text field"
}

{
  "output": "Response from output field"
}

{
  "response": "Response from response field"
}

Parameter Filtering¶

The Baseten backend automatically filters out certain parameters from additional_request_params:

stream: Always controlled by global --disable-streaming flag
use_prompt_format: Used internally for format selection, not sent to API

Other parameters (like temperature, top_p, etc.) are passed through to the API.

Error Handling¶

The backend provides robust error handling for:

Network connection issues
HTTP error responses
Malformed JSON responses
Plain text responses
Missing or invalid authentication

Environment Variables¶

You can use environment variables for authentication:

export MODEL_API_KEY=your-baseten-api-key

Supported Tasks¶

text-to-text: Text generation with both formats
image-text-to-text: Vision tasks with OpenAI format
text-to-embeddings: Embedding generation (if supported by model)

Best Practices¶

Use OpenAI format for instruct models: Models like Qwen-Instruct, Llama-Instruct, etc.
Use prompt format for base models: Models like Mistral-7B-v0.1, base Llama models, etc.
Set appropriate temperature: Avoid temperature: 0.0 which may cause model server errors
Test with small scenarios first: Use --num-concurrency 1 and small traffic scenarios for initial testing
Monitor logs: Watch for warnings about token estimation and response parsing

Troubleshooting¶

Common Issues¶

Model Endpoint: Verify the full URL is correct and the model is deployed and running
Authentication Issues: Ensure your API key is correct and has proper permissions
Temperature Error: If you see ValueError: temperature (=0.0) has to be a strictly positive float, set temperature: 0.7 in additional_request_params
Response Parsing Errors: The backend automatically handles various response formats, but check logs for parsing warnings

Debug Mode¶

Enable debug logging to see detailed request/response information:

export LOG_LEVEL=DEBUG
genai-bench benchmark --api-backend baseten ...