Performance Troubleshooting Guide¶

When benchmark results show unexpected performance, check these common issues before escalating.

Quick Checklist¶

1. Apple-to-Apple Comparison¶

Ensure you're comparing equivalent configurations:

Quantization¶

Verify both deployments use the same quantization (FP16, INT8, FP8, etc.)
Different quantization levels significantly impact throughput and latency

Input/Output Token Counts¶

Check actual request and response sizes in your deployment metrics
Quick token estimation: response_size_bytes / 4 ≈ token_count (1 token ≈ 4 characters ≈ 4 bytes)
Streaming overhead: Response size may be larger than actual token count due to streaming mode's extra statistics output. Disable with --disable-streaming for accurate measurements

Latency Metrics¶

TTFT (Time to First Token): Measures prompt processing speed
E2E (End-to-End) Latency: Total request time including all token generation
Don't compare TTFT with E2E latency—they measure different things

Queuing Latency¶

Compare End-to-end response time vs Inference time in deployment metrics
Large differences indicate queuing delays (requests waiting for available capacity)
High queuing latency suggests the deployment needs more replicas or the concurrency is too high

2. Speculative Decoding Considerations¶

When speculative decoding is enabled (e.g., speculative_decoding_mode: LOOKAHEAD_DECODING in config.yaml):

Use real datasets: Speculative decoding performs well only with realistic data patterns
Random data = poor performance: Acceptance rate is low with random tokens, negating the benefits of speculative decoding
Always benchmark with production-representative data when speculative decoding is enabled

Common Mistake: Token Distribution Override¶

The --traffic-scenario distribution format (e.g., N(2700,2300)/(350,150)) overwrites the actual dataset content with synthetic data. This defeats the purpose of using a real dataset for speculative decoding benchmarks.

Correct approach:

--traffic-scenario dataset \
--additional-request-params '{"max_tokens": 100}'

This uses the real dataset for input tokens while controlling max output tokens.

Using Real Datasets¶

When benchmarking speculative decoding or production-representative workloads, use --traffic-scenario dataset instead of token distribution formats.

Example 1: HuggingFace Dataset¶

export MODEL="zed-industries/zeta"
genai-bench benchmark \
  --api-backend openai \
  --qps-level 4 \
  --api-base "http://localhost:8000/v1/chat/completions" \
  --dataset-config "data_config.json" \
  --task "text-to-text" \
  --api-model-name "${MODEL}" \
  --model-tokenizer "${MODEL}" \
  --num-workers 8 \
  --max-requests-per-run 5000 \
  --additional-request-params '{"max_tokens": 512}' \
  --max-time-per-run 20 \
  --disable-streaming \
  --execution-engine async \
  --traffic-scenario dataset

Where data_config.json:

{
  "source": {
    "type": "huggingface",
    "path": "zed-industries/zeta",
    "huggingface_kwargs": {
      "split": "train",
      "revision": "main"
    }
  },
  "prompt_column": "input"
}

Example 2: Local CSV Dataset¶

genai-bench benchmark \
  --api-backend openai \
  --api-base "http://localhost:8000/v1/chat/completions" \
  --task "text-to-text" \
  --api-model-name "your-model" \
  --model-tokenizer "your-model" \
  --dataset-path shared_dataset_input100_output100.csv \
  --dataset-prompt-column "prompt" \
  --traffic-scenario dataset \
  --additional-request-params '{"max_tokens": 100, "ignore_eos": true}' \
  --max-time-per-run 10 \
  --max-requests-per-run 1000

Key Points¶

Scenario	Traffic Scenario	Result
Synthetic load testing	`D(100,100)`, `N(480,240)/(300,150)`	Generates random tokens matching distribution
Real dataset benchmarking	`dataset`	Uses actual dataset prompts
Speculative decoding testing	`dataset`	Required for realistic acceptance rates

Summary¶

Before reporting unexpected performance:

Verify quantization matches between compared deployments
Check actual token counts (not just configured values)
Distinguish TTFT vs E2E latency
Look for queuing latency in deployment metrics
Use --traffic-scenario dataset for speculative decoding benchmarks
Use --disable-streaming for accurate response size measurements