Evaluation Methodology

Each model receives identical prompts and must generate Stata code that executes successfully and produces correct numerical outputs. Evaluation is fully automated with deterministic grading.

Prompt Format

Models receive a natural language description of the task along with context about the data environment. No Stata-specific hints or syntax examples are provided in the prompt.

You are a Stata expert. Write a program called 'solve'
    that takes no arguments and returns the answer in r(res).
    
    Task: {task_description}
    
    The data environment is already set up. Write only the
    program definition.

Response Format

Models must output a Stata program definition. The generated code is extracted, executed in a fresh Stata session with pre-configured test data, and the return value is compared against expected outputs.

program define solve, rclass
        quietly regress y x
        return scalar res = _b[x]
    end

Test Cases

Each task includes 3-5 test cases with different data configurations. A task is marked as passed if any test case produces a correct output within specified tolerances.

Exact match for integer results
Relative tolerance (1e-6) for floating point
Wider tolerance for stochastic methods

Execution Environment

Code runs in Stata/SE with a 30-second timeout per test case. The environment includes common packages but models cannot install additional dependencies. Failed executions (syntax errors, runtime errors) count as failures.

Scoring

Pass rate is computed as the percentage of tasks where all test cases succeed. Category scores show performance breakdown. Models with multiple evaluation runs have results aggregated.

250 Total Tasks

4 Categories

~800 Test Cases

Limitations

The benchmark measures code generation accuracy, not reasoning explanation or interactive debugging. Tasks are self-contained; multi-turn interactions and file I/O are not tested. Results reflect model capabilities at evaluation time and may change with updates.