Evaluation Methodology
Each model receives identical prompts and must generate Stata code that executes successfully and produces correct numerical outputs. Evaluation is fully automated with deterministic grading.
Prompt Format
Models receive a natural language description of the task along with context about the data environment. No Stata-specific hints or syntax examples are provided in the prompt.
You are a Stata expert. Write a program called 'solve'
that takes no arguments and returns the answer in r(res).
Task: {task_description}
The data environment is already set up. Write only the
program definition.
Response Format
Models must output a Stata program definition. The generated code is extracted, executed in a fresh Stata session with pre-configured test data, and the return value is compared against expected outputs.
program define solve, rclass
quietly regress y x
return scalar res = _b[x]
end
Test Cases
Each task includes 3-5 test cases with different data configurations. A task is marked as passed if any test case produces a correct output within specified tolerances.
- Exact match for integer results
- Relative tolerance (1e-6) for floating point
- Wider tolerance for stochastic methods
Execution Environment
Code runs in Stata/SE with a 30-second timeout per test case. The environment includes common packages but models cannot install additional dependencies. Failed executions (syntax errors, runtime errors) count as failures.
Scoring
Pass rate is computed as the percentage of tasks where all test cases succeed. Category scores show performance breakdown. Models with multiple evaluation runs have results aggregated.
Limitations
The benchmark measures code generation accuracy, not reasoning explanation or interactive debugging. Tasks are self-contained; multi-turn interactions and file I/O are not tested. Results reflect model capabilities at evaluation time and may change with updates.