Evaluation Methodology

Each model receives identical prompts and must generate Stata code that executes successfully and produces correct numerical outputs. Evaluation is fully automated with deterministic grading.

You are a Stata expert. Write a program called 'solve'
    that takes no arguments and returns the answer in r(res).
    
    Task: {task_description}
    
    The data environment is already set up. Write only the
    program definition.
program define solve, rclass
        quietly regress y x
        return scalar res = _b[x]
    end
  • Exact match for integer results
  • Relative tolerance (1e-6) for floating point
  • Wider tolerance for stochastic methods
250 Total Tasks
4 Categories
~800 Test Cases

Limitations

The benchmark measures code generation accuracy, not reasoning explanation or interactive debugging. Tasks are self-contained; multi-turn interactions and file I/O are not tested. Results reflect model capabilities at evaluation time and may change with updates.