Task Types

The benchmark evaluates LLM capability across 250 Stata programming tasks spanning four categories. Each task requires generating executable Stata code that produces specific numerical outputs.

Example

"Reshape the data from wide to long format where each row represents a country-year observation. Return the number of rows after reshaping."

Example

"Calculate the Pearson correlation coefficient between variables x and y. Return r(rho)."

Example

"Estimate a two-stage least squares regression of y on x, instrumenting x with z. Return the coefficient on x."

Example

"Given a 2x2 matrix A, compute and return the trace (sum of diagonal elements). Store in scalar 'trace_result'."

Difficulty Distribution

Tasks range from straightforward single-command operations to multi-step procedures requiring domain knowledge. The benchmark intentionally includes tasks where naive approaches fail but proper Stata idioms succeed.

  • Basic: Direct application of common commands
  • Intermediate: Combining multiple commands, handling edge cases
  • Advanced: Panel methods, IV estimation, survival analysis, spatial econometrics