Task Types

The benchmark evaluates LLM capability across 250 Stata programming tasks spanning four categories. Each task requires generating executable Stata code that produces specific numerical outputs.

Data Manipulation

Tasks involving data reshaping, merging datasets, collapsing observations, string operations, and date/time handling. These test fundamental data wrangling skills essential for empirical research.

Example

"Reshape the data from wide to long format where each row represents a country-year observation. Return the number of rows after reshaping."

Summary Statistics

Computing descriptive statistics, tabulations, hypothesis tests, and correlation analyses. Models must correctly apply statistical functions and extract the right values from Stata's return objects.

Example

"Calculate the Pearson correlation coefficient between variables x and y. Return r(rho)."

Regression

OLS regression, panel data methods (fixed effects, random effects), instrumental variables, time series analysis, and advanced econometric techniques. This category tests understanding of Stata's estimation commands and post-estimation tools.

Example

"Estimate a two-stage least squares regression of y on x, instrumenting x with z. Return the coefficient on x."

Programming

Macro manipulation, loops, matrix operations, and Mata programming. These tasks evaluate fluency with Stata's programming constructs beyond basic data analysis commands.

Example

"Given a 2x2 matrix A, compute and return the trace (sum of diagonal elements). Store in scalar 'trace_result'."

Difficulty Distribution

Tasks range from straightforward single-command operations to multi-step procedures requiring domain knowledge. The benchmark intentionally includes tasks where naive approaches fail but proper Stata idioms succeed.

Basic: Direct application of common commands
Intermediate: Combining multiple commands, handling edge cases
Advanced: Panel methods, IV estimation, survival analysis, spatial econometrics