Running Stata in OpenAI Codex

Setting up a licensed Stata installation in Codex without committing large binaries to your repository.

Codex runs your code in disposable Linux containers. Each session starts fresh, which poses two challenges for Stata: the 700 MB installation and the license file. You cannot commit either to version control; the binaries are too large for Git, and the license contains sensitive credentials tied to your institution.

The solution: build Stata once in Docker on your local machine, upload the tarball to cloud storage, and pull it into Codex during setup using secrets for the license. This keeps your repository clean while giving Codex everything it needs to run Stata commands.

What you need

Before starting, gather these from your existing Stata installation:

  • The Stata Linux installer tarball (e.g., Stata19Linux64.tar.gz); download this from your Stata account portal
  • Your stata.lic file from an existing installation (usually in /usr/local/stata19/ on Linux or /Applications/Stata/ on Mac)
  • Docker Desktop1On Apple Silicon, enable Rosetta emulation in Docker Desktop settings.
  • A Google Drive account (or similar cloud storage) for hosting the built tarball

Building the tarball

The Stata installer is interactive and expects a GUI, but we can run it in a Docker container to capture the installed files. This command mounts your current directory, runs the installer, and packages the result:

docker run -it --platform=linux/amd64 \
  -v "$(pwd):/mnt" \
  ubuntu:22.04 bash -c '
    apt-get update && apt-get install -y libncurses5
    cd /mnt && tar -xzf Stata*Linux*.tar.gz
    cd /mnt/stata* && ./install
    tar -czf /mnt/stata19.tgz -C /usr/local stata19
  '

On Apple Silicon, the --platform flag forces x86 emulation, as Stata lacks ARM Linux builds2Without this flag, Docker builds ARM images on M-series Macs, which will not run Stata.. The installer will prompt you for the installation directory; accept the default /usr/local/stata19. When it finishes, you will have a ~720 MB tarball containing the full Stata installation minus the license file.

Hosting the tarball

Upload stata19.tgz to Google Drive and enable "Anyone with the link can view" sharing. Copy the share URL; you will need it for the Codex secrets. Google Drive works well because it offers free storage and fast downloads, but any cloud storage with direct download links will work (S3, Dropbox, etc.).

One caveat: Google Drive shows a "virus scan" warning for large files, which breaks curl and wget. Use gdown3gdown handles Google Drive's "virus scan" confirmation pages automatically. instead; it handles these confirmation pages automatically.

Codex secrets

Codex supports environment secrets that persist across sessions but stay out of your repository. Open your project settings in the Codex web interface and add two secrets:

  • STATA_TREE_URL — the Google Drive share link (the full URL, not just the file ID)
  • STATA_LIC_B64 — your license file, base64-encoded (see below)

Base64 encoding converts the license file into a single line of text that can be stored as an environment variable. Run this command on your local machine where stata.lic exists:

base64 -i stata.lic | tr -d "\n"

Copy the output (a long string of letters and numbers) into the STATA_LIC_B64 secret. The tr -d removes newlines so the entire license fits in one environment variable.

Setup script

Codex lets you define setup commands that run when your environment initializes. Add this script to your Codex setup configuration (or save it as setup.sh and reference it there). The script downloads Stata from your cloud storage, extracts it, writes the license, and creates symlinks so you can run stata-mp from anywhere:

#!/bin/bash
set -euo pipefail

sudo apt-get update && sudo apt-get install -y libncurses5 python3-pip
pip install gdown

gdown "$STATA_TREE_URL" -O /tmp/stata.tgz
sudo tar -xzf /tmp/stata.tgz -C /usr/local
rm /tmp/stata.tgz

echo "$STATA_LIC_B64" | base64 -d | sudo tee /usr/local/stata19/stata.lic > /dev/null
sudo ln -sf /usr/local/stata19/stata* /usr/local/bin/

The set -euo pipefail line makes the script exit on any error, which prevents partial installations. Codex caches the environment after setup completes successfully, so this only runs once; on the first session or after you explicitly clear the cache.

Testing

Once setup finishes, verify everything works by running:

stata-mp -b -e "about"

The -b flag runs Stata in batch mode (no GUI), and -e tells it to execute the "about" command which prints license information. If you see your serial number and licensed modules, the installation succeeded4The license must match the Stata version and edition (MP, SE, BE) in your tarball.. If you get a license error, double-check that your stata.lic matches the Stata version in your tarball.

Large datasets

Research projects often involve datasets too large for Git. The same pattern works here: upload your .dta files to cloud storage, store the URLs as Codex secrets, and add download commands to your setup script. For multiple files, consider creating a single tarball of your data directory to reduce setup time.

If your data changes frequently, you might prefer downloading it on-demand rather than during setup. This avoids re-running setup every time you update a file, but means slower first access in each session.

Updating

When StataCorp releases a new version, rebuild the tarball using the new installer and upload it to the same Google Drive location (or update STATA_TREE_URL if you change the file). For license renewals, just re-encode the new stata.lic and update STATA_LIC_B64. The setup script stays the same in both cases; you only need to clear your Codex environment cache to trigger a fresh setup.