Corpus playground for running local tools against popular Hex.pm packages
Find a file
2026-06-12 17:39:42 +03:00
lib Allow large Hex tarball metadata 2026-06-12 17:39:42 +03:00
scripts Use hex_core to unpack packages 2026-05-05 16:31:30 +03:00
test Add Hex mirror task 2026-06-12 14:26:58 +03:00
.formatter.exs Add Mix corpus fetcher 2026-05-06 11:26:00 +03:00
.gitignore Add Hex mirror task 2026-06-12 14:26:58 +03:00
mix.exs Add Mix corpus fetcher 2026-05-06 11:26:00 +03:00
mix.lock Add Mix corpus fetcher 2026-05-06 11:26:00 +03:00
README.md Allow large Hex tarball metadata 2026-06-12 17:39:42 +03:00

Hex Playground

Corpus playground for running local tools against large sets of Hex.pm packages.

Setup

cd ~/Development/hex-playground
mix deps.get

You can run it as a Mix task:

mix hex_playground.fetch --mode latest --limit 300 --concurrency 8

Or build a standalone escript:

mix escript.build
./hex_playground fetch --mode latest --limit 300

Fetch a corpus

Fetch and extract the latest release of packages from the signed Hex repository registry:

mix hex_playground.fetch --mode latest --limit 300 --concurrency 8

This creates:

  • manifest.json — package metadata, paths, mirror used, and file-extension counts
  • sources/<package>-<version>/ — extracted package sources
  • tarballs/<package>-<version>.tar — cached Hex tarballs

Useful modes:

# Latest release of every public Hex package
mix hex_playground.fetch --mode latest --concurrency 16 --prune-non-elixir

# Every public package version. Large: currently ~150k releases.
mix hex_playground.fetch --mode all --concurrency 16

# Top packages by downloads, using the Hex HTTP API for ranking
mix hex_playground.fetch --mode top --limit 1000 --concurrency 16

latest and all use the Hex repository endpoint:

https://repo.hex.pm/versions

Tarballs are downloaded from:

https://repo.hex.pm/tarballs/<name>-<version>.tar

and unpacked with hex_core.

Mirror balancing

Tarball downloads can be balanced across multiple repository mirrors. Registry discovery still uses --registry-url so the signed Hex.pm registry remains the source of truth.

mix hex_playground.fetch \
  --mode latest \
  --limit 1000 \
  --concurrency 16 \
  --mirror https://repo.hex.pm \
  --mirror https://cdn.jsdelivr.net/hex \
  --mirror-strategy round_robin

You can also pass mirrors comma-separated:

mix hex_playground.fetch \
  --mirror https://repo.hex.pm,https://cdn.jsdelivr.net/hex \
  --mirror-strategy random

Available strategies:

  • round_robin — distribute package tarball attempts across mirrors
  • random — pick a random starting mirror per package

If a mirror fails for a tarball, the downloader falls back to the remaining mirrors. Only https://repo.hex.pm is the official Hex.pm mirror; other mirrors are useful for public tarballs but should be treated as untrusted.

Build a serveable Hex.pm-compatible mirror

Mirror the signed Hex registry files and package tarballs into a static-file layout compatible with Hex clients:

mix hex_playground.mirror \
  --out mirror \
  --concurrency 32 \
  --package-concurrency 16 \
  --mirror https://repo.hex.pm \
  --mirror https://cdn.jsdelivr.net/hex

This creates:

  • mirror/names
  • mirror/versions
  • mirror/public_key
  • mirror/packages/<name>
  • mirror/tarballs/<name>-<version>.tar
  • mirror/.hex_playground/manifest.ndjson
  • mirror/.hex_playground/failures.ndjson when downloads fail
  • mirror/.hex_playground/summary.json

Registry metadata is always fetched from --registry-url, defaulting to the official https://repo.hex.pm. Tarball downloads are balanced across --mirror URLs with fallback when a mirror fails. Existing valid tarballs are reused unless --force is passed.

For a small test run:

mix hex_playground.mirror --out mirror-test --limit 20 --concurrency 4

Serve the mirror with any static HTTP server rooted at mirror/:

cd mirror
python3 -m http.server 8080

The served paths must match Hex's repository paths exactly:

/names
/versions
/public_key
/packages/<name>
/tarballs/<name>-<version>.tar

Verify a completed or partial mirror:

mix hex_playground.mirror.verify --out mirror

The verifier checks required registry files, package metadata files referenced by the manifest, tarball presence, and tarball unpacking. Hex tarballs with metadata files too large for hex_core's in-memory unpack safety limit are treated as valid, because they are still serveable by a mirror and fetchable by Hex clients. It writes mirror/.hex_playground/verify-summary.json.

To use the mirror as a drop-in replacement for the default Hex repo in an isolated Mix home:

MIX_HOME=/tmp/hex-mirror-mix \
  mix hex.repo set hexpm \
  --url http://localhost:8080 \
  --public-key mirror/public_key

Then ordinary Hex commands use the mirror:

MIX_HOME=/tmp/hex-mirror-mix mix hex.package fetch a1 0.25.0

If you add the mirror under a new repo name instead of overriding hexpm, Hex will reject upstream registry metadata unless you set HEX_NO_VERIFY_REPO_ORIGIN=1, because the signed package records still declare their origin as hexpm.

Run tools against every package

Use scripts/run_tool.exs with a command after --. Placeholders:

  • {name} — Hex package name
  • {version} — package version
  • {path} — relative source path
  • {abs_path} — absolute source path

Examples:

./scripts/run_tool.exs --limit 20 -- elixir -e 'IO.puts(System.get_env("HEX_PLAYGROUND_PACKAGE"))'

./scripts/run_tool.exs --limit 300 -- bash -lc 'find lib src -type f 2>/dev/null | wc -l'

./scripts/run_tool.exs --limit 300 -- bash -lc 'mix ex_dna --format json 2>/dev/null || true'

Each run writes:

  • runs/<timestamp>/results.ndjson
  • runs/<timestamp>/summary.json
  • one log file per package

Notes

This directory is intentionally data-heavy. Keep generated corpus data out of git unless explicitly needed.