One factory. One recipe. Every mfsig reproducible byte-for-byte.
MFFactory is the unified DFT-first compute pipeline that produces every mfsig on the platform — monomers, dimers, hetero-pairs, pocket fragments. One recipe, one binary, one provenance stamp. Klamt-purged. ALCOA-grade audit trail. Production atlas is v0.91.1 def2-SVP ; v0.92.0 TZVPP migration is smoke-tested but not yet rolled out at scale.
Five input types, one factory, five output schemas
Every cohort flows through the same MFFactory.compute() dispatcher. No cohort-specific recipe overrides, no parallel pipelines, no copy-paste forks. Same recipe → same provenance leg → same kernel-input contract downstream.
The factory's only conditional logic is around object_kind + phase_label: monomers run 1 SCF, BSSE-CP dimers run 5 SCFs in v28-compliant Boys-Bernardi order. Everything else (functional, basis, solvent, grid, convergence) is read from the registry and is identical across cohorts.
MF_FACTORY_REGISTRY.json · versions.0.92.0
A single JSON file holds every knob the SCF pipeline reads. The file is SHA-256-hashed; that hash becomes the third leg of the four-way provenance stamp on every mfsig produced.
| knob | v0.92.0 value | why |
|---|---|---|
| functional | wB97M-V | range-separated hybrid with built-in VV10 NLC dispersion · solves the deep-gap pathology that broke B3LYP on charged pockets |
| basis | def2-TZVPP | the basis wB97M-V was parameterised for (Mardirossian & Head-Gordon 2016) · ~1 kcal/mol energetics uncertainty on drug-like organics |
| aux_basis | def2-tzvp-jkfit | density fitting for J + K · ~3× SCF speedup, sub-mEh accuracy preserved |
| solvent | C-PCM | Conductor-like Polarizable Continuum Model · induces the polarization σ-charges on the cavity surface |
| epsilon | 80.0 | water-like dielectric · matches the cohort calibration anchors (FreeSolv, ASD) |
| radii | Bondi (P1-patched) | van-der-Waals radii table · P1 patch excludes ghost atoms from cavity construction |
| grid | (75, 302) Lebedev | atom-grid radial × angular — production-quality DFT grid |
| conv_tol | 1e-06 | SCF energy convergence (Hartree) · 6-digit DFT energies |
| density_fit | true | always on · TZVPP without DF would 10× the cost |
| L0 max_cycle | 60 (single) · 80 (BSSE) | DIIS budget before falling back to L1 |
| L3 fallback basis | def2-QZVPP | ONLY when L0/L1/L2 all fail · escape valve for pathological systems, NEVER silent |
Source of truth: refs/MF_FACTORY_REGISTRY.json · v0.92.0 SHA-256 cd88bda5… · predecessor: 0.91.1 (def2-SVP).
Four-level fallback chain — NEVER silent
The default DIIS solver at the production basis handles 96/96 production drugs at L0. When it fails, the orchestrator records the fallback level in the mfsig so the reader can audit WHICH solver produced the number — no opaque retries, no quietly-different basis.
Every mfsig records scf.fallback_level (0–3) and scf.method (DIIS / DIIS+SOSCF / DAMP+SOSCF). If level > 0 for a "normal" drug, that's a red flag worth investigating before downstream use.
No silent basis upgrade. No hidden retries with relaxed thresholds. If all four levels fail the SCF returns converged=false and the file is REJECTED at sealing time — never shipped to disk, never enters a kernel.
Raw σ, no empirical thresholds
The legacy σ-profile workflow ran q_sym through a 0.537 Å r_av spatial average then binned with a 0.0084 e/Ų H-bond threshold — both are empirical fits to small-molecule training sets, both inflate the distribution variance by ~30× over the raw electrostatic signal. v0.91.1+ removed both. The new sigma_moments are pure area-weighted statistics on the raw per-segment σ.
σ_i = q_sym_i / area_i # raw, per-segment, no spatial average w_i = area_i / Σ area_j # tile-area weight mean = Σ w_i · σ_i variance = Σ w_i · (σ_i − mean)² skewness = m₃ / variance^(3/2) kurtosis = m₄ / variance² − 3 # excess
Four numbers per molecule. Deterministic. Re-derivable bit-exact from sigma_per_segment + segment_areas stored in every mfsig. Zero empirical parameters.
Four-leg stamp on every mfsig
Every artefact carries four independent SHAs that, together, let a reader twenty years from now reconstruct the exact compute context byte-for-byte. Break any one, the file is rejected.
mf_factory_version0.92.0vendor_tree_sha256461e2aac…registry_recipe_sha256cd88bda5…image_tagworker:v29Plus an Ed25519 signature over the canonicalised payload (legs 1–4 are inputs to the SCF; the signature authenticates that the file was actually sealed by molforge's signing key). Verification is offline — no network call required to confirm authenticity, only the public key at /.well-known/mfsig-keys.json.
Five object kinds, one schema family
MFFactory.compute() accepts five values for object_kind. Each cohort has its own orchestrator that calls compute() one or more times with the appropriate phase_label + ghost_atoms. The output schema is identical across cohorts — only the suffix changes.
drug atlas (monomer)
SMILES → RDKit ETKDG geometry → MFFactory single-phase SCF → σ-profile, dipole, HOMO–LUMO, Mulliken charges. The base unit of every downstream kernel.
compute_drug_mfsig(smiles, inchi_key_14, output_path) → MFFactory.compute(atoms, charge, spin, object_kind="drug", options={phase_label:"single"})pocket fragment
Single-residue or MFCC-capped fragment from a PDB pocket → MFFactory single-phase SCF. Heavy-atom-only (descriptor H positions added by upstream capper); spin-0 closed-shell.
compute_pocket_fragment_mfsig(fragment_id, z_list, positions_aa, charge, spin, extras) → MFFactory.compute(atoms, ..., object_kind="pocket_fragment", options={phase_label:"single"})hetero dimer (drug × polymer)
Drug × polymer-segment Boys-Bernardi 5-SCF BSSE-CP for ASD (amorphous solid dispersion) Flory-Huggins χ prediction. Atom order locked: [polymer_atoms..., drug_atoms...].
compute_hetero_bsse_cp(pair_id, drug_atoms, polymer_atoms, ...) → 5× MFFactory.compute(..., object_kind="hetero_phase", options={phase_label, ghost_atoms})protein × ligand dimer (CL3)
Ligand × pocket fragment Boys-Bernardi 5-SCF BSSE-CP from the PDBBind CL3 benchmark cohort. The load-bearing input for the binding-affinity kernel.
compute_dimer_bsse_cp(pair_id, frag_atoms, drug_atoms, ...) → 5× MFFactory.compute(..., object_kind="dimer_phase", options={phase_label, ghost_atoms})self-assembled (homodimer)
Same-molecule pair — Boys-Bernardi 5-SCF BSSE-CP via the dimer orchestrator with frag = monomer_A, drug = monomer_B. Used for melting-point / cohesion-energy prediction. Phases 2–5 are arithmetically redundant but kept for provenance symmetry.
compute_dimer_bsse_cp(pair_id, frag_atoms=mono_a, drug_atoms=mono_b, ...) — identical 5-SCF chain, symmetry exploited by reader, not by writer.What's in every .mfsig.json
Five top-level blocks, the same structure on every cohort. The Anatomy page documents each field; this is the executive summary.
provenance4-way SHA stamp + image_patch_chain + computed_at_utc + schema (mfsig/v0.91.1)
chemistry_and_geometryelements / atomic_numbers / positions_aa / smiles / inchi_key_14 / total_charge / spin / ghost_mask
scfconverged / method / fallback_level / cycles / wall_s / n_basis · forensic SCF metadata
energiestotal_hartree / g_polar_hartree / homo_hartree / lumo_hartree / gap_ev
moments + cavity_and_sigmadipole_debye / atomic_charges_mulliken / cavity_area_aa2 / n_segments / sigma_per_segment (raw, ~3000–11000) / segment_areas / sigma_moments {mean, variance, skewness, kurtosis}
How the factory actually runs
The pipeline runs on H100 SECURE pods orchestrated through a single REST-API spawner. One generic runner script per cohort, one shared spawn script, one SHA-pinned image.
worker:v29-prod-2026-05-27vendor/gpu4pyscf (patched) + src/molforge (MFFactory + 5 orchestrators) + refs/MF_FACTORY_REGISTRY.json. SHA-256 pinned and stamped into every mfsig.
infra/v4/spawn_cohort_fleet.py--cohort {drug|hetero|cl3|pocket|homodimer} --n 5 --gpu H100. Spawns N pods via Runpod REST API (NOT GraphQL — GraphQL podTerminate is silently broken), prints SCP + SSH + launch commands.
infra/v4/runpod_kill.pyREST DELETE /v1/pods/<id> — only reliable termination path. Verified 404 after each kill. Never leaves zombie pods burning $.
Verification you can run yourself
Every claim on this page is auditable from the bytes — no internal logs required.
# 1. Verify the recipe SHA matches what was used at compute time
sha256sum refs/MF_FACTORY_REGISTRY.json
# expect : cd88bda5… (v0.92.0)
# 2. Verify every mfsig in a cohort matches the registry recipe
python refs/_v091_regen_audit.py
# expect : ALL GREEN — AST + orchestrator alignment + parse + extract/ignore
# 3. Verify a single mfsig's 5-tier quality gates
python refs/_audit_v091_1_n100_quality.py
# tier 1: schema integrity (5/5 expected)
# tier 2: physics sanity (5/5)
# tier 3: klamt-purge enforcement
# tier 4: cross-file consistency (uniform recipe SHA)
# tier 5: sigma_moments recomputation (machine-epsilon)
# 4. Verify a downstream prediction comes from the right mfsig
jq '.provenance.{mf_factory_version,vendor_tree_sha256,registry_recipe_sha256,image_tag}' \
refs/atlas_v031_v091_1/<inchi_key_14>.mfsig.jsonOpen a real mfsig in the 3D viewer
See provenance, SCF diagnostics, σ-profile, and Mulliken charges live.