Roadmap
Where the .mfsig generation engine, the open format, the viewer, and the standards are heading. Public roadmap, honest status flags, dependencies declared. We ship what we promise. (Downstream prediction kernels are on the separate kernels site.)
Physics fidelity
Three numerics fixes: charge-conservation renorm on the kept segment set, NaN-safe σ-moments, segment-area floor. Plus Grimme D3BJ dispersion. Klamt radii experiment.
Reverted Klamt → Bondi after the calibration mismatch dropped HB signal r from +0.65 → +0.09. Cohort: 58 drugs + 26 polymers · 84/84 PASS all 7 gates. Charge conservation at machine ε on every drug.
Autonomous overnight queue, sorted small→big by heavy-atom count. Skip-if-exists + restart-safe. Every entry follows the v0.28.3 gates. Expanding the calibration anchor set for tighter ΔG corrections.
Atlas generation at v0.29-directqm-svp recipe. Covers drug-like chemistry beyond the FreeSolv-overlapping validation cohort. Powers /atlas page screening workflows: candidate ranking, solvent screen, patent leads, portfolio risk, CDMO QC. Coverage-driven (not anchor-aligned) — /benchmarks stays on the v028 canonical cohort.
Address the documented +0.62 D systematic dipole bias at B3LYP/def2-SVP. TZVPD adds diffuse functions on aromatic π-systems where the bias is worst (aniline, nitrobenzene, imidazole, caffeine).
Today ~5% of the atlas fails with "electron number X and spin 0 inconsistent" — radicals, ammonium cations, organometallics. v0.28.5 adds an automatic UHF/UKS branch when RDKit detects odd-electron / charged species, recovering bretylium, choline, pyridostigmine and the rest.
HPMCAS_L (123 atoms) currently overshoots the single-GPU VRAM ceiling at the JK-contraction stage. Heavy-Track introduces staged DFT with chunked builders + next-generation GPU fallback, lifting the size cap to ~200 atoms.
ETKDG embed → UFF preopt → xtb rank → DFT top-k → Boltzmann avg over canonical ensemble. Each conformer is its own audit-signed .mfsig; the ensemble carries an additional ensemble hash binding them.
For ionisable drugs (≈45% of pharma library), persist a per-pH microspecies fan: neutral, anion, cation, zwitterion. Each is a full .mfsig keyed by pKa source. The user requests "drug at pH 7.4" and gets the weighted blend.
Range-separated meta-GGA wB97M-V with native VV10 dispersion + density fitting on both J and K (def2-svp-jkfit) + C-PCM ε=80, gpu4pyscf 1.7.0. Sidecar recipe; the SEAL recipes (v0.29-directqm-svp, v0.28.3) remain canonical. Currently sweeping an isolated 5,950-mol shadow atlas (4,533 universal cohort + 74 CompSol + 1,343 Tm polymer fusion); deep audit at n=1,013 reports 1,013 / 1,013 pristine (zero NaN / null / dim-mismatch / v0.29 contamination). Kernel refit + True-OOD re-verification gated behind cohort completion. NOT CERTIFIED until then. ⚠ Slug overlap with the planned v0.31 — relativistic physics entry below; final numbering to be locked before public release.
Some drugs (porphyrins, polycyclic aromatics with low-lying excitations) are poorly handled by single-reference B3LYP. v0.30 adds an automatic NEVPT2 / CASSCF fallback triggered by the diagnostic D1 / T1 indices.
Iodine (e.g. iodofenphos, levothyroxine), bromine (vermurafenib), and beyond. X2C scalar-relativistic Hamiltonian + appropriate basis sets. Drops the dipole / polarisability error on Z ≥ 35 molecules to <5%. ⚠ Slug overlap with the in-flight v0.31-gold-svp sidecar above; renumber one of the two before public release.
Harmonic ZPE + entropy is the standard approximation; for floppy molecules it under-counts entropy by ~1–2 kcal/mol. Vibrational SCF or quasi-harmonic correction tightens ΔG MAE another 30%.
Three new recipes: v0.30-d4 (B3LYP + D4 instead of D3BJ, ~5 % cost premium), v0.30-wb97mv (ωB97M-V + native VV10, ~12 % cost premium), v0.30-wb97mv-tz (MF4 reference, def2-TZVPD). v0.29 stays default; v0.30 ships alongside in physics_variants for customers who need the absolute SOTA claim.
Format & trust
Audit + chemistry + quantum + AI + legacy_vault. SHA-256 audit on every file. Rosetta Stone IO for 7 vendor formats. The contract that everything else hangs off.
Every .mfsig now persists the full molecular graph: RDKit-derived bond table, stereocenters with CIP labels, formal charges, ring systems, a separate connectivity-hash sub-SHA. No other format combines this with σ-profile.
Standard production tier + Compatibility tier (drop-in for legacy COSMO-RS pipelines) + Premium tier (pure-physics, zero empirical calibration on the benchmark). Each carries its own methodology hash and validation context. Same audit grade across all three.
v2.5 introduces physics_variants under quantum_and_thermodynamics — one molecule can carry results for multiple recipes in the same file, each independently signed. Chem-aware σ-moments (hb_donor_mass / hb_acceptor_mass / pi_negative_mass / polar_area_aa2 / halogen_mass) use bond-table classification (segment_chemistry v2.2). Backward compatible with v2.4.
audit_and_trust.lineage block: parent_sha256, genesis_sha256, history[]. Every derivation event (tier_upgrade, add_variant, add_pair_score) appends a signed entry. Replaces the old fallback semantics. Enables incremental upgrades without rebuild.
Native multi-conformer .mfsig with per-conformer SHAs + an ensemble-level hash binding them. Replay the ensemble; verify a single member; auditor accepts both.
Schema extensions for binary/ternary/n-ary mixtures and microspecies fans. Each file is a graph of signed sub-files (each microspecies / each conformer / each partner) with a top-level ensemble Merkle root.
Columnar Parquet flavour of .mfsig with σ-profile fixed-width vectors + per-row SHA. Vectorised scan over hundred-million-molecule libraries on a laptop. Same audit chain, different storage layout.
Every .mfsig signed with an RFC 3161 timestamp from an accredited TSA (DigiCert, GlobalSign, free.tsa.cz). Auditor can prove the file existed at a specific second — not just "the writer claims this date". Cryptographic complement to the existing SHA-256.
A Reference-grade .mfsig can carry N independent SHA-256 signatures from N labs that each recomputed and verified the cohort. Same file, multiple signers. Replaces "vendor trust" with "distributed trust" for the highest tier.
Recomputed σ for an existing molecule? Don't re-emit the whole file — emit a diff record signed by the new compute, parented to the previous SHA. Auditor walks the chain back to the original. Same idea as Git; we built it for chemistry.
CID = ipfs://Qm... derived from the file SHA. Anyone holding the CID re-fetches the bit-identical file from any peer who keeps it. Distributed-by-default reference cohorts; no central registry hostage situation.
Layer a NIST-PQC signature (e.g. Dilithium) on top of the SHA-256 once practical PQC hardware lands. SHA-256 is collision-resistant against classical attackers; this protects against future quantum ones. Forward-compatible spec.
Tools & UX
Babylon.js viewer with split-CPK bonds, aromatic hairlines, dipole vector arrow, live σ-tile threshold slider, high-res PNG snapshot, shareable URL state, in-canvas legend. The reference web viewer for σ-profile data.
Click two atoms = distance. Three = angle. Four = dihedral. Plus auto-pharmacophore markers (HBD/HBA/aromatic/lipophilic) derived directly from σ-surface — most viewers fake these from heuristics; ours come from the actual cavity.
One click → a .pse or .cxs file that opens in PyMOL/ChimeraX with the same atoms, σ-surface mesh, dipole vector, and saved camera. Pharma teams keep their existing pipeline; we feed it richer data.
Babylon → WebXR. Stand at the centroid of atorvastatin; see donor / acceptor patches as glowing islands around you. Hand-controller σ-threshold slider. Demo + teaching tool, not a primary workflow.
anywidget-based Jupyter cell magic: `mol = read_mfsig('x.mfsig.json'); mol` renders the full 3D viewer inline. Drop into pharma data-science workflows without leaving the notebook.
Click any .mfsig.json file in the explorer; the 3D viewer opens in a custom-editor tab. Tooltip-on-hover shows the audit pack. Same shipping standard as image previewers — no setup required.
=MFSIG("aspirin", "dipole") returns 2.55 D. Functions for every key field. Office.js sideload that pharma analysts install in 30 seconds. Drudgery-collapse for spreadsheet-bound formulators.
GraphQL endpoint over the public atlas: filter by dipole, by HB donor mass, by sigma-similarity to a SMILES, by atom count. Streaming + pagination. Same audit chain via signed response headers.
Pinch-zoom, two-finger orbit, pencil-tap to label atoms, palette button to summon the σ-threshold slider. The boardroom-and-bench demo device, not just a thumbnail-sized scaling of desktop UX.
Two chemists open the same viewer URL; each sees the other's cursor, camera state, and slider position in real time. Yjs CRDT over WebRTC. "Slack thread for the molecule."
Server-rendered /status page driven by refs/generate_status_inventory.py: live cohort counts, Zero-Corruption audit summary, 3-gate verification verdict, True-OOD validation results. Customers verify the integrity of our data lake without account or API call.
Ecosystem & standards
Pull the schema out of the code into a versioned, RFC-style document with worked examples + JSON Schema files. Other implementers (academic groups, vendor reimplementations) can target the spec rather than reverse-engineering our writer.
Pull requests adding native .mfsig read / write to the major open chemistry toolkits. We carry the maintenance burden for the first two release cycles. Lowers adoption friction for every research group already on those stacks.
Engage with the IUPAC Computational Chemistry sub-committee on σ-profile interoperability. Goal: .mfsig listed as an IUPAC-recognised exchange format alongside InChI / SMILES / .sdf for σ-profile data specifically.
Every cohort release (drugs, polymers, atlas) gets a citable Zenodo DOI with the FAIR-data metadata. Academic users cite a versioned dataset, not a moving target. ResearchGate-grade transparency.
Partner with three independent labs to re-compute the same 50-drug subset on different hardware + different DFT codes (PySCF, Q-Chem, ORCA). Publish the cross-lab variance. Empirically establishes the .mfsig audit guarantee.
Apache-2.0 CLI on PyPI + Homebrew + apt. Read/write/verify/convert. No backend dependency — runs entirely locally. The on-ramp that lets anyone produce + audit .mfsig files without a paid account.
How to read this roadmap
- ● SHIPPED — verifiable today: in the public spec, the published cohort, or this codebase.
- ● IN FLIGHT — the work runs right now on our compute; deliverable name + version is committed.
- ● NEXT — scoped, sized, dated. We start when the predecessors clear.
- ● PLANNED — architectural sketch exists; dependencies declared above; quarter-precision dates.
- ● R&D — speculative; year-precision; we publish iff it works, otherwise we say so.
Dependencies are explicit. A milestone can't slip its "depends on" predecessors silently — and when one does slip, it propagates downstream and the roadmap reflects it on the next refresh. Where we land vs. the original date is something we publish openly in the changelog.