Research Pipeline
Technical workflow map of the SELFIES-Transformer molecular discovery engine, from chemical pretraining to docking evidence and hosted visualization.
End-to-End Workflow
ZINC Pretraining Corpus
Large chemical library used to learn general molecular syntax and drug-like chemical space.
SMILES Standardization
Input molecular strings are cleaned, canonicalized, and prepared for robust representation conversion.
SELFIES Conversion
Canonical SMILES are converted into SELFIES to improve molecular validity during generation.
SELFIES Tokenization
SELFIES strings are split into tokens and mapped into a model vocabulary for sequence learning.
Decoder-only Transformer
Autoregressive Transformer learns next-token prediction over molecular SELFIES sequences.
HIV-1 Protease Fine-tuning
The generator is specialized using curated HIV-1 protease inhibitor records.
Molecule Generation
The fine-tuned model samples new candidate inhibitor molecules from learned chemical space.
RDKit Validation
Generated molecules are decoded, validated, canonicalized, and checked for chemical consistency.
Drug-likeness Filtering
Candidates are filtered using molecular descriptors such as QED, SA, molecular weight, and logP.
PAINS/Brenk Filtering
Problematic structural alerts and undesirable substructures are removed from the candidate pool.
Butina Diversity Clustering
Molecular diversity selection reduces redundancy before computational docking.
AutoDock Vina Docking
Selected candidates are docked against the prepared HIV-1 protease receptor active site.
ADMET Screening
Top-ranked molecules are summarized using screening-level pharmacokinetic and toxicity indicators.
Web Research Demo
Precomputed results, descriptors, docked poses, and molecular scenes are served through the hosted interface.