Methodology & Documentation
Technical summary of the data strategy, molecular representation, model design, generation workflow, filtering, docking, ADMET screening, and limitations of the SELFIES-Transformer discovery pipeline.
Research Objective
The project focuses on de novo generation and computational prioritization of candidate molecules for HIV-1 protease inhibition. The aim is to demonstrate an end-to-end AI-assisted molecular discovery workflow rather than claim experimentally validated drug activity.
Dataset Strategy
The workflow uses a large ZINC chemical corpus for molecular pretraining and curated HIV-1 protease inhibitor records for target-specific fine-tuning. This two-stage strategy first learns general molecular syntax and then biases generation toward a relevant therapeutic target domain.
Molecular Representation
The pipeline uses SELFIES as the generative molecular representation. SELFIES is useful for molecular generation because it is designed to decode into valid molecular graphs, reducing the invalid-output problem often seen in SMILES-based generation.
[C][C][=C][Branch1][Ring][C][=C][O][N][C][=O]Model Architecture
The generator follows a decoder-only Transformer setup trained with an autoregressive next-token prediction objective over SELFIES sequences. This allows the model to learn token dependencies and generate new molecular strings sequentially.
Training Strategy
The model is first pretrained on a broad chemical corpus and then fine-tuned on HIV-1 protease inhibitor records. This transfer-learning setup helps combine broad chemical language learning with target-focused molecular generation.
Candidate Generation
The fine-tuned model samples candidate molecules, which are decoded, canonicalized, and evaluated using cheminformatics descriptors before downstream filtering and docking.
Filtering and Ranking
Generated molecules are screened using validity checks, drug-likeness indicators, synthetic accessibility, PAINS/Brenk filtering, and diversity selection. The goal is to reduce the generated set into a smaller, computationally practical candidate pool.
Docking Setup
Selected candidates are docked against a prepared HIV-1 protease receptor structure using AutoDock Vina. Docking scores are used as computational prioritization signals, with lower Vina scores indicating stronger predicted binding in the docking setup.
ADMET Screening
Top-ranked molecules are summarized using screening-level ADMET indicators such as Lipinski violations, gastrointestinal absorption, BBB permeability flags, and PAINS/Brenk status where available.
Study Limitations
This is a computational research demo. Docking and ADMET outputs are useful for prioritization, but they do not replace experimental validation. No therapeutic, clinical, or biological efficacy claim should be made from these results alone.
Reproducibility Notes
- The hosted site presents precomputed outputs from the molecular generation and docking workflow.
- The 3D viewer visualizes saved receptor and docked ligand pose files in the browser.
- Candidate rankings are based on exported computational screening results.
- Experimental validation is required before any biological activity claim.