Methodology & Documentation

Technical summary of the data strategy, molecular representation, model design, generation workflow, filtering, docking, ADMET screening, and limitations of the SELFIES-Transformer discovery pipeline.

Research Objective

The project focuses on de novo generation and computational prioritization of candidate molecules for HIV-1 protease inhibition. The aim is to demonstrate an end-to-end AI-assisted molecular discovery workflow rather than claim experimentally validated drug activity.

Dataset Strategy

The workflow uses a large ZINC chemical corpus for molecular pretraining and curated HIV-1 protease inhibitor records for target-specific fine-tuning. This two-stage strategy first learns general molecular syntax and then biases generation toward a relevant therapeutic target domain.

Molecular Representation

The pipeline uses SELFIES as the generative molecular representation. SELFIES is useful for molecular generation because it is designed to decode into valid molecular graphs, reducing the invalid-output problem often seen in SMILES-based generation.

Representation ExampleSELFIES

[C][C][=C][Branch1][Ring][C][=C][O][N][C][=O]

Model Architecture

The generator follows a decoder-only Transformer setup trained with an autoregressive next-token prediction objective over SELFIES sequences. This allows the model to learn token dependencies and generate new molecular strings sequentially.

Training Strategy

The model is first pretrained on a broad chemical corpus and then fine-tuned on HIV-1 protease inhibitor records. This transfer-learning setup helps combine broad chemical language learning with target-focused molecular generation.

Candidate Generation

The fine-tuned model samples candidate molecules, which are decoded, canonicalized, and evaluated using cheminformatics descriptors before downstream filtering and docking.

Filtering and Ranking

Generated molecules are screened using validity checks, drug-likeness indicators, synthetic accessibility, PAINS/Brenk filtering, and diversity selection. The goal is to reduce the generated set into a smaller, computationally practical candidate pool.

Docking Setup

Selected candidates are docked against a prepared HIV-1 protease receptor structure using AutoDock Vina. Docking scores are used as computational prioritization signals, with lower Vina scores indicating stronger predicted binding in the docking setup.

Docking ToolAutoDock Vina

OutputDocking scores + docked poses

ADMET Screening

Top-ranked molecules are summarized using screening-level ADMET indicators such as Lipinski violations, gastrointestinal absorption, BBB permeability flags, and PAINS/Brenk status where available.

Study Limitations

This is a computational research demo. Docking and ADMET outputs are useful for prioritization, but they do not replace experimental validation. No therapeutic, clinical, or biological efficacy claim should be made from these results alone.

Reproducibility Notes

The hosted site presents precomputed outputs from the molecular generation and docking workflow.
The 3D viewer visualizes saved receptor and docked ligand pose files in the browser.
Candidate rankings are based on exported computational screening results.
Experimental validation is required before any biological activity claim.