HIV-1 Discovery

Research Pipeline

Technical workflow map of the SELFIES-Transformer molecular discovery engine, from chemical pretraining to docking evidence and hosted visualization.

End-to-End Workflow

Stage01

ZINC Pretraining Corpus

Large chemical library used to learn general molecular syntax and drug-like chemical space.

Output Artifact
Pretraining molecule corpus
Stage02

SMILES Standardization

Input molecular strings are cleaned, canonicalized, and prepared for robust representation conversion.

Output Artifact
Canonical SMILES records
Stage03

SELFIES Conversion

Canonical SMILES are converted into SELFIES to improve molecular validity during generation.

Output Artifact
SELFIES molecular strings
Stage04

SELFIES Tokenization

SELFIES strings are split into tokens and mapped into a model vocabulary for sequence learning.

Output Artifact
SELFIES token vocabulary
Stage05

Decoder-only Transformer

Autoregressive Transformer learns next-token prediction over molecular SELFIES sequences.

Output Artifact
Pretrained molecular generator
Stage06

HIV-1 Protease Fine-tuning

The generator is specialized using curated HIV-1 protease inhibitor records.

Output Artifact
Target-focused generator
Stage07

Molecule Generation

The fine-tuned model samples new candidate inhibitor molecules from learned chemical space.

Output Artifact
Generated candidate set
Stage08

RDKit Validation

Generated molecules are decoded, validated, canonicalized, and checked for chemical consistency.

Output Artifact
Valid canonical molecules
Stage09

Drug-likeness Filtering

Candidates are filtered using molecular descriptors such as QED, SA, molecular weight, and logP.

Output Artifact
Drug-like candidate subset
Stage10

PAINS/Brenk Filtering

Problematic structural alerts and undesirable substructures are removed from the candidate pool.

Output Artifact
Clean screening subset
Stage11

Butina Diversity Clustering

Molecular diversity selection reduces redundancy before computational docking.

Output Artifact
Diverse docking set
Stage12

AutoDock Vina Docking

Selected candidates are docked against the prepared HIV-1 protease receptor active site.

Output Artifact
Vina docking scores and poses
Stage13

ADMET Screening

Top-ranked molecules are summarized using screening-level pharmacokinetic and toxicity indicators.

Output Artifact
ADMET summary records
Stage14

Web Research Demo

Precomputed results, descriptors, docked poses, and molecular scenes are served through the hosted interface.

Output Artifact
Interactive Vercel research console

Architecture Overview

The pipeline uses SELFIES to improve molecular validity during autoregressive generation, then applies descriptor filtering, diversity selection, docking, ADMET screening, and browser-based structural visualization for research presentation.