Research Pipeline

Technical workflow map of the SELFIES-Transformer molecular discovery engine, from chemical pretraining to docking evidence and hosted visualization.

End-to-End Workflow

Stage01

ZINC Pretraining Corpus

Large chemical library used to learn general molecular syntax and drug-like chemical space.

Output Artifact

Pretraining molecule corpus

Stage02

SMILES Standardization

Input molecular strings are cleaned, canonicalized, and prepared for robust representation conversion.

Output Artifact

Canonical SMILES records

Stage03

SELFIES Conversion

Canonical SMILES are converted into SELFIES to improve molecular validity during generation.

Output Artifact

SELFIES molecular strings

Stage04

SELFIES Tokenization

SELFIES strings are split into tokens and mapped into a model vocabulary for sequence learning.

Output Artifact

SELFIES token vocabulary

Stage05

Decoder-only Transformer

Autoregressive Transformer learns next-token prediction over molecular SELFIES sequences.

Output Artifact

Pretrained molecular generator

Stage06

HIV-1 Protease Fine-tuning

The generator is specialized using curated HIV-1 protease inhibitor records.

Output Artifact

Target-focused generator

Stage07

Molecule Generation

The fine-tuned model samples new candidate inhibitor molecules from learned chemical space.

Output Artifact

Generated candidate set

Stage08

RDKit Validation

Generated molecules are decoded, validated, canonicalized, and checked for chemical consistency.

Output Artifact

Valid canonical molecules

Stage09

Drug-likeness Filtering

Candidates are filtered using molecular descriptors such as QED, SA, molecular weight, and logP.

Output Artifact

Drug-like candidate subset

Stage10

PAINS/Brenk Filtering

Problematic structural alerts and undesirable substructures are removed from the candidate pool.

Output Artifact

Clean screening subset

Stage11

Butina Diversity Clustering

Molecular diversity selection reduces redundancy before computational docking.

Output Artifact

Diverse docking set

Stage12

AutoDock Vina Docking

Selected candidates are docked against the prepared HIV-1 protease receptor active site.

Output Artifact

Vina docking scores and poses

Stage13

ADMET Screening

Top-ranked molecules are summarized using screening-level pharmacokinetic and toxicity indicators.

Output Artifact

ADMET summary records

Stage14

Web Research Demo

Precomputed results, descriptors, docked poses, and molecular scenes are served through the hosted interface.

Output Artifact

Interactive Vercel research console