AtomGen User Guide¶

Welcome to the AtomGen User Guide. This document provides comprehensive instructions on how to use all components of the AtomGen library for molecular modeling tasks.

Table of Contents¶

Installation
Quick Start
Data Loading
Pretraining
Fine-tuning
Inference
Advanced Features
Troubleshooting

Installation¶

The package can be installed using uv:

uv sync
source .venv/bin/activate

Quick Start¶

Here's a simple example to get you started with AtomGen using a pretrained model to extract features:

import torch
from transformers import AutoModel

# Load a pretrained model
model = AutoModel.from_pretrained("vector-institute/atomformer-base",
                                   trust_remote_code=True)

# Example input data
input_ids = torch.randint(0, 50, (1, 10))
coords = torch.randn(1, 10, 3)
attention_mask = torch.ones(1, 10)

# Extract features
with torch.no_grad():
    output = model(input_ids, coords=coords, attention_mask=attention_mask)

print(output.shape) # Should be (1, 10, 768) for the base model

This example demonstrates how to load the pretrained AtomFormer model and use it to extract features from molecular data.

Data Loading¶

AtomGen leverages the HuggingFace datasets library for data loading. Here are examples of loading some of the available datasets:

from datasets import load_dataset

# Load the S2EF-15M dataset
s2ef_dataset = load_dataset("vector-institute/s2ef-15m")

# Load ATOM3D SMP dataset
smp_dataset = load_dataset("vector-institute/atom3d-smp")

# Load ATOM3D LBA dataset
lba_dataset = load_dataset("vector-institute/atom3d-lba")

Dataset structure: - S2EF-15M: Contains 'input_ids' (atomic numbers), 'coords' (3D coordinates), 'forces', 'formation_energy', 'total_energy', and 'has_formation_energy' fields. - ATOM3D datasets: Generally contain 'input_ids', 'coords', and task-specific labels. For example, SMP has 20 regression targets, while LBA has a single binding affinity value.

You can inspect the structure of a dataset using:

print(dataset['train'].features)

Pretraining¶

To pretrain an AtomFormer model, use the pretrain_s2ef.py script. Here's an example of how to use it:

python pretrain_s2ef.py \
    --seed 42 \
    --project "AtomGen" \
    --name "s2ef_15m_train_base_10epochs" \
    --output_dir "./checkpoint" \
    --dataset_dir "./s2ef_15m" \
    --model_config "atomgen/models/configs/atomformer-base.json" \
    --tokenizer_json "atomgen/data/tokenizer.json" \
    --micro_batch_size 8 \
    --macro_batch_size 128 \
    --num_train_epochs 10 \
    --warmup_ratio 0.001 \
    --lr_scheduler_type "cosine" \
    --weight_decay 1.0e-2 \
    --max_grad_norm 5.0 \
    --learning_rate 3e-4 \
    --gradient_checkpointing

This script handles the complexities of pretraining, including data loading, model initialization, and training loop management.

Fine-tuning¶

For fine-tuning on ATOM3D tasks, use the run_atom3d.py script. Here's an example command:

python run_atom3d.py \
    --model_name_or_path "vector-institute/atomformer-base" \
    --dataset_name "vector-institute/atom3d-smp" \
    --output_dir "./results" \
    --batch_size 32 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \

Key arguments for run_atom3d.py:

--model_name_or_path: Pretrained model to start from
--dataset_name: ATOM3D dataset to use for fine-tuning
--output_dir: Directory to save results
--batch_size: Batch size per GPU/CPU for training
--learning_rate: Initial learning rate
--num_train_epochs: Total number of training epochs

Inference¶

To use a trained model for inference, you can load it directly from the HuggingFace Hub or from a local directory:

from transformers import AutoModelForSequenceClassification
import torch

# Load from HuggingFace Hub
model = AutoModelForSequenceClassification.from_pretrained("vector-institute/atomformer-base-smp",
                                                           trust_remote_code=True)

# Or load from a local directory
# model = AutoModelForSequenceClassification.from_pretrained("path/to/your/model/directory",
#                                                            trust_remote_code=True)

# Prepare your input data
input_ids = torch.randint(0, 50, (1, 10))
coords = torch.randn(1, 10, 3)
attention_mask = torch.ones(1, 10)

# Run inference
with torch.no_grad():
    output = model(input_ids, coords=coords, attention_mask=attention_mask)

predictions = output[1]
print(predictions.shape)  # Should be (1, 20) for the SMP task

This example assumes the model has been fine-tuned on the SMP task. Adjust the model class and output processing based on the specific task you're working with.

Advanced Features¶

Data Collation¶

The DataCollatorForAtomModeling class handles batching of molecular data. Here's how to use it:

from atomgen.data import DataCollatorForAtomModeling

data_collator = DataCollatorForAtomModeling(
    mam=True,  # Enable Masked Atom Modeling
    coords_perturb=0.1,  # Enable coordinate perturbation
    return_lap_pe=True,  # Return Laplacian Positional Encoding
)

Distributed Training¶

For multi-GPU training, modify your run_atom3d.py command:

python -m torch.distributed.launch --nproc_per_node=4 run_atom3d.py \
    --model_name_or_path "vector-institute/atomformer-base" \
    --dataset_name "vector-institute/atom3d-smp" \
    --output_dir "./results" \
    --batch_size 8 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \

Troubleshooting¶

If you encounter out-of-memory errors, try the following:

Reduce batch size in the script arguments
Enable gradient checkpointing (add --gradient_checkpointing to your command)

For more help, please check our GitHub Issues or open a new issue if you can't find a solution to your problem.