Multi-modal Sequence Encoding for AMPs
Benchmarking peptide sequence encoders and multi-modal transfer learning for out-of-distribution generalization. CIS 5200 (Machine Learning).
Project Overview & Motivation
Addressing the critical global health crisis posed by antimicrobial resistance (AMR), this project utilized a deep learning approach to mine peptide proteomes, focusing on the novel application of molecular de-extinction to identify potential antimicrobial peptides (AMPs) from ancient organisms.
The central challenge was addressing distribution shift in peptide activity prediction. While modern sequence-based models perform well on familiar data, they often fail when applied to novel, out-of-distribution (OOD) sequences. We investigated whether augmenting a high-capacity sequence encoder with structural features from AlphaFold could mitigate performance collapse when moving from modern to ancient peptide domains.
Project Aims:
- Train a deep learning model that learns high-dimensional latent representations of antimicrobial peptides directly from raw sequence data and rigorously compare its accuracy with traditional baseline models.
- Test the generalization of this model on an out-of-distribution dataset of extinct peptides.
Check out the project notebook here, and the paper here.
Methods & Model Architecture
We formulated the prediction of log2-transformed Minimum Inhibitory Concentration (MIC) as a supervised regression task, using the publicly available Database of Antimicrobial Activity and Structure of Peptides (DBAASP) with 25,306 unique peptide-target interactions across 6,233 sequences and 50 bacterial targets.
Our technical workflow focused on three main ML paradigms:
- Encoder Development: We implemented an APEX-based Sequence Encoder combining GRU layers with attention mechanisms to translate primary amino acid sequences into high-dimensional latent representations.
- Multi-modal Transfer Learning: To improve OOD stability, we integrated five explicit structural features (pLDDT, helix fraction, disordered state, etc.) extracted from AlphaFold via ColabFold. These were concatenated into a multi-modal layer to provide the model with geometric context.
- Baseline Comparison & Interpretability: We benchmarked against tree-based models (XGBoost, Random Forest) and utilized feature importance analysis to quantify the specific contribution of structural versus sequence-based features to the model’s final decision-making process.
Results & Key Insights
Our evaluation focused on the delta between in-distribution and out-of-distribution performance:
- In-Distribution Performance: The APEX-based Sequence Encoder achieved robust performance with Spearman rho = 0.79 and R² = 0.62, confirming strong predictive capability for the training domain. Baseline models (XGBoost, Random Forest) reached comparable performance, demonstrating the model’s stable in-distribution interpolation capability.
- Generalization Gap: Testing on an out-of-distribution dataset of 69 extinct peptides (41 validated AMPs) resulted in a significant performance collapse, where the Spearman rho dropped to 0.36, confirming the challenge of evolutionary generalization.
- Feature Rejection & Scientific Insight: Our multi-modal integration revealed that the model assigned < 1% feature weight to structural data, providing a crucial scientific insight: the low-value of static structural predictions (e.g., disordered state) for dynamic, membrane-targeting AMPs. The model’s rejection of structural features suggests that static 3D structures lack high predictive correlation with antimicrobial activity in this shifted domain.
This work successfully validates a high-performance sequence baseline but establishes that the primary bottleneck in next-generation AMP discovery lies in the current inability to model dynamic, bioactive structural states. Future work could explore Graph Neural Networks (GNNs) to fully utilize rich structural information from AlphaFold without losing information through summary statistics.
Skills Used
- Programming: PyTorch, scikit-learn, Pandas, NumPy, modlAMP, ColabFold
- ML concepts: GRU networks, attention mechanisms, transfer learning, multi-modal learning, XGBoost, Random Forest, regression, out-of-distribution generalization, feature importance analysis
- Domain-specific: AlphaFold, protein structure prediction, antimicrobial peptides, DBAASP database, sequence encoding