Multi-Scale Reversible Chaos Game Representation: A Unified Framework for Sequence Classification

Sarwan Ali; Taslim Murad

✨ TL;DR

MS-RCGR is a new method that converts biological sequences (DNA/protein) into multi-resolution geometric images without losing information, enabling better classification through traditional ML, computer vision, or hybrid approaches. The method consistently improves performance across different analysis paradigms and achieves best results when combined with protein language models.

01 · Problem

Biological sequence classification faces a fundamental challenge in balancing performance with interpretability. Traditional sequence encoding methods often lose information during transformation or fail to capture patterns at multiple scales. Existing approaches typically operate within a single analytical paradigm—either traditional machine learning with hand-crafted features, deep learning on raw sequences, or computer vision on sequence representations—limiting their flexibility and potentially missing complementary insights. There is a need for a unified framework that can preserve complete sequence information while enabling diverse analytical approaches and providing interpretable representations.

02 · Approach

The paper introduces Multi-Scale Reversible Chaos Game Representation (MS-RCGR), which transforms biological sequences into multi-resolution geometric representations using rational arithmetic and hierarchical k-mer decomposition. The method generates scale-invariant features through Chaos Game Representation while guaranteeing complete reversibility, meaning the original sequence can be perfectly reconstructed from the encoding. MS-RCGR creates geometric features at multiple scales, capturing patterns from individual nucleotides to complex motif structures. The framework supports three distinct analytical paradigms: traditional machine learning using extracted geometric features from the CGR representation, computer vision models that treat CGR outputs as images, and hybrid approaches that combine protein language model embeddings (ESM2, ProtT5) with MS-RCGR features. This multi-paradigm design allows researchers to choose the most appropriate analytical approach for their specific task.

03 · Key insights

What the paper shows.

01Reversibility through rational arithmetic ensures zero information loss during sequence-to-geometric transformation, distinguishing MS-RCGR from traditional encoding methods

02Multi-scale hierarchical k-mer decomposition captures biological patterns at different resolutions, from single nucleotides to complex motifs

03The framework successfully bridges three distinct analytical paradigms (traditional ML, computer vision, and language models), demonstrating versatility

04Hybrid approaches combining pre-trained language model embeddings with MS-RCGR features outperform either method used independently, suggesting complementary information capture

04 · Results

Comprehensive experiments on synthetic DNA and protein datasets covering seven distinct sequence classes demonstrate that MS-RCGR features consistently enhance classification performance across all three analytical paradigms. The hybrid approach combining pre-trained protein language model embeddings (ESM2 and ProtT5) with MS-RCGR features achieves superior performance compared to using either method alone. MS-RCGR features provide improvements when used with traditional machine learning algorithms operating on extracted geometric features, when used as input to computer vision models treating CGR outputs as images, and when combined with state-of-the-art protein language models. The results validate that the multi-scale geometric representation captures complementary information to sequence-based embeddings.

05 · Limitations

The paper does not explicitly discuss computational complexity or scalability to very long sequences, which could be a concern given the multi-scale nature of the representation. While tested on synthetic datasets covering seven sequence classes, validation on real-world biological datasets with more diverse and noisy data would strengthen the claims. The paper does not provide detailed analysis of which specific geometric features contribute most to classification performance, limiting mechanistic interpretability despite claims of interpretability. The comparison appears limited to specific protein language models (ESM2, ProtT5) and may not cover the full landscape of existing sequence encoding methods. Additionally, the practical applicability to different types of biological problems beyond classification (such as sequence generation or structure prediction) remains unexplored.

✨ Generated by Claude · Apr 21, 2026 · Read the PDF for authoritative content.

What the paper shows.

↘ Related papers