Skip to content

Elarwei001/s2st-distill

Repository files navigation

S2ST-Distill

Distill multilingual speech-to-speech translation models into lightweight single language-pair models for on-device real-time inference.

License: MIT Python 3.10+

🎯 Goal

Take a large multilingual S2ST model (e.g., SeamlessM4T with 100+ languages) and distill it into a tiny single language-pair model (e.g., EN→ZH only) that can:

  • Run on mobile devices (iOS/Android)
  • Achieve < 2.5s end-to-end latency for real-time interpretation
  • Maintain natural-sounding voice with preserved speaker timbre and prosody
  • Fit in 20-50MB for on-demand download

✨ Features

  • Language-pair pruning: Remove unnecessary language embeddings and parameters
  • Knowledge distillation: Transfer knowledge from teacher to compact student model
  • Layer pruning: Iteratively remove unimportant layers based on importance scores
  • Voice preservation: Maintain speaker identity and prosody in translated speech
  • Quantization: INT8/INT4 quantization for smaller model size
  • Mobile deployment: Export to CoreML (iOS) and TFLite (Android)

📊 Target Metrics

Metric Target Notes
Model Size 20-50 MB Single language-pair, quantized
Inference Latency < 300 ms Neural network computation only
End-to-End Latency < 2.5 s Including algorithmic lookahead
Translation Quality BLEU 28+ Compared to reference translations
Voice Naturalness MOS 3.5+ Mean Opinion Score (1-5 scale)
Voice Similarity > 0.75 Cosine similarity of speaker embeddings

🎮 Try the Demo

Test the distilled models with our web interface:

# Clone and install
git clone https://github.com/Elarwei001/s2st-distill.git
cd s2st-distill
pip install -r demo/requirements.txt

# Run the demo
python demo/app.py

# Open http://localhost:7860 in your browser

Features:

  • 🎙️ Record audio from microphone
  • 📁 Upload audio files
  • 🌐 Choose translation direction (EN↔ZH, ZH↔FR)
  • 🔊 Instant playback of translated speech

See demo/README.md for more details.


🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/Elarwei001/s2st-distill.git
cd s2st-distill

# Create virtual environment
conda create -n s2st python=3.10
conda activate s2st

# Install dependencies
pip install -r requirements.txt

Basic Usage

from s2st_distill import S2STDistiller

# Initialize distiller with base model
distiller = S2STDistiller(
    base_model="facebook/seamless-m4t-unity-small",
    source_lang="eng",
    target_lang="cmn"
)

# Run distillation pipeline
student_model = distiller.distill(
    train_dataset="path/to/train.json",
    num_epochs=10,
    target_size_mb=30
)

# Export for mobile
distiller.export_coreml("model.mlpackage")  # iOS
distiller.export_tflite("model.tflite")      # Android

📁 Project Structure

s2st-distill/
├── demo/                    # 🎮 Web UI Demo
│   ├── app.py               # Gradio web interface
│   ├── requirements.txt     # Demo dependencies
│   └── README.md            # Demo documentation
├── docs/                    # Documentation
│   ├── TECHNICAL_SPEC.md    # Detailed technical specification
│   ├── ARCHITECTURE.md      # Model architecture overview
│   └── DEPLOYMENT.md        # Mobile deployment guide
├── s2st_distill/            # Main package
│   ├── __init__.py
│   ├── distiller.py         # Main distillation pipeline
│   ├── pruning.py           # Language and layer pruning
│   ├── voice_preserve.py    # Speaker/prosody preservation
│   ├── quantize.py          # Quantization utilities
│   └── export.py            # Mobile export (CoreML/TFLite)
├── scripts/                 # Utility scripts
│   ├── train.py             # Training script
│   ├── evaluate.py          # Evaluation script
│   └── benchmark.py         # Latency benchmark
├── models/                  # Trained models (after distillation)
│   ├── en_zh/               # English → Chinese
│   ├── zh_en/               # Chinese → English
│   ├── zh_fr/               # Chinese → French
│   └── fr_zh/               # French → Chinese
├── tests/                   # Unit tests
├── examples/                # Example notebooks
├── requirements.txt
├── setup.py
└── README.md

📖 Documentation

🔬 How It Works

1. Language-Pair Pruning

Remove embeddings and parameters for languages not in the target pair, reducing model size by ~60%.

2. Knowledge Distillation

Use the original model as teacher to train a smaller student model, preserving translation quality.

3. Layer Pruning

Iteratively remove least important layers based on importance scores computed from validation loss.

4. Voice Preservation

  • Speaker Encoder: Extract speaker embeddings to preserve voice identity
  • Prosody Transfer: Transfer pitch, duration, and energy patterns from source to target

5. Quantization

Apply INT8/INT4 quantization to further reduce model size with minimal quality loss.

📚 References

🤝 Contributing

Contributions are welcome! Please read our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Meta AI for SeamlessM4T
  • Google Research for SimulTron and real-time S2ST research
  • The open-source speech processing community

About

Distill multilingual S2ST models for on-device real-time speech translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors