Skip to content

mathysgrapotte/bio2parquet

bio2parquet

ci documentation pypi version

Convert your genomic data to high-performance Parquet for seamless integration and faster ML pipeline training.

🚀 Features

  • FASTA Support: Convert FASTA files (.fasta, .fa, .fna) to Parquet format
  • Compression Support: Handle both plain and gzipped FASTA files
  • Hugging Face Integration: Direct upload to Hugging Face Hub for dataset sharing
  • Python Native: Built with Python for easy integration into your workflow
  • Type Safety: Full type hints support for better development experience

📦 Installation

pip install bio2parquet

With uv:

uv tool install bio2parquet

🎯 Quick Start

Command Line Interface

# Basic conversion
bio2parquet fasta input.fasta

# Specify output file
bio2parquet fasta input.fasta -o output.parquet

# Upload to Hugging Face Hub
bio2parquet fasta input.fasta --hf-repo-id username/dataset-name --hf-token your_token

Python API

from bio2parquet import create_dataset_from_fasta

# Convert FASTA to Parquet
dataset = create_dataset_from_fasta("input.fasta")
dataset.to_parquet("output.parquet")

# Upload to Hugging Face Hub
dataset.push_to_hub("username/dataset-name", token="your_token")

📚 Documentation

For detailed documentation, visit our documentation site.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the license file for details.

About

End slow text-based formats in bioinformatics by converting them to high-performant modern ones.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

 
 
 

Contributors

Languages