Skip to content

Fix MetaspacePreTokenizer: prepend_scheme no longer gated by addPrefixSpace#319

Merged
mattt merged 1 commit intohuggingface:mainfrom
beshkenadze:fix/metaspace-prepend-scheme-gating
Feb 24, 2026
Merged

Fix MetaspacePreTokenizer: prepend_scheme no longer gated by addPrefixSpace#319
mattt merged 1 commit intohuggingface:mainfrom
beshkenadze:fix/metaspace-prepend-scheme-gating

Conversation

@beshkenadze
Copy link
Contributor

Summary

  • Fixes MetaspacePreTokenizer.preTokenize() which never prepended when add_prefix_space was absent from the tokenizer config, even when prepend_scheme was set to "always" — breaking XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always".
  • Aligns with the canonical Rust implementation (huggingface/tokenizers PR #1357) where prepend_scheme is the sole runtime authority for prepending behavior.
  • Adds 9 new unit tests covering all prepend_scheme modes, backward compatibility with legacy add_prefix_space, precedence rules, and edge cases.

Fixes #318

Root Cause

In the previous code, addPrefixSpace (defaulting to false when absent) was used as an outer gate for the prependScheme logic:

// BEFORE (broken):
if addPrefixSpace, !normalized.hasPrefix(replacement) {
    if prependScheme == .always { prepend = stringReplacement }
    if prependScheme == .first, options.contains(.firstSection) { prepend = stringReplacement }
}

When add_prefix_space was absent from the config (which is normal for XLM-RoBERTa), addPrefixSpace defaulted to false, and the entire prepend block was never executed — regardless of prependScheme.

Changes

Sources/Tokenizers/PreTokenizer.swift

initprependScheme now resolves with backward compat:

  • If prepend_scheme is explicit in config → use it directly
  • Otherwise derive from add_prefix_space (true/absent → .always, false.never)

preTokenize — removed addPrefixSpace gate, replaced with switch prependScheme:

// AFTER (fixed):
if !normalized.hasPrefix(replacement) {
    switch prependScheme {
    case .always: prepend = stringReplacement
    case .first:
        if options.contains(.firstSection) { prepend = stringReplacement }
    case .never: break
    }
}

Tests/TokenizersTests/PreTokenizerTests.swift

9 new test functions:

Test Validates
always without add_prefix_space The XLM-RoBERTa fix (issue #318)
first with section options First-section-only prepending
never No prepending
Legacy add_prefix_space: true Backward compat → .always
Legacy add_prefix_space: false Backward compat → .never
Default (both absent) Defaults to .always
always supersedes add_prefix_space: false Precedence: scheme wins
never supersedes add_prefix_space: true Inverse precedence
Empty string Edge case behavior

Testing

  • All 18 PreTokenizer tests pass ✅
  • All 84 TokenizersTests pass ✅ (no regressions, including existing robertaXLMTokenizer integration test)

…xSpace

MetaspacePreTokenizer.preTokenize() never prepended the replacement
character (▁) when add_prefix_space was absent from the tokenizer
config, even when prepend_scheme was set to "always". This broke
XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace
with prepend_scheme: "always".

The fix aligns with the canonical Rust implementation (huggingface/
tokenizers PR #1357) where prepend_scheme is the sole authority:

- init: resolves prependScheme from explicit prepend_scheme first,
  falling back to add_prefix_space for backward compatibility
- preTokenize: uses switch on prependScheme directly, removing the
  addPrefixSpace gate

Fixes huggingface#318
Copilot AI review requested due to automatic review settings February 24, 2026 15:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in MetaspacePreTokenizer that broke tokenization for XLM-RoBERTa and other SentencePiece Unigram models. The issue was that add_prefix_space (defaulting to false when absent) was incorrectly used as a gate for the prepend_scheme logic, preventing the prepend character from being added even when prepend_scheme: "always" was explicitly set. The fix aligns the Swift implementation with the canonical Rust implementation where prepend_scheme is the sole runtime authority for prepending behavior.

Changes:

  • Modified MetaspacePreTokenizer.init() to properly resolve prepend_scheme with backward compatibility for legacy add_prefix_space configs
  • Removed the addPrefixSpace gate from preTokenize() and replaced it with a proper switch statement on prependScheme
  • Added 9 comprehensive unit tests covering all prepend schemes, backward compatibility, precedence rules, and edge cases

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
Sources/Tokenizers/PreTokenizer.swift Fixed init to handle prepend_scheme precedence and preTokenize to use switch-based logic without addPrefixSpace gate
Tests/TokenizersTests/PreTokenizerTests.swift Added 9 new tests covering all prepend_scheme modes, backward compatibility, precedence, and edge cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mattt
Copy link
Collaborator

mattt commented Feb 24, 2026

@beshkenadze Thank you for reporting and contributing a fix! This looks right to me. Merging this now.

@mattt mattt merged commit f3d5cbf into huggingface:main Feb 24, 2026
6 of 7 checks passed
@mattt
Copy link
Collaborator

mattt commented Feb 24, 2026

This is now available in 1.1.8

@beshkenadze beshkenadze deleted the fix/metaspace-prepend-scheme-gating branch February 24, 2026 19:09
DePasqualeOrg pushed a commit to DePasqualeOrg/swift-tokenizers that referenced this pull request Mar 4, 2026
…xSpace

MetaspacePreTokenizer.preTokenize() never prepended the replacement
character (▁) when add_prefix_space was absent from the tokenizer
config, even when prepend_scheme was set to "always". This broke
XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace
with prepend_scheme: "always".

The fix aligns with the canonical Rust implementation (huggingface/
tokenizers PR #1357) where prepend_scheme is the sole authority:

- init: resolves prependScheme from explicit prepend_scheme first,
  falling back to add_prefix_space for backward compatibility
- preTokenize: uses switch on prependScheme directly, removing the
  addPrefixSpace gate

Cherry-picked from huggingface/swift-transformers#319 (f3d5cbf).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MetaspacePreTokenizer: addPrefixSpace gates prependScheme, breaking XLM-RoBERTa tokenization

3 participants