Fix MetaspacePreTokenizer: prepend_scheme no longer gated by addPrefixSpace#319
Conversation
…xSpace MetaspacePreTokenizer.preTokenize() never prepended the replacement character (▁) when add_prefix_space was absent from the tokenizer config, even when prepend_scheme was set to "always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always". The fix aligns with the canonical Rust implementation (huggingface/ tokenizers PR #1357) where prepend_scheme is the sole authority: - init: resolves prependScheme from explicit prepend_scheme first, falling back to add_prefix_space for backward compatibility - preTokenize: uses switch on prependScheme directly, removing the addPrefixSpace gate Fixes huggingface#318
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug in MetaspacePreTokenizer that broke tokenization for XLM-RoBERTa and other SentencePiece Unigram models. The issue was that add_prefix_space (defaulting to false when absent) was incorrectly used as a gate for the prepend_scheme logic, preventing the prepend character ▁ from being added even when prepend_scheme: "always" was explicitly set. The fix aligns the Swift implementation with the canonical Rust implementation where prepend_scheme is the sole runtime authority for prepending behavior.
Changes:
- Modified
MetaspacePreTokenizer.init()to properly resolveprepend_schemewith backward compatibility for legacyadd_prefix_spaceconfigs - Removed the
addPrefixSpacegate frompreTokenize()and replaced it with a properswitchstatement onprependScheme - Added 9 comprehensive unit tests covering all prepend schemes, backward compatibility, precedence rules, and edge cases
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| Sources/Tokenizers/PreTokenizer.swift | Fixed init to handle prepend_scheme precedence and preTokenize to use switch-based logic without addPrefixSpace gate |
| Tests/TokenizersTests/PreTokenizerTests.swift | Added 9 new tests covering all prepend_scheme modes, backward compatibility, precedence, and edge cases |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@beshkenadze Thank you for reporting and contributing a fix! This looks right to me. Merging this now. |
|
This is now available in 1.1.8 |
…xSpace MetaspacePreTokenizer.preTokenize() never prepended the replacement character (▁) when add_prefix_space was absent from the tokenizer config, even when prepend_scheme was set to "always". This broke XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace with prepend_scheme: "always". The fix aligns with the canonical Rust implementation (huggingface/ tokenizers PR #1357) where prepend_scheme is the sole authority: - init: resolves prependScheme from explicit prepend_scheme first, falling back to add_prefix_space for backward compatibility - preTokenize: uses switch on prependScheme directly, removing the addPrefixSpace gate Cherry-picked from huggingface/swift-transformers#319 (f3d5cbf).
Summary
MetaspacePreTokenizer.preTokenize()which never prepended▁whenadd_prefix_spacewas absent from the tokenizer config, even whenprepend_schemewas set to"always"— breaking XLM-RoBERTa and any SentencePiece Unigram model relying on Metaspace withprepend_scheme: "always".prepend_schemeis the sole runtime authority for prepending behavior.prepend_schememodes, backward compatibility with legacyadd_prefix_space, precedence rules, and edge cases.Fixes #318
Root Cause
In the previous code,
addPrefixSpace(defaulting tofalsewhen absent) was used as an outer gate for theprependSchemelogic:When
add_prefix_spacewas absent from the config (which is normal for XLM-RoBERTa),addPrefixSpacedefaulted tofalse, and the entire prepend block was never executed — regardless ofprependScheme.Changes
Sources/Tokenizers/PreTokenizer.swiftinit—prependSchemenow resolves with backward compat:prepend_schemeis explicit in config → use it directlyadd_prefix_space(true/absent →.always,false→.never)preTokenize— removedaddPrefixSpacegate, replaced withswitch prependScheme:Tests/TokenizersTests/PreTokenizerTests.swift9 new test functions:
alwayswithoutadd_prefix_spacefirstwith section optionsneveradd_prefix_space: true.alwaysadd_prefix_space: false.never.alwaysalwayssupersedesadd_prefix_space: falseneversupersedesadd_prefix_space: trueTesting
robertaXLMTokenizerintegration test)