feat: updates to Docling Remote and Chunker components#11684
feat: updates to Docling Remote and Chunker components#11684Adam-Aghili merged 33 commits intomainfrom
Conversation
…gDocumentComponent `pragma: allowlist secret`
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughTwo new boolean input parameters ( Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 6❌ Failed checks (1 error, 4 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (5.47%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #11684 +/- ##
==========================================
- Coverage 35.33% 35.30% -0.04%
==========================================
Files 1525 1525
Lines 73302 73365 +63
Branches 11025 11041 +16
==========================================
Hits 25898 25898
- Misses 45991 46055 +64
+ Partials 1413 1412 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@src/lfx/src/lfx/_assets/component_index.json`:
- Line 64090: Summary: remove the unsupported always_emit_headings parameter and
input. Fix: in ChunkDoclingDocumentComponent remove the Message/Bool Input
definition for "always_emit_headings" from the inputs list and remove any
build_config toggles referencing "always_emit_headings" in update_build_config;
also remove the argument always_emit_headings=bool(self.always_emit_headings)
passed into the HybridChunker() instantiation inside chunk_documents (and any
uses of self.always_emit_headings). References to change: the inputs list entry
named "always_emit_headings", the update_build_config branch that sets
build_config["always_emit_headings"][...] and the HybridChunker(...) call in
chunk_documents.
In `@src/lfx/src/lfx/components/docling/chunk_docling_document.py`:
- Around line 183-187: The instantiation of HybridChunker is passing an
unsupported parameter always_emit_headings which will raise a TypeError; remove
the always_emit_headings argument from the HybridChunker(...) call (leave
tokenizer=tokenizer and merge_peers=bool(self.merge_peers)), or if you intend to
control heading inclusion, replace it with the supported parameter
include_heading_hierarchy and pass the appropriate boolean (e.g.,
include_heading_hierarchy=bool(self.include_heading_hierarchy)) so the
HybridChunker call uses only valid kwargs.
🧹 Nitpick comments (1)
src/lfx/src/lfx/_assets/component_index.json (1)
72454-72454: Unrelated dependency version bumps included in this PR.Hunks 5–14 update
2.5.0andvlmrunto0.5.4across multiple components. These changes are unrelated to the stated PR objective (addingmerge_peersandalways_emit_headings). Consider whether these should be in a separate PR for cleaner change tracking, or confirm they were intentionally bundled (e.g., via an index regeneration script).
mpawlow
left a comment
There was a problem hiding this comment.
Code Review 1
- See PR comments: (a), (b), (c)
- Note: I did not perform a functional review
src/backend/tests/unit/components/docling/test_chunk_docling_document_component.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This PR aims to add two new options to the ChunkDoclingDocumentComponent: merge_peers (to merge undersized chunks with shared metadata) and always_emit_headings (to emit headings for empty sections). However, the implementation is incomplete.
Changes:
- Added
merge_peersBoolInput parameter (fully implemented and working) - Added
always_emit_headingsBoolInput parameter (declared but not implemented) - Updated
update_build_configto show/hide both parameters based on chunker selection - Added else clause for unknown chunker types (defensive programming improvement)
- Updated component hash and metadata files
- Added unit tests for build config behavior
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/lfx/src/lfx/components/docling/chunk_docling_document.py | Added two BoolInput parameters, updated build config logic, passed merge_peers to HybridChunker, added else clause for unknown chunkers |
| src/lfx/src/lfx/_assets/stable_hash_history.json | Updated component hash from d84ce7ffc6cb to dfde83c23a83 |
| src/lfx/src/lfx/_assets/component_index.json | Added merge_peers to field_order and field definitions, updated code hash and embedded code value |
| src/backend/tests/unit/components/docling/test_chunk_docling_document_component.py | Added tests for build config behavior with new parameters and merge_peers functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/backend/tests/unit/components/docling/test_chunk_docling_document_component.py
Show resolved
Hide resolved
There was a problem hiding this comment.
@ricofurtado @edwinjosechittilappilly
Code Review 2
- Approved / LGTM
- See PR comment (2a) for a Minor concern
| info=("Which chunker to use."), | ||
| value="HybridChunker", | ||
| real_time_refresh=True, | ||
| input_types=["Message"], |
There was a problem hiding this comment.
(2a) [Minor] Verify a piped in Message value will not cause errors
- Concern: The dropdown value for Message drives
update_build_config, and if aMessageis connected to it instead of selecting from the dropdown, thereal_time_refreshmechanism and thebuild_config["chunker"]["value"]check inupdate_build_configmay not work as expected. - This is a Minor severity comment. Please feel free to optionally address or ignore
There was a problem hiding this comment.
Cool!
Thanks Mike!
There was a problem hiding this comment.
@HimavarshaVS lets send this to QA ? and accordingly we can update?
There was a problem hiding this comment.
@mpawlow True, The idea is we would use mesasge to connect it to connect to Global variable in case if we want to switch using API call/runtime.
There was a problem hiding this comment.
This has been approved by QA. Once the CI is passed, we should be good to merge
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(v, str): | ||
| v = v.strip() | ||
| if not v: | ||
| return 0 | ||
| try: |
There was a problem hiding this comment.
IntInput now accepts and coerces values from Message/Data and from numeric strings (e.g., "42", "3.14"). The existing input unit tests only cover native int/float values; adding test cases for these new accepted input forms (and their failure modes) would help prevent regressions.
| if isinstance(v, int): | ||
| return v | ||
| if isinstance(v, float): | ||
| v = int(v) | ||
| return v | ||
| return int(v) |
There was a problem hiding this comment.
IntInput.validate_value treats bool as an int (since bool is a subclass of int) and returns it unchanged. That can leave value as True/False instead of an actual integer (1/0) and makes the "integer" input behave unexpectedly; consider explicitly handling bool before the int check (convert or reject).
| if isinstance(v, Message): | ||
| v = v.text | ||
| elif isinstance(v, Data): | ||
| v = v.data.get(v.text_key, "") |
There was a problem hiding this comment.
For Data inputs, IntInput.validate_value uses v.data.get(v.text_key, ""), which silently maps a missing text_key to an empty string and then to 0. This differs from MessageTextInput/SecretStrInput, which raise a clear error when the key is missing; aligning behavior would prevent silent misconfiguration.
| v = v.data.get(v.text_key, "") | |
| # For Data inputs, ensure the expected text_key exists to avoid silently | |
| # mapping a missing key to an empty string (and then to 0). | |
| if v.text_key not in v.data: | |
| input_name = info.data.get("name", "unknown") | |
| msg = ( | |
| f"Missing key '{v.text_key}' in Data for input {input_name}." | |
| ) | |
| raise ValueError(msg) | |
| v = v.data[v.text_key] |
| input_name = info.data.get("name", "unknown") | ||
| msg = f"Could not convert '{v}' to integer for input {input_name}." | ||
| raise ValueError(msg) from None | ||
| if not v: |
There was a problem hiding this comment.
The final if not v: return 0 fallback will coerce any falsy non-string value (e.g., [], {}, set()) to 0 instead of rejecting invalid types. This can mask upstream payload bugs; consider restricting the defaulting behavior to v is None (and keep the separate empty-string handling) and raising for other non-numeric types.
| if not v: | |
| if v is None: |
| if isinstance(v, Message): | ||
| v = v.text | ||
| elif isinstance(v, Data): | ||
| v = v.data.get(v.text_key, "") | ||
| if isinstance(v, str): | ||
| v = v.strip() | ||
| if not v: | ||
| return 0.0 | ||
| try: | ||
| return float(v) | ||
| except ValueError: | ||
| input_name = info.data.get("name", "unknown") | ||
| msg = f"Could not convert '{v}' to float for input {input_name}." | ||
| raise ValueError(msg) from None | ||
| if not v: | ||
| return 0.0 | ||
| msg = f"Invalid value type {type(v)} for input {info.data.get('name')}." |
There was a problem hiding this comment.
FloatInput.validate_value has the same Data-key issue as IntInput: v.data.get(v.text_key, "") silently becomes 0.0 when the key is missing (or when an empty dict/list is passed and hits the if not v fallback). Consider raising a descriptive error when text_key is absent and avoiding coercion of arbitrary falsy containers to 0.0.
| field_type: SerializableFieldTypes = FieldTypes.NESTED_DICT | ||
| value: dict | None = {} | ||
|
|
||
| @field_validator("value", mode="before") | ||
| @classmethod | ||
| def validate_value(cls, v: Any, info): | ||
| if v is None or isinstance(v, dict): | ||
| return v | ||
| if isinstance(v, Message): | ||
| v = v.text | ||
| elif isinstance(v, Data): | ||
| v = v.data.get(v.text_key, "") | ||
| if isinstance(v, str): |
There was a problem hiding this comment.
NestedDictInput now parses JSON strings, but it still (1) uses value: dict | None = {} while DictInput uses Field(default_factory=dict) and (2) uses v.data.get(v.text_key, ""), which silently turns missing keys into {}. Using default_factory and raising when text_key is missing would make behavior consistent and avoid surprising shared defaults/silent drops.
This pull request adds comprehensive unit tests for the
ChunkDoclingDocumentComponentto ensure correct handling of HybridChunker parameters and updates the component configuration incomponent_index.jsonto support new flags and improve usability. The main focus is on supporting and testing the newmerge_peersandalways_emit_headingsoptions for chunking documents.Component configuration enhancements:
merge_peersandalways_emit_headingsas configurable attributes for theChunkDoclingDocumentComponent, including their default values and UI metadata. (src/lfx/src/lfx/_assets/component_index.json) [1] [2]chunkerto include new input types and options, improving flexibility for document chunking. (src/lfx/src/lfx/_assets/component_index.json)valuefields for several integer attributes to ensure proper initialization in the UI and backend. (src/lfx/src/lfx/_assets/component_index.json) [1] [2] [3] [4] [5]Testing improvements:
ChunkDoclingDocumentComponentcorrectly updates its build configuration based on chunker and provider selections, and that HybridChunker receives the appropriate flags formerge_peersandalways_emit_headings. (src/backend/tests/unit/components/docling/test_chunk_docling_document_component.py)