Skip to content

feat: updates to Docling Remote and Chunker components#11684

Merged
Adam-Aghili merged 33 commits intomainfrom
chunk-docling-document-component-changes
Feb 24, 2026
Merged

feat: updates to Docling Remote and Chunker components#11684
Adam-Aghili merged 33 commits intomainfrom
chunk-docling-document-component-changes

Conversation

@ricofurtado
Copy link
Contributor

@ricofurtado ricofurtado commented Feb 9, 2026

This pull request adds comprehensive unit tests for the ChunkDoclingDocumentComponent to ensure correct handling of HybridChunker parameters and updates the component configuration in component_index.json to support new flags and improve usability. The main focus is on supporting and testing the new merge_peers and always_emit_headings options for chunking documents.

Component configuration enhancements:

  • Added merge_peers and always_emit_headings as configurable attributes for the ChunkDoclingDocumentComponent, including their default values and UI metadata. (src/lfx/src/lfx/_assets/component_index.json) [1] [2]
  • Updated the input template for chunker to include new input types and options, improving flexibility for document chunking. (src/lfx/src/lfx/_assets/component_index.json)
  • Set default value fields for several integer attributes to ensure proper initialization in the UI and backend. (src/lfx/src/lfx/_assets/component_index.json) [1] [2] [3] [4] [5]

Testing improvements:

  • Added unit tests to verify that the ChunkDoclingDocumentComponent correctly updates its build configuration based on chunker and provider selections, and that HybridChunker receives the appropriate flags for merge_peers and always_emit_headings. (src/backend/tests/unit/components/docling/test_chunk_docling_document_component.py)

…gDocumentComponent `pragma: allowlist secret`
@github-actions github-actions bot added the community Pull Request from an external contributor label Feb 9, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 9, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Two new boolean input parameters (merge_peers and always_emit_headings) are added to the ChunkDoclingDocumentComponent with visibility logic that displays them only when HybridChunker is the active chunker. These parameters are passed to HybridChunker initialization during document processing.

Changes

Cohort / File(s) Summary
New ChunkDoclingDocumentComponent inputs
src/lfx/src/lfx/_assets/component_index.json, src/lfx/src/lfx/components/docling/chunk_docling_document.py
Added two new boolean inputs (merge_peers, always_emit_headings) with display names, descriptions, and default values. Extended build-config logic to show/hide these inputs when HybridChunker is active. Updated HybridChunker instantiation to receive these parameters from component state.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 6

❌ Failed checks (1 error, 4 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Test Coverage For New Implementations ❌ Error PR introduces new functionality (merge_peers and always_emit_headings parameters) without any corresponding test coverage. Add unit tests for parameter validation and integration tests verifying HybridChunker receives correct parameters. Address API incompatibility of always_emit_headings parameter.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Quality And Coverage ⚠️ Warning PR adds two new boolean parameters to ChunkDoclingDocumentComponent but includes zero test files or test modifications. Add unit tests validating parameter storage, visibility toggle logic, and HybridChunker instantiation; add integration tests for full chunk_documents() workflow.
Test File Naming And Structure ⚠️ Warning The pull request adds two new component options (merge_peers and always_emit_headings) but includes no test files following standard naming conventions (test_*.py or *.test.ts). Add test_chunk_docling_document.py with tests validating the new options are properly exposed, passed to HybridChunker, and handle edge cases including unsupported parameters.
Title check ⚠️ Warning The title mentions generic updates to multiple components but the actual changes focus specifically on adding two new options (merge_peers and always_emit_headings) to ChunkDoclingDocumentComponent only. Update the title to be more specific, such as: 'feat: add merge_peers and always_emit_headings options to ChunkDoclingDocumentComponent' to accurately reflect the primary changes.
Excessive Mock Usage Warning ❓ Inconclusive No test files for ChunkDoclingDocumentComponent were found in the pull request to assess mock usage patterns. Provide test file paths for ChunkDoclingDocumentComponent or clarify if this PR includes test coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chunk-docling-document-component-changes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

Frontend Unit Test Coverage Report

Coverage Summary

Lines Statements Branches Functions
Coverage: 19%
18.8% (6095/32404) 12.25% (3097/25275) 12.64% (879/6952)

Unit Test Results

Tests Skipped Failures Errors Time
2310 0 💤 0 ❌ 0 🔥 32.267s ⏱️

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

❌ Patch coverage is 5.47945% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 35.30%. Comparing base (cb22542) to head (c315ced).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/lfx/src/lfx/inputs/inputs.py 5.47% 68 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (5.47%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (41.93%) is below the target coverage (60.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #11684      +/-   ##
==========================================
- Coverage   35.33%   35.30%   -0.04%     
==========================================
  Files        1525     1525              
  Lines       73302    73365      +63     
  Branches    11025    11041      +16     
==========================================
  Hits        25898    25898              
- Misses      45991    46055      +64     
+ Partials     1413     1412       -1     
Flag Coverage Δ
backend 55.83% <ø> (-0.02%) ⬇️
frontend 16.98% <ø> (ø)
lfx 41.93% <5.47%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/lfx/src/lfx/inputs/inputs.py 57.95% <5.47%> (-11.60%) ⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/lfx/src/lfx/_assets/component_index.json`:
- Line 64090: Summary: remove the unsupported always_emit_headings parameter and
input. Fix: in ChunkDoclingDocumentComponent remove the Message/Bool Input
definition for "always_emit_headings" from the inputs list and remove any
build_config toggles referencing "always_emit_headings" in update_build_config;
also remove the argument always_emit_headings=bool(self.always_emit_headings)
passed into the HybridChunker() instantiation inside chunk_documents (and any
uses of self.always_emit_headings). References to change: the inputs list entry
named "always_emit_headings", the update_build_config branch that sets
build_config["always_emit_headings"][...] and the HybridChunker(...) call in
chunk_documents.

In `@src/lfx/src/lfx/components/docling/chunk_docling_document.py`:
- Around line 183-187: The instantiation of HybridChunker is passing an
unsupported parameter always_emit_headings which will raise a TypeError; remove
the always_emit_headings argument from the HybridChunker(...) call (leave
tokenizer=tokenizer and merge_peers=bool(self.merge_peers)), or if you intend to
control heading inclusion, replace it with the supported parameter
include_heading_hierarchy and pass the appropriate boolean (e.g.,
include_heading_hierarchy=bool(self.include_heading_hierarchy)) so the
HybridChunker call uses only valid kwargs.
🧹 Nitpick comments (1)
src/lfx/src/lfx/_assets/component_index.json (1)

72454-72454: Unrelated dependency version bumps included in this PR.

Hunks 5–14 update google to 2.5.0 and vlmrun to 0.5.4 across multiple components. These changes are unrelated to the stated PR objective (adding merge_peers and always_emit_headings). Consider whether these should be in a separate PR for cleaner change tracking, or confirm they were intentionally bundled (e.g., via an index regeneration script).

@ricofurtado ricofurtado changed the title Add merge_peers and always_emit_headings options to ChunkDoclingDocumentComponent feat: Added "merge_peers" and "always_emit_headings" options to ChunkDoclingDocumentComponent Feb 12, 2026
@ricofurtado ricofurtado requested a review from mpawlow February 12, 2026 20:42
@github-actions github-actions bot added the enhancement New feature or request label Feb 12, 2026
Copy link
Contributor

@mpawlow mpawlow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricofurtado

Code Review 1

  • See PR comments: (a), (b), (c)
  • Note: I did not perform a functional review

@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 20, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to add two new options to the ChunkDoclingDocumentComponent: merge_peers (to merge undersized chunks with shared metadata) and always_emit_headings (to emit headings for empty sections). However, the implementation is incomplete.

Changes:

  • Added merge_peers BoolInput parameter (fully implemented and working)
  • Added always_emit_headings BoolInput parameter (declared but not implemented)
  • Updated update_build_config to show/hide both parameters based on chunker selection
  • Added else clause for unknown chunker types (defensive programming improvement)
  • Updated component hash and metadata files
  • Added unit tests for build config behavior

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

File Description
src/lfx/src/lfx/components/docling/chunk_docling_document.py Added two BoolInput parameters, updated build config logic, passed merge_peers to HybridChunker, added else clause for unknown chunkers
src/lfx/src/lfx/_assets/stable_hash_history.json Updated component hash from d84ce7ffc6cb to dfde83c23a83
src/lfx/src/lfx/_assets/component_index.json Added merge_peers to field_order and field definitions, updated code hash and embedded code value
src/backend/tests/unit/components/docling/test_chunk_docling_document_component.py Added tests for build config behavior with new parameters and merge_peers functionality

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added the enhancement New feature or request label Feb 23, 2026
Copy link
Contributor

@mpawlow mpawlow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricofurtado @edwinjosechittilappilly

Code Review 2

  • Approved / LGTM
  • See PR comment (2a) for a Minor concern

info=("Which chunker to use."),
value="HybridChunker",
real_time_refresh=True,
input_types=["Message"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(2a) [Minor] Verify a piped in Message value will not cause errors

  • Concern: The dropdown value for Message drives update_build_config, and if a Message is connected to it instead of selecting from the dropdown, the real_time_refresh mechanism and the build_config["chunker"]["value"] check in update_build_config may not work as expected.
  • This is a Minor severity comment. Please feel free to optionally address or ignore

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!
Thanks Mike!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HimavarshaVS lets send this to QA ? and accordingly we can update?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpawlow True, The idea is we would use mesasge to connect it to connect to Global variable in case if we want to switch using API call/runtime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been approved by QA. Once the CI is passed, we should be good to merge

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +490 to +494
if isinstance(v, str):
v = v.strip()
if not v:
return 0
try:
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntInput now accepts and coerces values from Message/Data and from numeric strings (e.g., "42", "3.14"). The existing input unit tests only cover native int/float values; adding test cases for these new accepted input forms (and their failure modes) would help prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +482 to +485
if isinstance(v, int):
return v
if isinstance(v, float):
v = int(v)
return v
return int(v)
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntInput.validate_value treats bool as an int (since bool is a subclass of int) and returns it unchanged. That can leave value as True/False instead of an actual integer (1/0) and makes the "integer" input behave unexpectedly; consider explicitly handling bool before the int check (convert or reject).

Copilot uses AI. Check for mistakes.
if isinstance(v, Message):
v = v.text
elif isinstance(v, Data):
v = v.data.get(v.text_key, "")
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Data inputs, IntInput.validate_value uses v.data.get(v.text_key, ""), which silently maps a missing text_key to an empty string and then to 0. This differs from MessageTextInput/SecretStrInput, which raise a clear error when the key is missing; aligning behavior would prevent silent misconfiguration.

Suggested change
v = v.data.get(v.text_key, "")
# For Data inputs, ensure the expected text_key exists to avoid silently
# mapping a missing key to an empty string (and then to 0).
if v.text_key not in v.data:
input_name = info.data.get("name", "unknown")
msg = (
f"Missing key '{v.text_key}' in Data for input {input_name}."
)
raise ValueError(msg)
v = v.data[v.text_key]

Copilot uses AI. Check for mistakes.
input_name = info.data.get("name", "unknown")
msg = f"Could not convert '{v}' to integer for input {input_name}."
raise ValueError(msg) from None
if not v:
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final if not v: return 0 fallback will coerce any falsy non-string value (e.g., [], {}, set()) to 0 instead of rejecting invalid types. This can mask upstream payload bugs; consider restricting the defaulting behavior to v is None (and keep the separate empty-string handling) and raising for other non-numeric types.

Suggested change
if not v:
if v is None:

Copilot uses AI. Check for mistakes.
Comment on lines +542 to +558
if isinstance(v, Message):
v = v.text
elif isinstance(v, Data):
v = v.data.get(v.text_key, "")
if isinstance(v, str):
v = v.strip()
if not v:
return 0.0
try:
return float(v)
except ValueError:
input_name = info.data.get("name", "unknown")
msg = f"Could not convert '{v}' to float for input {input_name}."
raise ValueError(msg) from None
if not v:
return 0.0
msg = f"Invalid value type {type(v)} for input {info.data.get('name')}."
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FloatInput.validate_value has the same Data-key issue as IntInput: v.data.get(v.text_key, "") silently becomes 0.0 when the key is missing (or when an empty dict/list is passed and hits the if not v fallback). Consider raising a descriptive error when text_key is absent and avoiding coercion of arbitrary falsy containers to 0.0.

Copilot uses AI. Check for mistakes.
Comment on lines 595 to +607
field_type: SerializableFieldTypes = FieldTypes.NESTED_DICT
value: dict | None = {}

@field_validator("value", mode="before")
@classmethod
def validate_value(cls, v: Any, info):
if v is None or isinstance(v, dict):
return v
if isinstance(v, Message):
v = v.text
elif isinstance(v, Data):
v = v.data.get(v.text_key, "")
if isinstance(v, str):
Copy link

Copilot AI Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NestedDictInput now parses JSON strings, but it still (1) uses value: dict | None = {} while DictInput uses Field(default_factory=dict) and (2) uses v.data.get(v.text_key, ""), which silently turns missing keys into {}. Using default_factory and raising when text_key is missing would make behavior consistent and avoid surprising shared defaults/silent drops.

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 24, 2026
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 24, 2026
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 24, 2026
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Feb 24, 2026
@ricofurtado ricofurtado disabled auto-merge February 24, 2026 19:12
@github-actions github-actions bot removed the enhancement New feature or request label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community Pull Request from an external contributor enhancement New feature or request lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: [LF] Update Docling Chunck component in Langflow

6 participants