feat: Docling components#8394
Conversation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes introduce a new "Docling" integration across backend and frontend. Backend additions include new components for chunking, inlining, exporting, loading, and remote processing of Docling documents, along with utility functions and dependency updates. Frontend changes add a Docling icon, update icon mappings, and extend sidebar bundles to include Docling. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Frontend
participant Backend
participant DoclingLib
participant DoclingServeAPI
User->>Frontend: Selects Docling feature (chunk, inline, export, load, remote)
Frontend->>Backend: Sends request with files/data and Docling parameters
alt Local processing
Backend->>DoclingLib: Processes documents (chunking, inlining, exporting, loading)
DoclingLib-->>Backend: Returns processed DoclingDocument(s) or export result
else Remote processing
Backend->>DoclingServeAPI: Sends base64 encoded files for async conversion
DoclingServeAPI-->>Backend: Returns task IDs
Backend->>DoclingServeAPI: Polls task status with retry logic
DoclingServeAPI-->>Backend: Returns conversion results
Backend->>DoclingLib: Validates and parses DoclingDocument JSON
end
Backend-->>Frontend: Returns processed data/results
Frontend-->>User: Displays Docling results with Docling icon in sidebar
Suggested labels
✨ Finishing Touches🧪 Generate Unit Tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Signed-off-by: DKL <dkl@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (8)
pyproject.toml (1)
222-227: Note temporary override forpython-pptx.This override in
[tool.uv]forcespython-pptx>=1.0.2to address compatibility with document processing. Consider adding a TODO to remove this once upstream fixes are released.src/backend/base/langflow/components/docling/export_docling_document.py (4)
48-48: Fix typo in info text.There's a typo in the info text: "betweek" should be "between".
- info="Add this placeholder betweek pages in the markdown output.", + info="Add this placeholder between pages in the markdown output.",
66-127: Consider refactoring to reduce complexity.The static analysis correctly identifies that this method has too many branches (16/12). The input validation logic could be extracted into a separate method to improve readability and maintainability.
Consider extracting the document extraction logic into a helper method:
+ def _extract_documents(self) -> list[DoclingDocument]: + from docling_core.types.doc import DoclingDocument + + if isinstance(self.data_inputs, DataFrame): + if not len(self.data_inputs): + msg = "DataFrame is empty" + raise TypeError(msg) + try: + return self.data_inputs[self.doc_key].to_list() + except Exception as e: + msg = f"Error extracting DoclingDocument from DataFrame: {e}" + raise TypeError(msg) from e + # ... rest of extraction logic + def export_document(self) -> list[Data]: - from docling_core.types.doc import DoclingDocument, ImageRefMode - - documents: list[DoclingDocument] = [] - # ... complex validation logic + from docling_core.types.doc import ImageRefMode + + documents = self._extract_documents() # ... export logic🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 66-66: Too many branches (16/12)
(R0912)
122-122: Address the TODO comment.The TODO indicates missing metadata functionality. This could enhance the exported data's usefulness.
Would you like me to help implement the metadata addition functionality or open a new issue to track this enhancement?
125-125: Improve error message accuracy.The error message mentions "splitting text" but this method is exporting documents, not splitting them.
- msg = f"Error splitting text: {e}" + msg = f"Error exporting document: {e}"src/backend/base/langflow/components/docling/chunk_docling_document.py (2)
11-11: Fix typo in description.There's a typo in the description: "DocumentDocument" should be "DoclingDocument".
- description: str = "Use the DocumentDocument chunkers to split the document into chunks." + description: str = "Use the DoclingDocument chunkers to split the document into chunks."
45-46: Remove unused helper method.The
_docs_to_datamethod is defined but never used in this component. Consider removing it to reduce code clutter.- def _docs_to_data(self, docs) -> list[Data]: - return [Data(text=doc.page_content, data=doc.metadata) for doc in docs] -src/backend/base/langflow/components/docling/load_docling_document.py (1)
27-52: Consider the commented text export line.The implementation is correct with proper error handling and local imports. However, there's a commented line for text export that might indicate incomplete functionality.
# "text": doc.export_to_markdown(),Consider either removing this comment if the functionality isn't needed, or implementing it if it provides value to downstream components.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (3)
src/frontend/package-lock.jsonis excluded by!**/package-lock.jsonsrc/frontend/src/icons/Docling/Docling.svgis excluded by!**/*.svguv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
pyproject.toml(2 hunks)src/backend/base/langflow/components/docling/__init__.py(1 hunks)src/backend/base/langflow/components/docling/chunk_docling_document.py(1 hunks)src/backend/base/langflow/components/docling/docling_inline.py(1 hunks)src/backend/base/langflow/components/docling/export_docling_document.py(1 hunks)src/backend/base/langflow/components/docling/load_docling_document.py(1 hunks)src/frontend/src/icons/Docling/Docling.jsx(1 hunks)src/frontend/src/icons/Docling/index.tsx(1 hunks)src/frontend/src/icons/eagerIconImports.ts(2 hunks)src/frontend/src/icons/lazyIconImports.ts(1 hunks)src/frontend/src/utils/styleUtils.ts(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (4)
src/frontend/src/icons/Docling/index.tsx (1)
src/frontend/src/icons/Docling/Docling.jsx (1)
SvgDocling(1-336)
src/frontend/src/icons/eagerIconImports.ts (1)
src/frontend/src/icons/Docling/index.tsx (1)
DoclingIcon(4-9)
src/backend/base/langflow/components/docling/__init__.py (4)
src/backend/base/langflow/components/docling/chunk_docling_document.py (1)
ChunkDoclingDocumentComponent(9-119)src/backend/base/langflow/components/docling/docling_inline.py (1)
DoclingInlineComponent(6-130)src/backend/base/langflow/components/docling/export_docling_document.py (1)
ExportDoclingDocumentComponent(6-131)src/backend/base/langflow/components/docling/load_docling_document.py (1)
LoadDoclingDocumentComponent(7-52)
src/backend/base/langflow/components/docling/chunk_docling_document.py (4)
src/backend/base/langflow/inputs/inputs.py (3)
DropdownInput(467-491)HandleInput(76-87)MessageTextInput(205-256)src/backend/base/langflow/schema/data.py (1)
Data(23-275)src/backend/base/langflow/schema/dataframe.py (1)
DataFrame(11-206)src/backend/base/langflow/components/docling/export_docling_document.py (1)
as_dataframe(130-131)
🪛 Biome (1.9.4)
src/frontend/src/icons/Docling/index.tsx
[error] 6-6: Don't use '{}' as a type.
Prefer explicitly define the object shape. '{}' means "any non-nullable value".
(lint/complexity/noBannedTypes)
🪛 Pylint (3.3.7)
src/backend/base/langflow/components/docling/chunk_docling_document.py
[refactor] 48-48: Too many branches (16/12)
(R0912)
src/backend/base/langflow/components/docling/export_docling_document.py
[refactor] 66-66: Too many branches (16/12)
(R0912)
src/backend/base/langflow/components/docling/docling_inline.py
[refactor] 73-73: Too many local variables (17/15)
(R0914)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Optimize new Python code in this PR
- GitHub Check: Update Starter Projects
🔇 Additional comments (11)
pyproject.toml (1)
130-130: Ensure Docling dependency version compatibility.Verify that
docling>=2.36.1is compatible with the new components and does not introduce conflicts with existing dependencies.src/frontend/src/utils/styleUtils.ts (1)
259-259: Register Docling in sidebar bundles.The new
{ display_name: "Docling", name: "docling", icon: "Docling" }entry correctly integrates the Docling feature set into the sidebar.src/frontend/src/icons/eagerIconImports.ts (2)
27-27: ImportDoclingIconfor eager loading.The new import
import { DoclingIcon } from "@/icons/Docling";correctly adds Docling to the eager icon registry.
145-145: MapDoclingtoDoclingIcon.Adding
"Docling": DoclingIcontoeagerIconsMappingensures the Docling icon is available for immediate rendering.src/frontend/src/icons/lazyIconImports.ts (1)
73-74: Add lazy-loaded Docling icon entry.The new
"Docling": () => import("@/icons/Docling").then((mod) => ({ default: mod.DoclingIcon })),enables Docling icon to be fetched on demand.src/backend/base/langflow/components/docling/__init__.py (1)
1-11: Well-structured package initialization.The package initialization follows Python best practices with clear imports and a properly defined
__all__list that matches the imported components. This provides a clean public API for the Docling components module.src/frontend/src/icons/Docling/Docling.jsx (1)
1-338: Well-implemented SVG icon component.The React component follows best practices with proper props spreading and scalable dimensions. The complex SVG graphics are well-structured with appropriate use of gradients, transformations, and embedded imagery for the Docling brand representation.
src/backend/base/langflow/components/docling/chunk_docling_document.py (1)
96-114: Excellent chunking implementation with rich metadata.The chunking logic is well-implemented, using the contextualize method to enrich chunks and properly extracting metadata including document ID and item references. The error handling appropriately catches and re-raises exceptions with descriptive messages.
src/backend/base/langflow/components/docling/load_docling_document.py (1)
1-26: LGTM! Well-structured component definition.The component metadata, inheritance, and input/output definitions are properly implemented. The restriction to JSON files aligns with the component's purpose of loading DoclingDocument objects.
src/backend/base/langflow/components/docling/docling_inline.py (2)
48-71: LGTM! Well-configured input options.The input definitions provide good configurability for different Docling pipelines and OCR engines while maintaining sensible defaults.
113-130: LGTM! Solid file processing and result handling.The conversion logic correctly handles file filtering, processes documents through the converter, and properly maps results to Data objects with appropriate error handling.
src/backend/base/langflow/components/docling/chunk_docling_document.py
Outdated
Show resolved
Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/backend/base/langflow/components/docling/docling_inline.py (1)
72-129: 🛠️ Refactor suggestionAddress the complexity issue by refactoring converter setup.
The method has too many local variables (17/15 limit) as flagged by static analysis. This is the same issue identified in previous reviews.
Extract the pipeline configuration logic into separate class methods to reduce complexity:
+ def _create_standard_pipeline_options(self) -> "PdfPipelineOptions": + from docling.datamodel.pipeline_options import OcrOptions, PdfPipelineOptions + from docling.models.factories import get_ocr_factory + + pipeline_options = PdfPipelineOptions() + pipeline_options.do_ocr = self.ocr_engine != "" + + if pipeline_options.do_ocr: + ocr_factory = get_ocr_factory(allow_external_plugins=False) + ocr_options: OcrOptions = ocr_factory.create_options(kind=self.ocr_engine) + pipeline_options.ocr_options = ocr_options + + return pipeline_options + + def _create_vlm_pipeline_options(self) -> "VlmPipelineOptions": + from docling.datamodel.pipeline_options import VlmPipelineOptions + return VlmPipelineOptions() + + def _get_converter(self) -> "DocumentConverter": + from docling.datamodel.base_models import InputFormat + from docling.document_converter import DocumentConverter, FormatOption, PdfFormatOption + from docling.pipeline.vlm_pipeline import VlmPipeline + + if self.pipeline == "standard": + pipeline_options = self._create_standard_pipeline_options() + pdf_format_option = PdfFormatOption(pipeline_options=pipeline_options) + elif self.pipeline == "vlm": + pipeline_options = self._create_vlm_pipeline_options() + pdf_format_option = PdfFormatOption(pipeline_cls=VlmPipeline, pipeline_options=pipeline_options) + + format_options: dict[InputFormat, FormatOption] = { + InputFormat.PDF: pdf_format_option, + InputFormat.IMAGE: pdf_format_option, + } + + return DocumentConverter(format_options=format_options) def process_files(self, file_list: list[BaseFileComponent.BaseFile]) -> list[BaseFileComponent.BaseFile]: from docling.datamodel.base_models import ConversionStatus - - def _get_converter() -> DocumentConverter: - # Remove the nested function and complex logic file_paths = [file.path for file in file_list if file.path] if not file_paths: self.log("No files to process.") return file_list - converter = _get_converter() + converter = self._get_converter()🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 72-72: Too many local variables (17/15)
(R0914)
🧹 Nitpick comments (1)
src/backend/base/langflow/components/docling/docling_inline.py (1)
47-66: LGTM! Well-structured input configuration.The inputs provide appropriate configuration options for Docling pipelines and OCR engines. The TODO comment indicates good planning for future extensibility.
Would you like me to help implement additional Docling options mentioned in the TODO comment?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/backend/base/langflow/components/docling/docling_inline.py(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
src/backend/base/langflow/components/docling/docling_inline.py
[refactor] 72-72: Too many local variables (17/15)
(R0914)
🔇 Additional comments (3)
src/backend/base/langflow/components/docling/docling_inline.py (3)
1-13: LGTM! Clean component definition with proper metadata.The imports are appropriate and the class definition follows good practices with comprehensive metadata including documentation URL and proper inheritance.
15-45: LGTM! Comprehensive file format support.The VALID_EXTENSIONS list correctly covers a wide range of document formats supported by Docling, and the duplicate "png" issue from previous reviews has been resolved.
68-70: LGTM! Appropriate output configuration.Simple and clean output configuration that properly extends the base component outputs.
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
@ogabrielluiz @rodrigosnader the PR is updated and ready for review from our side. |
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
…w into add-docling-component
* initial DoclingComponent Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Correct Docling icon style properties. Signed-off-by: DKL <dkl@zurich.ibm.com> * add file_path Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add load from json and export to various formats Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add chunking component Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update src/backend/base/langflow/components/docling/docling_inline.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * add Docling Serve component Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * apply some suggestions Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Update src/backend/base/langflow/components/docling/_utils.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * Update src/backend/base/langflow/components/docling/docling_remote.py Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * add check for DoclingDocument in list Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * fix import Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add maximum poll timeout and better checks for the retry logic Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * add updated starter_projects Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * refactor _get_converter Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * return only DataFrame Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * remove LoadDoclingDocument Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * more options in the chunk component Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * move docling imports Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * [autofix.ci] apply automated fixes * move utils to langflow.base Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Signed-off-by: DKL <dkl@zurich.ibm.com> Co-authored-by: DKL <dkl@zurich.ibm.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Summary by CodeRabbit
New Features
Chores