fix: Knowledge base component refactor#9543
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughRemoves KBIngestion/KBRetrieval re-exports from components.data; introduces knowledge_bases package with lazy-loaded exports. Renames and relocates ingestion/retrieval components, updates their inputs, methods, and UI metadata; adjusts embedding/metadata retrieval behavior. Updates starter project JSONs, tests, and a frontend sidebar category. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Dev as Importer
participant KBPkg as components.knowledge_bases.__init__
participant Mod as ingestion/retrieval submodules
Dev->>KBPkg: from ...knowledge_bases import KnowledgeIngestionComponent
alt attribute not yet loaded
KBPkg->>KBPkg: __getattr__("KnowledgeIngestionComponent")
KBPkg->>Mod: import_mod("KnowledgeIngestionComponent", "ingestion", parent)
Mod-->>KBPkg: class KnowledgeIngestionComponent
KBPkg->>KBPkg: cache in globals
end
KBPkg-->>Dev: KnowledgeIngestionComponent
sequenceDiagram
autonumber
participant UI as Flow/UI
participant Ingest as KnowledgeIngestionComponent
participant Utils as knowledge_base_utils
participant Store as Vector Store
UI->>Ingest: run(input_df=Data|DataFrame|List[Data], kb, config)
Ingest->>Ingest: normalize input_df to DataFrame
Ingest->>Utils: get_knowledge_bases(), settings, user
Ingest->>Store: ingest DataFrame into KB collection
Note over Ingest: On error: raise RuntimeError
Store-->>Ingest: ack
Ingest-->>UI: status/result DataFrame
sequenceDiagram
autonumber
participant UI as Flow/UI
participant Retr as KnowledgeRetrievalComponent
participant Utils as knowledge_base_utils
participant VS as Vector Store
UI->>Retr: retrieve_data(kb, search_query, include_metadata, include_embeddings)
Retr->>Utils: load KB config, embeddings provider
Retr->>VS: query (similarity_search[_with_score])
alt include_metadata or include_embeddings
Retr->>VS: collection.get(ids, include metadatas[, embeddings])
VS-->>Retr: metadatas[, embeddings]
end
Retr->>Retr: build rows: content + optional metadata + optional _embeddings
Retr-->>UI: DataFrame "Results"
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
src/backend/base/langflow/components/knowledge_bases/ingestion.py (4)
526-553: Potential UnboundLocalError: embedding_model/api_key used before assignment.If embedding_metadata.json does not exist (or lacks keys), embedding_model and/or api_key are undefined when passed to _create_vector_store. This violates the PR objective of “raise on failures” and will surface as an unhelpful UnboundLocalError.
Apply this diff to initialize, validate, and fail fast with a clear error:
@@ async def build_kb_info(self) -> Data: - # Read the embedding info from the knowledge base folder + # Read the embedding info from the knowledge base folder kb_path = await self._kb_path() @@ - metadata_path = kb_path / "embedding_metadata.json" + metadata_path = kb_path / "embedding_metadata.json" + embedding_model: str | None = None + api_key: str | None = None @@ - if metadata_path.exists(): + if metadata_path.exists(): settings_service = get_settings_service() metadata = json.loads(metadata_path.read_text()) - embedding_model = metadata.get("embedding_model") + embedding_model = metadata.get("embedding_model") try: api_key = decrypt_api_key(metadata["api_key"], settings_service) except (InvalidToken, TypeError, ValueError) as e: logger.error(f"Could not decrypt API key. Please provide it manually. Error: {e}") @@ - if self.api_key: + if self.api_key: api_key = self.api_key self._save_embedding_metadata( kb_path=kb_path, embedding_model=embedding_model, api_key=api_key, ) + + # Validate presence of embedding model (and api_key if required) + if not embedding_model: + raise ValueError( + "Embedding metadata not found for knowledge base. " + "Create the KB via the 'Create new knowledge' dialog or ensure embedding_metadata.json exists." + )
342-381: Swallowing exceptions in _create_vector_store hides ingestion failures.The PR states ingestion should raise on failures. _create_vector_store logs errors but does not propagate them, so build_kb_info may report success with zero documents.
Propagate exceptions after logging:
@@ async def _create_vector_store(self, df_source: pd.DataFrame, config_list: list[dict[str, Any]], embedding_model: str, api_key: str) -> None: - try: + try: @@ - except (OSError, ValueError, RuntimeError) as e: - self.log(f"Error creating vector store: {e}") + except (OSError, ValueError, RuntimeError) as e: + self.log(f"Error creating vector store: {e}") + raise
416-454: Confusing/incorrect use of content vs identifier columns when building text and IDs.The comment says “Build content text from identifier columns,” but the code uses content_cols. Then page_content is reassigned from identifier_cols and reused for hashing. This conflates concerns and is error-prone.
- Build text_content from content_cols for embeddings.
- Build id_source from identifier_cols if present; else fall back to text_content.
- Hash id_source for _id.
@@ async def _convert_df_to_data_objects(self, df_source: pd.DataFrame, config_list: list[dict[str, Any]]) -> list[Data]: - for _, row in df_source.iterrows(): - # Build content text from identifier columns using list comprehension - identifier_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])] - - # Join all parts into a single string - page_content = " ".join(identifier_parts) - - # Build metadata from NON-vectorized columns only (simple key-value pairs) - data_dict = { - "text": page_content, # Main content for vectorization - } - - # Add identifier columns if they exist - if identifier_cols: - identifier_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])] - page_content = " ".join(identifier_parts) + for _, row in df_source.iterrows(): + # Build the text content from content (vectorized) columns + content_parts = [str(row[col]) for col in content_cols if col in row and pd.notna(row[col])] + text_content = " ".join(content_parts) + + # Determine identifier source (prefer identifier cols; else fall back to text_content) + if identifier_cols: + id_parts = [str(row[col]) for col in identifier_cols if col in row and pd.notna(row[col])] + id_source = " ".join(id_parts) + else: + id_source = text_content + + # Build metadata from NON-vectorized columns only (simple key-value pairs) + data_dict = { + "text": text_content, # Main content to embed + } @@ - # Hash the page_content for unique ID - page_content_hash = hashlib.sha256(page_content.encode()).hexdigest() + # Hash the identifier source for unique ID + page_content_hash = hashlib.sha256(id_source.encode()).hexdigest() data_dict["_id"] = page_content_hash
635-642: Catching the wrong timeout exception during embedding validation.asyncio.wait_for raises asyncio.TimeoutError, not built-in TimeoutError. The current except block won’t trigger.
Fix the except clause:
- except TimeoutError as e: + except asyncio.TimeoutError as e: msg = "Embedding validation timed out. Please verify network connectivity and key." raise ValueError(msg) from esrc/backend/base/langflow/components/knowledge_bases/retrieval.py (1)
239-257: Bug: Constructing Data with arbitrary kwargs will likely fail validationData expects its payload under the data field (see DataFrame.to_data_list usage across the codebase). Instantiating as Data(**kwargs) risks Pydantic validation errors and silent schema drift. Wrap the row dict in Data(data=...).
Apply this diff:
- data_list.append(Data(**kwargs)) + data_list.append(Data(data=kwargs))src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (1)
853-869: Potential UnboundLocalError and weak defaults in build_kb_infoInside the embedded component code (template.code.value), build_kb_info uses embedding_model and api_key variables that may be unset if metadata_path doesn’t exist or decrypt fails. That can throw NameError/UnboundLocalError and makes error handling brittle.
Patch the method to initialize variables and fail loudly when metadata is missing; also normalize SecretStr to str:
@@ - async def build_kb_info(self) -> Data: + async def build_kb_info(self) -> Data: @@ - # Read the embedding info from the knowledge base folder + # Read the embedding info from the knowledge base folder kb_path = await self._kb_path() if not kb_path: msg = "Knowledge base path is not set. Please create a new knowledge base first." raise ValueError(msg) metadata_path = kb_path / "embedding_metadata.json" - # If the API key is not provided, try to read it from the metadata file - if metadata_path.exists(): - settings_service = get_settings_service() - metadata = json.loads(metadata_path.read_text()) - embedding_model = metadata.get("embedding_model") - try: - api_key = decrypt_api_key(metadata["api_key"], settings_service) - except (InvalidToken, TypeError, ValueError) as e: - logger.error(f"Could not decrypt API key. Please provide it manually. Error: {e}") + # Initialize defaults + embedding_model: str | None = None + api_key: str | None = None + + # If the API key is not provided, try to read it from the metadata file + if metadata_path.exists(): + settings_service = get_settings_service() + metadata = json.loads(metadata_path.read_text()) + embedding_model = metadata.get("embedding_model") + try: + enc = metadata.get("api_key") + api_key = decrypt_api_key(enc, settings_service) if enc else None + except (InvalidToken, TypeError, ValueError) as e: + logger.error(f"Could not decrypt API key. Please provide it manually. Error: {e}") + api_key = None + else: + msg = "Embedding metadata not found. Create the knowledge base first from the dropdown dialog." + raise ValueError(msg) # Check if a custom API key was provided, update metadata if so - if self.api_key: - api_key = self.api_key + if self.api_key: + api_key = self.api_key.get_secret_value() if hasattr(self.api_key, "get_secret_value") else str(self.api_key) self._save_embedding_metadata( kb_path=kb_path, embedding_model=embedding_model, api_key=api_key, ) + if not embedding_model: + raise ValueError("Embedding model unavailable. Metadata is missing or invalid.") + # Create vector store following Local DB component pattern await self._create_vector_store(df_source, config_list, embedding_model=embedding_model, api_key=api_key)
🧹 Nitpick comments (19)
src/backend/tests/unit/base/data/test_kb_utils.py (4)
5-5: Rename test class to reflect the new module name.Keeps naming consistent with the new package.
-class TestKBUtils: +class TestKnowledgeBaseUtils:
45-47: Strengthen the assertion to match the comment (tokenization doesn’t stem “cats”→“cat”).As written,
>= 0.0always passes and doesn’t validate the stated behavior.- # Third document contains both "cats" and "dogs", but case-insensitive matching should work - # Note: "cats" != "cat" exactly, so this tests the term matching behavior - assert scores[2] >= 0.0 + # Third document contains "cats" and "dogs" (plural), which should NOT match "cat"/"dog" + # given simple whitespace tokenization without stemming. + assert scores[2] == 0.0
178-181: Make “different scores” assertions robust to edge cases and float quirks.Direct list inequality can be brittle if parameters incidentally yield identical vectors; check for any meaningful element-wise difference instead.
- # Scores should be different with different parameters - assert scores_default != scores_k1 - assert scores_default != scores_b + # Scores should be different with different parameters (tolerate float quirks) + assert any(abs(a - b) > 1e-12 for a, b in zip(scores_default, scores_k1)) + assert any(abs(a - b) > 1e-12 for a, b in zip(scores_default, scores_b))
1-459: (Optional) Consider relocating/renaming this test to mirror the new package path.For long-term maintainability, consider moving the file to:
- src/backend/tests/unit/base/knowledge_bases/test_knowledge_base_utils.py
This keeps tests aligned with the refactor’s package structure.
If you’d like, I can draft the minimal PR-wide file move plan (including updating any import paths) to keep everything consistent.
src/frontend/src/utils/styleUtils.ts (1)
214-218: Add icon/color mappings for the new 'knowledge_bases' category to keep UI consistent.You added the category to SIDEBAR_CATEGORIES, but there are no corresponding entries in categoryIcons, nodeIconToDisplayIconMap, nodeColors, or nodeColorsName. Some UI surfaces use those maps; missing keys can cause fallback icons/colors or blank icons.
Apply this diff to add consistent mappings:
--- a/src/frontend/src/utils/styleUtils.ts +++ b/src/frontend/src/utils/styleUtils.ts @@ export const nodeColors: { [char: string]: string } = { @@ Tool: "#00fbfc", + knowledge_bases: "#7c3aed", // violet-ish, distinct from vectorstores/retrievers }; @@ export const nodeColorsName: { [char: string]: string } = { @@ DataFrame: "pink", + knowledge_bases: "violet", }; @@ export const categoryIcons: Record<string, string> = { @@ toolkits: "Package2", tools: "Hammer", + knowledge_bases: "Package2", custom: "Edit", custom_components: "GradientInfinity", }; @@ export const nodeIconToDisplayIconMap: Record<string, string> = { @@ textsplitters: "Scissors", toolkits: "Package2", tools: "Hammer", + knowledge_bases: "Package2", custom_components: "GradientInfinity",src/backend/base/langflow/components/knowledge_bases/ingestion.py (1)
285-290: Prefer atomic write for embedding_metadata.json to avoid partial files.A crash between write_text and flush could leave a truncated JSON. Use a temp file + replace.
I can provide a small helper to write atomically using NamedTemporaryFile if you want it integrated here.
src/backend/tests/unit/components/knowledge_bases/test_ingestion.py (3)
246-295: Good duplicate-prevention test; consider adding a variant with allow_duplicates=True.Verifies skip on hash collision. Add a complementary test asserting both rows are returned when allow_duplicates=True.
If helpful, I can draft the test body.
103-108: Backward-compatibility mapping is missing.Guidelines ask for file_names_mapping to keep version compatibility. Since this is a rename/move from KBIngestionComponent, provide a mapping from the old component file to the new one.
If you share the exact legacy path/class (e.g., langflow/components/data/kb_ingestion.py::KBIngestionComponent), I can propose the precise mapping snippet.
312-333: Successful path test is solid; add failure-path coverage for missing metadata.Consider a test where embedding_metadata.json is absent: build_kb_info should now raise ValueError (after the suggested fix), and the test should assert the error message.
Happy to provide the assertion once the error text is finalized.
src/backend/base/langflow/components/knowledge_bases/__init__.py (1)
19-31: Lazy importer is clean; consider minor introspection polish.Return a sorted list in dir for deterministic introspection order. No functional change.
-def __dir__() -> list[str]: - return list(__all__) +def __dir__() -> list[str]: + return sorted(__all__)src/backend/tests/unit/components/knowledge_bases/test_retrieval.py (2)
90-95: Backward-compatibility mapping likely needed here, too.As with ingestion tests, add file_names_mapping to cover the rename from KBRetrievalComponent.
Share the old path/class and I’ll draft the mapping block.
369-385: Optional: Verify include_embeddings behavior end-to-end.You can patch the underlying Chroma client to return embeddings and assert that results include an _embeddings key when include_embeddings=True and omit it when False.
I can draft a focused unit test that stubs Chroma._client.get_collection(...).get(...) to return deterministic vectors.
src/backend/base/langflow/components/knowledge_bases/retrieval.py (4)
71-77: Mismatch between “Include Embeddings” help text and behaviorThe input’s info states embeddings are “Only applicable if 'Include Metadata' is enabled,” but the code adds _embeddings regardless of include_metadata. Either enforce the dependency or clarify the description.
Two options:
- Update help text (simple, no behavior change):
- info="Whether to include embeddings in the output. Only applicable if 'Include Metadata' is enabled.", + info="Whether to include embeddings in the output (adds an _embeddings field when available).",
- Or gate by include_metadata:
- if self.include_embeddings: + if self.include_embeddings and self.include_metadata: kwargs["_embeddings"] = id_to_embedding.get(doc[0].metadata.get("_id"))I recommend the first to keep current flexibility while avoiding confusion.
Also applies to: 245-251
231-238: Avoid reliance on private chroma._clientAccessing chroma._client is private API and brittle across langchain_chroma/chromadb versions. Prefer a public accessor if available, or guard with feature detection and clear error messages.
Example defensive pattern:
# Try safer access first collection = getattr(getattr(chroma, "_collection", None), "get", None) if callable(collection): embeddings_result = chroma._collection.get(where={"_id": {"$in": doc_ids}}, include=["metadatas", "embeddings"]) else: # Fallback to client with explicit warning if not hasattr(chroma, "_client"): raise RuntimeError("Underlying Chroma client not accessible to fetch embeddings safely.") collection = chroma._client.get_collection(name=self.knowledge_base) embeddings_result = collection.get(where={"_id": {"$in": doc_ids}}, include=["metadatas", "embeddings"])If the wrapper exposes a public get/collection accessor in your pinned version, prefer that instead.
206-222: Empty-query path uses similarity_search("")When no search_query is provided, similarity_search("") may yield implementation-defined results. If the intended behavior is “top-k arbitrary/most recent rows,” consider fetching via collection.get(limit=k) or a deterministic fallback, rather than embedding an empty string.
Sketch:
if self.search_query: ... else: # Deterministic fallback without embeddings call collection = chroma._client.get_collection(name=self.knowledge_base) fetched = collection.get(limit=self.top_k, include=["documents", "metadatas"]) results = [(Document(page_content=doc, metadata=meta), 0.0) for doc, meta in zip(fetched["documents"], fetched["metadatas"])]This also avoids an unnecessary embedding of "".
245-246: Score semantics: expose distance directly or a normalized similarityYou invert the distance via -1 * doc[1], producing negative scores. Consider exposing both distance (lower is better) and a normalized similarity in [0, 1], or just distance to avoid confusion.
Example:
- kwargs["_score"] = -1 * doc[1] + kwargs["distance"] = float(doc[1]) + # Optional: bounded similarity + kwargs["similarity"] = 1.0 / (1.0 + float(doc[1]))src/backend/base/langflow/initial_setup/starter_projects/Knowledge Retrieval.json (1)
656-673: Clarify “Include Embeddings” description in templateThe description repeats the “Only applicable if 'Include Metadata' is enabled” constraint, which the backend does not enforce. To avoid UX confusion, update the info text here too.
Apply this diff:
- "info": "Whether to include embeddings in the output. Only applicable if 'Include Metadata' is enabled.", + "info": "Whether to include embeddings in the output (adds an _embeddings field when available).",src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (2)
853-869: Catching the wrong timeout exception in update_build_configasyncio.wait_for raises asyncio.TimeoutError, not built-in TimeoutError. The current except might miss the specific timeout path.
Inside update_build_config, adjust:
- except TimeoutError as e: + except asyncio.TimeoutError as e: msg = "Embedding validation timed out. Please verify network connectivity and key." raise ValueError(msg) from e
933-939: Default column_config sets both “identifier” and “vectorize” to trueThis is valid (ID used for dedup hash, text used for vectors), just calling it out. If users expect a distinct identifier (e.g., URL) you might consider an example row with a non-vectorized identifier column in docs, but no change required.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
src/backend/base/langflow/components/data/__init__.py(0 hunks)src/backend/base/langflow/components/knowledge_bases/__init__.py(1 hunks)src/backend/base/langflow/components/knowledge_bases/ingestion.py(5 hunks)src/backend/base/langflow/components/knowledge_bases/retrieval.py(7 hunks)src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json(15 hunks)src/backend/base/langflow/initial_setup/starter_projects/Knowledge Retrieval.json(13 hunks)src/backend/tests/unit/base/data/test_kb_utils.py(1 hunks)src/backend/tests/unit/components/knowledge_bases/test_ingestion.py(6 hunks)src/backend/tests/unit/components/knowledge_bases/test_retrieval.py(8 hunks)src/frontend/src/utils/styleUtils.ts(1 hunks)
💤 Files with no reviewable changes (1)
- src/backend/base/langflow/components/data/init.py
🧰 Additional context used
📓 Path-based instructions (9)
src/frontend/src/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.cursor/rules/frontend_development.mdc)
src/frontend/src/**/*.{ts,tsx,js,jsx}: All frontend TypeScript and JavaScript code should be located under src/frontend/src/ and organized into components, pages, icons, stores, types, utils, hooks, services, and assets directories as per the specified directory layout.
Use React 18 with TypeScript for all UI components in the frontend.
Format all TypeScript and JavaScript code using the make format_frontend command.
Lint all TypeScript and JavaScript code using the make lint command.
Files:
src/frontend/src/utils/styleUtils.ts
src/frontend/src/utils/**/*.{ts,tsx,js,jsx}
📄 CodeRabbit inference engine (.cursor/rules/frontend_development.mdc)
All utility functions should be placed in the utils directory.
Files:
src/frontend/src/utils/styleUtils.ts
{src/backend/**/*.py,tests/**/*.py,Makefile}
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
{src/backend/**/*.py,tests/**/*.py,Makefile}: Run make format_backend to format Python code before linting or committing changes
Run make lint to perform linting checks on backend Python code
Files:
src/backend/tests/unit/base/data/test_kb_utils.pysrc/backend/base/langflow/components/knowledge_bases/ingestion.pysrc/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/base/langflow/components/knowledge_bases/retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.pysrc/backend/base/langflow/components/knowledge_bases/__init__.py
src/backend/tests/unit/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
Test component integration within flows using create_flow, build_flow, and get_build_events utilities
Files:
src/backend/tests/unit/base/data/test_kb_utils.pysrc/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.py
src/backend/tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)
src/backend/tests/**/*.py: Unit tests for backend code must be located in the 'src/backend/tests/' directory, with component tests organized by component subdirectory under 'src/backend/tests/unit/components/'.
Test files should use the same filename as the component under test, with an appropriate test prefix or suffix (e.g., 'my_component.py' → 'test_my_component.py').
Use the 'client' fixture (an async httpx.AsyncClient) for API tests in backend Python tests, as defined in 'src/backend/tests/conftest.py'.
When writing component tests, inherit from the appropriate base class in 'src/backend/tests/base.py' (ComponentTestBase, ComponentTestBaseWithClient, or ComponentTestBaseWithoutClient) and provide the required fixtures: 'component_class', 'default_kwargs', and 'file_names_mapping'.
Each test in backend Python test files should have a clear docstring explaining its purpose, and complex setups or mocks should be well-commented.
Test both sync and async code paths in backend Python tests, using '@pytest.mark.asyncio' for async tests.
Mock external dependencies appropriately in backend Python tests to isolate unit tests from external services.
Test error handling and edge cases in backend Python tests, including using 'pytest.raises' and asserting error messages.
Validate input/output behavior and test component initialization and configuration in backend Python tests.
Use the 'no_blockbuster' pytest marker to skip the blockbuster plugin in tests when necessary.
Be aware of ContextVar propagation in async tests; test both direct event loop execution and 'asyncio.to_thread' scenarios to ensure proper context isolation.
Test error handling by mocking internal functions using monkeypatch in backend Python tests.
Test resource cleanup in backend Python tests by using fixtures that ensure proper initialization and cleanup of resources.
Test timeout and performance constraints in backend Python tests using 'asyncio.wait_for' and timing assertions.
Test Langflow's Messag...
Files:
src/backend/tests/unit/base/data/test_kb_utils.pysrc/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.py
src/backend/base/langflow/components/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
src/backend/base/langflow/components/**/*.py: Add new backend components to the appropriate subdirectory under src/backend/base/langflow/components/
Implement async component methods using async def and await for asynchronous operations
Use asyncio.create_task for background work in async components and ensure proper cleanup on cancellation
Use asyncio.Queue for non-blocking queue operations in async components and handle timeouts appropriately
Files:
src/backend/base/langflow/components/knowledge_bases/ingestion.pysrc/backend/base/langflow/components/knowledge_bases/retrieval.pysrc/backend/base/langflow/components/knowledge_bases/__init__.py
src/backend/**/components/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)
In your Python component class, set the
iconattribute to a string matching the frontend icon mapping exactly (case-sensitive).
Files:
src/backend/base/langflow/components/knowledge_bases/ingestion.pysrc/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/base/langflow/components/knowledge_bases/retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.pysrc/backend/base/langflow/components/knowledge_bases/__init__.py
src/backend/tests/unit/components/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
src/backend/tests/unit/components/**/*.py: Mirror the component directory structure for unit tests in src/backend/tests/unit/components/
Use ComponentTestBaseWithClient or ComponentTestBaseWithoutClient as base classes for component unit tests
Provide file_names_mapping for backward compatibility in component tests
Create comprehensive unit tests for all new components
Files:
src/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.py
src/backend/base/langflow/components/**/__init__.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
Update init.py with alphabetical imports when adding new components
Files:
src/backend/base/langflow/components/knowledge_bases/__init__.py
🧠 Learnings (4)
📚 Learning: 2025-07-21T14:16:14.125Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : Test backward compatibility across Langflow versions in backend Python tests by mapping component files to supported versions using 'VersionComponentMapping'.
Applied to files:
src/backend/tests/unit/components/knowledge_bases/test_retrieval.py
📚 Learning: 2025-08-05T22:51:27.961Z
Learnt from: edwinjosechittilappilly
PR: langflow-ai/langflow#0
File: :0-0
Timestamp: 2025-08-05T22:51:27.961Z
Learning: The TestComposioComponentAuth test in src/backend/tests/unit/components/bundles/composio/test_base_composio.py demonstrates proper integration testing patterns for external API components, including real API calls with mocking for OAuth completion, comprehensive resource cleanup, and proper environment variable handling with pytest.skip() fallbacks.
Applied to files:
src/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.py
📚 Learning: 2025-07-21T14:16:14.125Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-07-21T14:16:14.125Z
Learning: Applies to src/backend/tests/**/*.py : When writing component tests, inherit from the appropriate base class in 'src/backend/tests/base.py' (ComponentTestBase, ComponentTestBaseWithClient, or ComponentTestBaseWithoutClient) and provide the required fixtures: 'component_class', 'default_kwargs', and 'file_names_mapping'.
Applied to files:
src/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.py
📚 Learning: 2025-07-18T18:25:54.486Z
Learnt from: CR
PR: langflow-ai/langflow#0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-07-18T18:25:54.486Z
Learning: Applies to src/backend/tests/unit/components/**/*.py : Use ComponentTestBaseWithClient or ComponentTestBaseWithoutClient as base classes for component unit tests
Applied to files:
src/backend/tests/unit/components/knowledge_bases/test_retrieval.pysrc/backend/tests/unit/components/knowledge_bases/test_ingestion.py
🧬 Code graph analysis (6)
src/backend/tests/unit/base/data/test_kb_utils.py (1)
src/backend/base/langflow/base/knowledge_bases/knowledge_base_utils.py (2)
compute_bm25(53-109)compute_tfidf(10-50)
src/backend/base/langflow/components/knowledge_bases/ingestion.py (4)
src/backend/base/langflow/base/knowledge_bases/knowledge_base_utils.py (1)
get_knowledge_bases(112-137)src/backend/base/langflow/inputs/inputs.py (2)
HandleInput(76-87)TableInput(38-73)src/backend/base/langflow/schema/data.py (1)
Data(23-285)src/backend/base/langflow/schema/dataframe.py (1)
DataFrame(11-206)
src/backend/tests/unit/components/knowledge_bases/test_retrieval.py (4)
src/backend/base/langflow/base/knowledge_bases/knowledge_base_utils.py (1)
get_knowledge_bases(112-137)src/backend/base/langflow/components/knowledge_bases/retrieval.py (2)
KnowledgeRetrievalComponent(27-256)retrieve_data(173-256)src/backend/tests/base.py (1)
ComponentTestBaseWithoutClient(163-164)src/backend/tests/unit/components/knowledge_bases/test_ingestion.py (5)
component_class(16-18)mock_knowledge_base_path(21-24)mock_user_data(32-36)default_kwargs(62-101)mock_user_id(57-59)
src/backend/base/langflow/components/knowledge_bases/retrieval.py (2)
src/backend/base/langflow/base/knowledge_bases/knowledge_base_utils.py (1)
get_knowledge_bases(112-137)src/backend/base/langflow/schema/dataframe.py (1)
DataFrame(11-206)
src/backend/tests/unit/components/knowledge_bases/test_ingestion.py (2)
src/backend/base/langflow/base/knowledge_bases/knowledge_base_utils.py (1)
get_knowledge_bases(112-137)src/backend/base/langflow/components/knowledge_bases/ingestion.py (1)
KnowledgeIngestionComponent(41-668)
src/backend/base/langflow/components/knowledge_bases/__init__.py (3)
src/backend/base/langflow/components/_importing.py (1)
import_mod(8-37)src/backend/base/langflow/components/knowledge_bases/ingestion.py (1)
KnowledgeIngestionComponent(41-668)src/backend/base/langflow/components/knowledge_bases/retrieval.py (1)
KnowledgeRetrievalComponent(27-256)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test-starter-projects
🔇 Additional comments (15)
src/backend/tests/unit/base/data/test_kb_utils.py (1)
2-2: Import path refactor verified – ready to mergeAll checks passed:
- No references to the old
langflow.base.data.kb_utilspath were found.- The new module
knowledge_base_utils.pyexists undersrc/backend/base/langflow/base/knowledge_bases/.- Both
compute_tfidfandcompute_bm25are correctly defined in that module.Great work moving these utilities into the new
knowledge_basespackage.src/backend/tests/unit/components/knowledge_bases/test_ingestion.py (2)
212-223: Patch targets for settings/key encryption look correct. LGTM.Mocks are scoped to the module under test and won’t leak. Good practice.
109-120: Validate behavior for list inputs (Data and DataFrame).Given the component now accepts HandleInput with is_list=True for Data/DataFrame, add tests for:
- input_df: list[Data]
- input_df: list[pd.DataFrame]
- mixed list [Data, DataFrame] (if you adopt the defensive handling)
I can generate these unit tests with fixtures producing the data and asserting the concatenated row count.
src/backend/tests/unit/components/knowledge_bases/test_retrieval.py (3)
96-109: KB listing tests are accurate and mirror ingestion-side expectations. LGTM.Covers hidden dirs and multiple KBs correctly.
302-329: Great override test for user-supplied API key.Using a real SecretStr ensures parity with runtime behavior.
330-342: Test for missing metadata is valuable; mirror the ingestion-side failure.Once ingestion raises on missing metadata, both components will behave consistently. No action here; just noting the alignment.
src/backend/base/langflow/components/knowledge_bases/retrieval.py (2)
27-33: Renaming and output alignment look goodComponent metadata (display_name/name/icon), output name/method (retrieve_data), and dynamic KB options via update_build_config align with the PR goals and frontend mapping.
Also applies to: 80-87, 89-101
1-25: Formatting/Linting Verification RequiredThe automated check could not run in this environment because the
makecommand was not found. Please verify locally that this module passes the repository’s formatting and linting standards by running:make format_backend make lintsrc/backend/base/langflow/initial_setup/starter_projects/Knowledge Retrieval.json (4)
26-31: Edge wiring: KnowledgeRetrieval → ChatOutput is consistentSource handle expose DataFrame; target ChatOutput accepts Data/DataFrame/Message. Connection metadata and ids look consistent.
603-617: Output port/method rename correctly reflectedOutput port name retrieve_data and method retrieve_data match the backend component. Good alignment for lazy exports and UI.
715-737: Search query tool_mode enabled — good callMarking search_query as tool_mode enables smoother interop with tool-calling models. No action needed.
535-561: Component identity updates are consistentModule path, icon (“download”), display/title, and code_hash updates reflect the rename/move. No issues spotted.
src/backend/base/langflow/initial_setup/starter_projects/Knowledge Ingestion.json (3)
713-777: Rename and module relocation look correctNode type/name/icon updated to Knowledge Ingestion with module langflow.components.knowledge_bases.ingestion.KnowledgeIngestionComponent. Starter graph metadata consistent.
940-960: Input flexibility: HandleInput for input_df (Data | DataFrame, list) is aligned with PR goalsThis improves usability by accepting Data directly and lists. Matches the described auto-conversion behavior.
1-1122: No outdated references detected; please run formatting and lint manually
- Ripgrep search for
KBIngestionComponent,KBRetrievalComponent,get_chroma_kb_data,chroma_kb_data,components.data.kb_retrieval,components.data.kb_ingestion, orkb_utilsreturned no matches.- The
make format_backendandmake lintcommands weren’t available in this environment—please run your project’s formatting and lint steps (e.g. via your Makefile, npm scripts, or CI commands) to ensure code style consistency.
| HandleInput( | ||
| name="input_df", | ||
| display_name="Data", | ||
| info="Table with all original columns (already chunked / processed).", | ||
| display_name="Input", | ||
| info=( | ||
| "Table with all original columns (already chunked / processed). " | ||
| "Accepts Data or DataFrame. If Data is provided, it is converted to a DataFrame automatically." | ||
| ), | ||
| input_types=["Data", "DataFrame"], | ||
| is_list=True, | ||
| required=True, |
There was a problem hiding this comment.
Input handling misses list[DataFrame] support; can crash downstream.
The HandleInput advertises input_types ["Data", "DataFrame"] with is_list=True, but build_kb_info only handles Data, list[Data], or single DataFrame. Passing list[DataFrame] leaves df_source as a Python list, breaking validation and ingestion.
Apply this diff to support list[DataFrame] (and mixed Data/DataFrame lists defensively):
@@ async def build_kb_info(self) -> Data:
- if isinstance(self.input_df, Data):
- df_source: pd.DataFrame = self.input_df.to_dataframe()
- elif isinstance(self.input_df, list) and all(isinstance(item, Data) for item in self.input_df):
- # If input_df is a list of Data objects, concatenate them into a single DataFrame
- df_source: pd.DataFrame = pd.concat([item.to_dataframe() for item in self.input_df], ignore_index=True)
- else:
- df_source: pd.DataFrame = self.input_df
+ if isinstance(self.input_df, Data):
+ df_source: pd.DataFrame = self.input_df.to_dataframe()
+ elif isinstance(self.input_df, list):
+ # Accept lists of Data or DataFrame (or a mix); normalize to DataFrame
+ frames: list[pd.DataFrame] = []
+ for item in self.input_df:
+ if isinstance(item, Data):
+ frames.append(item.to_dataframe())
+ elif isinstance(item, pd.DataFrame):
+ frames.append(item)
+ else:
+ msg = f"Unsupported input type in list: {type(item).__name__}. Expected Data or DataFrame."
+ raise ValueError(msg)
+ if not frames:
+ raise ValueError("Empty input list provided for 'input_df'.")
+ df_source = pd.concat(frames, ignore_index=True)
+ elif isinstance(self.input_df, pd.DataFrame):
+ df_source = self.input_df
+ else:
+ msg = f"Unsupported input type for 'input_df': {type(self.input_df).__name__}. Expected Data or DataFrame."
+ raise ValueError(msg)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| HandleInput( | |
| name="input_df", | |
| display_name="Data", | |
| info="Table with all original columns (already chunked / processed).", | |
| display_name="Input", | |
| info=( | |
| "Table with all original columns (already chunked / processed). " | |
| "Accepts Data or DataFrame. If Data is provided, it is converted to a DataFrame automatically." | |
| ), | |
| input_types=["Data", "DataFrame"], | |
| is_list=True, | |
| required=True, | |
| async def build_kb_info(self) -> Data: | |
| # … previous logic … | |
| if isinstance(self.input_df, Data): | |
| df_source: pd.DataFrame = self.input_df.to_dataframe() | |
| elif isinstance(self.input_df, list): | |
| # Accept lists of Data or DataFrame (or a mix); normalize to a single DataFrame | |
| frames: list[pd.DataFrame] = [] | |
| for item in self.input_df: | |
| if isinstance(item, Data): | |
| frames.append(item.to_dataframe()) | |
| elif isinstance(item, pd.DataFrame): | |
| frames.append(item) | |
| else: | |
| msg = ( | |
| f"Unsupported input type in list: {type(item).__name__}. " | |
| "Expected Data or DataFrame." | |
| ) | |
| raise ValueError(msg) | |
| if not frames: | |
| raise ValueError("Empty input list provided for 'input_df'.") | |
| df_source = pd.concat(frames, ignore_index=True) | |
| elif isinstance(self.input_df, pd.DataFrame): | |
| df_source = self.input_df | |
| else: | |
| msg = ( | |
| f"Unsupported input type for 'input_df': {type(self.input_df).__name__}. " | |
| "Expected Data or DataFrame." | |
| ) | |
| raise ValueError(msg) | |
| # … subsequent logic … |
🤖 Prompt for AI Agents
In src/backend/base/langflow/components/knowledge_bases/ingestion.py around
lines 107-116, build_kb_info currently accepts ["Data","DataFrame"] with
is_list=True but only handles list[Data] or single DataFrame; this misses
list[DataFrame] and mixed lists causing df_source to remain a raw Python list.
Fix by normalizing incoming values: if input is a list, iterate elements and
classify each element as Data (leave as-is) or DataFrame (convert to Data by
records dicts or collect DataFrames); if all elements are DataFrames, concat
them into a single DataFrame and proceed down the single-DataFrame branch;
otherwise convert DataFrame elements to Data and treat the whole list as
list[Data] so existing list-Data handling works; add defensive type checks and
clear error messages for unsupported element types.
edwinjosechittilappilly
left a comment
There was a problem hiding this comment.
Code LGTM
|
please check the Icons in template vs sidebar |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release-1.6.0 #9543 +/- ##
================================================
Coverage ? 34.68%
================================================
Files ? 1209
Lines ? 57112
Branches ? 5419
================================================
Hits ? 19812
Misses ? 37156
Partials ? 144
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|




This pull request refactors the knowledge base ingestion and retrieval components in Langflow, moving them out of the
datamodule into a dedicatedknowledge_basesmodule, and introduces dynamic/lazy importing for these components. It also updates the starter project configuration and improves input/output handling and error management for knowledge base operations.Component Refactoring and Module Organization:
KBIngestionComponentandKBRetrievalComponenthave been renamed toKnowledgeIngestionComponentandKnowledgeRetrievalComponent, respectively, and moved fromlangflow.components.datato a newlangflow.components.knowledge_basesmodule for better code organization. [1] [2]Dynamic Import and API Improvements:
knowledge_bases/__init__.pyuses Python’s__getattr__and__dir__to lazily import components only when accessed, improving startup time and modularity.Input/Output Handling Enhancements:
HandleInputfor the input dataframe, supports bothDataandDataFrametypes (including lists), and automatically convertsDataobjects to dataframes, making it more flexible for various input sources. [1] [2]Error Handling Improvements:
Starter Project Updates:
Summary by CodeRabbit