feat: Add specialized handler for tool calling improvements#10857
feat: Add specialized handler for tool calling improvements#10857Cristhianzl merged 21 commits intomainfrom
Conversation
… and reduce noise in logs
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThis change introduces IBM WatsonX/Granite-specific tool calling support with model detection, enhanced system prompts for multi-tool scenarios, placeholder detection in tool arguments, and a dynamic agent creator that alternates tool-choice binding based on iteration depth. The feature integrates into the existing tool-calling pipeline and is covered by comprehensive unit tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ToolCalling
participant GraniteHandler
participant LLM
participant Agent
participant Tools
Client->>ToolCalling: create_agent_runnable(llm, tools, prompt)
rect rgb(200, 220, 240)
Note over ToolCalling,GraniteHandler: Granite Model Detection & Prompt Enhancement
ToolCalling->>GraniteHandler: is_granite_model(llm)
GraniteHandler-->>ToolCalling: bool result
alt Granite Model Detected
ToolCalling->>GraniteHandler: get_enhanced_system_prompt(base_prompt, tools)
GraniteHandler-->>ToolCalling: enhanced_prompt with tool instructions
end
end
rect rgb(240, 220, 200)
Note over ToolCalling,Agent: Agent Creation
alt Granite Model with Tools
ToolCalling->>GraniteHandler: create_granite_agent(llm, tools, prompt)
loop For Each Iteration
rect rgb(220, 240, 220)
Note over GraniteHandler,LLM: Dynamic Tool Choice Binding
GraniteHandler->>LLM: Select tool_choice (required/auto based on iteration)
LLM->>LLM: Bind tools with selected tool_choice
end
GraniteHandler->>LLM: Invoke with input & agent_scratchpad
LLM-->>GraniteHandler: tool_call or response
GraniteHandler->>GraniteHandler: detect_placeholder_in_args(tool_calls)
alt Placeholder Detected
GraniteHandler->>GraniteHandler: Inject corrective SystemMessage
end
GraniteHandler->>Tools: Execute tool_calls
Tools-->>GraniteHandler: tool results
end
GraniteHandler-->>ToolCalling: RunnableLambda agent
else Default Model
ToolCalling->>Agent: create_tool_calling_agent(llm, tools, prompt)
Agent-->>ToolCalling: default agent
end
end
ToolCalling-->>Client: agent_runnable
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30–45 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (6 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (40.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #10857 +/- ##
==========================================
+ Coverage 33.76% 33.77% +0.01%
==========================================
Files 1400 1400
Lines 66341 66342 +1
Branches 9791 9791
==========================================
+ Hits 22401 22409 +8
+ Misses 42799 42792 -7
Partials 1141 1141
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
src/lfx/src/lfx/components/langchain_utilities/tool_calling.py (1)
56-62: Mutatingself.system_promptin-place may cause issues if called multiple times.If
create_agent_runnable()is invoked more than once on the same component instance, the system prompt will accumulate enhancements. Consider using a local variable instead:- if is_granite_model(self.llm) and self.tools: - self.system_prompt = get_enhanced_system_prompt( - self.system_prompt or "", - self.tools - ) + effective_system_prompt = self.system_prompt or "" + if is_granite_model(self.llm) and self.tools: + effective_system_prompt = get_enhanced_system_prompt( + effective_system_prompt, + self.tools + )Then use
effective_system_promptwhen building the prompt template. If this is a non-issue in practice, feel free to disregard.src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py (1)
733-739: Consider more explicit exception handling in tests.Using
contextlib.suppress(Exception)hides all exceptions, which may mask unexpected failures. Consider catching specific expected exceptions or asserting on the behavior before the exception:# Alternative: verify the call was made regardless of downstream parsing with patch("lfx.components.langchain_utilities.ibm_granite_handler.format_to_tool_messages", return_value=[]): try: agent.invoke(inputs) except Exception: # ToolsAgentOutputParser may fail pass self.mock_llm_required.invoke.assert_called()This makes the test intention clearer while still handling expected exceptions.
src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py (2)
55-72: The try/except/else control flow is correct but could be clearer.The function works correctly, but the
else: return ""branch handles both "no args_schema" and "args_schema exists but no model_fields" cases. Consider consolidating:def _get_tool_schema_description(tool) -> str: """Extract a brief description of the tool's expected parameters.""" + if not hasattr(tool, "args_schema") or not tool.args_schema: + return "" try: - if hasattr(tool, "args_schema") and tool.args_schema: - schema = tool.args_schema - if hasattr(schema, "model_fields"): - fields = schema.model_fields - params = [] - for name, field in fields.items(): - required = field.is_required() if hasattr(field, "is_required") else True - req_str = "(required)" if required else "(optional)" - params.append(f"{name} {req_str}") - return f"Parameters: {', '.join(params)}" if params else "" - except Exception: # noqa: BLE001 + schema = tool.args_schema + if not hasattr(schema, "model_fields"): + return "" + fields = schema.model_fields + params = [] + for name, field in fields.items(): + required = field.is_required() if hasattr(field, "is_required") else True + req_str = "(required)" if required else "(optional)" + params.append(f"{name} {req_str}") + return f"Parameters: {', '.join(params)}" if params else "" + except Exception: return "" - else: - return ""This is optional; the current implementation works.
188-196: Inline import could be moved to module level.The
SystemMessageimport inside the function is executed on every placeholder detection. Moving it to the top of the file would be slightly more efficient:from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnableLambda +from langchain_core.messages import SystemMessageif has_placeholder: logger.warning("[IBM WatsonX] Placeholder detected, forcing final answer") - from langchain_core.messages import SystemMessage corrective_msg = SystemMessage(content=(This is a minor optimization.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py(1 hunks)src/lfx/src/lfx/base/agents/agent.py(1 hunks)src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py(1 hunks)src/lfx/src/lfx/components/langchain_utilities/tool_calling.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/backend/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)
src/backend/**/*.py: Use FastAPI async patterns withawaitfor async operations in component execution methods
Useasyncio.create_task()for background tasks and implement proper cleanup with try/except forasyncio.CancelledError
Usequeue.put_nowait()for non-blocking queue operations andasyncio.wait_for()with timeouts for controlled get operations
Files:
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py
src/backend/tests/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)
src/backend/tests/**/*.py: Place backend unit tests insrc/backend/tests/directory, component tests insrc/backend/tests/unit/components/organized by component subdirectory, and integration tests accessible viamake integration_tests
Use same filename as component with appropriate test prefix/suffix (e.g.,my_component.py→test_my_component.py)
Use theclientfixture (FastAPI Test Client) defined insrc/backend/tests/conftest.pyfor API tests; it provides an asynchttpx.AsyncClientwith automatic in-memory SQLite database and mocked environment variables. Skip client creation by marking test with@pytest.mark.noclient
Inherit from the correctComponentTestBasefamily class located insrc/backend/tests/base.pybased on API access needs:ComponentTestBase(no API),ComponentTestBaseWithClient(needs API), orComponentTestBaseWithoutClient(pure logic). Provide three required fixtures:component_class,default_kwargs, andfile_names_mapping
Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration
Use@pytest.mark.asynciodecorator for async component tests and ensure async methods are properly awaited
Test background tasks usingasyncio.create_task()and verify completion withasyncio.wait_for()with appropriate timeout constraints
Test queue operations using non-blockingqueue.put_nowait()andasyncio.wait_for(queue.get(), timeout=...)to verify queue processing without blocking
Use@pytest.mark.no_blockbustermarker to skip the blockbuster plugin in specific tests
For database tests that may fail in batch runs, run them sequentially usinguv run pytest src/backend/tests/unit/test_database.pyr...
Files:
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py
**/test_*.py
📄 CodeRabbit inference engine (Custom checks)
**/test_*.py: Review test files for excessive use of mocks that may indicate poor test design - check if tests have too many mock objects that obscure what's actually being tested
Warn when mocks are used instead of testing real behavior and interactions, and suggest using real objects or test doubles when mocks become excessive
Ensure mocks are used appropriately for external dependencies only, not for core logic
Backend test files should follow the naming convention test_*.py with proper pytest structure
Test files should have descriptive test function names that explain what is being tested
Tests should be organized logically with proper setup and teardown
Consider including edge cases and error conditions for comprehensive test coverage
Verify tests cover both positive and negative scenarios where appropriate
For async functions in backend tests, ensure proper async testing patterns are used with pytest
For API endpoints, verify both success and error response testing
Files:
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py
🧠 Learnings (3)
📚 Learning: 2025-11-24T19:46:09.104Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Backend components should be structured with clear separation of concerns: agents, data processing, embeddings, input/output, models, text processing, prompts, tools, and vector stores
Applied to files:
src/lfx/src/lfx/components/langchain_utilities/tool_calling.py
📚 Learning: 2025-08-11T16:52:26.755Z
Learnt from: edwinjosechittilappilly
Repo: langflow-ai/langflow PR: 9336
File: src/backend/base/langflow/base/models/openai_constants.py:29-33
Timestamp: 2025-08-11T16:52:26.755Z
Learning: The "gpt-5-chat-latest" model in the OpenAI models configuration does not support tool calling, so tool_calling should be set to False for this model in src/backend/base/langflow/base/models/openai_constants.py.
Applied to files:
src/lfx/src/lfx/components/langchain_utilities/tool_calling.py
📚 Learning: 2025-11-24T19:47:28.997Z
Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Applied to files:
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py
🧬 Code graph analysis (3)
src/lfx/src/lfx/components/langchain_utilities/tool_calling.py (1)
src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py (3)
create_granite_agent(129-201)get_enhanced_system_prompt(75-107)is_granite_model(45-52)
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py (1)
src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py (5)
create_granite_agent(129-201)detect_placeholder_in_args(110-126)get_enhanced_system_prompt(75-107)is_granite_model(45-52)is_watsonx_model(29-42)
src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py (3)
src/lfx/src/lfx/field_typing/constants.py (1)
ChatPromptTemplate(73-74)src/lfx/src/lfx/base/tools/flow_tool.py (1)
args(32-34)src/backend/tests/unit/mock_language_model.py (2)
bind_tools(69-72)invoke(42-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
- GitHub Check: Lint Backend / Run Mypy (3.10)
- GitHub Check: Lint Backend / Run Mypy (3.11)
- GitHub Check: Lint Backend / Run Mypy (3.12)
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 4
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
- GitHub Check: Run Backend Tests / LFX Tests - Python 3.10
- GitHub Check: Run Backend Tests / Integration Tests - Python 3.10
- GitHub Check: Test Docker Images / Test docker images
- GitHub Check: Test Starter Templates
- GitHub Check: Update Starter Projects
- GitHub Check: Run Ruff Check and Format
- GitHub Check: Update Component Index
🔇 Additional comments (19)
src/lfx/src/lfx/base/agents/agent.py (1)
184-184: LGTM!The comment accurately describes the purpose of the subsequent code block that adds
system_promptandchat_historyto the existinginput_dict.src/lfx/src/lfx/components/langchain_utilities/tool_calling.py (2)
5-11: LGTM!Clean separation of IBM-specific logic into a dedicated module with appropriate imports.
78-86: Potential inconsistency: Handler claims all WatsonX models are affected, but only Granite models get special handling.The
ibm_granite_handler.pydocstring states tool calling issues affect ALL WatsonX models (Llama, Mistral, etc.), yet here only Granite models (detected byis_granite_model) receive the specialized handler. Non-Granite WatsonX models will fall through to the defaultcreate_tool_calling_agent.If the intention is to handle all WatsonX models, consider using
is_watsonx_model()instead:- if is_granite_model(self.llm) and self.tools: + if is_watsonx_model(self.llm) and self.tools: return create_granite_agent(self.llm, self.tools, prompt)If the current behavior is intentional (only Granite for now), update the handler's docstring to clarify scope.
src/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py (9)
26-30: LGTM!The
create_mock_toolhelper is a clean and reusable utility for creating mock tools with propernameattributes throughout the tests.
33-143: LGTM!Comprehensive test coverage for
is_watsonx_model()including class name detection, module detection, case insensitivity, and negative cases for non-WatsonX providers.
151-254: LGTM!Thorough test coverage for the deprecated
is_granite_model()function, including model_id/model_name detection, fallback behavior, case insensitivity, and edge cases.
261-347: LGTM!Good test coverage for
get_enhanced_system_prompt()including edge cases (empty, None, single tool), boundary condition (exactly 2 tools), and content verification.
497-507: Good documentation of current limitation.The test clearly documents that the current implementation doesn't detect placeholders in nested structures. If this is intentional, consider adding a comment in the production code as well.
565-619: LGTM!Well-structured parameterized tests for the regex pattern with comprehensive positive and negative cases.
626-698: LGTM!Good test coverage for
create_granite_agent()including error handling, valid inputs, tool_choice bindings, and custom parameters.
818-841: Test confirms Llama on WatsonX uses default agent.This test validates that non-Granite WatsonX models (like Llama) use the default
create_tool_calling_agent. This is consistent with the currentis_granite_modelcheck but contradicts the handler's docstring claim that all WatsonX models are affected. Ensure this is the intended behavior.
921-1002: LGTM!Comprehensive edge case testing including type coercion, regex safety, preservation guarantees, and documented implementation limitations.
src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py (7)
1-20: LGTM!Well-documented module purpose with clear explanation of WatsonX-specific tool calling issues affecting multiple model types.
22-26: Potential false positive for HTML<data>tag.The regex will match
<data>(HTML5 element) as it contains the keyword "data". This is unlikely in practice for tool arguments but worth noting. The current implementation is reasonable for the expected use case.
29-42: LGTM!Robust WatsonX detection with multiple fallback strategies (class name, module name).
45-52: LGTM!Clean implementation with proper deprecation notice and graceful handling of missing/non-string attributes.
75-107: LGTM!Well-structured prompt enhancement with appropriate early return for edge cases and comprehensive instructions addressing WatsonX tool-calling quirks.
110-126: LGTM!Clean implementation with proper null checks and support for both dict and string arguments. The documented limitations (no nested dict/list checking) are reasonable for the use case.
200-205: LGTM!Good logging of agent creation and useful alias for backward/forward compatibility.
| IMPORTANT INSTRUCTIONS FOR TOOL USAGE: | ||
|
|
||
| 1. ALWAYS call tools when you need information - never say "I cannot" or "I don't have access". | ||
| 2. Call ONE tool at a time, wait for the result, then decide your next action. |
There was a problem hiding this comment.
would this limit the parallel tool calling?
There was a problem hiding this comment.
Yes, but this is a WatsonX platform limitation, not a design choice. WatsonX models don't reliably support parallel tool calls - when they attempt multiple calls, they often fail or produce incorrect results. The code already handles this by keeping only the first tool call when multiple are returned (see _limit_to_single_tool_call()).
|
Is this intentional behavior for Granite models? The question would be whether we should be adding workarounds to support models that aren't intended to work with multi-tool agentic functions. I'm not very familiar with Granite's capabilities, but the solution looks as if it is handling many "quirks" in order to get it to work - and often making some very specific behavioral decisions. |
|
|
||
| # Create LLM variants with different tool_choice settings | ||
| # Note: WatsonX doesn't support parallel_tool_calls parameter, we handle multiple | ||
| # tool calls manually by keeping only the first one |
There was a problem hiding this comment.
I don't fully understand the solution here, but it feels like it should be solved on the WatsonX/Granite side rather than adding workarounds on our end.
My main concerns would be:
a) complexity of the workarounds
b) making specific behavioral decisions like the tool_choice, the dynamic prompts, the set number of iterations, the stop message.
c) inconsistencies in behaviors in our system using these models vs. outside our system
There was a problem hiding this comment.
You raise valid concerns. I've simplified the code significantly:
- Extracted helper functions (
_limit_to_single_tool_call,_handle_placeholder_in_response) for better readability - Improved docstrings to explain why this exists - these are documented WatsonX platform behaviors, not bugs we're working around
- Reduced complexity - the main
invoke()function is now much cleaner
Regarding the strategic question: These workarounds address known WatsonX API behaviors (tool_choice handling, single tool call support). While ideally IBM would fix these upstream, this handler ensures Langflow users get a working experience today
|
What if we limited the usage of watsonx/granite models in agentic workflows with a clear list of what's supported or not? |
|
@jordanrfrazier I know granite 4 should support multi-tool agents. I dont get why we need extra steps though. https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models |
…ow into cz/improve-ibm-models
0ba146e to
967b9ae
Compare
| def is_granite_model(llm) -> bool: | ||
| """Check if the LLM is an IBM Granite model. | ||
|
|
||
| DEPRECATED: Use is_watsonx_model() instead. |
There was a problem hiding this comment.
classic ai bot thinking that every time it changes something it needs to make the code it wrote 30 seconds ago deprecated
IBM WatsonX Tool Calling Improvements
Why This Was Needed
OpenAI models (GPT-4, etc.) were trained specifically for tool/function calling with a structured API. They understand when to call tools, how to format arguments as JSON, and when to stop and provide a final answer.
IBM WatsonX models (Granite, Llama, Mistral running on WatsonX) have different training and API behavior:
tool_choiceparameter behaves differently than OpenAI's implementation<tool_call>...</tool_call>) instead of using the structured APIProblem
IBM WatsonX models had significant issues with tool calling in the Agent component:
tool_choice='required'- Models couldn't provide final answers<tool_call>tags as text instead of using the native API<result-from-previous>instead of actual valuesSolution
Created a specialized handler (
ibm_granite_handler.py) with:tool_choiceswitching: Forcestool_choice='required'for first N iterations to ensure tools are called, then switches to'auto'to allow final answersFiles Changed
src/lfx/src/lfx/components/langchain_utilities/ibm_granite_handler.py- New handlersrc/lfx/src/lfx/components/langchain_utilities/tool_calling.py- Integration with Agentsrc/backend/tests/unit/components/models_and_agents/test_ibm_granite_handler.py- TestsUsage
The handler is automatically applied when using Granite models with the Tool Calling Agent component. No configuration needed.