Skip to content

UDF support: CREATE FUNCTION DDL with pipeline SQL integration#198

Draft
ryannedolan wants to merge 11 commits intomainfrom
udfs
Draft

UDF support: CREATE FUNCTION DDL with pipeline SQL integration#198
ryannedolan wants to merge 11 commits intomainfrom
udfs

Conversation

@ryannedolan
Copy link
Collaborator

@ryannedolan ryannedolan commented Mar 19, 2026

Summary

  • Adds CREATE FUNCTION DDL support with end-to-end data plane integration
  • Demo Java UDFs (Greet, StringLength) and Python UDF (reverse_string) baked into the Flink runner image
  • Routes pipeline output through SqlJob CRD instead of FlinkSessionJob directly, enabling dynamic UDF file delivery via the files field
  • Revives the hoptimator-flink-adapter module (SqlJob → FlinkSessionJob reconciler) and wires it into the operator build
  • FlinkRunner parses --file: directives to write UDF files to disk before executing SQL
  • Dockerfile updated with Python/PyFlink for Python UDF support

Testing Done

  • Unit tests for Java UDFs (GreetTest, StringLengthTest)
  • Unit tests for FlinkRunner file directive parsing (FlinkRunnerTest)
  • Integration test k8s-ddl-udf-demo.id verifies pipeline generation with real UDF class names
  • All existing integration tests updated for SqlJob output format
  • Full build passes (checkstyle, spotbugs, all tests)
  • Verify end-to-end on deployed cluster: CREATE FUNCTION greet AS 'com.linkedin.hoptimator.flink.runner.functions.Greet' followed by a materialized view using greet()

🤖 Generated with Claude Code

ryannedolan and others added 11 commits March 19, 2026 16:24
Add support for user-defined functions (UDFs) that can be registered via
CREATE FUNCTION and referenced in SQL queries. Registered functions are
included in pipeline SQL so Flink can execute them at runtime.

DDL syntax:
  CREATE FUNCTION name [RETURNS type] AS 'class' [LANGUAGE lang] [WITH (...)]
  DROP FUNCTION name

Phase 1 - JDBC driver + pipeline SQL:
- UserFunction API model (Deployable)
- OpaqueFunction: permissive ScalarFunction for Calcite validation with
  configurable return type (RETURNS clause) and ANY-typed parameters
- Session-scoped function registry on HoptimatorConnection
- CREATE/DROP FUNCTION handling in HoptimatorDdlExecutor
- FunctionImplementor in ScriptImplementor generates CREATE FUNCTION DDL
- PipelineRel.Implementor tracks functions and emits DDL before connectors
- Parser extended with RETURNS and LANGUAGE clauses

Phase 2 - Python code delivery:
- Job API gains files field for inline code (e.g., Python UDF sources)
- SqlJob CRD spec gains files field
- FlinkStreamingSqlJob and reconciler pass files through
- K8sJobDeployer exports files to template environment

Tests:
- ScriptImplementorTest: FunctionImplementor DDL generation
- Quidem unit test (create-function-ddl.id): DDL parsing, type validation
- Quidem integration test (k8s-ddl-udf.id): pipeline SQL with !specify

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…n functions

Calcite normalizes identifiers to uppercase, and all session-registered
functions are emitted in pipeline SQL (not just the one used in the query).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Java and Python UDF implementations baked into the Flink runner
image so CREATE FUNCTION DDL resolves real functions at runtime:

- Greet: scalar VARCHAR UDF (Java)
- StringLength: scalar INTEGER UDF (Java)
- reverse_string: scalar VARCHAR UDF (Python/PyFlink)

Update Dockerfile to install Python/PyFlink and copy Python UDFs.
Configure Flink session cluster with Python executable paths.
Add k8s-ddl-udf-demo.id integration test using real UDF class names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change flink-template.yaml to generate SqlJob instead of FlinkSessionJob
directly, so that UDF files (Python code) are bundled into the CRD and
can be dynamically delivered to the data plane.

- flink-template.yaml now generates SqlJob with sql + files fields
- FlinkStreamingSqlJob.yaml.template changed to FlinkSessionJob (session mode)
- FlinkStreamingSqlJob encodes files as --file: directives in sql args
- FlinkRunner parses --file: directives, writes to /opt/python-udfs/
- FlinkControllerProvider registers FlinkSessionJob API
- All integration test expected output updated from FlinkSessionJob to SqlJob

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolve conflict in venice-ddl-insert-partial.id: take main's updated
SQL with multiple key fields (;-delimiter fix from #199) in SqlJob format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The flink-adapter module was orphaned from the build (not in
settings.gradle). Revive it so the SqlJob -> FlinkSessionJob
reconciler is compiled, packaged, and deployed with the operator.

- Add hoptimator-flink-adapter to settings.gradle
- Add as runtimeOnly dependency in hoptimator-operator-integration
  (discovered via SPI ControllerProvider)
- Rewrite FlinkControllerProvider and FlinkStreamingSqlJobReconciler
  to use current K8sContext/K8sApi pattern (was using old Operator API)
- Fix build.gradle dependency aliases (libs.kubernetes.client)
- Add hoptimator-util dependency for Api interface

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded absolute path with System.getProperty fallback
to satisfy SpotBugs DMI_HARDCODED_ABSOLUTE_FILENAME check.
Configurable via -Dhoptimator.udf.dir, defaults to /opt/python-udfs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The template engine renders an empty map as blank string, not {}.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SnakeYAML dumps an empty map as "{}\n", which the template engine
renders as an indented {} on a separate line after "files:".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SnakeYAML's dump() appends a trailing newline to its output (e.g.,
"{}\n" for an empty map). The template engine's multiline expansion
converts this into a spurious whitespace-only line. Trimming the
output fixes the rendering of {{files}} and other map variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the --file: encoding mechanism with the production pattern:
FlinkRunner receives --sqljob=namespace/name and fetches the SqlJob
CR directly from the K8s API to get SQL statements and UDF files.

- FlinkRunner uses DynamicKubernetesApi to fetch SqlJob CR
- Extracts spec.sql (statements) and spec.files (UDF code)
- Writes files to UDF directory, then executes SQL
- Falls back to SQL-from-args for backward compatibility
- Reconciler simplified: just passes SqlJob reference to template
- FlinkStreamingSqlJob reduced to namespace+name export
- RBAC added for Flink SA to read SqlJob CRs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant