docs: expand seed dataset docs for filesystem sources by stepwise-ai-dev · Pull Request #452 · NVIDIA-NeMo/DataDesigner

stepwise-ai-dev · 2026-03-23T16:39:58Z

Summary

expand the seed dataset concept docs to cover the shipped filesystem-backed seed sources
document DirectorySeedSource and FileContentsSeedSource, including file_pattern, recursive, encoding, and the seeded columns they expose
fix the broken preview example to use data_designer.preview(...)

Closes #448.

Validation

cross-checked the docs against packages/data-designer-config/src/data_designer/config/seed_source.py
cross-checked the docs against packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
cross-checked the documented behavior against existing interface and seed reader tests
did not run make test locally because this machine has uv 0.5.11, which cannot parse the repo's newer tool.uv.required-version / uv.lock format

github-actions · 2026-03-23T16:40:09Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

greptile-apps · 2026-03-23T16:42:28Z

Greptile Summary

This PR expands docs/concepts/seed-datasets.md to document the two filesystem-backed seed sources (DirectorySeedSource and FileContentsSeedSource) that were previously undocumented on this page, and fixes a stale variable name in the preview example (designer → data_designer).

Key changes and findings:

DirectorySeedSource section: column schema (source_kind, source_path, relative_path, file_name), default values for file_pattern and recursive, and the note about basename-only matching all match the implementation in seed_reader.py and seed_source.py.
FileContentsSeedSource section: the source_kind value ("file_contents") and the full output_columns list (source_kind, source_path, relative_path, file_name, content) now match FileContentsSeedReader.output_columns and the runtime _build_metadata_record call — this resolves the concern raised in the previous review thread.
The encoding="utf-8" default is correct and matches seed_source.py.
The preview fix (data_designer.preview(...)) is correct — the variable is named data_designer in the complete example.
One factual inaccuracy remains: the sentence "Data Designer supports five ways to provide seed data" is wrong because AgentRolloutSeedSource is also a shipped source with its own recipe page, making the actual count six.

Confidence Score: 4/5

Safe to merge after fixing the off-by-one source count; all schema details are accurate.
The previous P0 concern (undocumented source_kind divergence for FileContentsSeedSource) has been fully addressed: the column schema is now explicitly listed and matches the runtime implementation. The only remaining issue is a factual count error ("five" instead of "six") that is a one-word fix. All other content — column names, default values, code examples, and the preview bugfix — cross-checks cleanly against the implementation.
docs/concepts/seed-datasets.md — the seed source count on line 57 needs to be updated from "five" to "six" (or the sentence rephrased) to account for AgentRolloutSeedSource.

Important Files Changed

Filename	Overview
docs/concepts/seed-datasets.md	Adds `DirectorySeedSource` and `FileContentsSeedSource` sections with accurate column schemas and correct default values, and fixes the preview call; the seed-source count ("five") is off by one because `AgentRolloutSeedSource` is also a shipped source documented elsewhere.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User provides SeedSource] --> B{seed_type?}
    B --> C[local → LocalFileSeedReader]
    B --> D[hf → HuggingFaceSeedReader]
    B --> E[dataframe → DataFrameSeedReader]
    B --> F[directory → DirectorySeedReader]
    B --> G[file_contents → FileContentsSeedReader]
    B --> H[agent_rollout → AgentRolloutSeedReader]

    F --> I[build_manifest\nsource_kind=directory_file\nsource_path, relative_path, file_name]
    G --> J[build_manifest\nsource_kind=file_contents\nsource_path, relative_path, file_name]
    J --> K[hydrate_row\nadds: content]

    I --> L[Seed columns injected\ninto Jinja2 templates]
    K --> L

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57

Comment:
**Source count is off by one**

The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

```suggestion
Data Designer supports six ways to provide seed data:
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "docs: clarify file contents seed source ..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T16:42:32Z

docs/concepts/seed-datasets.md

+`FileContentsSeedSource` adds one extra seeded column:
+
+- `content` - decoded text contents of the matched file


source_kind value undocumented, implicitly misleading

The section says FileContentsSeedSource exposes "the same metadata as DirectorySeedSource". The DirectorySeedSource section explicitly documents source_kind as always "directory_file", so a reader will naturally infer that FileContentsSeedSource also emits source_kind = "directory_file".

In fact, the implementation sets a different value:

# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544 _build_metadata_record( context=context, relative_path=relative_path, source_kind="file_contents", # ← different from "directory_file" )

Any user who filters or branches on source_kind (e.g. {% if source_kind == "directory_file" %}) would get silent wrong behaviour when using FileContentsSeedSource.

Please list the full column schema explicitly, matching the runtime implementation (output_columns on FileContentsSeedReader):

Suggested change

`FileContentsSeedSource` adds one extra seeded column:

- `content` - decoded text contents of the matched file

`FileContentsSeedSource` exposes these seeded columns:

- `source_kind` - always `"file_contents"`

- `source_path` - full path to the matched file

- `relative_path` - path relative to the configured directory

- `file_name` - basename of the matched file

- `content` - decoded text contents of the matched file

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/concepts/seed-datasets.md Line: 161-163 Comment: **`source_kind` value undocumented, implicitly misleading** The section says `FileContentsSeedSource` exposes "the same metadata as `DirectorySeedSource`". The `DirectorySeedSource` section explicitly documents `source_kind` as always `"directory_file"`, so a reader will naturally infer that `FileContentsSeedSource` also emits `source_kind = "directory_file"`. In fact, the implementation sets a different value: ```python # packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544 _build_metadata_record( context=context, relative_path=relative_path, source_kind="file_contents", # ← different from "directory_file" ) ``` Any user who filters or branches on `source_kind` (e.g. `{% if source_kind == "directory_file" %}`) would get silent wrong behaviour when using `FileContentsSeedSource`. Please list the full column schema explicitly, matching the runtime implementation (`output_columns` on `FileContentsSeedReader`): ```suggestion `FileContentsSeedSource` exposes these seeded columns: - `source_kind` - always `"file_contents"` - `source_path` - full path to the matched file - `relative_path` - path relative to the configured directory - `file_name` - basename of the matched file - `content` - decoded text contents of the matched file ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in b7ecc5d.

stepwise-ai-dev · 2026-03-23T16:44:36Z

I have read the DCO document and I hereby sign the DCO.

greptile-apps · 2026-03-23T16:57:46Z

docs/concepts/seed-datasets.md

 ## Seed Sources

-Data Designer supports three ways to provide seed data:
+Data Designer supports five ways to provide seed data:


Source count is off by one

The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

Suggested change

Data Designer supports five ways to provide seed data:

Data Designer supports six ways to provide seed data:

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/concepts/seed-datasets.md Line: 57 Comment: **Source count is off by one** The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong. Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:"). ```suggestion Data Designer supports six ways to provide seed data: ``` How can I resolve this? If you propose a fix, please make it concise.

docs: expand seed dataset docs for filesystem sources

86234c7

stepwise-ai-dev requested a review from a team as a code owner March 23, 2026 16:39

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

stepwise-ai-dev and others added 2 commits March 23, 2026 09:44

Merge branch 'main' into stepwise-ai-dev/docs/448-seed-dataset-docs

0d18601

docs: clarify file contents seed source columns

b7ecc5d

greptile-apps bot reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: expand seed dataset docs for filesystem sources#452

docs: expand seed dataset docs for filesystem sources#452
stepwise-ai-dev wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
stepwise-ai-dev:stepwise-ai-dev/docs/448-seed-dataset-docs

stepwise-ai-dev commented Mar 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 23, 2026 •

edited

Loading

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

stepwise-ai-dev Mar 23, 2026

Uh oh!

stepwise-ai-dev commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		`FileContentsSeedSource` adds one extra seeded column:

		- `content` - decoded text contents of the matched file

-`FileContentsSeedSource` adds one extra seeded column:
-- `content` - decoded text contents of the matched file
+`FileContentsSeedSource` exposes these seeded columns:
+- `source_kind` - always `"file_contents"`
+- `source_path` - full path to the matched file
+- `relative_path` - path relative to the configured directory
+- `file_name` - basename of the matched file
+- `content` - decoded text contents of the matched file

	Data Designer supports five ways to provide seed data:
	Data Designer supports six ways to provide seed data:

Conversation

stepwise-ai-dev commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

stepwise-ai-dev Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

stepwise-ai-dev commented Mar 23, 2026

Uh oh!

greptile-apps bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stepwise-ai-dev commented Mar 23, 2026 •

edited

Loading

github-actions bot commented Mar 23, 2026 •

edited

Loading

greptile-apps bot commented Mar 23, 2026 •

edited

Loading