Skip to content

docs: expand seed dataset docs for filesystem sources#452

Open
stepwise-ai-dev wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
stepwise-ai-dev:stepwise-ai-dev/docs/448-seed-dataset-docs
Open

docs: expand seed dataset docs for filesystem sources#452
stepwise-ai-dev wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
stepwise-ai-dev:stepwise-ai-dev/docs/448-seed-dataset-docs

Conversation

@stepwise-ai-dev
Copy link

@stepwise-ai-dev stepwise-ai-dev commented Mar 23, 2026

Summary

  • expand the seed dataset concept docs to cover the shipped filesystem-backed seed sources
  • document DirectorySeedSource and FileContentsSeedSource, including file_pattern, recursive, encoding, and the seeded columns they expose
  • fix the broken preview example to use data_designer.preview(...)

Closes #448.

Validation

  • cross-checked the docs against packages/data-designer-config/src/data_designer/config/seed_source.py
  • cross-checked the docs against packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py
  • cross-checked the documented behavior against existing interface and seed reader tests
  • did not run make test locally because this machine has uv 0.5.11, which cannot parse the repo's newer tool.uv.required-version / uv.lock format

@stepwise-ai-dev stepwise-ai-dev requested a review from a team as a code owner March 23, 2026 16:39
@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR expands docs/concepts/seed-datasets.md to document the two filesystem-backed seed sources (DirectorySeedSource and FileContentsSeedSource) that were previously undocumented on this page, and fixes a stale variable name in the preview example (designerdata_designer).

Key changes and findings:

  • DirectorySeedSource section: column schema (source_kind, source_path, relative_path, file_name), default values for file_pattern and recursive, and the note about basename-only matching all match the implementation in seed_reader.py and seed_source.py.
  • FileContentsSeedSource section: the source_kind value ("file_contents") and the full output_columns list (source_kind, source_path, relative_path, file_name, content) now match FileContentsSeedReader.output_columns and the runtime _build_metadata_record call — this resolves the concern raised in the previous review thread.
  • The encoding="utf-8" default is correct and matches seed_source.py.
  • The preview fix (data_designer.preview(...)) is correct — the variable is named data_designer in the complete example.
  • One factual inaccuracy remains: the sentence "Data Designer supports five ways to provide seed data" is wrong because AgentRolloutSeedSource is also a shipped source with its own recipe page, making the actual count six.

Confidence Score: 4/5

  • Safe to merge after fixing the off-by-one source count; all schema details are accurate.
  • The previous P0 concern (undocumented source_kind divergence for FileContentsSeedSource) has been fully addressed: the column schema is now explicitly listed and matches the runtime implementation. The only remaining issue is a factual count error ("five" instead of "six") that is a one-word fix. All other content — column names, default values, code examples, and the preview bugfix — cross-checks cleanly against the implementation.
  • docs/concepts/seed-datasets.md — the seed source count on line 57 needs to be updated from "five" to "six" (or the sentence rephrased) to account for AgentRolloutSeedSource.

Important Files Changed

Filename Overview
docs/concepts/seed-datasets.md Adds DirectorySeedSource and FileContentsSeedSource sections with accurate column schemas and correct default values, and fixes the preview call; the seed-source count ("five") is off by one because AgentRolloutSeedSource is also a shipped source documented elsewhere.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User provides SeedSource] --> B{seed_type?}
    B --> C[local → LocalFileSeedReader]
    B --> D[hf → HuggingFaceSeedReader]
    B --> E[dataframe → DataFrameSeedReader]
    B --> F[directory → DirectorySeedReader]
    B --> G[file_contents → FileContentsSeedReader]
    B --> H[agent_rollout → AgentRolloutSeedReader]

    F --> I[build_manifest\nsource_kind=directory_file\nsource_path, relative_path, file_name]
    G --> J[build_manifest\nsource_kind=file_contents\nsource_path, relative_path, file_name]
    J --> K[hydrate_row\nadds: content]

    I --> L[Seed columns injected\ninto Jinja2 templates]
    K --> L
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57

Comment:
**Source count is off by one**

The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

```suggestion
Data Designer supports six ways to provide seed data:
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "docs: clarify file contents seed source ..." | Re-trigger Greptile

Comment on lines +161 to +163
`FileContentsSeedSource` adds one extra seeded column:

- `content` - decoded text contents of the matched file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 source_kind value undocumented, implicitly misleading

The section says FileContentsSeedSource exposes "the same metadata as DirectorySeedSource". The DirectorySeedSource section explicitly documents source_kind as always "directory_file", so a reader will naturally infer that FileContentsSeedSource also emits source_kind = "directory_file".

In fact, the implementation sets a different value:

# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
    context=context,
    relative_path=relative_path,
    source_kind="file_contents",   # ← different from "directory_file"
)

Any user who filters or branches on source_kind (e.g. {% if source_kind == "directory_file" %}) would get silent wrong behaviour when using FileContentsSeedSource.

Please list the full column schema explicitly, matching the runtime implementation (output_columns on FileContentsSeedReader):

Suggested change
`FileContentsSeedSource` adds one extra seeded column:
- `content` - decoded text contents of the matched file
`FileContentsSeedSource` exposes these seeded columns:
- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 161-163

Comment:
**`source_kind` value undocumented, implicitly misleading**

The section says `FileContentsSeedSource` exposes "the same metadata as `DirectorySeedSource`". The `DirectorySeedSource` section explicitly documents `source_kind` as always `"directory_file"`, so a reader will naturally infer that `FileContentsSeedSource` also emits `source_kind = "directory_file"`.

In fact, the implementation sets a different value:

```python
# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
    context=context,
    relative_path=relative_path,
    source_kind="file_contents",   # ← different from "directory_file"
)
```

Any user who filters or branches on `source_kind` (e.g. `{% if source_kind == "directory_file" %}`) would get silent wrong behaviour when using `FileContentsSeedSource`.

Please list the full column schema explicitly, matching the runtime implementation (`output_columns` on `FileContentsSeedReader`):

```suggestion
`FileContentsSeedSource` exposes these seeded columns:

- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b7ecc5d.

@stepwise-ai-dev
Copy link
Author

I have read the DCO document and I hereby sign the DCO.

## Seed Sources

Data Designer supports three ways to provide seed data:
Data Designer supports five ways to provide seed data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Source count is off by one

The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

Suggested change
Data Designer supports five ways to provide seed data:
Data Designer supports six ways to provide seed data:
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57

Comment:
**Source count is off by one**

The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.

Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").

```suggestion
Data Designer supports six ways to provide seed data:
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs: expand seed dataset docs for filesystem seed sources and fix preview example typo

1 participant