docs: expand seed dataset docs for filesystem sources#452
docs: expand seed dataset docs for filesystem sources#452stepwise-ai-dev wants to merge 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
|
All contributors have signed the DCO ✍️ ✅ |
Greptile SummaryThis PR expands Key changes and findings:
|
| Filename | Overview |
|---|---|
| docs/concepts/seed-datasets.md | Adds DirectorySeedSource and FileContentsSeedSource sections with accurate column schemas and correct default values, and fixes the preview call; the seed-source count ("five") is off by one because AgentRolloutSeedSource is also a shipped source documented elsewhere. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User provides SeedSource] --> B{seed_type?}
B --> C[local → LocalFileSeedReader]
B --> D[hf → HuggingFaceSeedReader]
B --> E[dataframe → DataFrameSeedReader]
B --> F[directory → DirectorySeedReader]
B --> G[file_contents → FileContentsSeedReader]
B --> H[agent_rollout → AgentRolloutSeedReader]
F --> I[build_manifest\nsource_kind=directory_file\nsource_path, relative_path, file_name]
G --> J[build_manifest\nsource_kind=file_contents\nsource_path, relative_path, file_name]
J --> K[hydrate_row\nadds: content]
I --> L[Seed columns injected\ninto Jinja2 templates]
K --> L
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57
Comment:
**Source count is off by one**
The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.
Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").
```suggestion
Data Designer supports six ways to provide seed data:
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "docs: clarify file contents seed source ..." | Re-trigger Greptile
docs/concepts/seed-datasets.md
Outdated
| `FileContentsSeedSource` adds one extra seeded column: | ||
|
|
||
| - `content` - decoded text contents of the matched file |
There was a problem hiding this comment.
source_kind value undocumented, implicitly misleading
The section says FileContentsSeedSource exposes "the same metadata as DirectorySeedSource". The DirectorySeedSource section explicitly documents source_kind as always "directory_file", so a reader will naturally infer that FileContentsSeedSource also emits source_kind = "directory_file".
In fact, the implementation sets a different value:
# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
context=context,
relative_path=relative_path,
source_kind="file_contents", # ← different from "directory_file"
)Any user who filters or branches on source_kind (e.g. {% if source_kind == "directory_file" %}) would get silent wrong behaviour when using FileContentsSeedSource.
Please list the full column schema explicitly, matching the runtime implementation (output_columns on FileContentsSeedReader):
| `FileContentsSeedSource` adds one extra seeded column: | |
| - `content` - decoded text contents of the matched file | |
| `FileContentsSeedSource` exposes these seeded columns: | |
| - `source_kind` - always `"file_contents"` | |
| - `source_path` - full path to the matched file | |
| - `relative_path` - path relative to the configured directory | |
| - `file_name` - basename of the matched file | |
| - `content` - decoded text contents of the matched file |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 161-163
Comment:
**`source_kind` value undocumented, implicitly misleading**
The section says `FileContentsSeedSource` exposes "the same metadata as `DirectorySeedSource`". The `DirectorySeedSource` section explicitly documents `source_kind` as always `"directory_file"`, so a reader will naturally infer that `FileContentsSeedSource` also emits `source_kind = "directory_file"`.
In fact, the implementation sets a different value:
```python
# packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py, line 544
_build_metadata_record(
context=context,
relative_path=relative_path,
source_kind="file_contents", # ← different from "directory_file"
)
```
Any user who filters or branches on `source_kind` (e.g. `{% if source_kind == "directory_file" %}`) would get silent wrong behaviour when using `FileContentsSeedSource`.
Please list the full column schema explicitly, matching the runtime implementation (`output_columns` on `FileContentsSeedReader`):
```suggestion
`FileContentsSeedSource` exposes these seeded columns:
- `source_kind` - always `"file_contents"`
- `source_path` - full path to the matched file
- `relative_path` - path relative to the configured directory
- `file_name` - basename of the matched file
- `content` - decoded text contents of the matched file
```
How can I resolve this? If you propose a fix, please make it concise.|
I have read the DCO document and I hereby sign the DCO. |
| ## Seed Sources | ||
|
|
||
| Data Designer supports three ways to provide seed data: | ||
| Data Designer supports five ways to provide seed data: |
There was a problem hiding this comment.
The PR increments the count from three to five, but AgentRolloutSeedSource is also a shipped seed source — it has its own recipe at docs/recipes/trace_ingestion/agent_rollout_distillation.md and is featured on docs/recipes/cards.md. That makes six sources in total, so the sentence is factually wrong.
Either update the count to "six" and add a brief entry for AgentRolloutSeedSource, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").
| Data Designer supports five ways to provide seed data: | |
| Data Designer supports six ways to provide seed data: |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/concepts/seed-datasets.md
Line: 57
Comment:
**Source count is off by one**
The PR increments the count from three to five, but `AgentRolloutSeedSource` is also a shipped seed source — it has its own recipe at `docs/recipes/trace_ingestion/agent_rollout_distillation.md` and is featured on `docs/recipes/cards.md`. That makes six sources in total, so the sentence is factually wrong.
Either update the count to "six" and add a brief entry for `AgentRolloutSeedSource`, or rephrase to avoid embedding a hard count (e.g., "Data Designer supports multiple ways to provide seed data:").
```suggestion
Data Designer supports six ways to provide seed data:
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
DirectorySeedSourceandFileContentsSeedSource, includingfile_pattern,recursive,encoding, and the seeded columns they exposedata_designer.preview(...)Closes #448.
Validation
packages/data-designer-config/src/data_designer/config/seed_source.pypackages/data-designer-engine/src/data_designer/engine/resources/seed_reader.pymake testlocally because this machine hasuv 0.5.11, which cannot parse the repo's newertool.uv.required-version/uv.lockformat