Dataset Schema Design: Reference vs Embedding

January 12, 2025 · 4 min read

This article is generated by Language model and reviewed by human

Introduction

When designing datasets schema, the choice between embedding and referencing impacts usability, data size, and readability. This tutorial will guide you through when to embed data versus referencing it for static, read-only datasets.

Understanding Embedded vs. Referenced Formats

Embedded Data

All related data is contained in a single structure.

{
  "author": "Alice",
  "books": [
    { "title": "Book A", "year": 2020 },
    { "title": "Book B", "year": 2021 }
  ]
}

Referenced Data

Data is split into separate structures with references.

{
  "author": "Alice",
  "bookIds": [101, 102]
}

[
  { "bookId": 101, "title": "Book A", "year": 2020 },
  { "bookId": 102, "title": "Book B", "year": 2021 }
]

Key Considerations for Datasets

Data Readability
- Embed when you prioritize ease of understanding in a single file.
- Example: A dataset for teaching purposes, where clarity is more important than modularity.
- Reference when the dataset is large and modularity improves navigation.
File Size
- Embed when the dataset is small, and duplication doesn’t significantly increase file size.
- Reference when avoiding redundancy in large datasets is critical.
Reusability
- Embed when data relationships are tightly coupled, and you rarely reuse substructures.
- Example: A dataset of recipes, where ingredients are specific to each recipe.
- Reference when shared entities are reused across records.
- Example: A dataset of movies where actors appear in multiple films.
Processing Complexity
- Embed when you need simple, self-contained parsing.
- Reference when your processing logic can handle joining references.

Examples

Recipe Dataset

Embedded: Each recipe includes its own ingredient list.

{
  "recipe": "Pasta",
  "ingredients": [
    { "name": "Tomato", "quantity": "2 cups" },
    { "name": "Garlic", "quantity": "2 cloves" }
  ]
}

Referenced: Ingredients are stored separately for reuse.

{
  "recipe": "Pasta",
  "ingredientIds": [1, 2]
}

{ "ingredientId": 1, "name": "Tomato", "quantity": "2 cups" }
{ "ingredientId": 2, "name": "Garlic", "quantity": "2 cloves" }

Book Dataset

Embedded: Include books inline with their authors.

{
  "author": "Alice",
  "books": [
    { "title": "Book A", "year": 2020 },
    { "title": "Book B", "year": 2021 }
  ]
}

Referenced: Authors and books are stored separately.

{
  "author": "Alice",
  "bookIds": [101, 102]
}

{ "bookId": 101, "title": "Book A", "year": 2020 }
{ "bookId": 102, "title": "Book B", "year": 2021 }

Geographical Dataset

Embedded: Embed city data in the country.

{
  "country": "USA",
  "cities": [
    { "name": "New York", "population": 8000000 },
    { "name": "Los Angeles", "population": 4000000 }
  ]
}

Referenced: Separate cities for reuse across datasets.

{
  "country": "USA",
  "cityIds": [1, 2]
}

{ "cityId": 1, "name": "New York", "population": 8000000 }
{ "cityId": 2, "name": "Los Angeles", "population": 4000000 }

Section 4: Decision Flowchart

Is the dataset small and self-contained?

Yes → Embed
No → Reference

Are related entities reused across records?

Yes → Reference
No → Embed

Do you prioritize readability or modularity?

Readability → Embed
Modularity → Reference

Is redundancy acceptable?

Yes → Embed
No → Reference

Section 5: Best Practices

Favor Embedding for Teaching or Readability-Focused Datasets

Example: Tutorials or educational datasets.

Favor Referencing for Modular and Large Datasets

Example: Interconnected entities like books and authors.

Mix and Match When Necessary

Embed when redundancy is manageable; reference for shared entities.

Conclusion

For datasets, embedding is ideal for simplicity and clarity, while referencing is better for modularity and reusability. By evaluating your dataset’s purpose, size, and relationships, you can make the best choice for your needs.

Introduction​

Understanding Embedded vs. Referenced Formats​

Embedded Data​

Referenced Data​

Key Considerations for Datasets​

Examples​

Recipe Dataset​

Embedded: Each recipe includes its own ingredient list.​

Referenced: Ingredients are stored separately for reuse.​

Book Dataset​

Embedded: Include books inline with their authors.​

Referenced: Authors and books are stored separately.​

Geographical Dataset​

Embedded: Embed city data in the country.​

Referenced: Separate cities for reuse across datasets.​

Conclusion​