Skip to main content

Dataset Schema Design: Reference vs Embedding

· 4 min read

This article is generated by Language model and reviewed by human

Introduction

When designing datasets schema, the choice between embedding and referencing impacts usability, data size, and readability. This tutorial will guide you through when to embed data versus referencing it for static, read-only datasets.


Understanding Embedded vs. Referenced Formats

Embedded Data

All related data is contained in a single structure.

{
"author": "Alice",
"books": [
{ "title": "Book A", "year": 2020 },
{ "title": "Book B", "year": 2021 }
]
}

Referenced Data

Data is split into separate structures with references.

{
"author": "Alice",
"bookIds": [101, 102]
}
[
{ "bookId": 101, "title": "Book A", "year": 2020 },
{ "bookId": 102, "title": "Book B", "year": 2021 }
]

Key Considerations for Datasets

  1. Data Readability

    • Embed when you prioritize ease of understanding in a single file.
    • Example: A dataset for teaching purposes, where clarity is more important than modularity.
    • Reference when the dataset is large and modularity improves navigation.
  2. File Size

    • Embed when the dataset is small, and duplication doesn’t significantly increase file size.
    • Reference when avoiding redundancy in large datasets is critical.
  3. Reusability

    • Embed when data relationships are tightly coupled, and you rarely reuse substructures.
    • Example: A dataset of recipes, where ingredients are specific to each recipe.
    • Reference when shared entities are reused across records.
    • Example: A dataset of movies where actors appear in multiple films.
  4. Processing Complexity

    • Embed when you need simple, self-contained parsing.
    • Reference when your processing logic can handle joining references.

Examples

Recipe Dataset

Embedded: Each recipe includes its own ingredient list.

{
"recipe": "Pasta",
"ingredients": [
{ "name": "Tomato", "quantity": "2 cups" },
{ "name": "Garlic", "quantity": "2 cloves" }
]
}

Referenced: Ingredients are stored separately for reuse.

{
"recipe": "Pasta",
"ingredientIds": [1, 2]
}
{ "ingredientId": 1, "name": "Tomato", "quantity": "2 cups" }
{ "ingredientId": 2, "name": "Garlic", "quantity": "2 cloves" }

Book Dataset

Embedded: Include books inline with their authors.

{
"author": "Alice",
"books": [
{ "title": "Book A", "year": 2020 },
{ "title": "Book B", "year": 2021 }
]
}

Referenced: Authors and books are stored separately.

{
"author": "Alice",
"bookIds": [101, 102]
}
{ "bookId": 101, "title": "Book A", "year": 2020 }
{ "bookId": 102, "title": "Book B", "year": 2021 }

Geographical Dataset

Embedded: Embed city data in the country.

{
"country": "USA",
"cities": [
{ "name": "New York", "population": 8000000 },
{ "name": "Los Angeles", "population": 4000000 }
]
}

Referenced: Separate cities for reuse across datasets.

{
"country": "USA",
"cityIds": [1, 2]
}
{ "cityId": 1, "name": "New York", "population": 8000000 }
{ "cityId": 2, "name": "Los Angeles", "population": 4000000 }

Section 4: Decision Flowchart

  1. Is the dataset small and self-contained?
  • Yes → Embed
  • No → Reference
  1. Are related entities reused across records?
  • Yes → Reference
  • No → Embed
  1. Do you prioritize readability or modularity?
  • Readability → Embed
  • Modularity → Reference
  1. Is redundancy acceptable?
  • Yes → Embed
  • No → Reference

Section 5: Best Practices

  1. Favor Embedding for Teaching or Readability-Focused Datasets
  • Example: Tutorials or educational datasets.
  1. Favor Referencing for Modular and Large Datasets
  • Example: Interconnected entities like books and authors.
  1. Mix and Match When Necessary
  • Embed when redundancy is manageable; reference for shared entities.

Conclusion

For datasets, embedding is ideal for simplicity and clarity, while referencing is better for modularity and reusability. By evaluating your dataset’s purpose, size, and relationships, you can make the best choice for your needs.