Dataset Schema Design: Reference vs Embedding
· 4 min read
This article is generated by Language model and reviewed by human
Introduction
When designing datasets schema, the choice between embedding and referencing impacts usability, data size, and readability. This tutorial will guide you through when to embed data versus referencing it for static, read-only datasets.
Understanding Embedded vs. Referenced Formats
Embedded Data
All related data is contained in a single structure.
{
"author": "Alice",
"books": [
{ "title": "Book A", "year": 2020 },
{ "title": "Book B", "year": 2021 }
]
}
Referenced Data
Data is split into separate structures with references.
{
"author": "Alice",
"bookIds": [101, 102]
}
[
{ "bookId": 101, "title": "Book A", "year": 2020 },
{ "bookId": 102, "title": "Book B", "year": 2021 }
]
Key Considerations for Datasets
-
Data Readability
- Embed when you prioritize ease of understanding in a single file.
- Example: A dataset for teaching purposes, where clarity is more important than modularity.
- Reference when the dataset is large and modularity improves navigation.
-
File Size
- Embed when the dataset is small, and duplication doesn’t significantly increase file size.
- Reference when avoiding redundancy in large datasets is critical.
-
Reusability
- Embed when data relationships are tightly coupled, and you rarely reuse substructures.
- Example: A dataset of recipes, where ingredients are specific to each recipe.
- Reference when shared entities are reused across records.
- Example: A dataset of movies where actors appear in multiple films.
-
Processing Complexity
- Embed when you need simple, self-contained parsing.
- Reference when your processing logic can handle joining references.
Examples
Recipe Dataset
Embedded: Each recipe includes its own ingredient list.
{
"recipe": "Pasta",
"ingredients": [
{ "name": "Tomato", "quantity": "2 cups" },
{ "name": "Garlic", "quantity": "2 cloves" }
]
}
Referenced: Ingredients are stored separately for reuse.
{
"recipe": "Pasta",
"ingredientIds": [1, 2]
}
{ "ingredientId": 1, "name": "Tomato", "quantity": "2 cups" }
{ "ingredientId": 2, "name": "Garlic", "quantity": "2 cloves" }
Book Dataset
Embedded: Include books inline with their authors.
{
"author": "Alice",
"books": [
{ "title": "Book A", "year": 2020 },
{ "title": "Book B", "year": 2021 }
]
}
Referenced: Authors and books are stored separately.
{
"author": "Alice",
"bookIds": [101, 102]
}
{ "bookId": 101, "title": "Book A", "year": 2020 }
{ "bookId": 102, "title": "Book B", "year": 2021 }
Geographical Dataset
Embedded: Embed city data in the country.
{
"country": "USA",
"cities": [
{ "name": "New York", "population": 8000000 },
{ "name": "Los Angeles", "population": 4000000 }
]
}
Referenced: Separate cities for reuse across datasets.
{
"country": "USA",
"cityIds": [1, 2]
}
{ "cityId": 1, "name": "New York", "population": 8000000 }
{ "cityId": 2, "name": "Los Angeles", "population": 4000000 }
Section 4: Decision Flowchart
- Is the dataset small and self-contained?
- Yes → Embed
- No → Reference
- Are related entities reused across records?
- Yes → Reference
- No → Embed
- Do you prioritize readability or modularity?
- Readability → Embed
- Modularity → Reference
- Is redundancy acceptable?
- Yes → Embed
- No → Reference
Section 5: Best Practices
- Favor Embedding for Teaching or Readability-Focused Datasets
- Example: Tutorials or educational datasets.
- Favor Referencing for Modular and Large Datasets
- Example: Interconnected entities like books and authors.
- Mix and Match When Necessary
- Embed when redundancy is manageable; reference for shared entities.
Conclusion
For datasets, embedding is ideal for simplicity and clarity, while referencing is better for modularity and reusability. By evaluating your dataset’s purpose, size, and relationships, you can make the best choice for your needs.