Skip to main content

intro

What is Dataset Factory (datafact)

datafact is a dataset bundling framework that help you build, package, and distribute datasets.

Installation

To install datafact:

pip install -U datafact

Guide

Choose a template

To see the list of available templates:

 datafact templates

datafact has the following templates:

  1. hello-world: suitable for basic dataset projects.
  2. synthetic: suitable for creating dataset with synthetic data (via mkb).
  3. media: suitable for dataset projects with media files (image, audio, video data).

For more complicated examples:

Example: Build and Publish hello/world dataset.

Create a new project

datafact new hello/world

Enter the project folder

cd hello/world

Build the dataset

python project.py build

Preview the build (Optional)

python project.py preview show

Publish the dataset

# publish to local dataset.sh repo
python project.py publish

View it in dataset.sh web ui.

dataset.sh gui hello/world

Upload it to remote (Optional)

dataset.sh remote -p default upload -s hello/world -t latest hello/world

More examples

✅ hello/world
a dataset.sh tutorial

Tutorial: hello-world

In this guide, you will learn to create a simple dataset

Start
✅ media datasets
a dataset.sh tutorial

Tutorial: media datasets

in this guide, you will learn to how to bundle dataset with media files.

Start
✅ synthetic data datasets
a dataset.sh tutorial

Tutorial: synthetic data datasets

in this guide, you will learn to how to create and bundle dataset with synthetic data.

Start