Skip to main content

Template: synthetic

synthetic is a datafact template for creating datasets by generating synthetic using LLM. This template make use of another library mkb.

In order to proceed with this template, you need to setup mkb according to this guide

Create a new project using this template

datafact new datafact-tutorial/synthetic -t synthetic
Click to show output

Now go to the project folder datafact-tutorial/synthetic.

cd datafact-tutorial/synthetic
ls
Hide Output
README.md         data.py           datafact.json     fn.py             project.py        type.py

What to modify

You should modify the following files:

  • data.py
  • type.py
  • fn.py
  • README.md

Implement a function called create_data_dict which produce the content of your dataset.


import sys
from dataset_sh.constants import DEFAULT_COLLECTION_NAME
from fn import generate_greeting_message


def create_main_collection():
languages = [
"English",
"Spanish",
"Mandarin Chinese",
"Hindi",
"Arabic",
"Portuguese",
"Bengali",
"Russian",
"Japanese",
"Punjabi",
"German",
"Javanese",
"Korean",
"French",
"Telugu",
"Vietnamese",
"Turkish",
"Italian",
"Tamil",
"Urdu"
]

batch = generate_greeting_message.batch('generate_greeting_message_batch_0')

if batch.can_start_batch():
for lang in languages:
batch.add(lang)

batch.start_batch()
print('Batch started, please wait until it was processed.')
sys.exit(1)
else:
print('batch already started.')
status = batch.sync_remote()
print(status)
if status['status'] == 'finished':
items = []
for input_args, output in batch.iter_outputs():
items.append({
'language': input_args['args'][0],
"value": output.value
})
return items
else:
print('batch is not finished, exit dataset building process now.')
sys.exit(1)


def create_data_dict():
return {
DEFAULT_COLLECTION_NAME: create_main_collection()
}

Build and publish

To build your dataset

python project.py build

To preview your build before publish

python project.py preview show

To publish your build

python project.py publish

Example: science/acl-datasets

More Examples