Skip to main content

Guide For Dataset Creators

· 2 min read

What do you need to know

  • General Best Practice
  • (Optional) How to use easytype to provide type annotation.
  • How to create a dataset bundle using one of the following:
    • dataset.sh API: Tutorial
    • datafact: Tutorial
  • creating synthetic datasets? You may also want to learn mkb
  • How to verify dataset quality? (How good does the data capture your intents)
  • setup dataset.sh account and upload to our public dataset repository server.

What datasets can I build?

  • create synthetic datasets using generative model
  • convert old datasets
  • crowdsource or annotate them yourself
  • convert from knowledge graph (e.g. wikidata)

Want to build applications with data?

You can build

  • Open source ML model using our datasets
  • Web/Mobile app that display/navigate/visualize data so human can learn from it
  • Video that visualize data
  • Informative graphic data visualization

Want to contribute to the dataset.sh core software?

You want to

  • understand the internal data structure
  • understand the communication protocol

Looking for Ideas?

If you want to contribute but don't know where to start, you can consider the following:

  1. create synthetic datasets using generative model
  2. convert old datasets
  3. crowdsource or annotate them yourself
  4. convert from knowledge graph (e.g. wikidata)

for applications:

  1. consider build and open source ML model using our datasets
  2. create data visualization
  3. use data for your app