Making GitHub-hosted datasets discoverable by Google Dataset Search
Even though GitHub is best known for sharing and collaborating on source code it has also been used to share datasets. For small datasets (<100MB) that do not exceed free storage/download quotas, GitHub is an attractive home — especially considering its social features such as Issues and Pull Requests make it easy to get the community involved in data curation.
However, since GitHub is not a data repository, datasets hosted there might be hard to find. For example, Google Dataset Search depends on special semantic annotations to be present in the HTML code of the page describing a dataset for it to be indexed. This poses a unique problem — what if you host your datasets on GitHub and don’t want to spend time to set up GitHub Pages with custom HTML just to get your datasets indexed? There is a way to get the job done just by adding a few lines to a Markdown file (for example README.md
) in your repository.
Google Dataset Search uses a standard called Schema.org to discover dataset metadata. The standard includes many metadata fields (you can the most important ones here), but most of them are optional. The code for a basic dataset description one would have to add to a Markdown file would look like this:
GitHub Markdown parser will render it into the following form:
It looks like a normal table, but the itemtype
, itemscope
, and itemprop
attributes are parsed into a valid schema.org/Dataset record.
You can verify this by pointing the Structure Data Testing Tool to the GitHub hosted Markdown file you just created. The results should look something like this:
The name
and description
properties should be self-explanatory, however, you might be curious about the sameAs
property. It indicates the canonical location of the dataset — a very helpful signal when it comes to deduplicating multiple copies of a dataset created via forking. Of course, the name
, description
, and sameAs
properties are just a bare minimum. There are other fields such as names of measured variables, license, authors, the organization/company providing the dataset, etc… There a few good examples of GitHub hosted datasets with rich metadata to learn the more complex syntax from.
After you update the Markdown file in your repository you can sit back, relax and wait for Dataset Search to pick up the metadata about your dataset. If you want to step up your dataset hosting game, consider enabling integration with Zenodo.org — this will give your dataset long term preservation guarantees and a permanent identifier (plus all Zenodo datasets are already discoverable by Dataset Search).