Google Research Datasets

D3code Public
D3code is a large-scale cross-cultural dataset of parallel annotations for offensive language detection by over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions.

0 CC-BY-4.0 0 0 0 Updated May 22, 2024
visage Public
Visage contains an image dataset of images with human annotations on whether or not certain attributes are present or depicted in the image. The attribute may either be stereotypical or non-stereotypical w.r.t. to the identity group in the image. It also contains a list of attributes in English along with annotations about whether they are visual.

2 Apache-2.0 0 0 0 Updated May 17, 2024
sanpo_dataset Public

Python 37 Apache-2.0 0 3 1 Updated May 16, 2024
adversarial-nibbler Public
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

5 CC0-1.0 0 0 0 Updated May 13, 2024
scin Public
The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.

Jupyter Notebook 52 2 1 0 Updated May 8, 2024
MISeD Public
MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

2 0 0 0 Updated May 6, 2024
cpcd Public
The Conversational Playlist Creation Dataset (CPCD) contains 917 conversations between two people where users express preferences over sets of songs in natural language and wizards to elicit preferences from users. The dataset includes per-song ratings and can be used to design and evaluate conversational recommendation systems.

Python 8 2 1 1 Updated May 3, 2024
indic-gen-bench Public
IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.

21 2 0 0 Updated May 2, 2024
thesios Public
This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.

4 0 0 0 Updated Apr 29, 2024
Taskmaster Public
Please see the readme file as well as our 2019 EMNLP paper linked here -->

188 57 4 0 Updated Apr 24, 2024

View all repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Research Datasets

Pinned

Repositories

People

Top languages

Most used topics