1. Written April 9th, 2021. Currently there are many attribute-object labeled datasets, such as UTZAP50K, MIT-STATES, CUB, and so on. There are also verb-object
labeled datasets, more commonly known as Human Object Interaction (HOI) datasets such as HICO and V-COCO. The natural next step, then, is to have a tri-tuple label,
consisting of person-doing something-to object with some attribute, aka verb-noun-adjective. The collection fo current datasets, however, including the general
purpose image captioning dataset labels are too messy to aggregate together such dataset. Specifically, datasets like COCO and VisualGenome have images that are
collected in an unregulated fashion (and thus difficult to have sensible tri-state annotations), and well-collected image datasets like HICO does not have sufficient
annotation for adjectives. This problem is easily solvable through Amazon Mechanical Turk (AMT), but that's an effort for another time.