-
src_train_segments.csvcombines the unbalanced and balanced training sets: the splitting line number is 1715367. -
src_unbalanced_train_segments.csvis a copy ofsrc_full_unbalanced_train_segments.csv. -
src_unbalanced_train_segments_weight.csvcontains a weight of typeFLOATfor each sample insrc_unbalanced_train_segments.csv, note that there is line-to-line mapping between the two files. There weights are computed as the inverse of audio class frequencies. -
src_audiocaps_*are AudioCaps in our pre-training-data format, e.g.,
head -n 1 src_audiocaps_val.csv
# outputs
{
"dir": "all_1960000_1995000",
"aclip": ["p0.mp3"],
"frame": ["p0.4_01.jpg", "p0.4_02.jpg", "p0.4_03.jpg", "p0.4_04.jpg"],
"aclip_128": "p0.npz",
"frame_224": "p0.npz",
"id": "vfY_TJq7n_U",
"bos": "130.000",
"eos": "140.000",
"labels": ["/m/09ddx", "/m/09x0r", "/m/0jbk"],
"captions": ["Rustling occurs, ducks quack and water splashes, followed by an adult female and adult male speaking and duck calls being blown", "Ducks quack as a man speaks and makes a duck sound", "Birds chirp and ducks squawk while a man and woman speak", "Birds chirp and ducks quack before a man speaks", "Ducks quack and a man speaks"]
}
where
- dir: sub directory which contains the current audio clip. To make parallel data processing easier, we divide AudioSet audio into consecutive chunks (aka dir), e.g., all_0_35000, all_35000_70000, ..., all_2030000_2065000, all_2065000_2100000.
- aclip: audio clip suffix,
p0indicates the gold clip, you may seen[\d]which indicates negative clips croped from the video before or after the gold clip. - frame: video frame suffix,
p0.[\d]_[\d].jpgwhere the first number is the total number of video frames; the second number is the frame index. - aclip_128: pre-encoded audio clip (discarded).
- frame_224: pre-encoded video frames (discarded).
- id: youtube id.
- bos: start time of the gold clip.
- eos: ending time of the gold clip.
- labels: a list of ids which rely on
ontology.jsonto parse. - captions: only appear in audio captioning datasets.
See getitem as to parse each input sample.
We hypothesize that augmenting the balanced training set would be more data-efficient than using the whole training set.
For each sample in the balanced training set we find the topk (100 / 512) most similar samples from the unbalanced training set. In other words, we use each sample in the balanced training set as the seed and run
We use a VA pre-trained model to do this job and take the cosine distance as the similarity measure.
- AudioCaps (train, eval, test).
- AudioSet (balanced train, unbalanced train, weight of unbalanced train, full train, eval).