This dataset contains the log-mel spectrograms for the augmented soundscapes described in our ICASSP 2022 submission "Probably Pleasant? A Neural-Probabilistic Approach to Automatic Masker Selection for Urban Soundscape Augmentation", in
.npy
format. The data can be accessed using the
numpy
package of Python, using the command
numpy.load
.
The dataset is available as a 5-fold cross validation dataset, with the log-mel spectrograms for each fold having filenames
fold_#_features.npy
and the subjective ratings for the augmented soundscapes having filenames of the format
fold_#_labels.npy
, where
#
is the number of the fold in the set {1,2,3,4,5}. The independent test set has fold index 0.
Generation of augmented soundscapes
Each augmented soundscape was created by adding 30-second excerpts of recordings of sounds known as
maskers to binaural recordings of urban soundscapes (element-wise addition in the time domain). Each masker recording only has one class ("construction", "traffic", "water", or "wind") active for the entire duration of the recording, whereas each binaural recording of an urban soundscape may have multiple sound sources active at any point in the recording, including sound sources outside of the four masker classes.
Cross-validation set
The masker samples were obtained from
Freesound by searching the names of the masker classes (i.e. "construction", "traffic", "water", and "wind") on Freesound, and randomly picking a selection of tracks containing 30-second sections of sound that corresponded only to that particular masker class. The soundscape samples were obtained from the
Urban Soundscapes of the World (USotW) dataset, and consisted of all binaural recordings available in the public dataset, minus those with
- audible electrical noise,
- measured in-situ LA,eq values below 52 dB, and
- measured in-situ LA,eq values above 77 dB,
in order to
- reflect only the accurately-captured real-life soundscapes,
- ensure that reproduction levels were significantly above the noise floor of the location with the highest noise floor (~36 dB) where the subjective responses were obtained, and
- ensure safe listening levels for our participants.
In total, 120 out of the 127 publicly-available recordings in the USotW dataset were used for the cross-validation set.
Test set
The masker samples were obtained from
Freesound in the same manner as that for the cross-validation set, but ensuring that no overlap in recordings occurred between the test set and cross-validation set maskers. The soundscape samples were taken from binaural recordings of locations in Singapore (which was not represented in any of the soundscapes in the
USotW dataset and hence the cross-validation set). They were recorded under the similar
Soundscape Indices Protocol and were taken in similar urban contexts as the
USotW dataset Specifically, they were from
- a road facing a construction site,
- a gazebo in a park,
- a walkway facing a lake,
- a walkway facing a crowded canteen,
- a path facing a lake, and
- a path facing a lake with an aircraft flying overhead.
Participant information
The participants of the listening test were a sample of people who were able to physically come down to our laboratory (in Nanyang Technological University, Singapore) to listen to the stimuli and provide their responses. Their mean age was 28.4 ± 11.8 years, and there were a total of 151 female and 149 male participants. All participants were tested to have normal hearing (mean hearing threshold <20 dB (resp. 30 dB) at 0.5, 1, 2, 4, and 6 kHz for participants below (resp. equal to or above) 30 years of age).