DATA COLLECTION IN MUSIC GENERATION TRAINING SETS:
A CRITICAL ANALYSIS
Fabio Morreale University of Auckland [email protected]
Megha Sharma University of Tokyo
[email protected]
I-Chieh Wei University of Auckland [email protected]
ABSTRACT
1
The practices of data collection in training sets for Au-
2
tomatic Music Generation (AMG) tasks are opaque and
3
overlooked. In this paper, we aimed to identify these prac-
4
tices and surface the values they embed. We systemati-
5
cally identified all datasets used to train AMG models pre-
6
sented at the last ten editions of ISMIR. For each dataset,
7
we checked how it was populated and the extent to which
8
musicians wittingly contributed to its creation. Almost half
9
of the datasets (42.6%) were indiscriminately populated by
10
accumulating music data available online without seeking
11
any sort of permission. We discuss the ideologies that un-
12
derlie this practice and propose a number of suggestions
13
AMG dataset creators might follow. Overall, this paper
14
contributes to the emerging self-critical corpus of work of
15
the ISMIR community, reflecting on the ethical considera-
16
tions and the social responsibility of our work.
17
1. INTRODUCTION
18
The quest to generate music with AI (Automatic Music
19
Generation, AMG) is undergoing crucial but overlooked
20
ontological, artistic, and political transformations. Orig-
21
inally confined to academic labs and employed in niche
22
music genres, this quest is gaining traction mostly among
23
commercial companies1 aiming at automatically gener-
24
ating music in all genres. These transformations are en-
25
abled by a combination of socio-technical novelties, in-
26
cluding i) the growing influx of money in the field [2, 3];
27
ii) advanced in Deep-Learning (DL) techniques, such as
28
Transformers [4]; and iii) the increase of cheap computa-
29
tional power. While, from a purely musical perspective,
30
the quality of the music created with AI is undoubtedly
31
rising, from a socio-political perspective, a new gold rush
32
resulting from efforts to outperform competitors and make
33
the best AMG model is following the typical blueprint of
34
capitalist innovation [5–7]: corners are being cut; critical
35
1A list from Water & Music [1] includes, as of July 2023, companies like Microsoft, Facebook, Google, Spotify, Deezer, and ByteDance.
© F. Morreale, M. Sharma, and I. Wei. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: F. Morreale, M. Sharma, and I. Wei, “Data Collection in Music Generation Training Sets: A Critical Analysis”, inProc. of the 24th Int. Society for Music Information Retrieval Conf.,Milan, Italy, 2023.
questions have not been asked; short-term gains are priori-
36
tised;permissionis not being sought.
37
These ethically questionable practices are causing in-
38
creased concerns. A group of artists2 recently released
39
a manifesto that identifies one of the most urgent ethical
40
issues arising from AI-generated art: the exploitation of
41
artists’ work in training AI generation systems. Similarly,
42
Holly Herndon, a musician famous for popularising AI-
43
generated music, recently criticised OpenAI for not asking
44
living performers’ permission to use their music in their AI
45
model, JukeBox [8]. Most systems that generate artistic
46
content using Machine Learning (ML) indeed often indis-
47
criminately populate their training datasets by accumulat-
48
ing original material that is available online [9–12].
49
Within the ISMIR community, occasional fiery calls re-
50
quested the community to reflect on the ethical implica-
51
tions [13–15] of, and demanded accountability [2] for the
52
work we produce. However, no specific work investigated
53
the potentially exploitative nature of the datasets we use,
54
and no ethical consideration has been given to how data
55
has been generated. We argue that such investigation is
56
long overdue, especially as the publication of AMG mod-
57
els proceeds undisturbed - and actually, as we will show in
58
the paper, is steadily increasing.
59
To fill this gap, we aimed to assess the extent to which
60
training sets used in ISMIR papers that propose new AMG
61
models are affected by this issue. We first identified all
62
papers presented at the last ten editions of the conference,
63
from 2013 to 2022, that introduced a new music genera-
64
tion model or a pertinent dataset. Then we identified all
65
dataset(s) that have been used in these papers. Finally, we
66
surveyed information for each dataset, including how data
67
was populated and the extent to which musicians wittingly
68
contributed to its creation.
69
The contribution of this paper is threefold. First, we
70
provide descriptive statistics about the datasets that are
71
mostly used at ISMIR in AMG applications and how they
72
are populated. Second, we report the ideologies that are
73
embedded in them and outline a lack of adequate engage-
74
ment with musicians and carelessness on ethical matters.
75
Third, we offer suggestions for dataset creators interested
76
in following responsible practices in their work.
77
The rest of the paper is structured as follows. We first
78
review literature in Critical Data set Studies and report dis-
79
cussions on ethical issues within MIR. We then describe
80
2The European Guild for Artificial Intelligence Regulation,https:
//www.egair.eu/
the research process, report the results, identify the values
81
that are inscribed in the datasets, and offer suggestions for
82
dataset creators. We conclude the paper with a summary
83
of the study and directions for future work.
84
2. BACKGROUND
85
Deep-learning (DL), which nowadays is the most com-
86
monly adopted method to generate music automatically
87
[16–19], significantly relies on the quality and volume of
88
vast training data. Despite this reliance, dataset develop-
89
ment remains an underappreciated element in DL practice.
90
2.1 Critical Data Set Studies
91
A growing literature oncritical data set studies[20] aims
92
at identifying the ethical issues and hegemonic power
93
structures of datasets, in particular when used to train ML
94
models [10,12,21]. One of the most urgent issues concerns
95
the exploitation of user labour in AI systems: datasets are
96
populated with data generated by “unwitting labourers” [9]
97
and scraped from the Internet “without context and without
98
consent” [11]. The question around consent is particularly
99
convoluted: consent may have been given unwittingly, for
100
a specific use only, and “some people may never have been
101
given the chance to offer their consent at all” [20]. This
102
concern is not limited to the ivory tower of academia. Mu-
103
sicians are gaining awareness of this issue, and an increas-
104
ing number of complaints arise from the unfair or uncon-
105
sented use of original material in AI-generated art.3
106
Most critical work on dataset creation addressed Com-
107
puter Vision (CV) sets [11, 24, 25] like ImageNet and MS-
108
Celeb, which contain tens of millions of digital images up-
109
loaded by platform users. [25] identified the values em-
110
bedded into these datasets and their formation: evaluating
111
model work is prioritised to the detriment of carefuldata
112
work. Another case study that received attention is that of
113
reCAPTCHA [26–28]. Disguised as ahuman authentica-
114
tion tool, reCAPTCHA can be seen as a capture-machine
115
that exploits unpaid individuals’ perceptual abilities and
116
micro-labour to train AI datasets [27, 29].
117
The very way in which most datasets are created
118
embeds specific neoliberal values, like extractivism and
119
deregulation, as exemplified by OpenAI’s argument that
120
“IP should be free to use for AI, with training constituting
121
fair use“ [30, p. 54]. The all-you-can-scrap ideology dis-
122
misses individuals’ contributions to dataset creation, which
123
can be met by their creators with a laissez-faireattitude
124
[31] that overlooks the ethical implication and liability of
125
scraping the whole Internet [21,25]. In fact, when concerns
126
are voiced, they are specifically aligned with libertarian
127
values and related to how data privacy and data ownership
128
are barriers to collecting data [25].
129
The practices and routines of data accumulation are not
130
secret. The opposite is true. Among dataset creators, they
131
have become widely accepted, unquestioned, and unchal-
132
lenged following a process ofdataset naturalisation: “the
133
3Notable cases include GettyImages suing Stable Diffusion’s creators [22] and audiobook narrators complaining against Apple for using their voices to train AI [23].
contingencies of dataset creation are eroded in a manner
134
that ultimately renders the constitutive elements of their
135
formation invisible” [24]. Notably, the values and ideolo-
136
gies are not only inscribed in how technology is used but
137
also in how it is taught. The lack of interest in how datasets
138
are constructed can indeed be found in the lack of guidance
139
in typical ML textbooks or syllabi [24, 32].
140
While many dataset creators do not consciously attempt
141
to hide their data accumulation practices, they do not try
142
to fully disclose them either. Dataset naturalisation is in-
143
deed exacerbated by ill documentary practices: as reported
144
by [33], ML communities pay little attention to document-
145
ing data creation and use. [24] proposes that the lack of in-
146
formation on dataset creation (e.g. how datasets have been
147
created, and whether and how much annotators have been
148
paid) isstructural- thus ideological - rather than acciden-
149
tal. Every decision and every step in dataset development
150
that is left unaccounted and unarticulated from documen-
151
tary practices has a political meaning as these steps and
152
decisions are related as "not important" [25]. We will re-
153
turn to this point in the discussions.
154
2.2 Critical turn in MIR and AIM
155
Several technology communities are undergoing acritical
156
turn[34–37] that challenges existing knowledge produc-
157
tion methods and political positions as well as ethical and
158
political thoughts within a field. This turn is ethico-onto-
159
epistemological [38, 39] insofar as it questions what kinds
160
of work, knowledge, and social commitment is pursued
161
within and by the community.
162
While most criticisms of MIR research come from out-
163
side the community [3, 40–42], recent academic produc-
164
tion within MIR [2, 3, 14, 43–45] and the development of
165
a workshop series on Human-Centric MIR [46] testify that
166
we might be close to aCritical MIR- i.e. MIR scholarship
167
devoted to critically analysing the work produced in the
168
field. However, the sort of work that is (not) published at
169
ISMIR (less than 0.5% of the ISMIR submissions engage
170
with any sort of ethical discussions [2]) indicates that the
171
response of the field on ethical issues is still inadequate.
172
With respect to AMG, the ethical issues that have been
173
identified include copyright issues [15, 47], a narrow and
174
Western-centered understanding to music [43, 45], the risk
175
of musician redundancy[2, 14] or thecrisis of prolifera-
176
tion[44, 48], diversity issues [49], colonialist and extrac-
177
tive practices [2], and assumptions and bias that are embed-
178
ded in the AI systems [13–15]. To the best of our knowl-
179
edge, the potentially exploitative nature of AMG datasets
180
remains uncharted territory.
181
3. METHODOLOGY
182
3.1 Researcher Positionality and Motivation
183
Positionality statements are common in critical studies and
184
serve as a foundation for critical work to understand the
185
research context and the authors’ interpretation of the re-
186
sults. Since the outset of the research process, we have
187
strived to maintain objectivity and reflexivity by acknowl-
188
edging our unique positions and backgrounds. All three
189
authors are actively involved in MIR. The first author is
190
formally trained in computer science and is expert in criti-
191
cal theory and technology studies; the second author has a
192
background in computer science; and the third author has
193
a background in electronic engineering and is specialised
194
in machine learning algorithms.
195
The motivation for undertaking this study is twofold.
196
First, we aimed to support the growth of ISMIR com-
197
munity by contributing to the corpus of self-reflective
198
work, identifying ideologies that might be latent but, once
199
surfaced, can be considered problematic by community
200
members. Second, the development of the suggestions for
201
dataset creation derived from the personal experience of
202
one of the authors, who was involved in dataset creation
203
for AMGs and acknowledged the importance of commu-
204
nity guidance on the best ethical practices to adhere to.
205
3.2 Analysis of ISMIR publications
206
We conducted a systematic review of the last ten editions
207
of ISMIR (2013-2022). A total of 1078 publications were
208
sourced from the conference proceedings. Two of the au-
209
thors manually filtered the papers adopting two inclusion
210
criteria. First, we included all papers presenting a new mu-
211
sic generation model. We included all models that generate
212
new compositions or performances, including in-painting,
213
style transfer, and improvisation. Second, we included all
214
papers that introduced a new dataset that could potentially
215
be utilised as training material for AMG models but re-
216
jected works that did not contain symbolic or raw audio
217
music files. For example, we did not include the NSynth
218
Dataset [50], which contains sampled notes from different
219
instruments, but we included MedleyDB [51], which con-
220
tains annotated multitrack audio.
221
The analysis proceeded in two phases. First, for each
222
paper, we identified whether authors employed existing
223
datasets (i.e. datasets released or introduced before the
224
publication of the ISMIR paper) or created new ones (i.e.
225
datasets created or introduced as part of the original re-
226
search reported in the paper). We also examined the pres-
227
ence of any discussions of ethics and permission for using
228
data entries training data for AMG models.
229
In the second phase, we examined the datasets identi-
230
fied in the first phase. For papers that used an existing
231
dataset, we retrieved dataset information from the origi-
232
nal paper (whether or not it was published at ISMIR) in
233
which it was introduced. When we could not find suffi-
234
cient information in the paper, we checked dataset release
235
links, which were found either in the original paper or by
236
a web search of the dataset name. The information we col-
237
lected included i) data format (symbolic or audio), ii) how
238
datasets were populated; iii) whether data contained orig-
239
inal performances, compositions, or arrangements; iv) the
240
data type; v) the extent to which musicians were involved
241
in the dataset creation and whether they were aware of the
242
intended purposes for the dataset; and vi) whether ethical
243
concerns were discussed.
244
Figure 1: Distribution of selected papers and datasets over the years (only papers after 2003 were considered).
Dataset Name Format Occurrence
POP909 Symbolic 11
Nottingham Symbolic 9
Lakh MIDI Symbolic 5
HTPD3 Symbolic 3
Yamaha e-Competition Symbolic 3
Lakh Pianoroll Symbolic 3
MusicNet Both 3
URMP Both 3
AILABS17k Both 3
Bach Music21 Symbolic 3
Bach Chorales Symbolic 3
RWC Both 3
Table 1: The most popular datasets and their occurrences.
Each dataset comprises data in symbolic form, but three of them also include audio files.
From a methodological point of view, most of these in-
245
vestigations involved checking the aspect under scrutiny
246
(e.g. whether ethics was discussed) from the dataset
247
sources. The task of identifying how datasets were pop-
248
ulated was not as straightforward. In order to streamline
249
the analysis and facilitate the report of the findings, we
250
aimed to cluster datasets into categories that reflected dif-
251
ferent ways of populating datasets. Two of the authors per-
252
formed this categorisation following a deductive approach.
253
As they analysed more datasets, they introduced new cat-
254
egories and deleted or merged existing ones. The analy-
255
sis spreadsheet is available athttps://github.com/
256
Sma1033/amgdatasetethics.
257
4. FINDINGS
258
A total of 121 papers survived the filtering. Fig. 1 shows
259
the significant rise of interest in AMG in recent years.
260
Three fourth of the articles (82) introduced a new model,
261
which was either introduced on its own or with a new
262
dataset (Fig. 2a). From this list of papers, we identified
263
115 datasets (Fig. 2b). When only considering the 82 pa-
264
pers that introduced a new model, most (62 papers, 75.6%)
265
ISMIR researchers use, at least in part, existing datasets to
266
train their AMG models. Tab. 1 shows the 12 most fre-
267
quently used datasets in our survey, along with data for-
268
mat and their occurrence in our survey. The remaining 104
269
datasets were only used in one or two papers.
270
(a) (b)
Figure 2: (a) What is introduced: new model (73 pa- pers), new dataset (39), both (9).(b) Dataset Originality:
new datasets (58), existing datasets (53), both (12).
4.1 Dataset Creation
271
We clustered datasets into nine categories to reflect the dif-
272
ferent ways in which entries were collected (Tab. 2). The
273
categories are non-orthogonal: datasets could be associ-
274
ated with more than one category. In most cases, datasets
275
were populated without creators incurring any costs. Only
276
five datasets used paid online sources or compensated the
277
involved musicians. With the exception of those belong-
278
ing to the ‘Involved musicians’ and ‘Synthesised music’
279
categories, all datasets were populated with existing music
280
data. New music data accounted for 16.5% of all datasets.
281
We found evidence of poor documentary practices for 18
282
datasets (15.7%): in 10 cases, there was no information on
283
how data was collected; in 8 cases, we could not find any
284
documentation reporting how datasets were created.
285
4.2 Musicians’ involvement
286
Only 17 datasets (14.8%) involved musicians in any ca-
287
pacity (category ‘Involved musicians’). In 11 cases, musi-
288
cians performed or arranged existing compositions. Three
289
of these datasets (ASF-4, HP-10, and AIST) were en-
290
tirely created from novel compositions. The remaining
291
three datasets from this category did not contain compo-
292
sitions or performances created specifically for the dataset,
293
or at least, it was not explicitly mentioned. The Irish Tra-
294
ditional Dance Music dataset [61] used the recordings of
295
one of the authors’ own performances. For the remaining
296
two datasets [51, 62], the creators mentioned that profes-
297
sional musicians had created those recordings. However,
298
it is unclear whether these recordings are specifically cre-
299
ated for the dataset. The other two datasets that included
300
new music belonged to the ‘Synthesised music’ category
301
and algorithmically generated monophonic melodies [59]
302
or polyphonic MIDI sequences [63].
303
4.3 Musicians’ permission and awareness
304
We checked whether explicit permission was sought from
305
musicians to use their creations to train an AMG model.
306
Only three datasets creators reported having asked such
307
permission. The authors of ASF-4 and HP-10 datasets
308
[64] explicitly mentioned that the musicians involved were
309
made aware of the purpose of the dataset for AMG. Jazz
310
players participating in the creation of the FILOSAX
311
dataset [54] signed a document that provided explanations
312
about the goals of the dataset. However, it is not clear
313
whether these goals included AMG: whereas the authors
314
mention "music generation" in the Abstract, AMG was not
315
included in the list of potential applications of the dataset.
316
In the Mozart Piano Music Dataset [65], pianists gave per-
317
mission to use their performances for the intended use of
318
the dataset (music analysis), but they were probably not
319
aware and did not consent to have their performances used
320
to train the AMG model introduced in [66].
321
Two cases were particularly problematic. The MAST
322
Dataset [67], which was introduced for automatic rhythm
323
assessment, was sourced from student entrance exams
324
without seeking consent from the students. Another pop-
325
ular dataset, the Yamaha e-Competition dataset4 features
326
MIDI files of piano performances obtained from the en-
327
tries of the piano competition. Although Yamaha claims
328
ownership over all data generated during the event, com-
329
petitors are unlikely aware their performances are used to
330
train AMG models, as seen in [68]. The lack of permission
331
sought from the musicians clashes with the several com-
332
ments offered by dataset creators that often acknowledged
333
the valuable contributions made by these musicians, which
334
allows the dataset to existing in the first place.
335
4.4 Discussions on Ethical Issues
336
Our analysis revealed a lack of engagement with ethical
337
issues, corroborating findings from [25] in their analysis
338
of CV datasets. Only four datasets included any ethical
339
considerations, and only two of them contained an explicit
340
ethics statement. The authors of [69], which presented a
341
new GuitarPro dataset, listed several questions, some of
342
which are particularly relevant to this paper: “How to ac-
343
knowledge, reward and remunerate artists whose music
344
has been used to train models?” and “What if an artist
345
does not want to be part of a dataset?”. While their spon-
346
taneous engagement with these issues is commendable, it
347
is not clear to which extent the authors used these questions
348
in the development of their dataset.
349
In [70], the authors raised concerns about the impact
350
of AMG for “human musicians of the future”. They also
351
stated “care have (sic) to be given regarding the fair use of
352
existing musical material for model training” but did not
353
further explain what sort of care and what constitutes un-
354
fair use. [57] included an analysis concerning plagiarism
355
issues and observed that their model demonstrated a poten-
356
tial tendency for plagiarism. This issue was also recently
357
highlighted in [47], similar to the level exhibited by a hu-
358
man musician.
359
5. DISCUSSIONS
360
By leveraging our findings, this section first reports and
361
discusses the values embedded in the datasets used at IS-
362
MIR for AMG models. Then, we move to offer practical
363
suggestions to AMG dataset creators.
364
4https://www.piano-e-competition.com/
Category Description Occurrence Scraped online Existing music data collected online from websites [52] or databases [53] 49 Existing datasets Existing music data collected from existing datasets [16] 26 Involved musicians New music data was created by involving musicians in some capacity [54] 17 Private data Existing music data was collected from private databases [55] 5 Book collection Existing music data collected from printed books [56] 5 Online store Existing music data collected from an online commercial website [57] 4 CD collection Existing music data collected from published CD recordings [58] 3 Synthesised music New music synthesised using rule-based heuristics or other methods [59] 2 Not mentioned No explicit information about how data was obtained [60] 18 Table 2: Categorisation of how data was collected in the datasets. For each category, we included exemplary references.
5.1 The Values Embedded in Our Datasets
365
Our analysis extends what [24, 25, 33] have suggested for
366
other ML applications areas:data workand data collection
367
practices are de-prioritised and de-valued in AMG datasets
368
used at ISMIR. Most datasets (∼60%) had been populated
369
either by scraping songs from the Internet or by accumulat-
370
ing data from existing ones. Considering original music as
371
aterra nulliusthat is free for the taking means addressing
372
dataset creation with expediency. This approach follows
373
the hegemonic narrative that compares datatooil. This
374
comparison is highly ideological [71–73] as it disguises the
375
origin (and ends) of data [11], and de-penalises and jus-
376
tifies extractive practices using neo-colonial rhetoric that
377
data is something waiting to be discovered [71, 73].
378
This narrative underestimates or blatantly neglects the
379
human labour necessary for its development - which in-
380
cludes writing, performing, transcribing, and recording
381
music. The majority of datasets were created by amassing
382
musical compositions initially intended for purposes other
383
than AMG. In these cases, the original labour that was put
384
into the creative acts of composition or performance is sim-
385
ply neglected. Relatively few datasets included original
386
material, and in only two cases, specific permission was
387
asked to use musicians’ work to train datasets for AMG
388
purposes. This discussion point resonates with objections
389
to the unfair and exploitative practices of capturing individ-
390
uals’ labour andhumanness[9] when creating data for dig-
391
ital platforms [6, 74–76] and training AI systems [10–12].
392
As proposed by [77], human labour is structurally obfus-
393
catedin ML applications to the benefit of profit and inno-
394
vation. Similarly, [78] proposes that hiding the labour in
395
this context is crucial to attracting capital investments.
396
Our direct knowledge and lived experience of MIR of-
397
fers us a vantage point that we can employ in our reflexive
398
inquiry. We propose that dataset creators might have pri-
399
oritised safety over criticality and followed common, albeit
400
questionable, procedures simply because these are the pro-
401
cedures that are typically employed in AMG research. This
402
comment is not intended to absolve dataset creators from
403
the responsibilities that come with their work. Rather, it
404
is an invitation to self-assess one’s alignment with the ex-
405
ploitative ideologies we surfaced in this section. Yet, we
406
unequivocally found a lack ofdata work- including a lim-
407
ited interest in creating one’s own data, exploitation of the
408
labour of unwitting musicians (e.g. in the e-piano compe-
409
tition) and students [67] in dataset curation, and poor doc-
410
umentary practices regarding the source of data [60, 79].
411
We argue that this lack is ideological. What we leave unac-
412
counted for or unspoken in dataset creation and documen-
413
tation signs what we consider important or irrelevant [25].
414
Our findings indicate that the rights and demands of
415
musicians are not prioritised by dataset creators and that
416
the degree to which new models and datasets advance or
417
curb a fair model for musicians is largely ignored. This
418
comment resonates with a note from [80], who explained
419
that streaming services overlook “the rights of musicians
420
or users because their decisions are made based on wholly
421
other problems”. It is thus essential that ISMIR researchers
422
and practitioners reflect on theproblems they drive their
423
decisions on and the agendas they implicitly or explic-
424
itly follow. Answering questions like "what is the agenda
425
we are following and who benefits from it?" [81] requires
426
community discussions that are difficult, uncomfortable,
427
and controversial but nevertheless necessary. Avoiding en-
428
gaging with these questions is not a political absence but
429
rather a political tacit acceptance of the status quo [36, 37]
430
as datasets do not exist in a political void [20, 82].
431
5.2 Suggestions
432
In this section, we offer suggestions to the broader com-
433
munity and to individual authors interested in creating
434
new datasets or using existing ones to train AMG mod-
435
els. We developed these suggestions by integrating re-
436
sults from our analysis with findings from other academic
437
contributions, including ethical CV datasets recommenda-
438
tions [25]. These suggestions are not intended to be metic-
439
ulously followed as a recipe book. Rather, we devised
440
them as probes, navigation tools, or structured conversa-
441
tions whose development should continue in a participa-
442
tory way with the rest of the community.
443
Develop one’s own dataset. While exploiting musi-
444
cians’ labour in AI dataset creation is a questionable prac-
445
tice [9], expecting dataset creators to seek and obtain con-
446
sent from all humans involved in AMG datasets is unreal-
447
istic [30]. Thus, we recommend creating, whenever pos-
448
sible, one’s own dataset and hire musicians for as many
449
tasks as possible (i.e. composing, performing, arranging
450
songs). A small but important amount of datasets in our
451
investigation followed this practice. We acknowledge that
452
this suggestion might lead to equity issues. If it were to
453
be enforced, only big companies and top university labs
454
would have the economic means to develop such datasets.
455
However, rather than dismissing this issue as unsolvable
456
and continuing business as usual, we propose that the com-
457
munity interrogates itself and finds strategies to tackle it.
458
As an alternative, efforts might be made to develop mod-
459
els that are trainable on small or procedurally-generated
460
datasets following recent successful examples like [30,83].
461
Receive consent from musicians and remunerate
462
them. Dataset creators should inform musicians about the
463
specific goals of the dataset. It is possible that musicians
464
would willingly consent to train a dataset for several MIR
465
tasks but not for training AMG. When possible, dataset
466
creators should consider paying musicians for their labour
467
and disclose the amount [24], as found in [54]. Given the
468
equity problem discussed above, when paying musicians
469
is not feasible, that should be reported [25], and musicians
470
should be at least acknowledged. When AMG systems are
471
integrated into commercial products, a technical infrastruc-
472
ture might be implemented to distribute royalties to dataset
473
contributors. This suggestion shares Holly Herndon’s vi-
474
sion for a novel IP framework “compensates me for my
475
likeness when (and only when) money is made from it” [8].
476
Document the process of dataset development. Our
477
analysis revealed a general lack of care not only indoing
478
but also indocumentingdata work. For instance, POP909
479
dataset’s creators did not mention the source or selection
480
process of the “909 popular songs” used to generate pi-
481
ano arrangements [84] and the Lakh dataset’s creators sim-
482
ply mentioned that they extracted songs from “publicly-
483
available sources on the internet”5 website. Careless doc-
484
umentary practices, which we believe were mostly invol-
485
untary and caused by an undervaluing of this process in
486
the field [24, 25], implicitly reveal that how a dataset is
487
developed andwhose labour goes in it is not important.
488
We suggest the community develop protocols, guidelines,
489
or templates offeringfair practicesuggestions for dataset
490
creators to follow.
491
Report the intended use of the dataset. Our find-
492
ings indicate that it is a common practice among AMG
493
dataset creators to reuse existing datasets. We suggest that
494
dataset creators should report the original intended use of
495
their dataset and list the potential ‘allowed’ applications,
496
following the example of [54]. This practice would pre-
497
vent, or at least dissuade, future dataset creators from us-
498
ing that data for purposes other than the ones envisioned by
499
the creators and that musicians agreed on. This suggestion
500
is grounded on the observation that technologies are often
501
interpreted, used, and appropriated in ways that their cre-
502
ators cannot foresee or control (what [85] termsdesigner’s
503
fallacy). As new applications of datasets are discovered,
504
measures should be taken to ensure that permission from
505
involved musicians is obtained to use their work for uses
506
other than the ones they agreed on.
507
When borrowing data, maintain the purpose of the
508
5https://colinraffel.com/projects/lmd/
original datasets. Connected to the above suggestion, cre-
509
ators should maintain their original purpose when borrow-
510
ing entries for new datasets and avoid misappropriation.
511
This is particularly important when dealing with culturally
512
relevant and sensitive music. This is, for instance, the case
513
of the dataset on the Australian Aboriginal language used
514
by [86]. The author reported: “These datasets were public
515
domain and encouraged for use by the creator as a way to
516
share the sound of the language. Even so, it is not clear
517
that the creators of the dataset from the late nineties could
518
predict this (AI generation) ‘future use’ case” [30].
519
Volunteer ethical considerations. Our analysis re-
520
vealed that almost the entirety of the papers did not engage
521
in any form of ethical considerations. Authors can show
522
commitment to advancing more just practices in dataset
523
creation by reflecting on potential ethical limitations in
524
their datasets. Preferably, they should also include docu-
525
ments approved by an Ethics board, if applicable, that were
526
given and signed by the participating musicians.
527
6. CONCLUSIONS AND FUTURE WORK
528
We identified the dominant approaches to dataset creation
529
within ISMIR and analysed them with critical lenses to un-
530
derstand their ideological substrate. Most authors seem to
531
handle dataset creation with neoliberal attitudes and ex-
532
pediency. However, a small - yet significant - number of
533
dataset creators showed that other attitudes and values are
534
at play within ISMIR when creating datasets for AMG.
535
Our analysis did not explain the motivations for dataset
536
creators to engage, or not engage, with ethical issues in
537
their work, and this investigation is left for future work. Fi-
538
nally, we aim to extend the analysis to papers other than the
539
ones published at ISMIR and to conduct an ethnographic
540
study with AMG dataset creators to give voice to their
541
perspectives on the topic. To conclude, ISMIR has been
542
playing a significant role in the growth of ML models for
543
AMGs but the lack of an ethical infrastructure may facili-
544
tate an exploitative industry. It is our responsibility as the
545
main academic hub of AMG to recognise the need to en-
546
gage in discussions around the matters raised in the article
547
and to establish ISMIR as the home of this debate.
548
7. ETHICAL STATEMENT
549
In this study, we only used secondary data (desk research)
550
that is publicly available online. We reflect that analysing
551
existing information does not incur ethical issues.
552
8. CONFLICTS OF INTEREST
553
We have no relevant financial or non-financial interests to
554
disclose and no financial or proprietary interests in any-
555
thing discussed in this paper.
556
9. DATA AVAILABILITY
557
The spreadsheet with our analysis is available athttps:
558
//github.com/Sma1033/amgdatasetethics.
559
10. REFERENCES
560
[1] D. Edwards and D. McGlynn, “Creative AI for artists:
561
Track 80+ tools,” https://www.waterandmusic.com/
562
data/creative-ai-for-artists/, [Accessed: 12-Apr-2023].
563
[2] F. Morreale, “Where does the buck stop? Ethical and
564
political issues with AI in music creation.” inTransac-
565
tions of the International Society for Music Information
566
Retrieval, 2021, pp. 105–114.
567
[3] E. Drott, “Copyright, compensation, and commons in
568
the music AI industry,” inCreative Industries Journal,
569
2021, pp. 190–207.
570
[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
571
L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
572
“Attention is all you need,” inAdvances in neural in-
573
formation processing systems, 2017.
574
[5] D. Harvey,A brief history of neoliberalism. Oxford
575
University Press, USA, 2007.
576
[6] S. Zuboff, “The age of surveillance capitalism: The
577
fight for a human future at the new frontier of power,”
578
PublicAffairs, 2018.
579
[7] E. Morozov,To save everything, click here: The folly
580
of technological solutionism. Public Affairs, 2013.
581
[8] M. Clancy, “The Artist: Interview with Holly Hern-
582
don,” in Artificial Intelligence and Music Ecosystem,
583
2023, pp. 44–51.
584
[9] F. Morreale, E. Bahmanteymouri, B. Burmester,
585
A. Chen, and M. Thorp, “The unwitting labourer: ex-
586
tracting humanness in ai training,”AI & SOCIETY, pp.
587
1–11, 2023.
588
[10] P. Tubaro, “Learners in the loop: Hidden human skills
589
in machine intelligence,” in Sociologia Del Lavoro,
590
2022, pp. 110–129.
591
[11] K. Crawford, “The atlas of AI: Power, politics, and the
592
planetary costs of Artificial Intelligence,”Yale Univer-
593
sity Press, 2021.
594
[12] N. Dyer-Witheford, A. M. Kjøsen, and J. Steinhoff,
595
“Inhuman power: Artificial Intelligence and the future
596
of capitalism,”Pluto Press, 2019.
597
[13] A. Holzapfel, B. Sturm, and M. Coeckelbergh, “Ethical
598
dimensions of music information retrieval technology,”
599
inTransactions of the International Society for Music
600
Information Retrieval, 2018, pp. 44–55.
601
[14] G. Born, J. Morris, F. Diaz, and A. Anderson, “Artifi-
602
cial Intelligence, music recommendation, and the cura-
603
tion of culture,” Schwartz Reisman Institute for Tech-
604
nology and Society White Paper, 2021.
605
[15] B. L. Sturm, M. Iglesias, O. Ben-Tal, M. Miron, and
606
E. Gómez, “Artificial Intelligence and Music: Open
607
questions of copyright law and engineering praxis,” in
608
Arts, 2019, p. 115.
609
[16] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang,
610
“Musegan: Multi-track sequential Generative Adver-
611
sarial Networks for symbolic music generation and ac-
612
companiment,” inProceedings of the AAAI Conference
613
on Artificial Intelligence, 2018.
614
[17] W. Chen, J. Keast, J. Moody, C. Moriarty, F. Villalo-
615
bos, V. Winter, X. Zhang, X. Lyu, E. Freeman, J. Wang,
616
S. Cai, and K. M. Kinnaird, “Data usage in MIR: His-
617
tory & future recommendations,” inProceedings of the
618
International Society for Music Information Retrieval
619
Conference, 2019, pp. 25–30.
620
[18] Y.-S. Huang and Y.-H. Yang, “Pop music transformer:
621
Beat-based modeling and generation of expressive pop
622
piano compositions,” inProceedings of the ACM Inter-
623
national Conference on Multimedia, 2020.
624
[19] A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel,
625
M. Verzetti, A. Caillon, Q. Huang, A. Jansen,
626
A. Roberts, M. Tagliasacchiet al., “MusicLM: Gen-
627
erating music from text,”arXiv:2301.11325, 2023.
628
[20] N. B. Thylstrup, “The ethics and politics of data
629
sets in the age of machine learning: deleting traces
630
and encountering remains,”Media, Culture & Society,
631
vol. 44, no. 4, pp. 655–671, 2022.
632
[21] R. Van Noorden, “The ethical questions that haunt
633
facial-recognition research,” Nature, vol. 587, no.
634
7834, pp. 354–359, 2020.
635
[22] J. Vincent, “Getty images is suing the creators of
636
AI art tool Stable Diffusion for scraping its con-
637
tent,” https://www.theverge.com/2023/1/17/23558516/
638
ai-art-copyright-stable-diffusion-getty-images-lawsuit,
639
[Accessed: 11-Apr-2023].
640
[23] S. Agarwalt, “Audiobook narrators fear apple used
641
their voices to train AI,” https://www.wired.com/story/
642
apple-spotify-audiobook-narrators-ai-contract/, [Ac-
643
cessed: 11-Apr-2023].
644
[24] E. Denton, A. Hanna, R. Amironesei, A. Smart, and
645
H. Nicole, “On the genealogy of machine learning
646
datasets: A critical history of imagenet,” inBig Data
647
& Society, 2021.
648
[25] M. K. Scheuerman, A. Hanna, and E. Denton, “Do
649
datasets have politics? Disciplinary values in com-
650
puter vision dataset development,” in Proceedings of
651
the ACM on Human-Computer Interaction, 2021, pp.
652
1–37.
653
[26] V. Avanesi and J. Teurlings, “I’m not a robot, or am
654
i?: Micro-labor and the immanent subsumption of the
655
social in the human computation of recaptchas,” inIn-
656
ternational Journal of Communication, 2022, p. 19.
657
[27] B. T. Pettis, “reCAPTCHA challenges and the produc-
658
tion of the ideal web user,” inConvergence, 2022.
659
[28] R. Mühlhoff, “Human-aided Artificial Intelligence: Or,
660
how to run large computations in human brains? To-
661
ward a media sociology of machine learning,” inNew
662
media & Society, 2020, pp. 1868–1884.
663
[29] J. O’Malley, “Captcha if you can: How you’ve
664
been training AI for years without realising it,”
665
https://www.techradar.com/news/captcha-if-you-
666
can-how-youve-been-training-ai-for-years-without-
667
realising-it, [Accessed: 12-Apr-2023].
668
[30] R. Savery and G. Weinberg, “Robotics: Fast and curi-
669
ous: A CNN for ethical deep learning musical gener-
670
ation,” inArtificial Intelligence and Music Ecosystem,
671
2022, pp. 52–67.
672
[31] E. S. Jo and T. Gebru, “Lessons from archives: Strate-
673
gies for collecting sociocultural data in machine learn-
674
ing,” inProceedings of the conference on fairness, ac-
675
countability, and transparency, 2020, pp. 306–316.
676
[32] I. Goodfellow, Y. Bengio, and A. Courville, “Deep
677
learning,”MIT press, 2016.
678
[33] T. Gebru, J. Morgenstern, B. Vecchione, J. W.
679
Vaughan, H. Wallach, H. D. III, and K. Crawford,
680
“Datasheets for datasets,” Commun. ACM, vol. 64,
681
no. 12, p. 86–92, nov 2021. [Online]. Available:
682
https://doi.org/10.1145/3458723
683
[34] N. E. Gold, R. Masu, C. Chevalier, and F. Morreale,
684
“Share your values! Community-driven embedding of
685
ethics in research,” inCHI Conference on Human Fac-
686
tors in Computing Systems Extended Abstracts, 2022,
687
pp. 1–7.
688
[35] S. Benford, C. Greenhalgh, B. Anderson, R. Jacobs,
689
M. Golembewski, M. Jirotka, B. C. Stahl, J. Timmer-
690
mans, G. Giannachi, M. Adamset al., “The ethical im-
691
plications of HCI’s turn to the cultural,” inACM Trans-
692
actions on Computer-Human Interaction, 2015, pp. 1–
693
37.
694
[36] F. Morreale, A. Bin, A. McPherson, P. Stapleton, and
695
M. Wanderley, “A NIME of the times: Developing an
696
outward-looking political agenda for this community,”
697
inNew Interfaces for Musical Expression, 2020.
698
[37] O. Keyes, J. Hoy, and M. Drouhard, “Human-computer
699
insurrection: Notes on an anarchist HCI,” inProceed-
700
ings of the CHI conference on human factors in com-
701
puting systems, 2019, pp. 1–13.
702
[38] K. Barad, “Meeting the universe halfway: Quantum
703
physics and the entanglement of matter and meaning,”
704
duke university Press, 2007.
705
[39] C. Frauenberger, “Entanglement HCI the next wave?”
706
in ACM Transactions on Computer-Human Interac-
707
tion, 2019, pp. 1–27.
708
[40] J. W. Morris, “Curation by code: Infomediaries and the
709
data mining of taste,” inEuropean journal of cultural
710
studies, 2015, pp. 446–463.
711
[41] N. Seaver, “Computing taste: Algorithms and the mak-
712
ers of music recommendation,”University of Chicago
713
Press, 2022.
714
[42] J. Sterne and E. Razlogova, “Machine learning in con-
715
text, or learning from LANDR: Artificial Intelligence
716
and the platformization of music mastering,” inSocial
717
Media + Society, 2019.
718
[43] G. Born, “Diversifying mir: Knowledge and real-
719
world challenges, and new interdisciplinary futures,”
720
inTransactions of the International Society for Music
721
Information Retrieval, 2020.
722
[44] M. Clancy, “Reflections on the financial and ethical
723
implications of music generated by Artificial Intel-
724
ligence,” Ph.D. dissertation, Trinity College Dublin.
725
School of Creative Arts. Discipline of Music, 2021.
726
[45] R. Huang, B. L. Sturm, and A. Holzapfel, “De-
727
centering the west: East asian philosophies and the
728
ethics of applying Artificial Intelligence to music.” in
729
Proceedings of the International Society for Music In-
730
formation Retrieval Conference, 2021, pp. 301–309.
731
[46] L. Porcaro, C. Castillo, and E. Gómez Gutiérrez, “Mu-
732
sic recommendation diversity: a tentative framework
733
and preliminary results,” in Workshop on Designing
734
Human-Centric Music Information Research Systems.,
735
2019.
736
[47] Z. Yin, F. Reuben, S. Stepney, and T. Collins, “Deep
737
learning’s shallow gains: a comparative evaluation of
738
algorithms for automatic music generation,” in Ma-
739
chine Learning, 2023, pp. 1–38.
740
[48] J. Attali, “Noise: The political economy of music,”
741
Manchester University Press, 1985.
742
[49] L. Porcaro, C. Castillo, and E. Gómez Gutiérrez, “Di-
743
versity by design in music recommender systems,” in
744
Transactions of the International Society for Music In-
745
formation Retrieval, 2021.
746
[50] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck,
747
K. Simonyan, and M. Norouzi, “Neural audio synthesis
748
of musical notes with wavenet autoencoders,” inPro-
749
ceedings of the International Conference on Machine
750
Learning, 2017.
751
[51] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch,
752
C. Cannam, and J. P. Bello, “Medleydb: A multitrack
753
dataset for annotation-intensive mir research.” inPro-
754
ceedings of the International Society for Music Infor-
755
mation Retrieval Conference, 2014, pp. 155–160.
756
[52] M. Defferrard, K. Benzi, P. Vandergheynst, and
757
X. Bresson, “FMA: A dataset for music analysis,” in
758
Proceedings of the International Society for Music In-
759
formation Retrieval Conference, 2017.
760