Zayed University Zayed University
ZU Scholars ZU Scholars
All Works
4-1-2022
How can generative adversarial networks impact computer How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion generated art? Insights from poetry to melody conversion
Sakib Shahriar Zayed University Noora Al Roken
American University of Sharjah
Follow this and additional works at: https://zuscholars.zu.ac.ae/works Part of the Computer Sciences Commons
Recommended Citation Recommended Citation
Shahriar, Sakib and Roken, Noora Al, "How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion" (2022). All Works. 4933.
https://zuscholars.zu.ac.ae/works/4933
This Article is brought to you for free and open access by ZU Scholars. It has been accepted for inclusion in All Works by an authorized administrator of ZU Scholars. For more information, please contact [email protected].
ContentslistsavailableatScienceDirect
International Journal of Information Management Data Insights
journalhomepage:www.elsevier.com/locate/jjimei
How can generative adversarial networks impact computer generated art?
Insights from poetry to melody conversion
Sakib Shahriar
a,∗, Noora Al Roken
baCollege of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates
bDepartment of Computer Science and Engineering, American University of Sharjah, Sharjah, United Arab Emirates
a r t i c le i n f o
Keywords:
Generative adversarial networks Deep learning
Melody generation Music generation Artificial intelligence Computer generated art
a b s t r a ct
Recentadvancesindeeplearningandgenerativeadversarialnetworks(GANs),inparticular,hasenabledinterest- ingapplicationsincludingphotorealisticimagegeneration,imagetranslation,andautomaticcaptiongeneration.
Thishasopeneduppossibilitiesformanycross-domainapplicationsincomputergeneratedartsandliterature.
Althoughthereareexistingsoftware-basedapproachesforgeneratingmusicalaccompanimentofagivenpo- etry,therearenoexistingimplementationusingGANs.Thisworkproposesanovelpoetrytomelodygeneration conditionedonpoememotionusingGANs.Adatasetcontainingpairsofpoetryandmelodybasedonthreeemo- tioncategoriesisintroduced.Furthermore,variousGANarchitecturesincludingSpecGANandWaveGANwere exploredforautomaticmelodysynthesisforagivenclassofpoetry.ConditionalSpecGANproducedthebest melodiesaccordingtoquantitativemetrics.MelodiesproducedbySpecGANwereevaluatedbyvolunteerswho deemedthequalitytobeaboveaverage.
1. Introduction
Computerartwasfirstdiscoveredasawaytoproduceobjectiveart piecesby relyingon probabilitytheories (Chamberlainetal., 2018).
Sincethe1970s,theartistHaroldCohenstartedtestingwithcomputer sketchgeneration,wherehemanuallycodedthepaintingstructureinto theprogram(Chamberlainetal., 2018).Unlikeartificialintelligence (AI),regularcomputerprogramscouldnotintelligentlyproduceanew artpiecewithoutthedesigner’sintervention.Withtherapiddevelop- mentsofAI,computer-generateddrawings,music,andpoetryhaveim- proved.Advancedlearningalgorithmsandevolutionarymodelshave generatedcomplexandrealisticartworks(Chamberlainetal.,2018).In 2018,theportraitofEdmonddeBelamywassoldfor$430,000inan auction,atamuchhigherpricecomparedtotheanticipatedpricetag of$7,000-$10,000forbeingacomputer-generatedart(Goold,2021).
TheEdmonddeBelamycasewasthefirstcaseofanAI-generatedart- workinanauctionwheretheFrenchcreatorsfed30,000portraitstoan availablemachinelearningalgorithmandselectedthebest-generated picture(Goold,2021).Finally,theownersaddedtheirsignaturestothe bottomrightcorner.Furthermore,recentinterestsinadversarialtrain- ingforartworkgenerationhavegrownsignificantly.Thenetworktries togeneraterealisticoutputs,andthecriticordiscriminatortriestodis- tinguishbetween therealandgeneratedsamples.Earlier,thehuman playedthecritic’srole,andtheprogramwasaformofanevolutionary
∗Correspondingauthor.
E-mailaddress:[email protected](S.Shahriar).
algorithmoraneuralnetwork(Soderlund&Blair,2018).Agroupof outputs(e.g.,imagesormusic)arepresentedtothecritic,andasub- groupwiththebestresultsisselectedandpassedtothenextstageor iteration.Suchaprocessistime-consuming,andhence,theentireproce- durehasbeenautomatedbyreplacingthecriticwithaneuralnetwork thatcanlearndiscriminantfeaturestodistinguishbetweentheinputs (Soderlund&Blair,2018).Consequently,researchersandartistsinre- centtimeshavestartedutilizingadversariallearningtogeneratevarious artformsincludingvisual,textual,andaudioarts.
Generatingmusicthatexpressaparticularemotionconveyedbya poemhelpsthereaderunderstandthepieceofliteratureonadeeper level.Musicaccompanimentsforpoemscanalsobeusedtoenhancethe moodofthepoetryreaders.However,insufficientworkhasbeencar- riedoutforgeneratingmusicalpiecesforpoemswhileconsideringthe emotionofthetext.Tothebestofourknowledge,noresearchapplica- tionhasyetbeenproposedthatrelatestopoemandmelodygeneration using deeplearning.Theexistingliteraturemainlyfocusongenerat- ingmusicforotherartformssuchassonglyrics.Hence,inthiswork, wepresentthefirstArabicpoetrytomelodygenerationutilizinggen- erativeadversarialnetworks(GANs).GANscangeneratenewcontent basedon amin-maxgamebetween twonetworks,thegeneratorand thediscriminator(Ruzafa,2020).Thegeneratortriestogeneratefake samplesandimproveitsnetworktotrickthediscriminator.Meanwhile, thediscriminatortriestodistinguishbetween therealandfakesam- ples.MoresophisticatedGANarchitecturescanmakeuse oflabelsto
https://doi.org/10.1016/j.jjimei.2022.100066
Received8November2021;Receivedinrevisedform26February2022;Accepted26February2022
2667-0968/© 2022TheAuthor(s).PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Fig. 1. Graphical illustration of the proposed application.
generatedatapointsforaspecificcategory.Thisisusefulbecauseinthe standardGAN,datapointsaregeneratedonlybasedontheinputnoise andunconditional generationof datapointsis notsuitableformany applications.
Although GANs have been widely utilized for various applica- tions including 3D object generation, medicine, and traffic control (Aggarwaletal.,2021),theiradoptioninaudiodataandmorespecifi- callycross-domainaudiogenerationincludingtexttoaudioandimage toaudiogenerationisverylimited.Thespecificproblemoftexttoaudio transformationcanbehighlightedfurtherbyarelatedresearchappli- cation,text-to-speech(TTS)transformation.MostTTSprocessescarried bytheneuralnetworkmodelsareauto-regressive(AR)whichmeans thatthespeechsegmentsaregeneratedbasedonthepreviouslysynthe- sizedsegments.ARmethodsdonotsupportparallelization(i.e.,theuse ofGPU)duetotheirsequentialnature(Prengeretal.,2019),(Lietal., 2019).Consequently,ARmodelsarenotefficientforreal-timeapplica- tionsbecauseoftheconsecutivecalculationsthatrequiretime(Okamoto etal.,2019;Renetal.,2019)addedhowthecommonlyusedARTTS modelssuchasTacotron,Tacotron2,DeepVoice3,andClarinetgen- erateMel-spectrogramsfrom thetextinputsforaudiosynthesis.The useofspectrogramsfurtherslowsdowntheaudiogenerationsinceeach inputisconvertedtoaMel-spectrogram,andthesequencelengthishun- dredsorthousands.Mel-spectrogramsaresusceptibletoword-skipping orrepeating,whichcausesinconsistencybetweenthetextandgener- atedspeech(Renetal.,2019).BesidesthedifficultiesinTTSgenera- tion,melodyormusicgenerationcomeswithitsownsetsofchallenges Dielemanetal.(2018)explainwhymusicgenerationhasreceivedless attentionthanspeechduetoitscomplexity.MIDIandothersymbolic musicrepresentationshavebeenusedinliteraturetorepresentthemu- sicconcisely.However,manyessentialfeaturesthatdescribetheinstru- mentsusedareremovedwhenusingsymbolicrepresentations.Thepi- anocanbetranslatedintoMIDIsequences,butitisdifficulttorepresent mostinstruments.Itismoreefficienttouserawmusicaudiosignals sinceitretainstheneededinformation.However,itisevenmorediffi- culttomodeliti.e.,capturemusicalstructureoveranextendedperiod.
Therefore,theproposedtext toaudiogenerationapplication investi- gatesthecompetencyofGANswithaudiodata.Theobjectiveof this studyistoaddresssomeoftheexistingchallengesinaudiogeneration andfurtherexaminetheaudiogenerationfromagiveninputtextwhich hasnotreceivedsufficientattentioninexistingresearchworks.
Ahigh-levelgraphicalillustrationoftheproposedframeworkispre- sentedinFig.1.Thegeneratorutilizesthepoemfromthedatasetalong withsomenoisetosynthesizenewdatapointsthatwouldultimatelyre- semblethemelodycorrespondingtotheinputpoem.Thegeneratordoes sobyoptimizingthelossfunctionthatminimizesthedifferencebetween thegeneratedsampleandtherealmelody.Theroleofthediscrimina- tor,inthiscase,istoprovidefeedbacktothegeneratorbyattempting todistinguishbetweenthegeneratedandtheactualmelody.Afterthe trainingphase,thegeneratorwouldideallylearntherepresentationbe- tweentheinputpoetryandthecorrespondingmelody.InSection3,a detailedexplanationofthetypeofGANarchitectureaswellastheex- perimentalsetupusedinthisworkispresented.
Thereareseveralchallengestosolvingthisproblemofmelodygen- erationforpoetry.Themostnaturalsolutionofemployinghumanmu-
siciansandpoetstocollaborateinsolvingthistaskisnotscalabledue tohighcosts.Moreover,thereisachallengeintermsofthelanguageof thepoemandthemusicassociatedwiththecultureofthewrittentext.
Therefore,forpracticalusage,thesolutionwouldbetoutilizecomputers togeneratemelodies.Theuseofprogramminglanguagesandextensive humansupervisioninvolvedwasamajordrawbackinolderapproaches tomusicgenerationbycomputers.Inthiscontext,thecomputer’srole wastosuggestsynthesesandleavethehumanexperttocompletethe feedbackloopandmakethefinaldecisionofacceptingorrejectingthe suggestions(Noll,1967).Functionsforamplitude,frequency,anddu- rationofasequenceofnotesweredrawnonacathode-raytubewitha lightpenandthecomputerprovidedsynthesisbycombiningthesefunc- tionsusingsimplealgorithms(Noll,1967).Consequently,themelodies producedbysuchcomputerprogramssoundeduniformandartificial.
Thechallengingaspectofthecomputermusicgenerationcomesfrom thefactthathumansupervisionisrequiredatsomelevel.Besidesthe humansupervision,itisnotstraightforwardtoprogramthecomputerto transformthepoetryintomelodytocapturethemoodofapoem.This wouldrequirefurtherresourcestofirstdevelopaprogramthatcanpro- cessthelanguagewithinthepoetry.Thesolutiontoboththeneedfor humansupervisionaswellaslanguagetoaudiorepresentationlearn- ingcan only beachieved ifcomputers aretrainedlike humans,i.e., tolearnfromexperience,usinglotsofdataandwithoutexplicitpro- gramming.ThisiswheredeeplearningandmorespecificallyGANsare extremelyuseful.GANsaregoodattransforminginputsfromonedo- maintoanother(Ruzafa,2020),andconsequently,inourwork,itcan helpwithtransformingthepoem,whichisatextinputintoarepresenta- tivemelody,anaudiooutput.ResearchinimagesgeneratedusingGANs havealreadyshowntoberealisticwherehumanscannottellwhether a particularartpiecewas producedbya humanartist oramachine (Mazzone&Elgammal,2019).Therefore,GANswillnotonlyprovidea moreflexibleimplementationbutwillalsoproducemorerealisticand diversemelodiesthatcancapturetheimaginationofthepoetryreaders.
2. Background
Thissectionpresentstherelatedworksintheliteraturepertaining tomelodyandlyricsgenerationaswellasbackgroundinformationon GANarchitectures.
2.1. AIandcomputergeneratedart
GANshavebeeneffectivelyusedinvariousartgenerationapplica- tionsincludingvisualarts,writtenarts,musicgenerations,calligraphy stylerecognition,andmelodyclassification(Shahriar,2021).Areview oftheliteratureinthecontextofrelatedmelodyandmusicgeneration applicationsispresentednext.Scireaetal.(2015)proposedasystemus- ingtwoMarkovchainstogeneratelyricsfromacademicresearchpapers alongwithappropriatemelodiestoaccompanythemusic.However,this workdidnotprovideanevaluationcasestudyonthequalityofthelyrics andmelodiesgenerated.InAckermanandLoker(2017),machinelearn- ingmodelsincludingrandomforestwereusedtocreateanapplication thatiscapableofassistingmusiciansandamateursinproducingquality songsandmusic.Althoughthreesongswerecreatedusingtheapplica-
tion,theywerenotevaluatedbycasestudies.Asystemthatcananalyze themoodofthepoemandgeneratearelevantmusicalcomponentwas presentedin(Stere& Trăuşan-Matu, 2017).Firstly,apoemwasana- lyzedbyconsideringitsrhythm,punctuation,andmood.Basedonthe analysis,apianocompositionwasgeneratedtocomplementthemoodof thepoem.Machinelearningwasusedtoclassifythemoodofthepoem.
Theauthorutilizedapre-trainedNaïveBayesclassifier calledMusic- Mood(Raschka,2016)toclassifythepoemtextas“happy” or“sad”.
Thepaperfailedtoaddresstheaccuracyofthemoodclassificationas wellasthequalityofthegeneratedmusic.Milon-Floresetal.,2019pre- sentedanapplicationtogenerateaudiovisualsummariesofliterarytexts byproducingemotion-basedmusiccompositionandgraph-basedani- mation.Naturallanguageprocessingalgorithmswereusedtoextract emotionsandcharactersfrom literarytextsbutnomachinelearning frameworkswereusedinthiswork.Fukayamaetal.(2010)developed anautomatedJapanesesongcomposer,"Orpheus."Theapplicationused theGalatea-Talktooltoconvertthetextuallyricstospeechandthehid- denMarkovmodeltosynthesizesingingvoice.Asurveywasconducted on 1378people concluded that 70.8%of the songs wereattractive.
Monteithetal.,(2012)generatedmelodyfordifferentgenresinclud- ingnurseryrhymesandrocksongsbyextractingtherhythmandpitch fromthecorrespondinglyrics.Theresultsfromacasestudyindicated thattheoriginalmelodiesweremorefamiliar,butthedifferenceswere notsignificant.Generationofmusicfromtextwasproposedinasystem calledTransPose(DavisandMohammad,2014),whichexploitedthe relationshipbetweenvariouscharacteristicsofmusic(suchasloudness, melody,andtempo)andtheemotionstheyinvoke.Forexample,Loud musiccorrespondstopowerorangerwhilefearorsadnessislinkedwith softmusic.Onedrawbackofthisapproachisthatmid-piecekeychanges cannotbecapturedwiththistechnique.Forfurthercomparisonofsym- bolicmusicgenerationtechniques,readersareencouragedtoreferto thefollowingsurvey(Briotetal.,2020).
Mishraetal.(2019)proposedanRNN-basedgenerativemodelfor melodygeneration.ThemodeutilizedLSTMarchitecturetolearnthe inputstructurefromatextfileandwastrainedtopredictthenextchar- acterinthesequence.Itwasconcludedthatthehuman-generatedand machine-generatedmelodiescouldnotbedistinguishedfromasurvey.
InWelikalaandFernando(2020),musicalinstrumentdigitalinterface (MIDI)files,whichcontaindatatospecifythemusicalinstructionsuch asnote’snotationandpitch,wereusedtotrainahybridvariationalau- toencoderandGANtogeneratemusicalmelodyforaspecificgenre.It wasconcludedthattheconsistencyofgeneratedmelodieswasnotupto thesamelevelashumancomposition.Similarly,Xuetal.(2020)pro- posedamelodygenerationframeworkusingGAN,consistingofaBi- LSTMgeneratorandanLSTMdiscriminator. Theproposed modelon averageobtainedascoreof3.27outof5by19humanevaluators,out- performingthebaselinemodels.Yuetal.(2020).usedconditionalGANs formelodygeneration.Thelyricsencodertokenizedthelyricsintoa vectorcontainingaseriesof wordandsyllabletokensaswellasad- ditionalnoises.Theoutputofthelayerwasfedintotheconditioned- lyricsGAN.MIDIsequenceturnerwasthenusedtoquantizetheMIDI notesproduced bythe generator. A similar studywas conductedin (Conditional LSTM-GAN for Melody Generation from Lyrics, 2021), whichusedconditional-LSTMGANs.Comparisonwithrandomandmax- imumlikelihoodestimationbaselineshowedthattheproposedmodel outperformedthebaselinesintermsofBLEUscore.Baoetal.,2019pro- posedamelodycomposer,"songwriter"togeneratelyrics-conditioned melodyanditsalignment.Theauthorsdividedthelyricsintosentences thatareprocessedinsequence.Thesetofgeneratedmelodieswasag- gregatedforeachlyrictogeneratethecompletemelody.Theevalua- tionusingF1andBLEUscoresshowedthattheproposedmodelout- performedexistingmodelsonallmetrics.DiasandFernando(2019)in- troducedasystemwithaninterfacethatgeneratesmelodyusingase- quenceofmusicalnotesgivenEnglishlyricswithoutanyhumaninter- vention.TheLSTMwasusedtotrainandpredictthemelodytargetdata.
Thehumanevaluationresultshowedthatthegeneratedmelody was
pleasanttolistentobecauseofthewell-developednotes.Theauthorsin DeepCross-ModalCorrelationLearningforAudioandLyricsinMusic Retrieval|ACMTransactionsonMultimediaComputing,Communica- tions,andApplications(2021)introducedamultimodalmusicretrieval thatretrievesaudiofromlyricsdataandviceversa.MFCCfeatureswere extracted frommusicaudiosignals usingCNN,andDoc2vecis used toembedlyricsdocuments.BasedonPixelCNN(vandenOordetal., 2016),WaveNet(vandenOordetal.,2016)isamodelforaudiogener- ation.Itispossibletogeneraterawaudiowaveformswiththismodel.
Although casestudyresults werepresented,evaluationmetricswere notreported.InPerformanceRNN:GeneratingMusicwithExpressive TimingandDynamics(2021),anLSTMmodelsthemusicinsymbolic formusingMIDIfiles.Yamahae-PianoCompetitiondatasetwasused fortrainingandtosupplementthetrainingdata,theaugmentationof each record(e.g.,speedandtransposition)werealsoused.Similarly, Boulanger-Lewandowskietal.(2012)exploredtheapplicationinpoly- phonicmusicgenerationusingtherestrictedBoltzmannmachine.
2.2. Deeplearningandadversarialnetworks
Deeplearningisasubsetofmachinelearning,wherecomputersys- temslearnfrom experiencebygettingexposuretotrainingdataand without requiring explicitprogramming. Deeplearning architectures aredifferentfromtraditionalmachinelearningonesbecausetheyuti- lizeartificialneuralnetworks(ANNs),abiologicallyinspiredcomput- ingtechniquethatreplicatesthewayneuronsfunctioninintelligentor- ganisms(Kar,2016).Inmanyapplicationsthatareofhighcomplexity, otherformsofbio-inspired computationmaybecombinedwithneu- ralnetworkstoofferahybriddecisionsupportsystem(Chakraborty&
Kar,2017).Inmachinelearning,anextensiveselectionofinputfeatures totrainthealgorithmsformakingintelligentdecisionsisrequiredwhich makestheprocesstime-consuming.Meanwhile,deeplearningmodels generallylearnfeaturesfromtheinputtrainingdatawhichmakesthem farmoreeffectivefortaskssuchasimageandaudioclassification.While simpleANNscontainabasicstructureandmodelswithlessthanthree layersareknownasshallownetworks,deeplearningmodelshaveanin- creasingdepthofmorethanthreelayers(Bouwmansetal.,2019).The organizationinlayersallowsmorecomplexrepresentationsofaprob- lemtoberepresentedintermsofsimplerrepresentations.Othertypes ofdeeplearningmodelsaresuitableforspecificsetsofproblemsde- pendingonthenatureofthedataset.Forinstance,convolutionalneural networks(CNNs)utilizeconvolutionoperationtomergetwosetsofin- formationandasubsequentpoolinglayerreducesthedimensionality andcomplexity(Dhillon&Verma,2020).CNNsarethereforeeffective inimageandvideoapplicationssuchasactionrecognitionfromvideo footage(Yaoetal.,2019).One ofthefactorsthatenabledtherecent successofdeeplearningmodelsisthemassivevolumeofavailabledata allowingbigdataanalyticswhichistheprocessofobtainingmeaning- fulinformationandinsightsfromavarietyofdatasources(Grover&
Kar,2017).Theideaoflearningfromexperiencetoclassifyorpredict variablescanbeextendedtogeneratingnewdatapointsusingadver- sarialnetworks.
GANsprimarilyconsistoftwodeeplearningarchitecturesnamelya generatorandadiscriminator.Theroleofadiscriminatorcanbelikened tothatofadeeplearningclassifier.Inthiscontext,thediscriminatoris predictingwhetheragivensampleisfromtherealtrainingsetoritis generatedbythegenerator.Theperformanceofthediscriminatorisfur- therusedbythegeneratorasafeedbackasittriestoimproveitslearning parameterstofoolthediscriminator.Oncethegeneratorsuccessfully learnsthedistributionofthedataset,itcanproduceimagesthatare virtuallyindistinguishablefromothersamplesintherealdataset.The abilityofbigdataanalyticstoprovideflexiblemanagementofinforma- tionassets(Grover& Kar,2017)canbefurtherimprovedwithGANs astheycaneffectivelyperformcross-domaintransformation.Moreover, inrecenttimestext-basedinformationretrievalandmultimedia-based informationretrievalhasbecomeanintegralresearchtopicduetothe
growingamount ofthese datatypes(Kushwahaetal., 2021).There- fore,frameworksthatcaneffectivelyperformcross-domaintransferof- fergreatbenefitintermsofinformationstorageandretrievalaswellas theextractionofnovelinsightsandknowledge.Theuseoftext-to-text transformationhasresultedinunique applicationsuchasparaphrase generation(Palivela,2021).Consequently,amorechallengingaspect oftransformingthetextinputtoadifferentdomain,i.e.,audioisunder- takeninthisworkusingGANs.
2.3. GANarchitecturesandlossfunctions
WaveGANandSpecGANaretwotypesofarchitecturesthataresuit- ablefortheproposedapplication.MostoftheotherusedGANarchitec- turesarebuiltforTTSgenerationandarenottestedonmusicormelody generation.Meanwhile,thetwoconsideredmodelsaretestedformusic generationaccordingtotheexistingworks.Theoutputmusicsamples areavailableonline,andtheysoundclearandrealistic.Furthermore, therealisticoutputsoftheWaveGANandSpecGANarchitecturesfound inDonahueetal.(2019),wasduetotheselectedneuralnetworkmodel (i.e.,DCGAN)andthelossfunctionWGAN-GP.TheDCGAN,proposed inRadfordetal.(2016),combinestheCNNandGANarchitecturesfor unsupervisedlearning.CNNisawell-knownmodelthatactsasafea- tureextractorandaclassifier.Theconvolutionallayers extractbasic andcomplexfeatures,whichhelpstoidentifytheinputclass.Thegen- eratoranddiscriminatorinRadfordetal.(2016)bothconsistoffour convolutionallayers withno poolinglayers.However,theWaveGAN andSpecGANmodelshaveanextralayer.Theoriginaloutputofthe DCGANis64×64pixelswhichtranslateto4096samples.Therefore, Donahueetal.(2019)decidedtoaddanextralayerforthegenerator anddiscriminatortoincreasesamplesto16384.
Intermsofthelossfunctions,theWassersteindistanceisdifferent fromtheKullback-Leiblerdivergence(KL),andJenson Shannon(JS) distancessinceitmeasuresthesimilaritybetweentwoprobabilitydis- tributionsbycomparingthedistancebetweenthem(i.e.,comparingthe horizontaldifferenceandnotthevertical).TheWassersteindistanceis helpfulwhencomparingtheprobabilitydistributionsthatarenotover- lapping.KLdivergenceisknowntohaveaninfinityvaluewhenthe twoprobability distributionsarenot overlapping(i.e., thegenerated probabilitydistributionis0,andtherealdistributionisgreaterthan0) (Arjovskyetal.,2017).Meanwhile,JSdivergencestabilizestheoutput tolog2whenthedistributionsarenotmeeting(Arjovskyetal.,2017).
Unlikethepreviousmethods,theWassersteindistanceprovidesastable distancemetricsincetheoutputvaluedoesnotincreaseexponentially (Arjovskyetal.,2017).
Thefollowingconditionssummarizetheoutputsofthedistancesin casethedistributionsarenotoverlapping(i.e.,𝜃≠0)andwhentheyare entirelyoverlapping(i.e.,𝜃=0):
𝑊( ℙ0,ℙ𝜃)
=|𝜃| (1)
𝐽𝑆( ℙ0,ℙ𝜃)
={log2 if𝜃≠0,
0 if𝜃=0, (2)
𝐾𝐿( ℙ𝜃∥ℙ0
)=𝐾𝐿( ℙ0∥ℙ𝜃)
=
{+∞ if𝜃≠0
0 if𝜃=0, (3)
WecanseethattheWassersteindistanceisequaltothe𝜃variable correspondingtothedistancebetweenthedistributions.(Arjovskyetal., 2017)usedtheWassersteindistancetocalculatethelossoftheGAN Eq.(4.)belowpresentsWGAN:
𝑉WGAN
(𝐷𝑤,𝐺)
=𝔼𝒙∼𝑃𝑋
[𝐷𝑤(𝒙)]
−𝔼𝒛∼𝑃𝑍
[𝐷𝑤(𝐺(𝒛))]
(4) Thediscriminator𝐷𝑤nolongerdetectstherealorfakesamples,and instead,itcalculatestheWassersteindistance.Hence,thediscriminator iscalleda“critic” sinceittellshowclosethetwodistributionsare.The firstfunction,𝔼𝒙∼𝑃𝑋[𝐷𝑤(𝒙)], calculatestheWassersteindistance with respecttothedatasample𝑥takenfromtherealdataandthesecond
equationwithrespecttothegeneratedsample𝑧.Thenetworkparameter weightsforthecriticandgeneratornetworkareupdatedtominimize Wassersteindistancebetweenthedistributions.
Theclippingmethodis usedintheWGANtosatisfytheLipschitz constraintbylimitingtheweightofthecritic. However,ithassome drawbacks,includingtakingalongtimetoreachtheoptimalpointfor thecriticandhavingvanishinggradientsifthenumberoflayersislarge or batchnormalizationwas not used(Arjovskyetal., 2017).Hence, gradientpenalty(GP)wasintroducedinGulrajanietal.(2017)tobetter addresstheLipschitzconstraintbyenforcingasofterlimitonthecritic’s weight.TheGPequationisasfollowsusingtheaddedequation:
𝐿=𝔼𝒙∼𝑃𝑋[ 𝐷𝑤(𝒙)]
−𝔼𝒛∼𝑃𝑍[
𝐷𝑤(𝐺(𝒛))]
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
Originalcriticloss
+𝜆 𝔼
̂ 𝒙∼ℙ𝒙̂
[(‖‖∇𝒙̂𝐷(𝒙̂)‖‖2−1)2]
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
Gradientpenalty
(5)
where ̂𝑥isaninterpolatedvaluebetweentherealandgeneratedsam- ples.TheGPtakes𝜆asaregularizationparameterandthegradientnorm of ̂𝑥issubtractedby1andsquared.Theresultisthevalueofthenorm ofthegradientcloseto1,whichwillsatisfytheLipschitzconstraint.
3. Methodology
Thissectiondescribestheoverallmethodologyfortheproposedap- plicationincludingGANarchitecturesused,datacollection,andexper- imentalsetup.
3.1. GANarchitectureselection
Inthiswork,theaudiodatasetcomprisedofmelodiesinrawaudio formatinsteadofMIDIrepresentations.WaveGANandSpecGANaretwo typesofGANarchitecturesusedformusicgenerationthatiscompati- blewithrawaudiofileformats.WeselectedWaveGANforthemelody generationsinceitdemonstratedbetterresultsonthespeechcommands zerothroughnine(SC09)datasetusedin(Donahueetal.,2019).The SpecGANhasahigherchanceofoverfittingandthemeanopinionscores (MOS)forthehumanquality,ease,anddiversityevaluationsarelower thanWaveGAN(Donahueetal.,2019).WaveGANisspecificallysuitable becauseofthemelodydatabeinginrawaudioformat.Otherpopular approachesrelatedtomelodyandmusicgenerationutilizedMIDIor symbolicdatasetsinwhichcasetheGANarchitecturessuchastheone inLimetal.(2017)wouldbesuitable.Similarly,SpecGANmakesuse ofSTFTfromrawaudioandusesaniterativeGriffin-Limalgorithmfor inversiontoaudio.So,inconclusion,thekeydifferencebetweenSpec- GANtoWaveGANisthattherawaudioisconvertedtospectrograms firstbeforetrainingandperformsinversetransformationtoaudiobe- foresavingthegenerations.TheWaveGANimplementationispublicly availableonGitHub(Donahue,2021).Forourexperiments,wemod- ifiedthefollowingGitHubrepository(Adrián,2021)thatissuitedto conditionalWaveGAN.ForSpecGANimplementation,wemodifiedthe implementationfoundin(GitHub-naotokui/SpecGAN:SpecGAN-gen- erateaudiowithadversarialtraining2021).
3.2. Datacollection
TheapplicationofArabicpoetrytomelodyisanoveloneandcon- sequently there donot exist anyArabic poem-emotionandmelody- emotiondatasets.Therefore,different sourceswereutilizedtocollect datatobuildanovelArabicpoemandmelodydatasets.Toexpressthe appropriatemoodofthepoetryandlinkittoitscorrespondingmelody, poetrywascollectedbasedonemotions.Inthefuture,thisworkcan beextendedtoexperimentwithotherattributessuchaspoemeraand poemgeographicallocation.Torepresentthethreeemotionclassesof poems, theArabicmusicalmode ormaqāmwasused. Thesearepat- terns of melody andthe maqāmtechnique denotesthepitches, pat- terns,andconstructionofapieceofmusic,uniquetoArabicartmusic (Touma&Touma,2003).Theuseofmaqāmforselectingthemelodies
Table1
Datasetcharacteristics.
Dataset Category Total Samples Attributes
Poetry Sad 4064 Contains 9452 poems of varying length, with each row containing poem text and corresponding emotion label.
Love 3443
Joy 1945
Melody Saba 8319 Contains 16,322 audio files of 4 seconds.
Each file is labeled with its corresponding maqam name.
Seka 4611
Ajam 3392
issuitableduetotheabilityofvariousmaqamstoexclusivelyconvey certainemotions(Shahriar&Tariq,2021).AccordingtoFarrajandShu- mays(2019)andShokouhiandYusof(2013), melodiesbelongingto maqāmAjamrepresentstheemotionofjoy,Sekarepresentslove,and Sabarepresentsmelancholyandsadness.Therefore,themaqāmdenot- ingaparticularemotionwillbepairedwithits corresponding poem denotingthesameemotion.
Forthepoemdataset,awebscraperwasdevelopedtocollectpoems withtheircorrespondingemotionsfromAlDiwan(AlDiwan,2021)web- site.Thiswebsitewasselectedbecauseitwastheonlyaccessiblesource wherepoemswerecategorizedunderseveralemotioncategories.Other sourcessuchasKagglecontainedArabicpoetrydatasetsclassifiedbythe eraandlocationsofthepoems.Forsimplicity,threeemotioncategories includingsad,joy,andlovewereselected.However,inthefuture,the proposed frameworkis flexible andcan easily be extended toother emotions.Atotalof9452poemswerecollected:4064sad,3443love, and1945joy.Forthemelodydataset,YouTubemusicalplayswereused tocollectthreetypesofmelodiesfromeachofmaqamsSaba,Seka,and Ajamwhichcorrespondtosad,love,andjoy,respectively.Themelodies weresplitinto4-secondsegments,resultingin16,322audiofiles.This leftuswithatotalof8319recordingsformaqamSaba,4611recordings forSeka,and3392recordingsforAjam.Sincethereweremoremelodies availablecomparedtopoems,apoetryfromeachclasswaspairedwith amelodyfromits correspondingmaqamclassrandomly.This meant thatabigportionofmelodiesfromthedatasetwerenotusedfortrain- ingastheywereleftunpaired.Infuturework,morepoemsfromthe emotionclassescanbecollectedandpairedwiththeunpairedmelodies toexpand thetraining setsize. Thedataset is publiclyavailable on GitHub (https://github.com/SakibShahriar95/Arabic-Poetry-Melody).
ThedatasetattributesaresummarizedinTable1. 3.3. Experimentalsetup
ToimplementtheproposedapplicationofArabicpoetrytomelody generations,twodifferentapproacheswereutilized.Bothapproaches areGAN-basedandareusedtogeneratemelodiesappropriateforagiven inputtext.Thekeydifferenceisthatthefirstapproachutilizesthetext embeddingfromthepoemstodirectlytraintheGANwhereasthesec- ondreliesonaseparateclassifiertofirstdeterminetheclassthatthe poembelongstobeforegeneratingmelodyforthatclass.However,the resultfrombothapproacheswilloffertheproposedpoetrytomelody generationapplication.
3.3.1. End-to-end(E2E)poetrytomelodygenerationusingWaveGAN Inthefirstapproach,nointermediarystepsarerequired.Thisisbe- causetheinputpoemtextintheformofwordembeddingsarebeing directlyusedtotraintheGANarchitecture.Fig.2.illustratestheexper- imentalsetupusingthefirstapproach.
During thetrainingphase,thepoems inthe datasetisconverted tocorrespondingwordembeddingsusingtheAraBERT(Antounetal., 2020) model.Toreduce thecomputationalcosttwosteps wereper- formed:
• Thefirst128wordsofeachpoemwereconsideredtogeneratethe embedding.Thiswasdonetoreducecomputationalcostandmem- oryexhaustion.
• T-distributed stochastic neighbor embedding (TSNE) (van der Maaten & Hinton, 2008) is applied tothe output embedding of AraBERTbecausethedefaultembeddingsizewas(768Xnumber ofwords).ThenumberofoutputcomponentsforTSNEwassetto 100tobalanceitforconcatenation withthelatentvectorsizeof 100.
Therefore,theinputtothegeneratorisanoisevectorthatisconcate- natedwiththewordembeddings(afterapplyingTSNE).Thediscrimina- torreceivestherealmelodydatatotrainitselftodistinguishbetweenthe realonesandthesynthesizedones.Aftertraining,thearchitecturecan generateanappropriatemelodyforagiventextembedding.WaveGAN isbuiltontopofDCGANarchitecture.Forupsampling,thegenerator usestransposedconvolution(sinceaudiosignalsare1-D,alargerker- nelofsize25wasused).Therestofthearchitectureiskeptthesame asstandardDCGAN.Forthegenerator,ReLUactivationwasusedfor allthefirstfourlayerswhereastanhactivationwasusedforthefinal layer.Meanwhile,inthediscriminator,noactivationwasusedforthe firstfourlayerswhereasthelastlayerusedLeakyReLUactivation.The trainingstrategyusedwasWGAN-GP.
3.3.2. PoetrytomelodygenerationusingconditionalGANs
Inthisapproach,aconditionalGAN-basedmethodwasused.There isanintermediarytextclassifierthatisfirstresponsibleforgenerating thecorrectlabelstobeusedbytheconditionalGAN.Fig.3.displaysthe experimentalsetupforthesecondapproach.
Two conditionalGANs,namelyconditionalWaveGAN andcondi- tionalSpecGANwereexperimentedwith.ThegeneratorfortheWave- GANistrainedbasedontheinputrandomnoiseandthelabelsofthe poemtextswhichspecifytheemotionofthepoem.Thediscriminator takesasinputboththeactualmelodyandtheonesthatweresynthesized tolearntodistinguishbetweenthem.TheonlydifferenceforSpecGAN implementationisthattherawaudiofileisfirstconvertedtospectro- gramrepresentationforthediscriminatortotrainon.Theentirepro- cessisreversedduringsynthesiswhenthegeneratedspectrogramsare inversetransformedtorawaudiorepresentations.Duringthetraining phase,atextclassifierisnotrequiredasthelabeleddataisavailable.
However,oncetheconditionalGANshavebeentrainedsuccessfully,we cannotsimplyinputthetextoritsembeddingtogeneratenewmelodies.
Rather,thelabelofthepoemcategorymustbeobtainedfirstbyrun- ningthetextthroughatextclassifier,AraBERTinthiscase.Thelabel canthenbepassedontotheconditionalGANtogenerateanewmelody.
Themostsignificantdrawbacktothisapproachistheperformanceofthe textclassifierplaysabigroleintheaccuracyofthegeneratedmelodies.
Moreover,anintermediarystepisoftenundesirableinmanyapplica- tions.
Withintheconditionalapproach,thefirstmethodutilizestheWave- GAN.ThearchitectureoftheconditionalWaveGANisthesameasthe oneusedforE2Egeneration.Thisincludesthenumberoflayers,ac- tivations,andthetrainingstrategy.Theonlydifferenceisthatinthis approachtheTSNEoutputsforthewordembeddingsarenotrequired.
Rather,thelabelisconcatenatedtothenoisevectorsuchthattheGAN architecture is trainedtogenerate samplesfrom the correctmelody class.Afterthetrainingphase,thegeneratorcanbecalledbypassing thelabelfortheclassesaswellasarandomnoisevectortoobtainnew melodiesforallthethreeclasses.
ThesecondconditionalapproachutilizesSpecGAN.Asmentioned previouslyandillustratedinFig.3,thisapproachmakesuseofaspec- trogram representation. Therefore,a preprocessingstep is employed before thetrainingphase.During thisstage, theaudiofilesforeach category areloaded.We thenconverteach audiointo128∗128 Mel- spectrogramrepresentations.Wethensavethespectrograms,theclass labels,Melfeaturemean,andstandarddeviationinto.npz(NumPy)for- matforstorageofarraydata.Thereafter,thisfilecandirectlybeused forthetrainingphasewithoutrunningthepreprocessingstepsmultiple times.During thetrainingphase,thetrainingratiois setto5asper
Fig.2. End-to-endpoetrytomelodygeneration.
Fig.3. Conditionalpoetrytomelodygeneration.
theapproachproposedin(Donahueetal.,2019).Thisensuresthatthe discriminatorupdateswillhappen5timespergeneratorupdate.The generatorconsistsof11layersasfollows:Adenselayerof16384units isfollowedby5blocksofUpSampling2DandConv2Dlayers.Allthe activationsareReLUexceptforthelastConv2Dlayerwhichhastanh activation.Thediscriminatorontheotherhandconsistsofsixlayers.
First,therearefiveConv2DlayersfollowedbyLeakyReLUactivations.
Finally,afterflattening,wehavetwodenselayers.Thefirstdenselayer isresponsiblefordeterminingwhetherthegeneratedaudioisrealor fake,soithasasingleoutput.However,noactivationisuseddueto theuseoftheWassersteinlossfunction.Theotherdenselayeris re- sponsibleforgeneratingtheprobabilitiesthattheaudiobelongstoone ofthecategories.Therefore,ithasthreeoutputswithasoftmaxactiva- tionfunction.Thegeneratorlearningrateissetto0.0001withAdam optimizerandthemodelwillbeoptimizedonWassersteinloss.Onthe otherhand,forthediscriminator,thelearningrateandoptimizerare thesameasthatofthegenerator.InadditiontotheWassersteinloss, thediscriminatoralsohasagradientpenaltylossandcategoricalcross- entropylossforpredictingtheclasses.First,wecallthegeneratortopre- dictthespectrogrambygivingtheinputnoiseandthecategorylabel.
Next,wecallontheinverseMel-spectrogramtoobtaintherawaudio files.
4. Evaluationandresults
Inthissection,resultsfromthedifferentapproacheswillbe com- paredusingqualitativeevaluationandcommonquantitativeevaluation forGANs.Thefollowingsectionspresentanexplanationofthesemetrics andthecorrespondingresults.
4.1. Quantitativevalidation
Theleave-one-out(LOO)andGAN-train/testmetricswillbeusedto validatethemodelperformance.LOOusesallthen-1samplepointsfor trainingwherenisthenumberofsamplesandtheremainingsamplefor testing.Theclass(i.e.,fakeorreal)ofthenearestsampletothetesting pointisgiventoitintermsofdistance.Theobjectiveistotrainak- nearestneighbor(K-NN)algorithmwithaKvalueof1(1-NN)andapply theLOOapproachfortrainingK-NN.FortheGAN,LOOof0.5isthe bestwherethediscriminatorhasa50%chanceofcorrectlylabelingthe point.IftheLOOvaluewaslessthan0.5,itwilllikelyleadtooverfitting sincethegeneratedandrealpointsareveryclose.Ontheotherhand, aLOOvaluegreaterthan0.5meansthatthediscriminatorcaneasily distinguishbetween realandfakesamples. Furthermore,tocalculate thedistancebetweenthesampledpointandthenearestneighbor,the Euclideandistancewillbeused.Giventwopoints,xandywecanget theEuclideandistancebetweenthemusingEq.(6):
𝐸𝑢𝑐𝑙𝑖𝑑𝑖𝑎𝑛(𝑥,𝑦)=
√∑𝑛
𝑖=1
(𝑦𝑖−𝑥𝑖)2
(6)
Moreover, the dynamic time warping (DTW) (Salvador and Chan,2007)algorithmwillbeusedtocalculatethedistancebetween thedistributionsoftherealexamplesandthegeneratedones.TheLOO methodusingtheDTWdistancewillalsobeutilizedintheK-NNalgo- rithm.
GAN-test is when the classifier is trained on the actual samples and tested on thegenerated sample, andGAN-train is the opposite (Konstantin_Shmelkov_How_good_is_ECCV_2018_paper.pdf, 2021).
FromtheGAN-test,itcanbe determinedwhetherthemodelisover- fitting, or does not capture the actual distribution, or generating
Table2
Architectureparametersandconditionsforeachexperiment.
Approach Batch Size Input Layers (G, D) Target Length (Sec.) Sample Rate (Hz)
E2E-WaveGAN 64 (6,8) 4 8192
Conditional-WaveGAN 32 (6,8) 4 8192
Conditional-SpecGAN 64 (11,6) 1 16000
Approach Noise Vector Cond. Input Format Batch Norm. Epochs
E2E-WaveGAN (100, 1) Arabert + tSNE (100,1) True 9200
Conditional-WaveGAN (64, 1) Label joined with Embedding (16,1) True 15000
Conditional-SpecGAN (97, 1) One-hot Encoded (3,1) False 3000
realistic samples (Konstantin_Shmelkov_How_good_is_ECCV_2018_
paper.pdf,2021).Ifthetestingaccuracyisgreaterthanthevalidation, thegeneratedsamplescapturethefinedetailsofthetrainingsetand henceproducesimilarsamples.Meanwhile,thediversityofthegener- atedsamplescanbeextractedusingGAN-trainaccuracy.Thecloserthe testingaccuracytothevalidationaccuracy,thesimilarthegenerated resultsaretotherealones,andthemorediversetheyare.Forbuilding both networks, ResNet34 (ResNet-34, 2021) architecture was used.
Thesame architecturewas used becausethegoal in this evaluation wastoestimatethesimilaritybetweenthegeneratedandrealsamples byassessing theGAN-train andGAN-testaccuracies.Optimizing the architecturesfurtherwouldpotentiallyleadtohigherscoresforboth.
However,inthiscase,theobjectivewastoobtainageneralideaofthe accuraciesandnotnecessarilyoptimizethem.
4.2. Qualitativevalidation
Tovalidatethequalityofthegeneratedmelodies,aqualitativecase studywasconducted.Weinvitedtwelvevolunteerswhowereuniver- sitystudentstoevaluateandscoreboththegeneratedaswellasthe originalmelodies.Theparticipantswerepresentedwithatotaloffour questionsafterbeingpresentedwithtenaudiosamplesfromwhichfive ofthembelongedtotherealmelodiesandfiveofthemwereGANsyn- thesized.ThegeneratedsampleswerepresentedfromtheSpecGANas theothermodelsproducedpoor-qualitymelodies.Foreachaudiosam- ple,theevaluatorswereaskedthefollowingquestionsandaskedtorate between1to5foreachquestion,indicatingverypoor(1)toverygood (5).
• Theoverallqualityofthemelodyplayed(1poorquality,5great quality).
• Themelodyplayedisenjoyable(1notveryenjoyable,5veryenjoy- able).
• Themelodyplayedisinteresting(1notveryinteresting,5veryin- teresting).
• The melody played sounds artificial (1 sounds very artificial, 5 soundsveryrealistic).
Followingthesurvey,thescoresofeachquestionforboththereal andthegeneratedmelodieswillbeaggregated,andcomparisonwillbe made.Althoughduetothesmallsamplesizewhichlimitsthepossibility oftheresultsbeingconclusive,wecanstillobtainageneralperception intermsofthequalityofthegeneratedmelodies.Weareparticularly interestedtonoteifthegeneratedmelodieswereofgoodqualityandif theydidnotsoundartificial.Moreover,iftheevaluatorsfoundthemto beinterestingandenjoyable,itwouldfurtheraddtothesuccessofthe generatedmelodies.
4.3. Results
Threedifferentexperimentsforthepurposeofpoetrytomelodygen- erationapplicationwasconducted.InTable2,wetheparametersand conditionsthateachofthemodelswasexperimentedonareoutlined.
Thearchitectureingreaterdetailswasalreadypresentedintheprevious section.
4.3.1. Quantitativeresults
Thefirstexperimentwasend-to-endusing conditionalWaveGAN, andthemodelwastrainedfor9200epochsinbatchsizesof64.Witha samplingrateof8192,outputof4secondswasobtainedwhichwasthe sizeofthefilesintheaudiodataset.However,thegeneratedmelodies from theembeddings in this approach wereof poor quality asthey soundedlikerandomnoisyinstruments.Webelievethisisduetothe transformationofthewordembeddingvector,whichforcomputational reasonscouldnotbedirectlyusedasaninputandwasrequiredtobe downsampled.UsingAraBERT,weobtainedembeddingsof254by768 foreachinputtext.Incontrast,CIFAR-10imagesareof28by28size.
TheextremelylargesizeoftheAraBERTmodelmeantthatsufficient memorycouldnotbeallocatedfortraining.Asaresult,theinputsize wassignificantlyreduced.Moreover,tomatchthedimensionsaTSNE transformationwas performedwhichwould meanthatfurtherinfor- mationwaslost.Therefore,webelieveinthefuturesmallertextem- beddingmodelsanda moresophisticated downscalingapproach are required.TheLOOresultsusingEuclideanindicatesthatthemodelis leaningmoretowardstheoptimalresultwithavalueof0.58.However, thediscriminatorhasahigherchanceofdistinguishingbetweenthereal andfakesamplesthanthegenerator,whichtriestotricktheclassifier.
Meanwhile,thevalueoftheLOOusingtheDTWdistanceis0.91,which showsanentirelydifferentoutcomewherethemodelishighlycapable of distinguishingbetweenthesamples.Finally,thedistance between theactualandgenerateddistributionsis361382,whichishigh.Conse- quently,thismeansthatthegeneratorisnotcreatingrealisticsamples thatwouldfoolthediscriminator.
Inthesecondexperiment,thesameWaveGANarchitecturewasuti- lizedbutwithadifferentinputrepresentation.Inthiscase,theclassla- belswereusedinsteadofthewordembeddings.Themodelwastrained for15000epochswithabatchsizeof32.Theotherconfigurationsare thesameasthefirstexperiment. Theresultsobtained usingthis ap- proachsoundedmorecoherent,althoughapost-processingnoisefilter wasrequiredtodenoisethegeneratedmelody.Itispossibletofurther improvetheperformanceofthisarchitecturebyusingadditionaldata andhyperparametertuning.Duetoonlybeingabletorun about10 epochseveryminute,theamountoftuningthatcouldbedonewasre- stricted.TheLOOwiththeEuclideandistanceshowsanoptimalvalue of0.51whichmeansthatthediscriminatorhasa0.5probabilityofcor- rectlyclassifyingthesample.TheLOOwiththeDWThassimilarbehav- iortothepreviousarchitecturewherethevalueishigher(i.e.,worst) thantheEuclideandistance,andinthismodel,thevalueis0.82.The DWTdistanceis150063,whichislowerthantheE2E-WaveGAN.Hence theLOO-DWTvalueisslightlybetter.TheGAN-testaccuracyof0.36is higherthantheGAN-trainaccuracyof0.31.Subsequently,thisreveals thatthegeneratedsamplesarelessdiverse.Overall,theGAN-testand GAN-trainaccuraciesareverylow,whichindicatesthattheproduced sampleshavepoorqualityanddiversity.
In thefinal experiment,a conditional SpecGAN architecturewas used.Unliketheprevioustwoexperiments,aspectrogramrepresenta- tionisutilizedhere.BecausetheoriginalimplementationofSpecGAN wasfixedtoone-secondrepresentation(Donahueetal.,2019),audio synthesisofone-secondlengthwasproducedusingthisapproach.As pertherecommendationin(Donahueetal.,2019),batchnormalization
Fig.4. Spectrogramcomparisonforsixgeneratedand realsamples.
wasnotimplementedforthisarchitecture.Themodelwastrainedfora totalof3000epochsinbatchsizesof64.Theresultsobtainedusingthis architecturesoundedaestheticallymorepleasantthantheothers.The onlydownsideisthatthelengthoftheaudiowasonesecondasopposed totheinputaudiodatasetfilesbeingfoursecondslong.ALOOscoreof 0.49wasobtainedwiththeEuclideandistance,whichisclosetoperfect, indicatingthatthediscriminatorhas0.5probabilityofcorrectlyclassi- fyingthesample.However,usingtheDTWdistancewiththeLOO,the scorewas0.78.Thisindicatesthatthegeneratedsamplesarelikelytobe distinguished.Overall,thescoreisbettercomparedtotheprevioustwo experiments.TheGAN-trainaccuracyof0.4ishigherthantheGAN-test accuracyof0.31.Thisindicatesthatthetargetdistributionwasnotcap- turedwellandthereisalackofaudioquality.Finally,theDTWdistance of92327indicatesthatthegenerateddistributionisclosertotheactual onewhenwecomparewiththefirsttwoexperiments.Asaresult,the bestresultswereobtainedusingtheconditionalSpecGAN.
Forfurther evaluation, a total of six spectrograms were selected that were generated by the three approaches as well as the spec- trogram for thereal samples. Fig. 4. displayssix randomly selected Mel-spectrograms of original music and the music generated from conditionalWaveGAN,conditionalSpecGAN,andE2EWaveGAN.The original filescontain a wide rangeof signals spanningover alarge rangeoffrequencies.Thiscontrastswiththesignalsgeneratedthrough E2EWaveGANandtosomeextentconditionalWaveGAN.Theformer hasgeneratedmoremonotonicsignalswithwhitenoisewhilethelatter showsthesignalsaremodulatedpossiblythroughtheprovidedlabel orthelatentspacevariation.Theclosestwaveformsaregeneratedby SpecGANwhich showedthat thegeneratedsignals aremore diverse andclosertotheoriginalone.Thisargumentisalsosupportedbythe scoresobtainedthroughthequalitativeassessment.
Thequantitativeresultsobtainedfromthethreeexperimentsusing theaforementionedmetricsis summarizedinTable3. Fig.5.alsoil- lustratesthecomparison of the DTWdistance between thereal and generateddatadistributionsforeachalgorithm.AlowerDTWdistance indicatesthatthegeneratedandrealdatapointsarecloserandthere- forelowervalue,inthiscase,indicatesbetterperformance.Ascanbe seenfromthetable,intermsofLOOevaluation,bothconditionalWave- GANandSpecGANoutperformedE2EWaveGAN.However,boththese modelsobtainedlowerscoresintermsofGAN-trainandGAN-testac- curacies.SincetheE2E-WaveGANgenerateddatapointsbasedontext
Fig.5. DTWdistancebetweenrealandgenerateddistributions.
embedding,therewasnolabeltoperformtheGan-trainandGan-test evaluationonit.IntermsofDTWdistance,asdisplayedinFig.5,itcan be observedthat SpecGANoutperformed theother models.Thiswas clearafterassessingthesamplesgeneratedfromallthreeexperiments thattheconditionalSpecGANonesweretheclosesttorealsamples.In contrast,conditionalWaveGANproducednoisymelodiesandtheE2E WaveGANproducedmelodiesthatwerenotmeaningful.Consequently, onlytheaudiosgeneratedbyconditionalSpecGANwereselectedforthe qualitativecasestudy.
4.3.2. Qualitativeresults
Toassessthequalityofthemelodiesgeneratedbytheconditional SpecGAN,twelvevolunteerswereaskedtoanswerfourquality-related questions. Theevaluationcriteria werebasedon enjoyable, interest- ing,realistic,andtheoverallqualityofthemelodiestobescoredbe- tween1and5.Theaveragescorealongwiththeirstandarddeviationis summarizedinTable4.Samplemelodiesgeneratedusingconditional SpecGAN are available online (https://github.com/SakibShahriar95/
Arabic-Poetry-Melody).
Thetableindicatesthatrealmelodiesarebetterunderallfourcri- teriacompared tothegeneratedones.However,forallthemeasures thegeneratedmelodies scoredpointsthat wereabove average(2.5).
Moreinterestingly,thehighestofthefourmeasuresforthegenerated melodieswasrealisticwiththeparticipantsonaverageratingit2.75 outof5forbeingrealistic.Thismeanstheparticipantsthinkthatthe
Table3
Quantitativeresultscomparison.
Approach LOO (Euclidean) LOO (DTW) GAN-Train GAN-Test DTW Distance
E2E-WaveGAN 0.58 ± 0.49 0.91 ± 0.29 N/A N/A 361382
Conditional-WaveGAN 0.51 ± 0.50 0.82 ± 0.39 0.31 0.36 150063 Conditional-SpecGAN 0.49 ± 0.50 0.78 ± 0.41 0.40 0.31 92327
Fig. 6. Surveyresult comparison between real and generatedmelodies.
Table4
QualitativeresultsforconditionalSpecGAN.
Evaluation Real Melodies SpecGAN Generated Enjoyable 3.73 ± 0.61 2.60 ± 0.28 Interesting 3.75 ± 0.45 2.53 ± 0.31 Realistic 3.83 ± 0.29 2.75 ± 0.38 Overall Quality 3.72 ± 0.32 2.53 ± 0.32
SpecGANgeneratedaudiosaremorelikelytoberealisticthanfake.This indicatestousthattheoverallresultsfortheconditionalSpecGANare quitepromisingandcanbeimproveduponinthefuture.Fig.6.illus- tratesthecomparisonofthequalitativecasestudy.
5. Discussion
Despite the comparatively limited research in GAN-based cross- domainaudiogeneration,theresultsobtainedforthenovelpoetryto melodygenerationwerepromising.Inthissection,someofthechal- lenges,implications,andfutureworkofthisapplicationarediscussed.
5.1. Challenges
Therewereseveralnotablechallengesthatwereovercometoimple- menttheproposedGAN-basedpoetrytomelodygenerationframework.
Firstly,therearenoexistingdatasetsforpoetryandmelodiespaired basedonsuitablemoodandemotion.Assuch,asignificantcontribution included the introduction of an Arabic poetry and melody dataset.
Related worksfocused on generating melodies from givenlyrics by aligningthelyricswiththemusicalnotes,synthesizingthelyrics,and combiningthemwiththegeneratedmelodytocreatesuitablesongs.
Meanwhile,inthiswork,theobjectiveistogenerateanovelmelody foreachpoemrepresentingtheemotionconveyedbythepoem.The lackofcross-domainaudiogenerationresearchintheliteraturemade itfurtherchallenging.
Moreover,therepresentationoftheaudiosignalsisquitecomplex andspatiallycostlytorepresentindigitalform.Themusicsignalscould
form themelodyatvarioustimescales whichcannotbe captured al- gorithmically.Thespatialcomplexity,i.e.,thehigh-fidelityrateofau- diosignals,isaproblembecauseitrequireslargecomputationalpower.
Thesetwoproblemsofvaryingtimescaleconsistencyandhigh-fidelity makeitverydifficulttomodelthemelodies.Variousalgorithmspriori- tizetheglobalstructureatthecostofhighcomputationalpowerwhereas othersfocusonusingproxyrepresentationslikespectrogramstocon- servecomputationalpower.Therefore,thesolutionstothistradeoff can- notbeeasilygeneralizedandrequiresignificantdomainknowledgeand experimentation.
5.2. Contributionstoliterature
Asdiscussedintheprevioussections,thisworkpresentedthefirst poetrytomelodygenerationusingGANs.Consequently,anoveldataset wasintroducedcontainingpairedpoetryandmelodybasedonemotions.
Thedatasetcanbeusedbyresearcherstofurtherimprovethemelody generation.Theycanalsoexperimentwithnewtaskssuchasmelodyto poetrygenerationaswellastextandaudioclassificationintheArabic language.Moreover,thisworkpresentedareasonableperformancefor melodygenerationusingSpecGANarchitectureandtheimplementation canbeutilizedforrelatedapplications.However,theresultsobtained inthisworkalsohighlighttheimmensechallengeindealingwithraw audioespeciallywiththetexttoaudiotransformationinanend-to-end manner.Therefore,necessarycomputationpowerandtrainingtimeis requiredtofurtherenhancethegenerationquality.
5.3. Implicationsforpractice
Thepracticalimplicationsofthisresearcharemanifold.Firstly,the userexperienceforreadingpoetrywillbegreatlyenhancedwiththead- ditionofmelody.Thiscanbeachievedbyintegratingtheproposedpo- etrytomelodygenerationframeworkintoamobileorwebapplication.
Basedonthepoetrybeingreadbytheuser,amelodymatchingthemood ofthepoemwillbeplayedinreal-timetoenhancethereadingexperi- ence.Moreover,theproposedframeworkcanbeofgreatbenefittothe mediaproductionindustry.Forinstance,avideoscenewhereapoemis beingrecitedmayhaveabackgroundmelodytoaccompanyit.Instead