• Tidak ada hasil yang ditemukan

How can generative adversarial networks impact computer How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion generated art? Insights from poetry to melody conversion

N/A
N/A
Protected

Academic year: 2023

Membagikan "How can generative adversarial networks impact computer How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion generated art? Insights from poetry to melody conversion "

Copied!
12
0
0

Teks penuh

(1)

Zayed University Zayed University

ZU Scholars ZU Scholars

All Works

4-1-2022

How can generative adversarial networks impact computer How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion generated art? Insights from poetry to melody conversion

Sakib Shahriar Zayed University Noora Al Roken

American University of Sharjah

Follow this and additional works at: https://zuscholars.zu.ac.ae/works Part of the Computer Sciences Commons

Recommended Citation Recommended Citation

Shahriar, Sakib and Roken, Noora Al, "How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion" (2022). All Works. 4933.

https://zuscholars.zu.ac.ae/works/4933

This Article is brought to you for free and open access by ZU Scholars. It has been accepted for inclusion in All Works by an authorized administrator of ZU Scholars. For more information, please contact [email protected].

(2)

ContentslistsavailableatScienceDirect

International Journal of Information Management Data Insights

journalhomepage:www.elsevier.com/locate/jjimei

How can generative adversarial networks impact computer generated art?

Insights from poetry to melody conversion

Sakib Shahriar

a,

, Noora Al Roken

b

aCollege of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates

bDepartment of Computer Science and Engineering, American University of Sharjah, Sharjah, United Arab Emirates

a r t i c le i n f o

Keywords:

Generative adversarial networks Deep learning

Melody generation Music generation Artificial intelligence Computer generated art

a b s t r a ct

Recentadvancesindeeplearningandgenerativeadversarialnetworks(GANs),inparticular,hasenabledinterest- ingapplicationsincludingphotorealisticimagegeneration,imagetranslation,andautomaticcaptiongeneration.

Thishasopeneduppossibilitiesformanycross-domainapplicationsincomputergeneratedartsandliterature.

Althoughthereareexistingsoftware-basedapproachesforgeneratingmusicalaccompanimentofagivenpo- etry,therearenoexistingimplementationusingGANs.Thisworkproposesanovelpoetrytomelodygeneration conditionedonpoememotionusingGANs.Adatasetcontainingpairsofpoetryandmelodybasedonthreeemo- tioncategoriesisintroduced.Furthermore,variousGANarchitecturesincludingSpecGANandWaveGANwere exploredforautomaticmelodysynthesisforagivenclassofpoetry.ConditionalSpecGANproducedthebest melodiesaccordingtoquantitativemetrics.MelodiesproducedbySpecGANwereevaluatedbyvolunteerswho deemedthequalitytobeaboveaverage.

1. Introduction

Computerartwasfirstdiscoveredasawaytoproduceobjectiveart piecesby relyingon probabilitytheories (Chamberlainetal., 2018).

Sincethe1970s,theartistHaroldCohenstartedtestingwithcomputer sketchgeneration,wherehemanuallycodedthepaintingstructureinto theprogram(Chamberlainetal., 2018).Unlikeartificialintelligence (AI),regularcomputerprogramscouldnotintelligentlyproduceanew artpiecewithoutthedesigner’sintervention.Withtherapiddevelop- mentsofAI,computer-generateddrawings,music,andpoetryhaveim- proved.Advancedlearningalgorithmsandevolutionarymodelshave generatedcomplexandrealisticartworks(Chamberlainetal.,2018).In 2018,theportraitofEdmonddeBelamywassoldfor$430,000inan auction,atamuchhigherpricecomparedtotheanticipatedpricetag of$7,000-$10,000forbeingacomputer-generatedart(Goold,2021).

TheEdmonddeBelamycasewasthefirstcaseofanAI-generatedart- workinanauctionwheretheFrenchcreatorsfed30,000portraitstoan availablemachinelearningalgorithmandselectedthebest-generated picture(Goold,2021).Finally,theownersaddedtheirsignaturestothe bottomrightcorner.Furthermore,recentinterestsinadversarialtrain- ingforartworkgenerationhavegrownsignificantly.Thenetworktries togeneraterealisticoutputs,andthecriticordiscriminatortriestodis- tinguishbetween therealandgeneratedsamples.Earlier,thehuman playedthecritic’srole,andtheprogramwasaformofanevolutionary

Correspondingauthor.

E-mailaddress:[email protected](S.Shahriar).

algorithmoraneuralnetwork(Soderlund&Blair,2018).Agroupof outputs(e.g.,imagesormusic)arepresentedtothecritic,andasub- groupwiththebestresultsisselectedandpassedtothenextstageor iteration.Suchaprocessistime-consuming,andhence,theentireproce- durehasbeenautomatedbyreplacingthecriticwithaneuralnetwork thatcanlearndiscriminantfeaturestodistinguishbetweentheinputs (Soderlund&Blair,2018).Consequently,researchersandartistsinre- centtimeshavestartedutilizingadversariallearningtogeneratevarious artformsincludingvisual,textual,andaudioarts.

Generatingmusicthatexpressaparticularemotionconveyedbya poemhelpsthereaderunderstandthepieceofliteratureonadeeper level.Musicaccompanimentsforpoemscanalsobeusedtoenhancethe moodofthepoetryreaders.However,insufficientworkhasbeencar- riedoutforgeneratingmusicalpiecesforpoemswhileconsideringthe emotionofthetext.Tothebestofourknowledge,noresearchapplica- tionhasyetbeenproposedthatrelatestopoemandmelodygeneration using deeplearning.Theexistingliteraturemainlyfocusongenerat- ingmusicforotherartformssuchassonglyrics.Hence,inthiswork, wepresentthefirstArabicpoetrytomelodygenerationutilizinggen- erativeadversarialnetworks(GANs).GANscangeneratenewcontent basedon amin-maxgamebetween twonetworks,thegeneratorand thediscriminator(Ruzafa,2020).Thegeneratortriestogeneratefake samplesandimproveitsnetworktotrickthediscriminator.Meanwhile, thediscriminatortriestodistinguishbetween therealandfakesam- ples.MoresophisticatedGANarchitecturescanmakeuse oflabelsto

https://doi.org/10.1016/j.jjimei.2022.100066

Received8November2021;Receivedinrevisedform26February2022;Accepted26February2022

2667-0968/© 2022TheAuthor(s).PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense (http://creativecommons.org/licenses/by-nc-nd/4.0/)

(3)

Fig. 1. Graphical illustration of the proposed application.

generatedatapointsforaspecificcategory.Thisisusefulbecauseinthe standardGAN,datapointsaregeneratedonlybasedontheinputnoise andunconditional generationof datapointsis notsuitableformany applications.

Although GANs have been widely utilized for various applica- tions including 3D object generation, medicine, and traffic control (Aggarwaletal.,2021),theiradoptioninaudiodataandmorespecifi- callycross-domainaudiogenerationincludingtexttoaudioandimage toaudiogenerationisverylimited.Thespecificproblemoftexttoaudio transformationcanbehighlightedfurtherbyarelatedresearchappli- cation,text-to-speech(TTS)transformation.MostTTSprocessescarried bytheneuralnetworkmodelsareauto-regressive(AR)whichmeans thatthespeechsegmentsaregeneratedbasedonthepreviouslysynthe- sizedsegments.ARmethodsdonotsupportparallelization(i.e.,theuse ofGPU)duetotheirsequentialnature(Prengeretal.,2019),(Lietal., 2019).Consequently,ARmodelsarenotefficientforreal-timeapplica- tionsbecauseoftheconsecutivecalculationsthatrequiretime(Okamoto etal.,2019;Renetal.,2019)addedhowthecommonlyusedARTTS modelssuchasTacotron,Tacotron2,DeepVoice3,andClarinetgen- erateMel-spectrogramsfrom thetextinputsforaudiosynthesis.The useofspectrogramsfurtherslowsdowntheaudiogenerationsinceeach inputisconvertedtoaMel-spectrogram,andthesequencelengthishun- dredsorthousands.Mel-spectrogramsaresusceptibletoword-skipping orrepeating,whichcausesinconsistencybetweenthetextandgener- atedspeech(Renetal.,2019).BesidesthedifficultiesinTTSgenera- tion,melodyormusicgenerationcomeswithitsownsetsofchallenges Dielemanetal.(2018)explainwhymusicgenerationhasreceivedless attentionthanspeechduetoitscomplexity.MIDIandothersymbolic musicrepresentationshavebeenusedinliteraturetorepresentthemu- sicconcisely.However,manyessentialfeaturesthatdescribetheinstru- mentsusedareremovedwhenusingsymbolicrepresentations.Thepi- anocanbetranslatedintoMIDIsequences,butitisdifficulttorepresent mostinstruments.Itismoreefficienttouserawmusicaudiosignals sinceitretainstheneededinformation.However,itisevenmorediffi- culttomodeliti.e.,capturemusicalstructureoveranextendedperiod.

Therefore,theproposedtext toaudiogenerationapplication investi- gatesthecompetencyofGANswithaudiodata.Theobjectiveof this studyistoaddresssomeoftheexistingchallengesinaudiogeneration andfurtherexaminetheaudiogenerationfromagiveninputtextwhich hasnotreceivedsufficientattentioninexistingresearchworks.

Ahigh-levelgraphicalillustrationoftheproposedframeworkispre- sentedinFig.1.Thegeneratorutilizesthepoemfromthedatasetalong withsomenoisetosynthesizenewdatapointsthatwouldultimatelyre- semblethemelodycorrespondingtotheinputpoem.Thegeneratordoes sobyoptimizingthelossfunctionthatminimizesthedifferencebetween thegeneratedsampleandtherealmelody.Theroleofthediscrimina- tor,inthiscase,istoprovidefeedbacktothegeneratorbyattempting todistinguishbetweenthegeneratedandtheactualmelody.Afterthe trainingphase,thegeneratorwouldideallylearntherepresentationbe- tweentheinputpoetryandthecorrespondingmelody.InSection3,a detailedexplanationofthetypeofGANarchitectureaswellastheex- perimentalsetupusedinthisworkispresented.

Thereareseveralchallengestosolvingthisproblemofmelodygen- erationforpoetry.Themostnaturalsolutionofemployinghumanmu-

siciansandpoetstocollaborateinsolvingthistaskisnotscalabledue tohighcosts.Moreover,thereisachallengeintermsofthelanguageof thepoemandthemusicassociatedwiththecultureofthewrittentext.

Therefore,forpracticalusage,thesolutionwouldbetoutilizecomputers togeneratemelodies.Theuseofprogramminglanguagesandextensive humansupervisioninvolvedwasamajordrawbackinolderapproaches tomusicgenerationbycomputers.Inthiscontext,thecomputer’srole wastosuggestsynthesesandleavethehumanexperttocompletethe feedbackloopandmakethefinaldecisionofacceptingorrejectingthe suggestions(Noll,1967).Functionsforamplitude,frequency,anddu- rationofasequenceofnotesweredrawnonacathode-raytubewitha lightpenandthecomputerprovidedsynthesisbycombiningthesefunc- tionsusingsimplealgorithms(Noll,1967).Consequently,themelodies producedbysuchcomputerprogramssoundeduniformandartificial.

Thechallengingaspectofthecomputermusicgenerationcomesfrom thefactthathumansupervisionisrequiredatsomelevel.Besidesthe humansupervision,itisnotstraightforwardtoprogramthecomputerto transformthepoetryintomelodytocapturethemoodofapoem.This wouldrequirefurtherresourcestofirstdevelopaprogramthatcanpro- cessthelanguagewithinthepoetry.Thesolutiontoboththeneedfor humansupervisionaswellaslanguagetoaudiorepresentationlearn- ingcan only beachieved ifcomputers aretrainedlike humans,i.e., tolearnfromexperience,usinglotsofdataandwithoutexplicitpro- gramming.ThisiswheredeeplearningandmorespecificallyGANsare extremelyuseful.GANsaregoodattransforminginputsfromonedo- maintoanother(Ruzafa,2020),andconsequently,inourwork,itcan helpwithtransformingthepoem,whichisatextinputintoarepresenta- tivemelody,anaudiooutput.ResearchinimagesgeneratedusingGANs havealreadyshowntoberealisticwherehumanscannottellwhether a particularartpiecewas producedbya humanartist oramachine (Mazzone&Elgammal,2019).Therefore,GANswillnotonlyprovidea moreflexibleimplementationbutwillalsoproducemorerealisticand diversemelodiesthatcancapturetheimaginationofthepoetryreaders.

2. Background

Thissectionpresentstherelatedworksintheliteraturepertaining tomelodyandlyricsgenerationaswellasbackgroundinformationon GANarchitectures.

2.1. AIandcomputergeneratedart

GANshavebeeneffectivelyusedinvariousartgenerationapplica- tionsincludingvisualarts,writtenarts,musicgenerations,calligraphy stylerecognition,andmelodyclassification(Shahriar,2021).Areview oftheliteratureinthecontextofrelatedmelodyandmusicgeneration applicationsispresentednext.Scireaetal.(2015)proposedasystemus- ingtwoMarkovchainstogeneratelyricsfromacademicresearchpapers alongwithappropriatemelodiestoaccompanythemusic.However,this workdidnotprovideanevaluationcasestudyonthequalityofthelyrics andmelodiesgenerated.InAckermanandLoker(2017),machinelearn- ingmodelsincludingrandomforestwereusedtocreateanapplication thatiscapableofassistingmusiciansandamateursinproducingquality songsandmusic.Althoughthreesongswerecreatedusingtheapplica-

(4)

tion,theywerenotevaluatedbycasestudies.Asystemthatcananalyze themoodofthepoemandgeneratearelevantmusicalcomponentwas presentedin(Stere& Trăuşan-Matu, 2017).Firstly,apoemwasana- lyzedbyconsideringitsrhythm,punctuation,andmood.Basedonthe analysis,apianocompositionwasgeneratedtocomplementthemoodof thepoem.Machinelearningwasusedtoclassifythemoodofthepoem.

Theauthorutilizedapre-trainedNaïveBayesclassifier calledMusic- Mood(Raschka,2016)toclassifythepoemtextas“happy” or“sad”.

Thepaperfailedtoaddresstheaccuracyofthemoodclassificationas wellasthequalityofthegeneratedmusic.Milon-Floresetal.,2019pre- sentedanapplicationtogenerateaudiovisualsummariesofliterarytexts byproducingemotion-basedmusiccompositionandgraph-basedani- mation.Naturallanguageprocessingalgorithmswereusedtoextract emotionsandcharactersfrom literarytextsbutnomachinelearning frameworkswereusedinthiswork.Fukayamaetal.(2010)developed anautomatedJapanesesongcomposer,"Orpheus."Theapplicationused theGalatea-Talktooltoconvertthetextuallyricstospeechandthehid- denMarkovmodeltosynthesizesingingvoice.Asurveywasconducted on 1378people concluded that 70.8%of the songs wereattractive.

Monteithetal.,(2012)generatedmelodyfordifferentgenresinclud- ingnurseryrhymesandrocksongsbyextractingtherhythmandpitch fromthecorrespondinglyrics.Theresultsfromacasestudyindicated thattheoriginalmelodiesweremorefamiliar,butthedifferenceswere notsignificant.Generationofmusicfromtextwasproposedinasystem calledTransPose(DavisandMohammad,2014),whichexploitedthe relationshipbetweenvariouscharacteristicsofmusic(suchasloudness, melody,andtempo)andtheemotionstheyinvoke.Forexample,Loud musiccorrespondstopowerorangerwhilefearorsadnessislinkedwith softmusic.Onedrawbackofthisapproachisthatmid-piecekeychanges cannotbecapturedwiththistechnique.Forfurthercomparisonofsym- bolicmusicgenerationtechniques,readersareencouragedtoreferto thefollowingsurvey(Briotetal.,2020).

Mishraetal.(2019)proposedanRNN-basedgenerativemodelfor melodygeneration.ThemodeutilizedLSTMarchitecturetolearnthe inputstructurefromatextfileandwastrainedtopredictthenextchar- acterinthesequence.Itwasconcludedthatthehuman-generatedand machine-generatedmelodiescouldnotbedistinguishedfromasurvey.

InWelikalaandFernando(2020),musicalinstrumentdigitalinterface (MIDI)files,whichcontaindatatospecifythemusicalinstructionsuch asnote’snotationandpitch,wereusedtotrainahybridvariationalau- toencoderandGANtogeneratemusicalmelodyforaspecificgenre.It wasconcludedthattheconsistencyofgeneratedmelodieswasnotupto thesamelevelashumancomposition.Similarly,Xuetal.(2020)pro- posedamelodygenerationframeworkusingGAN,consistingofaBi- LSTMgeneratorandanLSTMdiscriminator. Theproposed modelon averageobtainedascoreof3.27outof5by19humanevaluators,out- performingthebaselinemodels.Yuetal.(2020).usedconditionalGANs formelodygeneration.Thelyricsencodertokenizedthelyricsintoa vectorcontainingaseriesof wordandsyllabletokensaswellasad- ditionalnoises.Theoutputofthelayerwasfedintotheconditioned- lyricsGAN.MIDIsequenceturnerwasthenusedtoquantizetheMIDI notesproduced bythe generator. A similar studywas conductedin (Conditional LSTM-GAN for Melody Generation from Lyrics, 2021), whichusedconditional-LSTMGANs.Comparisonwithrandomandmax- imumlikelihoodestimationbaselineshowedthattheproposedmodel outperformedthebaselinesintermsofBLEUscore.Baoetal.,2019pro- posedamelodycomposer,"songwriter"togeneratelyrics-conditioned melodyanditsalignment.Theauthorsdividedthelyricsintosentences thatareprocessedinsequence.Thesetofgeneratedmelodieswasag- gregatedforeachlyrictogeneratethecompletemelody.Theevalua- tionusingF1andBLEUscoresshowedthattheproposedmodelout- performedexistingmodelsonallmetrics.DiasandFernando(2019)in- troducedasystemwithaninterfacethatgeneratesmelodyusingase- quenceofmusicalnotesgivenEnglishlyricswithoutanyhumaninter- vention.TheLSTMwasusedtotrainandpredictthemelodytargetdata.

Thehumanevaluationresultshowedthatthegeneratedmelody was

pleasanttolistentobecauseofthewell-developednotes.Theauthorsin DeepCross-ModalCorrelationLearningforAudioandLyricsinMusic Retrieval|ACMTransactionsonMultimediaComputing,Communica- tions,andApplications(2021)introducedamultimodalmusicretrieval thatretrievesaudiofromlyricsdataandviceversa.MFCCfeatureswere extracted frommusicaudiosignals usingCNN,andDoc2vecis used toembedlyricsdocuments.BasedonPixelCNN(vandenOordetal., 2016),WaveNet(vandenOordetal.,2016)isamodelforaudiogener- ation.Itispossibletogeneraterawaudiowaveformswiththismodel.

Although casestudyresults werepresented,evaluationmetricswere notreported.InPerformanceRNN:GeneratingMusicwithExpressive TimingandDynamics(2021),anLSTMmodelsthemusicinsymbolic formusingMIDIfiles.Yamahae-PianoCompetitiondatasetwasused fortrainingandtosupplementthetrainingdata,theaugmentationof each record(e.g.,speedandtransposition)werealsoused.Similarly, Boulanger-Lewandowskietal.(2012)exploredtheapplicationinpoly- phonicmusicgenerationusingtherestrictedBoltzmannmachine.

2.2. Deeplearningandadversarialnetworks

Deeplearningisasubsetofmachinelearning,wherecomputersys- temslearnfrom experiencebygettingexposuretotrainingdataand without requiring explicitprogramming. Deeplearning architectures aredifferentfromtraditionalmachinelearningonesbecausetheyuti- lizeartificialneuralnetworks(ANNs),abiologicallyinspiredcomput- ingtechniquethatreplicatesthewayneuronsfunctioninintelligentor- ganisms(Kar,2016).Inmanyapplicationsthatareofhighcomplexity, otherformsofbio-inspired computationmaybecombinedwithneu- ralnetworkstoofferahybriddecisionsupportsystem(Chakraborty&

Kar,2017).Inmachinelearning,anextensiveselectionofinputfeatures totrainthealgorithmsformakingintelligentdecisionsisrequiredwhich makestheprocesstime-consuming.Meanwhile,deeplearningmodels generallylearnfeaturesfromtheinputtrainingdatawhichmakesthem farmoreeffectivefortaskssuchasimageandaudioclassification.While simpleANNscontainabasicstructureandmodelswithlessthanthree layersareknownasshallownetworks,deeplearningmodelshaveanin- creasingdepthofmorethanthreelayers(Bouwmansetal.,2019).The organizationinlayersallowsmorecomplexrepresentationsofaprob- lemtoberepresentedintermsofsimplerrepresentations.Othertypes ofdeeplearningmodelsaresuitableforspecificsetsofproblemsde- pendingonthenatureofthedataset.Forinstance,convolutionalneural networks(CNNs)utilizeconvolutionoperationtomergetwosetsofin- formationandasubsequentpoolinglayerreducesthedimensionality andcomplexity(Dhillon&Verma,2020).CNNsarethereforeeffective inimageandvideoapplicationssuchasactionrecognitionfromvideo footage(Yaoetal.,2019).One ofthefactorsthatenabledtherecent successofdeeplearningmodelsisthemassivevolumeofavailabledata allowingbigdataanalyticswhichistheprocessofobtainingmeaning- fulinformationandinsightsfromavarietyofdatasources(Grover&

Kar,2017).Theideaoflearningfromexperiencetoclassifyorpredict variablescanbeextendedtogeneratingnewdatapointsusingadver- sarialnetworks.

GANsprimarilyconsistoftwodeeplearningarchitecturesnamelya generatorandadiscriminator.Theroleofadiscriminatorcanbelikened tothatofadeeplearningclassifier.Inthiscontext,thediscriminatoris predictingwhetheragivensampleisfromtherealtrainingsetoritis generatedbythegenerator.Theperformanceofthediscriminatorisfur- therusedbythegeneratorasafeedbackasittriestoimproveitslearning parameterstofoolthediscriminator.Oncethegeneratorsuccessfully learnsthedistributionofthedataset,itcanproduceimagesthatare virtuallyindistinguishablefromothersamplesintherealdataset.The abilityofbigdataanalyticstoprovideflexiblemanagementofinforma- tionassets(Grover& Kar,2017)canbefurtherimprovedwithGANs astheycaneffectivelyperformcross-domaintransformation.Moreover, inrecenttimestext-basedinformationretrievalandmultimedia-based informationretrievalhasbecomeanintegralresearchtopicduetothe

(5)

growingamount ofthese datatypes(Kushwahaetal., 2021).There- fore,frameworksthatcaneffectivelyperformcross-domaintransferof- fergreatbenefitintermsofinformationstorageandretrievalaswellas theextractionofnovelinsightsandknowledge.Theuseoftext-to-text transformationhasresultedinunique applicationsuchasparaphrase generation(Palivela,2021).Consequently,amorechallengingaspect oftransformingthetextinputtoadifferentdomain,i.e.,audioisunder- takeninthisworkusingGANs.

2.3. GANarchitecturesandlossfunctions

WaveGANandSpecGANaretwotypesofarchitecturesthataresuit- ablefortheproposedapplication.MostoftheotherusedGANarchitec- turesarebuiltforTTSgenerationandarenottestedonmusicormelody generation.Meanwhile,thetwoconsideredmodelsaretestedformusic generationaccordingtotheexistingworks.Theoutputmusicsamples areavailableonline,andtheysoundclearandrealistic.Furthermore, therealisticoutputsoftheWaveGANandSpecGANarchitecturesfound inDonahueetal.(2019),wasduetotheselectedneuralnetworkmodel (i.e.,DCGAN)andthelossfunctionWGAN-GP.TheDCGAN,proposed inRadfordetal.(2016),combinestheCNNandGANarchitecturesfor unsupervisedlearning.CNNisawell-knownmodelthatactsasafea- tureextractorandaclassifier.Theconvolutionallayers extractbasic andcomplexfeatures,whichhelpstoidentifytheinputclass.Thegen- eratoranddiscriminatorinRadfordetal.(2016)bothconsistoffour convolutionallayers withno poolinglayers.However,theWaveGAN andSpecGANmodelshaveanextralayer.Theoriginaloutputofthe DCGANis64×64pixelswhichtranslateto4096samples.Therefore, Donahueetal.(2019)decidedtoaddanextralayerforthegenerator anddiscriminatortoincreasesamplesto16384.

Intermsofthelossfunctions,theWassersteindistanceisdifferent fromtheKullback-Leiblerdivergence(KL),andJenson Shannon(JS) distancessinceitmeasuresthesimilaritybetweentwoprobabilitydis- tributionsbycomparingthedistancebetweenthem(i.e.,comparingthe horizontaldifferenceandnotthevertical).TheWassersteindistanceis helpfulwhencomparingtheprobabilitydistributionsthatarenotover- lapping.KLdivergenceisknowntohaveaninfinityvaluewhenthe twoprobability distributionsarenot overlapping(i.e., thegenerated probabilitydistributionis0,andtherealdistributionisgreaterthan0) (Arjovskyetal.,2017).Meanwhile,JSdivergencestabilizestheoutput tolog2whenthedistributionsarenotmeeting(Arjovskyetal.,2017).

Unlikethepreviousmethods,theWassersteindistanceprovidesastable distancemetricsincetheoutputvaluedoesnotincreaseexponentially (Arjovskyetal.,2017).

Thefollowingconditionssummarizetheoutputsofthedistancesin casethedistributionsarenotoverlapping(i.e.,𝜃≠0)andwhentheyare entirelyoverlapping(i.e.,𝜃=0):

𝑊( ℙ0,𝜃)

=|𝜃| (1)

𝐽𝑆( ℙ0,𝜃)

={log2 if𝜃≠0,

0 if𝜃=0, (2)

𝐾𝐿( ℙ𝜃∥ℙ0

)=𝐾𝐿( ℙ0∥ℙ𝜃)

=

{+∞ if𝜃≠0

0 if𝜃=0, (3)

WecanseethattheWassersteindistanceisequaltothe𝜃variable correspondingtothedistancebetweenthedistributions.(Arjovskyetal., 2017)usedtheWassersteindistancetocalculatethelossoftheGAN Eq.(4.)belowpresentsWGAN:

𝑉WGAN

(𝐷𝑤,𝐺)

=𝔼𝒙𝑃𝑋

[𝐷𝑤(𝒙)]

−𝔼𝒛𝑃𝑍

[𝐷𝑤(𝐺(𝒛))]

(4) Thediscriminator𝐷𝑤nolongerdetectstherealorfakesamples,and instead,itcalculatestheWassersteindistance.Hence,thediscriminator iscalleda“critic” sinceittellshowclosethetwodistributionsare.The firstfunction,𝔼𝒙𝑃𝑋[𝐷𝑤(𝒙)], calculatestheWassersteindistance with respecttothedatasample𝑥takenfromtherealdataandthesecond

equationwithrespecttothegeneratedsample𝑧.Thenetworkparameter weightsforthecriticandgeneratornetworkareupdatedtominimize Wassersteindistancebetweenthedistributions.

Theclippingmethodis usedintheWGANtosatisfytheLipschitz constraintbylimitingtheweightofthecritic. However,ithassome drawbacks,includingtakingalongtimetoreachtheoptimalpointfor thecriticandhavingvanishinggradientsifthenumberoflayersislarge or batchnormalizationwas not used(Arjovskyetal., 2017).Hence, gradientpenalty(GP)wasintroducedinGulrajanietal.(2017)tobetter addresstheLipschitzconstraintbyenforcingasofterlimitonthecritic’s weight.TheGPequationisasfollowsusingtheaddedequation:

𝐿=𝔼𝒙𝑃𝑋[ 𝐷𝑤(𝒙)]

−𝔼𝒛𝑃𝑍[

𝐷𝑤(𝐺(𝒛))]

⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟

Originalcriticloss

+𝜆 𝔼

̂ 𝒙𝒙̂

[(‖‖∇𝒙̂𝐷(𝒙̂)‖‖2−1)2]

⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟

Gradientpenalty

(5)

where ̂𝑥isaninterpolatedvaluebetweentherealandgeneratedsam- ples.TheGPtakes𝜆asaregularizationparameterandthegradientnorm of ̂𝑥issubtractedby1andsquared.Theresultisthevalueofthenorm ofthegradientcloseto1,whichwillsatisfytheLipschitzconstraint.

3. Methodology

Thissectiondescribestheoverallmethodologyfortheproposedap- plicationincludingGANarchitecturesused,datacollection,andexper- imentalsetup.

3.1. GANarchitectureselection

Inthiswork,theaudiodatasetcomprisedofmelodiesinrawaudio formatinsteadofMIDIrepresentations.WaveGANandSpecGANaretwo typesofGANarchitecturesusedformusicgenerationthatiscompati- blewithrawaudiofileformats.WeselectedWaveGANforthemelody generationsinceitdemonstratedbetterresultsonthespeechcommands zerothroughnine(SC09)datasetusedin(Donahueetal.,2019).The SpecGANhasahigherchanceofoverfittingandthemeanopinionscores (MOS)forthehumanquality,ease,anddiversityevaluationsarelower thanWaveGAN(Donahueetal.,2019).WaveGANisspecificallysuitable becauseofthemelodydatabeinginrawaudioformat.Otherpopular approachesrelatedtomelodyandmusicgenerationutilizedMIDIor symbolicdatasetsinwhichcasetheGANarchitecturessuchastheone inLimetal.(2017)wouldbesuitable.Similarly,SpecGANmakesuse ofSTFTfromrawaudioandusesaniterativeGriffin-Limalgorithmfor inversiontoaudio.So,inconclusion,thekeydifferencebetweenSpec- GANtoWaveGANisthattherawaudioisconvertedtospectrograms firstbeforetrainingandperformsinversetransformationtoaudiobe- foresavingthegenerations.TheWaveGANimplementationispublicly availableonGitHub(Donahue,2021).Forourexperiments,wemod- ifiedthefollowingGitHubrepository(Adrián,2021)thatissuitedto conditionalWaveGAN.ForSpecGANimplementation,wemodifiedthe implementationfoundin(GitHub-naotokui/SpecGAN:SpecGAN-gen- erateaudiowithadversarialtraining2021).

3.2. Datacollection

TheapplicationofArabicpoetrytomelodyisanoveloneandcon- sequently there donot exist anyArabic poem-emotionandmelody- emotiondatasets.Therefore,different sourceswereutilizedtocollect datatobuildanovelArabicpoemandmelodydatasets.Toexpressthe appropriatemoodofthepoetryandlinkittoitscorrespondingmelody, poetrywascollectedbasedonemotions.Inthefuture,thisworkcan beextendedtoexperimentwithotherattributessuchaspoemeraand poemgeographicallocation.Torepresentthethreeemotionclassesof poems, theArabicmusicalmode ormaqāmwasused. Thesearepat- terns of melody andthe maqāmtechnique denotesthepitches, pat- terns,andconstructionofapieceofmusic,uniquetoArabicartmusic (Touma&Touma,2003).Theuseofmaqāmforselectingthemelodies

(6)

Table1

Datasetcharacteristics.

Dataset Category Total Samples Attributes

Poetry Sad 4064 Contains 9452 poems of varying length, with each row containing poem text and corresponding emotion label.

Love 3443

Joy 1945

Melody Saba 8319 Contains 16,322 audio files of 4 seconds.

Each file is labeled with its corresponding maqam name.

Seka 4611

Ajam 3392

issuitableduetotheabilityofvariousmaqamstoexclusivelyconvey certainemotions(Shahriar&Tariq,2021).AccordingtoFarrajandShu- mays(2019)andShokouhiandYusof(2013), melodiesbelongingto maqāmAjamrepresentstheemotionofjoy,Sekarepresentslove,and Sabarepresentsmelancholyandsadness.Therefore,themaqāmdenot- ingaparticularemotionwillbepairedwithits corresponding poem denotingthesameemotion.

Forthepoemdataset,awebscraperwasdevelopedtocollectpoems withtheircorrespondingemotionsfromAlDiwan(AlDiwan,2021)web- site.Thiswebsitewasselectedbecauseitwastheonlyaccessiblesource wherepoemswerecategorizedunderseveralemotioncategories.Other sourcessuchasKagglecontainedArabicpoetrydatasetsclassifiedbythe eraandlocationsofthepoems.Forsimplicity,threeemotioncategories includingsad,joy,andlovewereselected.However,inthefuture,the proposed frameworkis flexible andcan easily be extended toother emotions.Atotalof9452poemswerecollected:4064sad,3443love, and1945joy.Forthemelodydataset,YouTubemusicalplayswereused tocollectthreetypesofmelodiesfromeachofmaqamsSaba,Seka,and Ajamwhichcorrespondtosad,love,andjoy,respectively.Themelodies weresplitinto4-secondsegments,resultingin16,322audiofiles.This leftuswithatotalof8319recordingsformaqamSaba,4611recordings forSeka,and3392recordingsforAjam.Sincethereweremoremelodies availablecomparedtopoems,apoetryfromeachclasswaspairedwith amelodyfromits correspondingmaqamclassrandomly.This meant thatabigportionofmelodiesfromthedatasetwerenotusedfortrain- ingastheywereleftunpaired.Infuturework,morepoemsfromthe emotionclassescanbecollectedandpairedwiththeunpairedmelodies toexpand thetraining setsize. Thedataset is publiclyavailable on GitHub (https://github.com/SakibShahriar95/Arabic-Poetry-Melody).

ThedatasetattributesaresummarizedinTable1. 3.3. Experimentalsetup

ToimplementtheproposedapplicationofArabicpoetrytomelody generations,twodifferentapproacheswereutilized.Bothapproaches areGAN-basedandareusedtogeneratemelodiesappropriateforagiven inputtext.Thekeydifferenceisthatthefirstapproachutilizesthetext embeddingfromthepoemstodirectlytraintheGANwhereasthesec- ondreliesonaseparateclassifiertofirstdeterminetheclassthatthe poembelongstobeforegeneratingmelodyforthatclass.However,the resultfrombothapproacheswilloffertheproposedpoetrytomelody generationapplication.

3.3.1. End-to-end(E2E)poetrytomelodygenerationusingWaveGAN Inthefirstapproach,nointermediarystepsarerequired.Thisisbe- causetheinputpoemtextintheformofwordembeddingsarebeing directlyusedtotraintheGANarchitecture.Fig.2.illustratestheexper- imentalsetupusingthefirstapproach.

During thetrainingphase,thepoems inthe datasetisconverted tocorrespondingwordembeddingsusingtheAraBERT(Antounetal., 2020) model.Toreduce thecomputationalcosttwosteps wereper- formed:

• Thefirst128wordsofeachpoemwereconsideredtogeneratethe embedding.Thiswasdonetoreducecomputationalcostandmem- oryexhaustion.

• T-distributed stochastic neighbor embedding (TSNE) (van der Maaten & Hinton, 2008) is applied tothe output embedding of AraBERTbecausethedefaultembeddingsizewas(768Xnumber ofwords).ThenumberofoutputcomponentsforTSNEwassetto 100tobalanceitforconcatenation withthelatentvectorsizeof 100.

Therefore,theinputtothegeneratorisanoisevectorthatisconcate- natedwiththewordembeddings(afterapplyingTSNE).Thediscrimina- torreceivestherealmelodydatatotrainitselftodistinguishbetweenthe realonesandthesynthesizedones.Aftertraining,thearchitecturecan generateanappropriatemelodyforagiventextembedding.WaveGAN isbuiltontopofDCGANarchitecture.Forupsampling,thegenerator usestransposedconvolution(sinceaudiosignalsare1-D,alargerker- nelofsize25wasused).Therestofthearchitectureiskeptthesame asstandardDCGAN.Forthegenerator,ReLUactivationwasusedfor allthefirstfourlayerswhereastanhactivationwasusedforthefinal layer.Meanwhile,inthediscriminator,noactivationwasusedforthe firstfourlayerswhereasthelastlayerusedLeakyReLUactivation.The trainingstrategyusedwasWGAN-GP.

3.3.2. PoetrytomelodygenerationusingconditionalGANs

Inthisapproach,aconditionalGAN-basedmethodwasused.There isanintermediarytextclassifierthatisfirstresponsibleforgenerating thecorrectlabelstobeusedbytheconditionalGAN.Fig.3.displaysthe experimentalsetupforthesecondapproach.

Two conditionalGANs,namelyconditionalWaveGAN andcondi- tionalSpecGANwereexperimentedwith.ThegeneratorfortheWave- GANistrainedbasedontheinputrandomnoiseandthelabelsofthe poemtextswhichspecifytheemotionofthepoem.Thediscriminator takesasinputboththeactualmelodyandtheonesthatweresynthesized tolearntodistinguishbetweenthem.TheonlydifferenceforSpecGAN implementationisthattherawaudiofileisfirstconvertedtospectro- gramrepresentationforthediscriminatortotrainon.Theentirepro- cessisreversedduringsynthesiswhenthegeneratedspectrogramsare inversetransformedtorawaudiorepresentations.Duringthetraining phase,atextclassifierisnotrequiredasthelabeleddataisavailable.

However,oncetheconditionalGANshavebeentrainedsuccessfully,we cannotsimplyinputthetextoritsembeddingtogeneratenewmelodies.

Rather,thelabelofthepoemcategorymustbeobtainedfirstbyrun- ningthetextthroughatextclassifier,AraBERTinthiscase.Thelabel canthenbepassedontotheconditionalGANtogenerateanewmelody.

Themostsignificantdrawbacktothisapproachistheperformanceofthe textclassifierplaysabigroleintheaccuracyofthegeneratedmelodies.

Moreover,anintermediarystepisoftenundesirableinmanyapplica- tions.

Withintheconditionalapproach,thefirstmethodutilizestheWave- GAN.ThearchitectureoftheconditionalWaveGANisthesameasthe oneusedforE2Egeneration.Thisincludesthenumberoflayers,ac- tivations,andthetrainingstrategy.Theonlydifferenceisthatinthis approachtheTSNEoutputsforthewordembeddingsarenotrequired.

Rather,thelabelisconcatenatedtothenoisevectorsuchthattheGAN architecture is trainedtogenerate samplesfrom the correctmelody class.Afterthetrainingphase,thegeneratorcanbecalledbypassing thelabelfortheclassesaswellasarandomnoisevectortoobtainnew melodiesforallthethreeclasses.

ThesecondconditionalapproachutilizesSpecGAN.Asmentioned previouslyandillustratedinFig.3,thisapproachmakesuseofaspec- trogram representation. Therefore,a preprocessingstep is employed before thetrainingphase.During thisstage, theaudiofilesforeach category areloaded.We thenconverteach audiointo128128 Mel- spectrogramrepresentations.Wethensavethespectrograms,theclass labels,Melfeaturemean,andstandarddeviationinto.npz(NumPy)for- matforstorageofarraydata.Thereafter,thisfilecandirectlybeused forthetrainingphasewithoutrunningthepreprocessingstepsmultiple times.During thetrainingphase,thetrainingratiois setto5asper

(7)

Fig.2. End-to-endpoetrytomelodygeneration.

Fig.3. Conditionalpoetrytomelodygeneration.

theapproachproposedin(Donahueetal.,2019).Thisensuresthatthe discriminatorupdateswillhappen5timespergeneratorupdate.The generatorconsistsof11layersasfollows:Adenselayerof16384units isfollowedby5blocksofUpSampling2DandConv2Dlayers.Allthe activationsareReLUexceptforthelastConv2Dlayerwhichhastanh activation.Thediscriminatorontheotherhandconsistsofsixlayers.

First,therearefiveConv2DlayersfollowedbyLeakyReLUactivations.

Finally,afterflattening,wehavetwodenselayers.Thefirstdenselayer isresponsiblefordeterminingwhetherthegeneratedaudioisrealor fake,soithasasingleoutput.However,noactivationisuseddueto theuseoftheWassersteinlossfunction.Theotherdenselayeris re- sponsibleforgeneratingtheprobabilitiesthattheaudiobelongstoone ofthecategories.Therefore,ithasthreeoutputswithasoftmaxactiva- tionfunction.Thegeneratorlearningrateissetto0.0001withAdam optimizerandthemodelwillbeoptimizedonWassersteinloss.Onthe otherhand,forthediscriminator,thelearningrateandoptimizerare thesameasthatofthegenerator.InadditiontotheWassersteinloss, thediscriminatoralsohasagradientpenaltylossandcategoricalcross- entropylossforpredictingtheclasses.First,wecallthegeneratortopre- dictthespectrogrambygivingtheinputnoiseandthecategorylabel.

Next,wecallontheinverseMel-spectrogramtoobtaintherawaudio files.

4. Evaluationandresults

Inthissection,resultsfromthedifferentapproacheswillbe com- paredusingqualitativeevaluationandcommonquantitativeevaluation forGANs.Thefollowingsectionspresentanexplanationofthesemetrics andthecorrespondingresults.

4.1. Quantitativevalidation

Theleave-one-out(LOO)andGAN-train/testmetricswillbeusedto validatethemodelperformance.LOOusesallthen-1samplepointsfor trainingwherenisthenumberofsamplesandtheremainingsamplefor testing.Theclass(i.e.,fakeorreal)ofthenearestsampletothetesting pointisgiventoitintermsofdistance.Theobjectiveistotrainak- nearestneighbor(K-NN)algorithmwithaKvalueof1(1-NN)andapply theLOOapproachfortrainingK-NN.FortheGAN,LOOof0.5isthe bestwherethediscriminatorhasa50%chanceofcorrectlylabelingthe point.IftheLOOvaluewaslessthan0.5,itwilllikelyleadtooverfitting sincethegeneratedandrealpointsareveryclose.Ontheotherhand, aLOOvaluegreaterthan0.5meansthatthediscriminatorcaneasily distinguishbetween realandfakesamples. Furthermore,tocalculate thedistancebetweenthesampledpointandthenearestneighbor,the Euclideandistancewillbeused.Giventwopoints,xandywecanget theEuclideandistancebetweenthemusingEq.(6):

𝐸𝑢𝑐𝑙𝑖𝑑𝑖𝑎𝑛(𝑥,𝑦)=

√∑𝑛

𝑖=1

(𝑦𝑖𝑥𝑖)2

(6)

Moreover, the dynamic time warping (DTW) (Salvador and Chan,2007)algorithmwillbeusedtocalculatethedistancebetween thedistributionsoftherealexamplesandthegeneratedones.TheLOO methodusingtheDTWdistancewillalsobeutilizedintheK-NNalgo- rithm.

GAN-test is when the classifier is trained on the actual samples and tested on thegenerated sample, andGAN-train is the opposite (Konstantin_Shmelkov_How_good_is_ECCV_2018_paper.pdf, 2021).

FromtheGAN-test,itcanbe determinedwhetherthemodelisover- fitting, or does not capture the actual distribution, or generating

(8)

Table2

Architectureparametersandconditionsforeachexperiment.

Approach Batch Size Input Layers (G, D) Target Length (Sec.) Sample Rate (Hz)

E2E-WaveGAN 64 (6,8) 4 8192

Conditional-WaveGAN 32 (6,8) 4 8192

Conditional-SpecGAN 64 (11,6) 1 16000

Approach Noise Vector Cond. Input Format Batch Norm. Epochs

E2E-WaveGAN (100, 1) Arabert + tSNE (100,1) True 9200

Conditional-WaveGAN (64, 1) Label joined with Embedding (16,1) True 15000

Conditional-SpecGAN (97, 1) One-hot Encoded (3,1) False 3000

realistic samples (Konstantin_Shmelkov_How_good_is_ECCV_2018_

paper.pdf,2021).Ifthetestingaccuracyisgreaterthanthevalidation, thegeneratedsamplescapturethefinedetailsofthetrainingsetand henceproducesimilarsamples.Meanwhile,thediversityofthegener- atedsamplescanbeextractedusingGAN-trainaccuracy.Thecloserthe testingaccuracytothevalidationaccuracy,thesimilarthegenerated resultsaretotherealones,andthemorediversetheyare.Forbuilding both networks, ResNet34 (ResNet-34, 2021) architecture was used.

Thesame architecturewas used becausethegoal in this evaluation wastoestimatethesimilaritybetweenthegeneratedandrealsamples byassessing theGAN-train andGAN-testaccuracies.Optimizing the architecturesfurtherwouldpotentiallyleadtohigherscoresforboth.

However,inthiscase,theobjectivewastoobtainageneralideaofthe accuraciesandnotnecessarilyoptimizethem.

4.2. Qualitativevalidation

Tovalidatethequalityofthegeneratedmelodies,aqualitativecase studywasconducted.Weinvitedtwelvevolunteerswhowereuniver- sitystudentstoevaluateandscoreboththegeneratedaswellasthe originalmelodies.Theparticipantswerepresentedwithatotaloffour questionsafterbeingpresentedwithtenaudiosamplesfromwhichfive ofthembelongedtotherealmelodiesandfiveofthemwereGANsyn- thesized.ThegeneratedsampleswerepresentedfromtheSpecGANas theothermodelsproducedpoor-qualitymelodies.Foreachaudiosam- ple,theevaluatorswereaskedthefollowingquestionsandaskedtorate between1to5foreachquestion,indicatingverypoor(1)toverygood (5).

• Theoverallqualityofthemelodyplayed(1poorquality,5great quality).

• Themelodyplayedisenjoyable(1notveryenjoyable,5veryenjoy- able).

• Themelodyplayedisinteresting(1notveryinteresting,5veryin- teresting).

• The melody played sounds artificial (1 sounds very artificial, 5 soundsveryrealistic).

Followingthesurvey,thescoresofeachquestionforboththereal andthegeneratedmelodieswillbeaggregated,andcomparisonwillbe made.Althoughduetothesmallsamplesizewhichlimitsthepossibility oftheresultsbeingconclusive,wecanstillobtainageneralperception intermsofthequalityofthegeneratedmelodies.Weareparticularly interestedtonoteifthegeneratedmelodieswereofgoodqualityandif theydidnotsoundartificial.Moreover,iftheevaluatorsfoundthemto beinterestingandenjoyable,itwouldfurtheraddtothesuccessofthe generatedmelodies.

4.3. Results

Threedifferentexperimentsforthepurposeofpoetrytomelodygen- erationapplicationwasconducted.InTable2,wetheparametersand conditionsthateachofthemodelswasexperimentedonareoutlined.

Thearchitectureingreaterdetailswasalreadypresentedintheprevious section.

4.3.1. Quantitativeresults

Thefirstexperimentwasend-to-endusing conditionalWaveGAN, andthemodelwastrainedfor9200epochsinbatchsizesof64.Witha samplingrateof8192,outputof4secondswasobtainedwhichwasthe sizeofthefilesintheaudiodataset.However,thegeneratedmelodies from theembeddings in this approach wereof poor quality asthey soundedlikerandomnoisyinstruments.Webelievethisisduetothe transformationofthewordembeddingvector,whichforcomputational reasonscouldnotbedirectlyusedasaninputandwasrequiredtobe downsampled.UsingAraBERT,weobtainedembeddingsof254by768 foreachinputtext.Incontrast,CIFAR-10imagesareof28by28size.

TheextremelylargesizeoftheAraBERTmodelmeantthatsufficient memorycouldnotbeallocatedfortraining.Asaresult,theinputsize wassignificantlyreduced.Moreover,tomatchthedimensionsaTSNE transformationwas performedwhichwould meanthatfurtherinfor- mationwaslost.Therefore,webelieveinthefuturesmallertextem- beddingmodelsanda moresophisticated downscalingapproach are required.TheLOOresultsusingEuclideanindicatesthatthemodelis leaningmoretowardstheoptimalresultwithavalueof0.58.However, thediscriminatorhasahigherchanceofdistinguishingbetweenthereal andfakesamplesthanthegenerator,whichtriestotricktheclassifier.

Meanwhile,thevalueoftheLOOusingtheDTWdistanceis0.91,which showsanentirelydifferentoutcomewherethemodelishighlycapable of distinguishingbetweenthesamples.Finally,thedistance between theactualandgenerateddistributionsis361382,whichishigh.Conse- quently,thismeansthatthegeneratorisnotcreatingrealisticsamples thatwouldfoolthediscriminator.

Inthesecondexperiment,thesameWaveGANarchitecturewasuti- lizedbutwithadifferentinputrepresentation.Inthiscase,theclassla- belswereusedinsteadofthewordembeddings.Themodelwastrained for15000epochswithabatchsizeof32.Theotherconfigurationsare thesameasthefirstexperiment. Theresultsobtained usingthis ap- proachsoundedmorecoherent,althoughapost-processingnoisefilter wasrequiredtodenoisethegeneratedmelody.Itispossibletofurther improvetheperformanceofthisarchitecturebyusingadditionaldata andhyperparametertuning.Duetoonlybeingabletorun about10 epochseveryminute,theamountoftuningthatcouldbedonewasre- stricted.TheLOOwiththeEuclideandistanceshowsanoptimalvalue of0.51whichmeansthatthediscriminatorhasa0.5probabilityofcor- rectlyclassifyingthesample.TheLOOwiththeDWThassimilarbehav- iortothepreviousarchitecturewherethevalueishigher(i.e.,worst) thantheEuclideandistance,andinthismodel,thevalueis0.82.The DWTdistanceis150063,whichislowerthantheE2E-WaveGAN.Hence theLOO-DWTvalueisslightlybetter.TheGAN-testaccuracyof0.36is higherthantheGAN-trainaccuracyof0.31.Subsequently,thisreveals thatthegeneratedsamplesarelessdiverse.Overall,theGAN-testand GAN-trainaccuraciesareverylow,whichindicatesthattheproduced sampleshavepoorqualityanddiversity.

In thefinal experiment,a conditional SpecGAN architecturewas used.Unliketheprevioustwoexperiments,aspectrogramrepresenta- tionisutilizedhere.BecausetheoriginalimplementationofSpecGAN wasfixedtoone-secondrepresentation(Donahueetal.,2019),audio synthesisofone-secondlengthwasproducedusingthisapproach.As pertherecommendationin(Donahueetal.,2019),batchnormalization

(9)

Fig.4. Spectrogramcomparisonforsixgeneratedand realsamples.

wasnotimplementedforthisarchitecture.Themodelwastrainedfora totalof3000epochsinbatchsizesof64.Theresultsobtainedusingthis architecturesoundedaestheticallymorepleasantthantheothers.The onlydownsideisthatthelengthoftheaudiowasonesecondasopposed totheinputaudiodatasetfilesbeingfoursecondslong.ALOOscoreof 0.49wasobtainedwiththeEuclideandistance,whichisclosetoperfect, indicatingthatthediscriminatorhas0.5probabilityofcorrectlyclassi- fyingthesample.However,usingtheDTWdistancewiththeLOO,the scorewas0.78.Thisindicatesthatthegeneratedsamplesarelikelytobe distinguished.Overall,thescoreisbettercomparedtotheprevioustwo experiments.TheGAN-trainaccuracyof0.4ishigherthantheGAN-test accuracyof0.31.Thisindicatesthatthetargetdistributionwasnotcap- turedwellandthereisalackofaudioquality.Finally,theDTWdistance of92327indicatesthatthegenerateddistributionisclosertotheactual onewhenwecomparewiththefirsttwoexperiments.Asaresult,the bestresultswereobtainedusingtheconditionalSpecGAN.

Forfurther evaluation, a total of six spectrograms were selected that were generated by the three approaches as well as the spec- trogram for thereal samples. Fig. 4. displayssix randomly selected Mel-spectrograms of original music and the music generated from conditionalWaveGAN,conditionalSpecGAN,andE2EWaveGAN.The original filescontain a wide rangeof signals spanningover alarge rangeoffrequencies.Thiscontrastswiththesignalsgeneratedthrough E2EWaveGANandtosomeextentconditionalWaveGAN.Theformer hasgeneratedmoremonotonicsignalswithwhitenoisewhilethelatter showsthesignalsaremodulatedpossiblythroughtheprovidedlabel orthelatentspacevariation.Theclosestwaveformsaregeneratedby SpecGANwhich showedthat thegeneratedsignals aremore diverse andclosertotheoriginalone.Thisargumentisalsosupportedbythe scoresobtainedthroughthequalitativeassessment.

Thequantitativeresultsobtainedfromthethreeexperimentsusing theaforementionedmetricsis summarizedinTable3. Fig.5.alsoil- lustratesthecomparison of the DTWdistance between thereal and generateddatadistributionsforeachalgorithm.AlowerDTWdistance indicatesthatthegeneratedandrealdatapointsarecloserandthere- forelowervalue,inthiscase,indicatesbetterperformance.Ascanbe seenfromthetable,intermsofLOOevaluation,bothconditionalWave- GANandSpecGANoutperformedE2EWaveGAN.However,boththese modelsobtainedlowerscoresintermsofGAN-trainandGAN-testac- curacies.SincetheE2E-WaveGANgenerateddatapointsbasedontext

Fig.5. DTWdistancebetweenrealandgenerateddistributions.

embedding,therewasnolabeltoperformtheGan-trainandGan-test evaluationonit.IntermsofDTWdistance,asdisplayedinFig.5,itcan be observedthat SpecGANoutperformed theother models.Thiswas clearafterassessingthesamplesgeneratedfromallthreeexperiments thattheconditionalSpecGANonesweretheclosesttorealsamples.In contrast,conditionalWaveGANproducednoisymelodiesandtheE2E WaveGANproducedmelodiesthatwerenotmeaningful.Consequently, onlytheaudiosgeneratedbyconditionalSpecGANwereselectedforthe qualitativecasestudy.

4.3.2. Qualitativeresults

Toassessthequalityofthemelodiesgeneratedbytheconditional SpecGAN,twelvevolunteerswereaskedtoanswerfourquality-related questions. Theevaluationcriteria werebasedon enjoyable, interest- ing,realistic,andtheoverallqualityofthemelodiestobescoredbe- tween1and5.Theaveragescorealongwiththeirstandarddeviationis summarizedinTable4.Samplemelodiesgeneratedusingconditional SpecGAN are available online (https://github.com/SakibShahriar95/

Arabic-Poetry-Melody).

Thetableindicatesthatrealmelodiesarebetterunderallfourcri- teriacompared tothegeneratedones.However,forallthemeasures thegeneratedmelodies scoredpointsthat wereabove average(2.5).

Moreinterestingly,thehighestofthefourmeasuresforthegenerated melodieswasrealisticwiththeparticipantsonaverageratingit2.75 outof5forbeingrealistic.Thismeanstheparticipantsthinkthatthe

(10)

Table3

Quantitativeresultscomparison.

Approach LOO (Euclidean) LOO (DTW) GAN-Train GAN-Test DTW Distance

E2E-WaveGAN 0.58 ± 0.49 0.91 ± 0.29 N/A N/A 361382

Conditional-WaveGAN 0.51 ± 0.50 0.82 ± 0.39 0.31 0.36 150063 Conditional-SpecGAN 0.49 ± 0.50 0.78 ± 0.41 0.40 0.31 92327

Fig. 6. Surveyresult comparison between real and generatedmelodies.

Table4

QualitativeresultsforconditionalSpecGAN.

Evaluation Real Melodies SpecGAN Generated Enjoyable 3.73 ± 0.61 2.60 ± 0.28 Interesting 3.75 ± 0.45 2.53 ± 0.31 Realistic 3.83 ± 0.29 2.75 ± 0.38 Overall Quality 3.72 ± 0.32 2.53 ± 0.32

SpecGANgeneratedaudiosaremorelikelytoberealisticthanfake.This indicatestousthattheoverallresultsfortheconditionalSpecGANare quitepromisingandcanbeimproveduponinthefuture.Fig.6.illus- tratesthecomparisonofthequalitativecasestudy.

5. Discussion

Despite the comparatively limited research in GAN-based cross- domainaudiogeneration,theresultsobtainedforthenovelpoetryto melodygenerationwerepromising.Inthissection,someofthechal- lenges,implications,andfutureworkofthisapplicationarediscussed.

5.1. Challenges

Therewereseveralnotablechallengesthatwereovercometoimple- menttheproposedGAN-basedpoetrytomelodygenerationframework.

Firstly,therearenoexistingdatasetsforpoetryandmelodiespaired basedonsuitablemoodandemotion.Assuch,asignificantcontribution included the introduction of an Arabic poetry and melody dataset.

Related worksfocused on generating melodies from givenlyrics by aligningthelyricswiththemusicalnotes,synthesizingthelyrics,and combiningthemwiththegeneratedmelodytocreatesuitablesongs.

Meanwhile,inthiswork,theobjectiveistogenerateanovelmelody foreachpoemrepresentingtheemotionconveyedbythepoem.The lackofcross-domainaudiogenerationresearchintheliteraturemade itfurtherchallenging.

Moreover,therepresentationoftheaudiosignalsisquitecomplex andspatiallycostlytorepresentindigitalform.Themusicsignalscould

form themelodyatvarioustimescales whichcannotbe captured al- gorithmically.Thespatialcomplexity,i.e.,thehigh-fidelityrateofau- diosignals,isaproblembecauseitrequireslargecomputationalpower.

Thesetwoproblemsofvaryingtimescaleconsistencyandhigh-fidelity makeitverydifficulttomodelthemelodies.Variousalgorithmspriori- tizetheglobalstructureatthecostofhighcomputationalpowerwhereas othersfocusonusingproxyrepresentationslikespectrogramstocon- servecomputationalpower.Therefore,thesolutionstothistradeoff can- notbeeasilygeneralizedandrequiresignificantdomainknowledgeand experimentation.

5.2. Contributionstoliterature

Asdiscussedintheprevioussections,thisworkpresentedthefirst poetrytomelodygenerationusingGANs.Consequently,anoveldataset wasintroducedcontainingpairedpoetryandmelodybasedonemotions.

Thedatasetcanbeusedbyresearcherstofurtherimprovethemelody generation.Theycanalsoexperimentwithnewtaskssuchasmelodyto poetrygenerationaswellastextandaudioclassificationintheArabic language.Moreover,thisworkpresentedareasonableperformancefor melodygenerationusingSpecGANarchitectureandtheimplementation canbeutilizedforrelatedapplications.However,theresultsobtained inthisworkalsohighlighttheimmensechallengeindealingwithraw audioespeciallywiththetexttoaudiotransformationinanend-to-end manner.Therefore,necessarycomputationpowerandtrainingtimeis requiredtofurtherenhancethegenerationquality.

5.3. Implicationsforpractice

Thepracticalimplicationsofthisresearcharemanifold.Firstly,the userexperienceforreadingpoetrywillbegreatlyenhancedwiththead- ditionofmelody.Thiscanbeachievedbyintegratingtheproposedpo- etrytomelodygenerationframeworkintoamobileorwebapplication.

Basedonthepoetrybeingreadbytheuser,amelodymatchingthemood ofthepoemwillbeplayedinreal-timetoenhancethereadingexperi- ence.Moreover,theproposedframeworkcanbeofgreatbenefittothe mediaproductionindustry.Forinstance,avideoscenewhereapoemis beingrecitedmayhaveabackgroundmelodytoaccompanyit.Instead

Referensi

Dokumen terkait