How can generative adversarial networks impact computer How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion generated art? Insights from poetry to melody conversion

(1)

Zayed University Zayed University

ZU Scholars ZU Scholars

All Works

4-1-2022

How can generative adversarial networks impact computer How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion generated art? Insights from poetry to melody conversion

Sakib Shahriar Zayed University Noora Al Roken

American University of Sharjah

Follow this and additional works at: https://zuscholars.zu.ac.ae/works Part of the Computer Sciences Commons

Recommended Citation Recommended Citation

Shahriar, Sakib and Roken, Noora Al, "How can generative adversarial networks impact computer generated art? Insights from poetry to melody conversion" (2022). All Works. 4933.

https://zuscholars.zu.ac.ae/works/4933

This Article is brought to you for free and open access by ZU Scholars. It has been accepted for inclusion in All Works by an authorized administrator of ZU Scholars. For more information, please contact [email protected].

(2)

ContentslistsavailableatScienceDirect

International Journal of Information Management Data Insights

journalhomepage:www.elsevier.com/locate/jjimei

How can generative adversarial networks impact computer generated art?

Insights from poetry to melody conversion

Sakib Shahriar

^a^,^∗

, Noora Al Roken

^b

aCollege of Technological Innovation, Zayed University, Abu Dhabi, United Arab Emirates

bDepartment of Computer Science and Engineering, American University of Sharjah, Sharjah, United Arab Emirates

a r t i c le i n f o

Keywords:

Generative adversarial networks Deep learning

Melody generation Music generation Artiﬁcial intelligence Computer generated art

a b s t r a ct

Recentadvancesindeeplearningandgenerativeadversarialnetworks(GANs),inparticular,hasenabledinterest- ingapplicationsincludingphotorealisticimagegeneration,imagetranslation,andautomaticcaptiongeneration.

Thishasopeneduppossibilitiesformanycross-domainapplicationsincomputergeneratedartsandliterature.

Althoughthereareexistingsoftware-basedapproachesforgeneratingmusicalaccompanimentofagivenpo- etry,therearenoexistingimplementationusingGANs.Thisworkproposesanovelpoetrytomelodygeneration conditionedonpoememotionusingGANs.Adatasetcontainingpairsofpoetryandmelodybasedonthreeemo- tioncategoriesisintroduced.Furthermore,variousGANarchitecturesincludingSpecGANandWaveGANwere exploredforautomaticmelodysynthesisforagivenclassofpoetry.ConditionalSpecGANproducedthebest melodiesaccordingtoquantitativemetrics.MelodiesproducedbySpecGANwereevaluatedbyvolunteerswho deemedthequalitytobeaboveaverage.

1. Introduction

Computerartwasﬁrstdiscoveredasawaytoproduceobjectiveart piecesby relyingon probabilitytheories (Chamberlainetal., 2018).

Sincethe1970s,theartistHaroldCohenstartedtestingwithcomputer sketchgeneration,wherehemanuallycodedthepaintingstructureinto theprogram(Chamberlainetal., 2018).Unlikeartiﬁcialintelligence (AI),regularcomputerprogramscouldnotintelligentlyproduceanew artpiecewithoutthedesigner’sintervention.Withtherapiddevelop- mentsofAI,computer-generateddrawings,music,andpoetryhaveim- proved.Advancedlearningalgorithmsandevolutionarymodelshave generatedcomplexandrealisticartworks(Chamberlainetal.,2018).In 2018,theportraitofEdmonddeBelamywassoldfor$430,000inan auction,atamuchhigherpricecomparedtotheanticipatedpricetag of$7,000-$10,000forbeingacomputer-generatedart(Goold,2021).

TheEdmonddeBelamycasewastheﬁrstcaseofanAI-generatedart- workinanauctionwheretheFrenchcreatorsfed30,000portraitstoan availablemachinelearningalgorithmandselectedthebest-generated picture(Goold,2021).Finally,theownersaddedtheirsignaturestothe bottomrightcorner.Furthermore,recentinterestsinadversarialtrain- ingforartworkgenerationhavegrownsigniﬁcantly.Thenetworktries togeneraterealisticoutputs,andthecriticordiscriminatortriestodis- tinguishbetween therealandgeneratedsamples.Earlier,thehuman playedthecritic’srole,andtheprogramwasaformofanevolutionary

∗Correspondingauthor.

E-mailaddress:[email protected](S.Shahriar).

algorithmoraneuralnetwork(Soderlund&Blair,2018).Agroupof outputs(e.g.,imagesormusic)arepresentedtothecritic,andasub- groupwiththebestresultsisselectedandpassedtothenextstageor iteration.Suchaprocessistime-consuming,andhence,theentireproce- durehasbeenautomatedbyreplacingthecriticwithaneuralnetwork thatcanlearndiscriminantfeaturestodistinguishbetweentheinputs (Soderlund&Blair,2018).Consequently,researchersandartistsinre- centtimeshavestartedutilizingadversariallearningtogeneratevarious artformsincludingvisual,textual,andaudioarts.

Generatingmusicthatexpressaparticularemotionconveyedbya poemhelpsthereaderunderstandthepieceofliteratureonadeeper level.Musicaccompanimentsforpoemscanalsobeusedtoenhancethe moodofthepoetryreaders.However,insuﬃcientworkhasbeencar- riedoutforgeneratingmusicalpiecesforpoemswhileconsideringthe emotionofthetext.Tothebestofourknowledge,noresearchapplica- tionhasyetbeenproposedthatrelatestopoemandmelodygeneration using deeplearning.Theexistingliteraturemainlyfocusongenerat- ingmusicforotherartformssuchassonglyrics.Hence,inthiswork, wepresenttheﬁrstArabicpoetrytomelodygenerationutilizinggen- erativeadversarialnetworks(GANs).GANscangeneratenewcontent basedon amin-maxgamebetween twonetworks,thegeneratorand thediscriminator(Ruzafa,2020).Thegeneratortriestogeneratefake samplesandimproveitsnetworktotrickthediscriminator.Meanwhile, thediscriminatortriestodistinguishbetween therealandfakesam- ples.MoresophisticatedGANarchitecturescanmakeuse oflabelsto

https://doi.org/10.1016/j.jjimei.2022.100066

Received8November2021;Receivedinrevisedform26February2022;Accepted26February2022

(3)

Fig. 1. Graphical illustration of the proposed application.

generatedatapointsforaspeciﬁccategory.Thisisusefulbecauseinthe standardGAN,datapointsaregeneratedonlybasedontheinputnoise andunconditional generationof datapointsis notsuitableformany applications.

Although GANs have been widely utilized for various applications including 3D object generation, medicine, and traffic control (Aggarwaletal.,2021),theiradoptioninaudiodataandmorespecifi- callycross-domainaudiogenerationincludingtexttoaudioandimage toaudiogenerationisverylimited.Thespecificproblemoftexttoaudio transformationcanbehighlightedfurtherbyarelatedresearchappli- cation,text-to-speech(TTS)transformation.MostTTSprocessescarried bytheneuralnetworkmodelsareauto-regressive(AR)whichmeans thatthespeechsegmentsaregeneratedbasedonthepreviouslysynthe- sizedsegments.ARmethodsdonotsupportparallelization(i.e.,theuse ofGPU)duetotheirsequentialnature(Prengeretal.,2019),(Lietal., 2019).Consequently,ARmodelsarenotefficientforreal-timeapplica- tionsbecauseoftheconsecutivecalculationsthatrequiretime(Okamoto etal.,2019;Renetal.,2019)addedhowthecommonlyusedARTTS modelssuchasTacotron,Tacotron2,DeepVoice3,andClarinetgen- erateMel-spectrogramsfrom thetextinputsforaudiosynthesis.The useofspectrogramsfurtherslowsdowntheaudiogenerationsinceeach inputisconvertedtoaMel-spectrogram,andthesequencelengthishun- dredsorthousands.Mel-spectrogramsaresusceptibletoword-skipping orrepeating,whichcausesinconsistencybetweenthetextandgener- atedspeech(Renetal.,2019).BesidesthedifficultiesinTTSgenera- tion,melodyormusicgenerationcomeswithitsownsetsofchallenges Dielemanetal.(2018)explainwhymusicgenerationhasreceivedless attentionthanspeechduetoitscomplexity.MIDIandothersymbolic musicrepresentationshavebeenusedinliteraturetorepresentthemu- sicconcisely.However,manyessentialfeaturesthatdescribetheinstru- mentsusedareremovedwhenusingsymbolicrepresentations.Thepi- anocanbetranslatedintoMIDIsequences,butitisdifficulttorepresent mostinstruments.Itismoreefficienttouserawmusicaudiosignals sinceitretainstheneededinformation.However,itisevenmorediffi- culttomodeliti.e.,capturemusicalstructureoveranextendedperiod.

Therefore,theproposedtext toaudiogenerationapplication investi- gatesthecompetencyofGANswithaudiodata.Theobjectiveof this studyistoaddresssomeoftheexistingchallengesinaudiogeneration andfurtherexaminetheaudiogenerationfromagiveninputtextwhich hasnotreceivedsuﬃcientattentioninexistingresearchworks.

Ahigh-levelgraphicalillustrationoftheproposedframeworkispre- sentedinFig.1.Thegeneratorutilizesthepoemfromthedatasetalong withsomenoisetosynthesizenewdatapointsthatwouldultimatelyre- semblethemelodycorrespondingtotheinputpoem.Thegeneratordoes sobyoptimizingthelossfunctionthatminimizesthediﬀerencebetween thegeneratedsampleandtherealmelody.Theroleofthediscrimina- tor,inthiscase,istoprovidefeedbacktothegeneratorbyattempting todistinguishbetweenthegeneratedandtheactualmelody.Afterthe trainingphase,thegeneratorwouldideallylearntherepresentationbe- tweentheinputpoetryandthecorrespondingmelody.InSection3,a detailedexplanationofthetypeofGANarchitectureaswellastheex- perimentalsetupusedinthisworkispresented.

Thereareseveralchallengestosolvingthisproblemofmelodygen- erationforpoetry.Themostnaturalsolutionofemployinghumanmu-

siciansandpoetstocollaborateinsolvingthistaskisnotscalabledue tohighcosts.Moreover,thereisachallengeintermsofthelanguageof thepoemandthemusicassociatedwiththecultureofthewrittentext.

Therefore,forpracticalusage,thesolutionwouldbetoutilizecomputers togeneratemelodies.Theuseofprogramminglanguagesandextensive humansupervisioninvolvedwasamajordrawbackinolderapproaches tomusicgenerationbycomputers.Inthiscontext,thecomputer’srole wastosuggestsynthesesandleavethehumanexperttocompletethe feedbackloopandmaketheﬁnaldecisionofacceptingorrejectingthe suggestions(Noll,1967).Functionsforamplitude,frequency,anddu- rationofasequenceofnotesweredrawnonacathode-raytubewitha lightpenandthecomputerprovidedsynthesisbycombiningthesefunc- tionsusingsimplealgorithms(Noll,1967).Consequently,themelodies producedbysuchcomputerprogramssoundeduniformandartiﬁcial.

Thechallengingaspectofthecomputermusicgenerationcomesfrom thefactthathumansupervisionisrequiredatsomelevel.Besidesthe humansupervision,itisnotstraightforwardtoprogramthecomputerto transformthepoetryintomelodytocapturethemoodofapoem.This wouldrequirefurtherresourcestofirstdevelopaprogramthatcanpro- cessthelanguagewithinthepoetry.Thesolutiontoboththeneedfor humansupervisionaswellaslanguagetoaudiorepresentationlearn- ingcan only beachieved ifcomputers aretrainedlike humans,i.e., tolearnfromexperience,usinglotsofdataandwithoutexplicitpro- gramming.ThisiswheredeeplearningandmorespecificallyGANsare extremelyuseful.GANsaregoodattransforminginputsfromonedo- maintoanother(Ruzafa,2020),andconsequently,inourwork,itcan helpwithtransformingthepoem,whichisatextinputintoarepresenta- tivemelody,anaudiooutput.ResearchinimagesgeneratedusingGANs havealreadyshowntoberealisticwherehumanscannottellwhether a particularartpiecewas producedbya humanartist oramachine (Mazzone&Elgammal,2019).Therefore,GANswillnotonlyprovidea moreflexibleimplementationbutwillalsoproducemorerealisticand diversemelodiesthatcancapturetheimaginationofthepoetryreaders.

2. Background

Thissectionpresentstherelatedworksintheliteraturepertaining tomelodyandlyricsgenerationaswellasbackgroundinformationon GANarchitectures.

2.1. AIandcomputergeneratedart

GANshavebeeneﬀectivelyusedinvariousartgenerationapplica- tionsincludingvisualarts,writtenarts,musicgenerations,calligraphy stylerecognition,andmelodyclassiﬁcation(Shahriar,2021).Areview oftheliteratureinthecontextofrelatedmelodyandmusicgeneration applicationsispresentednext.Scireaetal.(2015)proposedasystemus- ingtwoMarkovchainstogeneratelyricsfromacademicresearchpapers alongwithappropriatemelodiestoaccompanythemusic.However,this workdidnotprovideanevaluationcasestudyonthequalityofthelyrics andmelodiesgenerated.InAckermanandLoker(2017),machinelearn- ingmodelsincludingrandomforestwereusedtocreateanapplication thatiscapableofassistingmusiciansandamateursinproducingquality songsandmusic.Althoughthreesongswerecreatedusingtheapplica-

(4)

tion,theywerenotevaluatedbycasestudies.Asystemthatcananalyze themoodofthepoemandgeneratearelevantmusicalcomponentwas presentedin(Stere& Trăuşan-Matu, 2017).Firstly,apoemwasana- lyzedbyconsideringitsrhythm,punctuation,andmood.Basedonthe analysis,apianocompositionwasgeneratedtocomplementthemoodof thepoem.Machinelearningwasusedtoclassifythemoodofthepoem.

Theauthorutilizedapre-trainedNaïveBayesclassiﬁer calledMusic- Mood(Raschka,2016)toclassifythepoemtextas“happy” or“sad”.

Thepaperfailedtoaddresstheaccuracyofthemoodclassiﬁcationas wellasthequalityofthegeneratedmusic.Milon-Floresetal.,2019pre- sentedanapplicationtogenerateaudiovisualsummariesofliterarytexts byproducingemotion-basedmusiccompositionandgraph-basedani- mation.Naturallanguageprocessingalgorithmswereusedtoextract emotionsandcharactersfrom literarytextsbutnomachinelearning frameworkswereusedinthiswork.Fukayamaetal.(2010)developed anautomatedJapanesesongcomposer,"Orpheus."Theapplicationused theGalatea-Talktooltoconvertthetextuallyricstospeechandthehid- denMarkovmodeltosynthesizesingingvoice.Asurveywasconducted on 1378people concluded that 70.8%of the songs wereattractive.

Monteithetal.,(2012)generatedmelodyfordifferentgenresinclud- ingnurseryrhymesandrocksongsbyextractingtherhythmandpitch fromthecorrespondinglyrics.Theresultsfromacasestudyindicated thattheoriginalmelodiesweremorefamiliar,butthedifferenceswere notsignificant.Generationofmusicfromtextwasproposedinasystem calledTransPose(DavisandMohammad,2014),whichexploitedthe relationshipbetweenvariouscharacteristicsofmusic(suchasloudness, melody,andtempo)andtheemotionstheyinvoke.Forexample,Loud musiccorrespondstopowerorangerwhilefearorsadnessislinkedwith softmusic.Onedrawbackofthisapproachisthatmid-piecekeychanges cannotbecapturedwiththistechnique.Forfurthercomparisonofsym- bolicmusicgenerationtechniques,readersareencouragedtoreferto thefollowingsurvey(Briotetal.,2020).

Mishraetal.(2019)proposedanRNN-basedgenerativemodelfor melodygeneration.ThemodeutilizedLSTMarchitecturetolearnthe inputstructurefromatextﬁleandwastrainedtopredictthenextchar- acterinthesequence.Itwasconcludedthatthehuman-generatedand machine-generatedmelodiescouldnotbedistinguishedfromasurvey.

InWelikalaandFernando(2020),musicalinstrumentdigitalinterface (MIDI)ﬁles,whichcontaindatatospecifythemusicalinstructionsuch asnote’snotationandpitch,wereusedtotrainahybridvariationalau- toencoderandGANtogeneratemusicalmelodyforaspeciﬁcgenre.It wasconcludedthattheconsistencyofgeneratedmelodieswasnotupto thesamelevelashumancomposition.Similarly,Xuetal.(2020)pro- posedamelodygenerationframeworkusingGAN,consistingofaBi- LSTMgeneratorandanLSTMdiscriminator. Theproposed modelon averageobtainedascoreof3.27outof5by19humanevaluators,out- performingthebaselinemodels.Yuetal.(2020).usedconditionalGANs formelodygeneration.Thelyricsencodertokenizedthelyricsintoa vectorcontainingaseriesof wordandsyllabletokensaswellasad- ditionalnoises.Theoutputofthelayerwasfedintotheconditioned- lyricsGAN.MIDIsequenceturnerwasthenusedtoquantizetheMIDI notesproduced bythe generator. A similar studywas conductedin (Conditional LSTM-GAN for Melody Generation from Lyrics, 2021), whichusedconditional-LSTMGANs.Comparisonwithrandomandmax- imumlikelihoodestimationbaselineshowedthattheproposedmodel outperformedthebaselinesintermsofBLEUscore.Baoetal.,2019pro- posedamelodycomposer,"songwriter"togeneratelyrics-conditioned melodyanditsalignment.Theauthorsdividedthelyricsintosentences thatareprocessedinsequence.Thesetofgeneratedmelodieswasag- gregatedforeachlyrictogeneratethecompletemelody.Theevalua- tionusingF1andBLEUscoresshowedthattheproposedmodelout- performedexistingmodelsonallmetrics.DiasandFernando(2019)in- troducedasystemwithaninterfacethatgeneratesmelodyusingase- quenceofmusicalnotesgivenEnglishlyricswithoutanyhumaninter- vention.TheLSTMwasusedtotrainandpredictthemelodytargetdata.

Thehumanevaluationresultshowedthatthegeneratedmelody was

pleasanttolistentobecauseofthewell-developednotes.Theauthorsin DeepCross-ModalCorrelationLearningforAudioandLyricsinMusic Retrieval|ACMTransactionsonMultimediaComputing,Communica- tions,andApplications(2021)introducedamultimodalmusicretrieval thatretrievesaudiofromlyricsdataandviceversa.MFCCfeatureswere extracted frommusicaudiosignals usingCNN,andDoc2vecis used toembedlyricsdocuments.BasedonPixelCNN(vandenOordetal., 2016),WaveNet(vandenOordetal.,2016)isamodelforaudiogener- ation.Itispossibletogeneraterawaudiowaveformswiththismodel.

Although casestudyresults werepresented,evaluationmetricswere notreported.InPerformanceRNN:GeneratingMusicwithExpressive TimingandDynamics(2021),anLSTMmodelsthemusicinsymbolic formusingMIDIﬁles.Yamahae-PianoCompetitiondatasetwasused fortrainingandtosupplementthetrainingdata,theaugmentationof each record(e.g.,speedandtransposition)werealsoused.Similarly, Boulanger-Lewandowskietal.(2012)exploredtheapplicationinpoly- phonicmusicgenerationusingtherestrictedBoltzmannmachine.

2.2. Deeplearningandadversarialnetworks

Deeplearningisasubsetofmachinelearning,wherecomputersys- temslearnfrom experiencebygettingexposuretotrainingdataand without requiring explicitprogramming. Deeplearning architectures aredifferentfromtraditionalmachinelearningonesbecausetheyuti- lizeartificialneuralnetworks(ANNs),abiologicallyinspiredcomput- ingtechniquethatreplicatesthewayneuronsfunctioninintelligentor- ganisms(Kar,2016).Inmanyapplicationsthatareofhighcomplexity, otherformsofbio-inspired computationmaybecombinedwithneu- ralnetworkstoofferahybriddecisionsupportsystem(Chakraborty&

Kar,2017).Inmachinelearning,anextensiveselectionofinputfeatures totrainthealgorithmsformakingintelligentdecisionsisrequiredwhich makestheprocesstime-consuming.Meanwhile,deeplearningmodels generallylearnfeaturesfromtheinputtrainingdatawhichmakesthem farmoreeffectivefortaskssuchasimageandaudioclassification.While simpleANNscontainabasicstructureandmodelswithlessthanthree layersareknownasshallownetworks,deeplearningmodelshaveanin- creasingdepthofmorethanthreelayers(Bouwmansetal.,2019).The organizationinlayersallowsmorecomplexrepresentationsofaprob- lemtoberepresentedintermsofsimplerrepresentations.Othertypes ofdeeplearningmodelsaresuitableforspecificsetsofproblemsde- pendingonthenatureofthedataset.Forinstance,convolutionalneural networks(CNNs)utilizeconvolutionoperationtomergetwosetsofin- formationandasubsequentpoolinglayerreducesthedimensionality andcomplexity(Dhillon&Verma,2020).CNNsarethereforeeffective inimageandvideoapplicationssuchasactionrecognitionfromvideo footage(Yaoetal.,2019).One ofthefactorsthatenabledtherecent successofdeeplearningmodelsisthemassivevolumeofavailabledata allowingbigdataanalyticswhichistheprocessofobtainingmeaning- fulinformationandinsightsfromavarietyofdatasources(Grover&

Kar,2017).Theideaoflearningfromexperiencetoclassifyorpredict variablescanbeextendedtogeneratingnewdatapointsusingadver- sarialnetworks.

GANsprimarilyconsistoftwodeeplearningarchitecturesnamelya generatorandadiscriminator.Theroleofadiscriminatorcanbelikened tothatofadeeplearningclassifier.Inthiscontext,thediscriminatoris predictingwhetheragivensampleisfromtherealtrainingsetoritis generatedbythegenerator.Theperformanceofthediscriminatorisfur- therusedbythegeneratorasafeedbackasittriestoimproveitslearning parameterstofoolthediscriminator.Oncethegeneratorsuccessfully learnsthedistributionofthedataset,itcanproduceimagesthatare virtuallyindistinguishablefromothersamplesintherealdataset.The abilityofbigdataanalyticstoprovideflexiblemanagementofinforma- tionassets(Grover& Kar,2017)canbefurtherimprovedwithGANs astheycaneffectivelyperformcross-domaintransformation.Moreover, inrecenttimestext-basedinformationretrievalandmultimedia-based informationretrievalhasbecomeanintegralresearchtopicduetothe

(5)

growingamount ofthese datatypes(Kushwahaetal., 2021).There- fore,frameworksthatcaneffectivelyperformcross-domaintransferof- fergreatbenefitintermsofinformationstorageandretrievalaswellas theextractionofnovelinsightsandknowledge.Theuseoftext-to-text transformationhasresultedinunique applicationsuchasparaphrase generation(Palivela,2021).Consequently,amorechallengingaspect oftransformingthetextinputtoadifferentdomain,i.e.,audioisunder- takeninthisworkusingGANs.

2.3. GANarchitecturesandlossfunctions

WaveGANandSpecGANaretwotypesofarchitecturesthataresuit- ablefortheproposedapplication.MostoftheotherusedGANarchitec- turesarebuiltforTTSgenerationandarenottestedonmusicormelody generation.Meanwhile,thetwoconsideredmodelsaretestedformusic generationaccordingtotheexistingworks.Theoutputmusicsamples areavailableonline,andtheysoundclearandrealistic.Furthermore, therealisticoutputsoftheWaveGANandSpecGANarchitecturesfound inDonahueetal.(2019),wasduetotheselectedneuralnetworkmodel (i.e.,DCGAN)andthelossfunctionWGAN-GP.TheDCGAN,proposed inRadfordetal.(2016),combinestheCNNandGANarchitecturesfor unsupervisedlearning.CNNisawell-knownmodelthatactsasafea- tureextractorandaclassiﬁer.Theconvolutionallayers extractbasic andcomplexfeatures,whichhelpstoidentifytheinputclass.Thegen- eratoranddiscriminatorinRadfordetal.(2016)bothconsistoffour convolutionallayers withno poolinglayers.However,theWaveGAN andSpecGANmodelshaveanextralayer.Theoriginaloutputofthe DCGANis64×64pixelswhichtranslateto4096samples.Therefore, Donahueetal.(2019)decidedtoaddanextralayerforthegenerator anddiscriminatortoincreasesamplesto16384.

Intermsofthelossfunctions,theWassersteindistanceisdifferent fromtheKullback-Leiblerdivergence(KL),andJenson Shannon(JS) distancessinceitmeasuresthesimilaritybetweentwoprobabilitydis- tributionsbycomparingthedistancebetweenthem(i.e.,comparingthe horizontaldifferenceandnotthevertical).TheWassersteindistanceis helpfulwhencomparingtheprobabilitydistributionsthatarenotover- lapping.KLdivergenceisknowntohaveaninfinityvaluewhenthe twoprobability distributionsarenot overlapping(i.e., thegenerated probabilitydistributionis0,andtherealdistributionisgreaterthan0) (Arjovskyetal.,2017).Meanwhile,JSdivergencestabilizestheoutput tolog2whenthedistributionsarenotmeeting(Arjovskyetal.,2017).

Unlikethepreviousmethods,theWassersteindistanceprovidesastable distancemetricsincetheoutputvaluedoesnotincreaseexponentially (Arjovskyetal.,2017).

Thefollowingconditionssummarizetheoutputsofthedistancesin casethedistributionsarenotoverlapping(i.e.,𝜃≠0)andwhentheyare entirelyoverlapping(i.e.,𝜃=0):

𝑊( ℙ0,ℙ_𝜃)

=|𝜃| (1)

𝐽𝑆( ℙ0,ℙ_𝜃)

={log2 if𝜃≠0,

0 if𝜃=0, (2)

𝐾𝐿( ℙ_𝜃∥ℙ0

)=𝐾𝐿( ℙ0∥ℙ_𝜃)

=

{+∞ if𝜃≠0

0 if𝜃=0, (3)

WecanseethattheWassersteindistanceisequaltothe𝜃variable correspondingtothedistancebetweenthedistributions.(Arjovskyetal., 2017)usedtheWassersteindistancetocalculatethelossoftheGAN Eq.(4.)belowpresentsWGAN:

𝑉WGAN

(𝐷_𝑤,𝐺)

=𝔼_𝒙∼𝑃𝑋

[𝐷_𝑤(𝒙)]

−𝔼_𝒛∼𝑃𝑍

[𝐷_𝑤(𝐺(𝒛))]

(4) Thediscriminator𝐷_𝑤nolongerdetectstherealorfakesamples,and instead,itcalculatestheWassersteindistance.Hence,thediscriminator iscalleda“critic” sinceittellshowclosethetwodistributionsare.The ﬁrstfunction,𝔼_𝒙∼𝑃_𝑋[𝐷_𝑤(𝒙)], calculatestheWassersteindistance with respecttothedatasample𝑥takenfromtherealdataandthesecond

equationwithrespecttothegeneratedsample𝑧.Thenetworkparameter weightsforthecriticandgeneratornetworkareupdatedtominimize Wassersteindistancebetweenthedistributions.

Theclippingmethodis usedintheWGANtosatisfytheLipschitz constraintbylimitingtheweightofthecritic. However,ithassome drawbacks,includingtakingalongtimetoreachtheoptimalpointfor thecriticandhavingvanishinggradientsifthenumberoflayersislarge or batchnormalizationwas not used(Arjovskyetal., 2017).Hence, gradientpenalty(GP)wasintroducedinGulrajanietal.(2017)tobetter addresstheLipschitzconstraintbyenforcingasofterlimitonthecritic’s weight.TheGPequationisasfollowsusingtheaddedequation:

𝐿=𝔼_𝒙∼𝑃_𝑋[ 𝐷_𝑤(𝒙)]

−𝔼_𝒛∼𝑃_𝑍[

𝐷_𝑤(𝐺(𝒛))]

⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟

Originalcriticloss

+𝜆 𝔼

̂ 𝒙∼ℙ_𝒙_̂

[(‖‖∇_𝒙_̂𝐷(𝒙̂)‖‖²−1)2]

⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟

Gradientpenalty

(5)

where ̂𝑥isaninterpolatedvaluebetweentherealandgeneratedsam- ples.TheGPtakes𝜆asaregularizationparameterandthegradientnorm of ̂𝑥issubtractedby1andsquared.Theresultisthevalueofthenorm ofthegradientcloseto1,whichwillsatisfytheLipschitzconstraint.

3. Methodology

Thissectiondescribestheoverallmethodologyfortheproposedap- plicationincludingGANarchitecturesused,datacollection,andexper- imentalsetup.

3.1. GANarchitectureselection

Inthiswork,theaudiodatasetcomprisedofmelodiesinrawaudio formatinsteadofMIDIrepresentations.WaveGANandSpecGANaretwo typesofGANarchitecturesusedformusicgenerationthatiscompati- blewithrawaudiofileformats.WeselectedWaveGANforthemelody generationsinceitdemonstratedbetterresultsonthespeechcommands zerothroughnine(SC09)datasetusedin(Donahueetal.,2019).The SpecGANhasahigherchanceofoverfittingandthemeanopinionscores (MOS)forthehumanquality,ease,anddiversityevaluationsarelower thanWaveGAN(Donahueetal.,2019).WaveGANisspecificallysuitable becauseofthemelodydatabeinginrawaudioformat.Otherpopular approachesrelatedtomelodyandmusicgenerationutilizedMIDIor symbolicdatasetsinwhichcasetheGANarchitecturessuchastheone inLimetal.(2017)wouldbesuitable.Similarly,SpecGANmakesuse ofSTFTfromrawaudioandusesaniterativeGriffin-Limalgorithmfor inversiontoaudio.So,inconclusion,thekeydifferencebetweenSpec- GANtoWaveGANisthattherawaudioisconvertedtospectrograms firstbeforetrainingandperformsinversetransformationtoaudiobe- foresavingthegenerations.TheWaveGANimplementationispublicly availableonGitHub(Donahue,2021).Forourexperiments,wemod- ifiedthefollowingGitHubrepository(Adrián,2021)thatissuitedto conditionalWaveGAN.ForSpecGANimplementation,wemodifiedthe implementationfoundin(GitHub-naotokui/SpecGAN:SpecGAN-gen- erateaudiowithadversarialtraining2021).

3.2. Datacollection

TheapplicationofArabicpoetrytomelodyisanoveloneandcon- sequently there donot exist anyArabic poem-emotionandmelody- emotiondatasets.Therefore,diﬀerent sourceswereutilizedtocollect datatobuildanovelArabicpoemandmelodydatasets.Toexpressthe appropriatemoodofthepoetryandlinkittoitscorrespondingmelody, poetrywascollectedbasedonemotions.Inthefuture,thisworkcan beextendedtoexperimentwithotherattributessuchaspoemeraand poemgeographicallocation.Torepresentthethreeemotionclassesof poems, theArabicmusicalmode ormaqāmwasused. Thesearepat- terns of melody andthe maqāmtechnique denotesthepitches, pat- terns,andconstructionofapieceofmusic,uniquetoArabicartmusic (Touma&Touma,2003).Theuseofmaqāmforselectingthemelodies

(6)

Table1

Datasetcharacteristics.

Dataset Category Total Samples Attributes

Poetry Sad 4064 Contains 9452 poems of varying length, with each row containing poem text and corresponding emotion label.

Love 3443

Joy 1945

Melody Saba 8319 Contains 16,322 audio ﬁles of 4 seconds.

Each ﬁle is labeled with its corresponding maqam name.

Seka 4611

Ajam 3392

issuitableduetotheabilityofvariousmaqamstoexclusivelyconvey certainemotions(Shahriar&Tariq,2021).AccordingtoFarrajandShu- mays(2019)andShokouhiandYusof(2013), melodiesbelongingto maqāmAjamrepresentstheemotionofjoy,Sekarepresentslove,and Sabarepresentsmelancholyandsadness.Therefore,themaqāmdenot- ingaparticularemotionwillbepairedwithits corresponding poem denotingthesameemotion.

Forthepoemdataset,awebscraperwasdevelopedtocollectpoems withtheircorrespondingemotionsfromAlDiwan(AlDiwan,2021)web- site.Thiswebsitewasselectedbecauseitwastheonlyaccessiblesource wherepoemswerecategorizedunderseveralemotioncategories.Other sourcessuchasKagglecontainedArabicpoetrydatasetsclassifiedbythe eraandlocationsofthepoems.Forsimplicity,threeemotioncategories includingsad,joy,andlovewereselected.However,inthefuture,the proposed frameworkis flexible andcan easily be extended toother emotions.Atotalof9452poemswerecollected:4064sad,3443love, and1945joy.Forthemelodydataset,YouTubemusicalplayswereused tocollectthreetypesofmelodiesfromeachofmaqamsSaba,Seka,and Ajamwhichcorrespondtosad,love,andjoy,respectively.Themelodies weresplitinto4-secondsegments,resultingin16,322audiofiles.This leftuswithatotalof8319recordingsformaqamSaba,4611recordings forSeka,and3392recordingsforAjam.Sincethereweremoremelodies availablecomparedtopoems,apoetryfromeachclasswaspairedwith amelodyfromits correspondingmaqamclassrandomly.This meant thatabigportionofmelodiesfromthedatasetwerenotusedfortrain- ingastheywereleftunpaired.Infuturework,morepoemsfromthe emotionclassescanbecollectedandpairedwiththeunpairedmelodies toexpand thetraining setsize. Thedataset is publiclyavailable on GitHub (https://github.com/SakibShahriar95/Arabic-Poetry-Melody).

ThedatasetattributesaresummarizedinTable1. 3.3. Experimentalsetup

ToimplementtheproposedapplicationofArabicpoetrytomelody generations,twodifferentapproacheswereutilized.Bothapproaches areGAN-basedandareusedtogeneratemelodiesappropriateforagiven inputtext.Thekeydifferenceisthatthefirstapproachutilizesthetext embeddingfromthepoemstodirectlytraintheGANwhereasthesec- ondreliesonaseparateclassifiertofirstdeterminetheclassthatthe poembelongstobeforegeneratingmelodyforthatclass.However,the resultfrombothapproacheswilloffertheproposedpoetrytomelody generationapplication.

3.3.1. End-to-end(E2E)poetrytomelodygenerationusingWaveGAN Intheﬁrstapproach,nointermediarystepsarerequired.Thisisbe- causetheinputpoemtextintheformofwordembeddingsarebeing directlyusedtotraintheGANarchitecture.Fig.2.illustratestheexper- imentalsetupusingtheﬁrstapproach.

During thetrainingphase,thepoems inthe datasetisconverted tocorrespondingwordembeddingsusingtheAraBERT(Antounetal., 2020) model.Toreduce thecomputationalcosttwosteps wereper- formed:

• Theﬁrst128wordsofeachpoemwereconsideredtogeneratethe embedding.Thiswasdonetoreducecomputationalcostandmem- oryexhaustion.

• T-distributed stochastic neighbor embedding (TSNE) (van der Maaten & Hinton, 2008) is applied tothe output embedding of AraBERTbecausethedefaultembeddingsizewas(768Xnumber ofwords).ThenumberofoutputcomponentsforTSNEwassetto 100tobalanceitforconcatenation withthelatentvectorsizeof 100.

Therefore,theinputtothegeneratorisanoisevectorthatisconcate- natedwiththewordembeddings(afterapplyingTSNE).Thediscrimina- torreceivestherealmelodydatatotrainitselftodistinguishbetweenthe realonesandthesynthesizedones.Aftertraining,thearchitecturecan generateanappropriatemelodyforagiventextembedding.WaveGAN isbuiltontopofDCGANarchitecture.Forupsampling,thegenerator usestransposedconvolution(sinceaudiosignalsare1-D,alargerker- nelofsize25wasused).Therestofthearchitectureiskeptthesame asstandardDCGAN.Forthegenerator,ReLUactivationwasusedfor allthefirstfourlayerswhereastanhactivationwasusedforthefinal layer.Meanwhile,inthediscriminator,noactivationwasusedforthe firstfourlayerswhereasthelastlayerusedLeakyReLUactivation.The trainingstrategyusedwasWGAN-GP.

3.3.2. PoetrytomelodygenerationusingconditionalGANs

Inthisapproach,aconditionalGAN-basedmethodwasused.There isanintermediarytextclassiﬁerthatisﬁrstresponsibleforgenerating thecorrectlabelstobeusedbytheconditionalGAN.Fig.3.displaysthe experimentalsetupforthesecondapproach.

Two conditionalGANs,namelyconditionalWaveGAN andcondi- tionalSpecGANwereexperimentedwith.ThegeneratorfortheWave- GANistrainedbasedontheinputrandomnoiseandthelabelsofthe poemtextswhichspecifytheemotionofthepoem.Thediscriminator takesasinputboththeactualmelodyandtheonesthatweresynthesized tolearntodistinguishbetweenthem.TheonlydifferenceforSpecGAN implementationisthattherawaudiofileisfirstconvertedtospectro- gramrepresentationforthediscriminatortotrainon.Theentirepro- cessisreversedduringsynthesiswhenthegeneratedspectrogramsare inversetransformedtorawaudiorepresentations.Duringthetraining phase,atextclassifierisnotrequiredasthelabeleddataisavailable.

However,oncetheconditionalGANshavebeentrainedsuccessfully,we cannotsimplyinputthetextoritsembeddingtogeneratenewmelodies.

Rather,thelabelofthepoemcategorymustbeobtainedﬁrstbyrun- ningthetextthroughatextclassiﬁer,AraBERTinthiscase.Thelabel canthenbepassedontotheconditionalGANtogenerateanewmelody.

Themostsigniﬁcantdrawbacktothisapproachistheperformanceofthe textclassiﬁerplaysabigroleintheaccuracyofthegeneratedmelodies.

Moreover,anintermediarystepisoftenundesirableinmanyapplica- tions.

Withintheconditionalapproach,theﬁrstmethodutilizestheWave- GAN.ThearchitectureoftheconditionalWaveGANisthesameasthe oneusedforE2Egeneration.Thisincludesthenumberoflayers,ac- tivations,andthetrainingstrategy.Theonlydiﬀerenceisthatinthis approachtheTSNEoutputsforthewordembeddingsarenotrequired.

Rather,thelabelisconcatenatedtothenoisevectorsuchthattheGAN architecture is trainedtogenerate samplesfrom the correctmelody class.Afterthetrainingphase,thegeneratorcanbecalledbypassing thelabelfortheclassesaswellasarandomnoisevectortoobtainnew melodiesforallthethreeclasses.

ThesecondconditionalapproachutilizesSpecGAN.Asmentioned previouslyandillustratedinFig.3,thisapproachmakesuseofaspec- trogram representation. Therefore,a preprocessingstep is employed before thetrainingphase.During thisstage, theaudioﬁlesforeach category areloaded.We thenconverteach audiointo128^∗128 Mel- spectrogramrepresentations.Wethensavethespectrograms,theclass labels,Melfeaturemean,andstandarddeviationinto.npz(NumPy)for- matforstorageofarraydata.Thereafter,thisﬁlecandirectlybeused forthetrainingphasewithoutrunningthepreprocessingstepsmultiple times.During thetrainingphase,thetrainingratiois setto5asper

(7)

Fig.2. End-to-endpoetrytomelodygeneration.

Fig.3. Conditionalpoetrytomelodygeneration.

theapproachproposedin(Donahueetal.,2019).Thisensuresthatthe discriminatorupdateswillhappen5timespergeneratorupdate.The generatorconsistsof11layersasfollows:Adenselayerof16384units isfollowedby5blocksofUpSampling2DandConv2Dlayers.Allthe activationsareReLUexceptforthelastConv2Dlayerwhichhastanh activation.Thediscriminatorontheotherhandconsistsofsixlayers.

First,thereareﬁveConv2DlayersfollowedbyLeakyReLUactivations.

Finally,afterﬂattening,wehavetwodenselayers.Theﬁrstdenselayer isresponsiblefordeterminingwhetherthegeneratedaudioisrealor fake,soithasasingleoutput.However,noactivationisuseddueto theuseoftheWassersteinlossfunction.Theotherdenselayeris re- sponsibleforgeneratingtheprobabilitiesthattheaudiobelongstoone ofthecategories.Therefore,ithasthreeoutputswithasoftmaxactiva- tionfunction.Thegeneratorlearningrateissetto0.0001withAdam optimizerandthemodelwillbeoptimizedonWassersteinloss.Onthe otherhand,forthediscriminator,thelearningrateandoptimizerare thesameasthatofthegenerator.InadditiontotheWassersteinloss, thediscriminatoralsohasagradientpenaltylossandcategoricalcross- entropylossforpredictingtheclasses.First,wecallthegeneratortopre- dictthespectrogrambygivingtheinputnoiseandthecategorylabel.

Next,wecallontheinverseMel-spectrogramtoobtaintherawaudio ﬁles.

4. Evaluationandresults

Inthissection,resultsfromthediﬀerentapproacheswillbe com- paredusingqualitativeevaluationandcommonquantitativeevaluation forGANs.Thefollowingsectionspresentanexplanationofthesemetrics andthecorrespondingresults.

4.1. Quantitativevalidation

Theleave-one-out(LOO)andGAN-train/testmetricswillbeusedto validatethemodelperformance.LOOusesallthen-1samplepointsfor trainingwherenisthenumberofsamplesandtheremainingsamplefor testing.Theclass(i.e.,fakeorreal)ofthenearestsampletothetesting pointisgiventoitintermsofdistance.Theobjectiveistotrainak- nearestneighbor(K-NN)algorithmwithaKvalueof1(1-NN)andapply theLOOapproachfortrainingK-NN.FortheGAN,LOOof0.5isthe bestwherethediscriminatorhasa50%chanceofcorrectlylabelingthe point.IftheLOOvaluewaslessthan0.5,itwilllikelyleadtooverﬁtting sincethegeneratedandrealpointsareveryclose.Ontheotherhand, aLOOvaluegreaterthan0.5meansthatthediscriminatorcaneasily distinguishbetween realandfakesamples. Furthermore,tocalculate thedistancebetweenthesampledpointandthenearestneighbor,the Euclideandistancewillbeused.Giventwopoints,xandywecanget theEuclideandistancebetweenthemusingEq.(6):

𝐸𝑢𝑐𝑙𝑖𝑑𝑖𝑎𝑛(𝑥,𝑦)=

√∑_𝑛

𝑖=1

(𝑦_𝑖−𝑥_𝑖)²

(6)

Moreover, the dynamic time warping (DTW) (Salvador and Chan,2007)algorithmwillbeusedtocalculatethedistancebetween thedistributionsoftherealexamplesandthegeneratedones.TheLOO methodusingtheDTWdistancewillalsobeutilizedintheK-NNalgo- rithm.

GAN-test is when the classiﬁer is trained on the actual samples and tested on thegenerated sample, andGAN-train is the opposite (Konstantin_Shmelkov_How_good_is_ECCV_2018_paper.pdf, 2021).

FromtheGAN-test,itcanbe determinedwhetherthemodelisover- ﬁtting, or does not capture the actual distribution, or generating

(8)

Table2

Architectureparametersandconditionsforeachexperiment.

Approach Batch Size Input Layers (G, D) Target Length (Sec.) Sample Rate (Hz)

E2E-WaveGAN 64 (6,8) 4 8192

Conditional-WaveGAN 32 (6,8) 4 8192

Conditional-SpecGAN 64 (11,6) 1 16000

Approach Noise Vector Cond. Input Format Batch Norm. Epochs

E2E-WaveGAN (100, 1) Arabert + tSNE (100,1) True 9200

Conditional-WaveGAN (64, 1) Label joined with Embedding (16,1) True 15000

Conditional-SpecGAN (97, 1) One-hot Encoded (3,1) False 3000

realistic samples (Konstantin_Shmelkov_How_good_is_ECCV_2018_

paper.pdf,2021).Ifthetestingaccuracyisgreaterthanthevalidation, thegeneratedsamplescapturetheﬁnedetailsofthetrainingsetand henceproducesimilarsamples.Meanwhile,thediversityofthegener- atedsamplescanbeextractedusingGAN-trainaccuracy.Thecloserthe testingaccuracytothevalidationaccuracy,thesimilarthegenerated resultsaretotherealones,andthemorediversetheyare.Forbuilding both networks, ResNet34 (ResNet-34, 2021) architecture was used.

Thesame architecturewas used becausethegoal in this evaluation wastoestimatethesimilaritybetweenthegeneratedandrealsamples byassessing theGAN-train andGAN-testaccuracies.Optimizing the architecturesfurtherwouldpotentiallyleadtohigherscoresforboth.

However,inthiscase,theobjectivewastoobtainageneralideaofthe accuraciesandnotnecessarilyoptimizethem.

4.2. Qualitativevalidation

Tovalidatethequalityofthegeneratedmelodies,aqualitativecase studywasconducted.Weinvitedtwelvevolunteerswhowereuniver- sitystudentstoevaluateandscoreboththegeneratedaswellasthe originalmelodies.Theparticipantswerepresentedwithatotaloffour questionsafterbeingpresentedwithtenaudiosamplesfromwhichﬁve ofthembelongedtotherealmelodiesandﬁveofthemwereGANsyn- thesized.ThegeneratedsampleswerepresentedfromtheSpecGANas theothermodelsproducedpoor-qualitymelodies.Foreachaudiosam- ple,theevaluatorswereaskedthefollowingquestionsandaskedtorate between1to5foreachquestion,indicatingverypoor(1)toverygood (5).

• Theoverallqualityofthemelodyplayed(1poorquality,5great quality).

• Themelodyplayedisenjoyable(1notveryenjoyable,5veryenjoy- able).

• Themelodyplayedisinteresting(1notveryinteresting,5veryin- teresting).

• The melody played sounds artiﬁcial (1 sounds very artiﬁcial, 5 soundsveryrealistic).

Followingthesurvey,thescoresofeachquestionforboththereal andthegeneratedmelodieswillbeaggregated,andcomparisonwillbe made.Althoughduetothesmallsamplesizewhichlimitsthepossibility oftheresultsbeingconclusive,wecanstillobtainageneralperception intermsofthequalityofthegeneratedmelodies.Weareparticularly interestedtonoteifthegeneratedmelodieswereofgoodqualityandif theydidnotsoundartiﬁcial.Moreover,iftheevaluatorsfoundthemto beinterestingandenjoyable,itwouldfurtheraddtothesuccessofthe generatedmelodies.

4.3. Results

Threediﬀerentexperimentsforthepurposeofpoetrytomelodygen- erationapplicationwasconducted.InTable2,wetheparametersand conditionsthateachofthemodelswasexperimentedonareoutlined.

Thearchitectureingreaterdetailswasalreadypresentedintheprevious section.

4.3.1. Quantitativeresults

Theﬁrstexperimentwasend-to-endusing conditionalWaveGAN, andthemodelwastrainedfor9200epochsinbatchsizesof64.Witha samplingrateof8192,outputof4secondswasobtainedwhichwasthe sizeoftheﬁlesintheaudiodataset.However,thegeneratedmelodies from theembeddings in this approach wereof poor quality asthey soundedlikerandomnoisyinstruments.Webelievethisisduetothe transformationofthewordembeddingvector,whichforcomputational reasonscouldnotbedirectlyusedasaninputandwasrequiredtobe downsampled.UsingAraBERT,weobtainedembeddingsof254by768 foreachinputtext.Incontrast,CIFAR-10imagesareof28by28size.

TheextremelylargesizeoftheAraBERTmodelmeantthatsufficient memorycouldnotbeallocatedfortraining.Asaresult,theinputsize wassignificantlyreduced.Moreover,tomatchthedimensionsaTSNE transformationwas performedwhichwould meanthatfurtherinfor- mationwaslost.Therefore,webelieveinthefuturesmallertextem- beddingmodelsanda moresophisticated downscalingapproach are required.TheLOOresultsusingEuclideanindicatesthatthemodelis leaningmoretowardstheoptimalresultwithavalueof0.58.However, thediscriminatorhasahigherchanceofdistinguishingbetweenthereal andfakesamplesthanthegenerator,whichtriestotricktheclassifier.

Meanwhile,thevalueoftheLOOusingtheDTWdistanceis0.91,which showsanentirelydiﬀerentoutcomewherethemodelishighlycapable of distinguishingbetweenthesamples.Finally,thedistance between theactualandgenerateddistributionsis361382,whichishigh.Conse- quently,thismeansthatthegeneratorisnotcreatingrealisticsamples thatwouldfoolthediscriminator.

Inthesecondexperiment,thesameWaveGANarchitecturewasuti- lizedbutwithadifferentinputrepresentation.Inthiscase,theclassla- belswereusedinsteadofthewordembeddings.Themodelwastrained for15000epochswithabatchsizeof32.Theotherconfigurationsare thesameasthefirstexperiment. Theresultsobtained usingthis ap- proachsoundedmorecoherent,althoughapost-processingnoisefilter wasrequiredtodenoisethegeneratedmelody.Itispossibletofurther improvetheperformanceofthisarchitecturebyusingadditionaldata andhyperparametertuning.Duetoonlybeingabletorun about10 epochseveryminute,theamountoftuningthatcouldbedonewasre- stricted.TheLOOwiththeEuclideandistanceshowsanoptimalvalue of0.51whichmeansthatthediscriminatorhasa0.5probabilityofcor- rectlyclassifyingthesample.TheLOOwiththeDWThassimilarbehav- iortothepreviousarchitecturewherethevalueishigher(i.e.,worst) thantheEuclideandistance,andinthismodel,thevalueis0.82.The DWTdistanceis150063,whichislowerthantheE2E-WaveGAN.Hence theLOO-DWTvalueisslightlybetter.TheGAN-testaccuracyof0.36is higherthantheGAN-trainaccuracyof0.31.Subsequently,thisreveals thatthegeneratedsamplesarelessdiverse.Overall,theGAN-testand GAN-trainaccuraciesareverylow,whichindicatesthattheproduced sampleshavepoorqualityanddiversity.

In theﬁnal experiment,a conditional SpecGAN architecturewas used.Unliketheprevioustwoexperiments,aspectrogramrepresenta- tionisutilizedhere.BecausetheoriginalimplementationofSpecGAN wasﬁxedtoone-secondrepresentation(Donahueetal.,2019),audio synthesisofone-secondlengthwasproducedusingthisapproach.As pertherecommendationin(Donahueetal.,2019),batchnormalization

(9)

Fig.4. Spectrogramcomparisonforsixgeneratedand realsamples.

wasnotimplementedforthisarchitecture.Themodelwastrainedfora totalof3000epochsinbatchsizesof64.Theresultsobtainedusingthis architecturesoundedaestheticallymorepleasantthantheothers.The onlydownsideisthatthelengthoftheaudiowasonesecondasopposed totheinputaudiodatasetﬁlesbeingfoursecondslong.ALOOscoreof 0.49wasobtainedwiththeEuclideandistance,whichisclosetoperfect, indicatingthatthediscriminatorhas0.5probabilityofcorrectlyclassi- fyingthesample.However,usingtheDTWdistancewiththeLOO,the scorewas0.78.Thisindicatesthatthegeneratedsamplesarelikelytobe distinguished.Overall,thescoreisbettercomparedtotheprevioustwo experiments.TheGAN-trainaccuracyof0.4ishigherthantheGAN-test accuracyof0.31.Thisindicatesthatthetargetdistributionwasnotcap- turedwellandthereisalackofaudioquality.Finally,theDTWdistance of92327indicatesthatthegenerateddistributionisclosertotheactual onewhenwecomparewiththeﬁrsttwoexperiments.Asaresult,the bestresultswereobtainedusingtheconditionalSpecGAN.

Forfurther evaluation, a total of six spectrograms were selected that were generated by the three approaches as well as the spectrogram for thereal samples. Fig. 4. displayssix randomly selected Mel-spectrograms of original music and the music generated from conditionalWaveGAN,conditionalSpecGAN,andE2EWaveGAN.The original ﬁlescontain a wide rangeof signals spanningover alarge rangeoffrequencies.Thiscontrastswiththesignalsgeneratedthrough E2EWaveGANandtosomeextentconditionalWaveGAN.Theformer hasgeneratedmoremonotonicsignalswithwhitenoisewhilethelatter showsthesignalsaremodulatedpossiblythroughtheprovidedlabel orthelatentspacevariation.Theclosestwaveformsaregeneratedby SpecGANwhich showedthat thegeneratedsignals aremore diverse andclosertotheoriginalone.Thisargumentisalsosupportedbythe scoresobtainedthroughthequalitativeassessment.

Thequantitativeresultsobtainedfromthethreeexperimentsusing theaforementionedmetricsis summarizedinTable3. Fig.5.alsoil- lustratesthecomparison of the DTWdistance between thereal and generateddatadistributionsforeachalgorithm.AlowerDTWdistance indicatesthatthegeneratedandrealdatapointsarecloserandthere- forelowervalue,inthiscase,indicatesbetterperformance.Ascanbe seenfromthetable,intermsofLOOevaluation,bothconditionalWave- GANandSpecGANoutperformedE2EWaveGAN.However,boththese modelsobtainedlowerscoresintermsofGAN-trainandGAN-testaccuracies.SincetheE2E-WaveGANgenerateddatapointsbasedontext

Fig.5. DTWdistancebetweenrealandgenerateddistributions.

embedding,therewasnolabeltoperformtheGan-trainandGan-test evaluationonit.IntermsofDTWdistance,asdisplayedinFig.5,itcan be observedthat SpecGANoutperformed theother models.Thiswas clearafterassessingthesamplesgeneratedfromallthreeexperiments thattheconditionalSpecGANonesweretheclosesttorealsamples.In contrast,conditionalWaveGANproducednoisymelodiesandtheE2E WaveGANproducedmelodiesthatwerenotmeaningful.Consequently, onlytheaudiosgeneratedbyconditionalSpecGANwereselectedforthe qualitativecasestudy.

4.3.2. Qualitativeresults

Toassessthequalityofthemelodiesgeneratedbytheconditional SpecGAN,twelvevolunteerswereaskedtoanswerfourquality-related questions. Theevaluationcriteria werebasedon enjoyable, interesting,realistic,andtheoverallqualityofthemelodiestobescoredbe- tween1and5.Theaveragescorealongwiththeirstandarddeviationis summarizedinTable4.Samplemelodiesgeneratedusingconditional SpecGAN are available online (https://github.com/SakibShahriar95/

Arabic-Poetry-Melody).

Thetableindicatesthatrealmelodiesarebetterunderallfourcri- teriacompared tothegeneratedones.However,forallthemeasures thegeneratedmelodies scoredpointsthat wereabove average(2.5).

Moreinterestingly,thehighestofthefourmeasuresforthegenerated melodieswasrealisticwiththeparticipantsonaverageratingit2.75 outof5forbeingrealistic.Thismeanstheparticipantsthinkthatthe

(10)

Table3

Quantitativeresultscomparison.

Approach LOO (Euclidean) LOO (DTW) GAN-Train GAN-Test DTW Distance

E2E-WaveGAN 0.58 ± 0.49 0.91 ± 0.29 N/A N/A 361382

Conditional-WaveGAN 0.51 ± 0.50 0.82 ± 0.39 0.31 0.36 150063 Conditional-SpecGAN 0.49 ± 0.50 0.78 ± 0.41 0.40 0.31 92327

Fig. 6. Surveyresult comparison between real and generatedmelodies.

Table4

QualitativeresultsforconditionalSpecGAN.

Evaluation Real Melodies SpecGAN Generated Enjoyable 3.73 ± 0.61 2.60 ± 0.28 Interesting 3.75 ± 0.45 2.53 ± 0.31 Realistic 3.83 ± 0.29 2.75 ± 0.38 Overall Quality 3.72 ± 0.32 2.53 ± 0.32

SpecGANgeneratedaudiosaremorelikelytoberealisticthanfake.This indicatestousthattheoverallresultsfortheconditionalSpecGANare quitepromisingandcanbeimproveduponinthefuture.Fig.6.illus- tratesthecomparisonofthequalitativecasestudy.

5. Discussion

Despite the comparatively limited research in GAN-based cross- domainaudiogeneration,theresultsobtainedforthenovelpoetryto melodygenerationwerepromising.Inthissection,someofthechal- lenges,implications,andfutureworkofthisapplicationarediscussed.

5.1. Challenges

Therewereseveralnotablechallengesthatwereovercometoimple- menttheproposedGAN-basedpoetrytomelodygenerationframework.

Firstly,therearenoexistingdatasetsforpoetryandmelodiespaired basedonsuitablemoodandemotion.Assuch,asigniﬁcantcontribution included the introduction of an Arabic poetry and melody dataset.

Related worksfocused on generating melodies from givenlyrics by aligningthelyricswiththemusicalnotes,synthesizingthelyrics,and combiningthemwiththegeneratedmelodytocreatesuitablesongs.

Meanwhile,inthiswork,theobjectiveistogenerateanovelmelody foreachpoemrepresentingtheemotionconveyedbythepoem.The lackofcross-domainaudiogenerationresearchintheliteraturemade itfurtherchallenging.

Moreover,therepresentationoftheaudiosignalsisquitecomplex andspatiallycostlytorepresentindigitalform.Themusicsignalscould

form themelodyatvarioustimescales whichcannotbe captured al- gorithmically.Thespatialcomplexity,i.e.,thehigh-ﬁdelityrateofau- diosignals,isaproblembecauseitrequireslargecomputationalpower.

Thesetwoproblemsofvaryingtimescaleconsistencyandhigh-fidelity makeitverydifficulttomodelthemelodies.Variousalgorithmspriori- tizetheglobalstructureatthecostofhighcomputationalpowerwhereas othersfocusonusingproxyrepresentationslikespectrogramstocon- servecomputationalpower.Therefore,thesolutionstothistradeoff can- notbeeasilygeneralizedandrequiresignificantdomainknowledgeand experimentation.

5.2. Contributionstoliterature

Asdiscussedintheprevioussections,thisworkpresentedtheﬁrst poetrytomelodygenerationusingGANs.Consequently,anoveldataset wasintroducedcontainingpairedpoetryandmelodybasedonemotions.

Thedatasetcanbeusedbyresearcherstofurtherimprovethemelody generation.Theycanalsoexperimentwithnewtaskssuchasmelodyto poetrygenerationaswellastextandaudioclassiﬁcationintheArabic language.Moreover,thisworkpresentedareasonableperformancefor melodygenerationusingSpecGANarchitectureandtheimplementation canbeutilizedforrelatedapplications.However,theresultsobtained inthisworkalsohighlighttheimmensechallengeindealingwithraw audioespeciallywiththetexttoaudiotransformationinanend-to-end manner.Therefore,necessarycomputationpowerandtrainingtimeis requiredtofurtherenhancethegenerationquality.

5.3. Implicationsforpractice

Thepracticalimplicationsofthisresearcharemanifold.Firstly,the userexperienceforreadingpoetrywillbegreatlyenhancedwiththead- ditionofmelody.Thiscanbeachievedbyintegratingtheproposedpo- etrytomelodygenerationframeworkintoamobileorwebapplication.

Basedonthepoetrybeingreadbytheuser,amelodymatchingthemood ofthepoemwillbeplayedinreal-timetoenhancethereadingexperi- ence.Moreover,theproposedframeworkcanbeofgreatbeneﬁttothe mediaproductionindustry.Forinstance,avideoscenewhereapoemis beingrecitedmayhaveabackgroundmelodytoaccompanyit.Instead