BIBLIOGRAPHY
A.2 Topic Analysis
• choose a topic𝑧𝑖 ∼ 𝜃𝑑
• choose a sentimentℓ𝑖 ∼ 𝜋𝑑 ,𝑧
𝑖
• choose a word𝑤𝑖from the multinomial distribution over words defined byℓ𝑖and𝑧𝑖(parameter𝜙ℓ𝑖
𝑧𝑖 which is the per-corpus joint sentiment-topic word distribution).
The hyperparameter𝛼in this case is the prior for topic distribution. That is, it can be thought of as the prior distribution of topics before having seen any documents.
Similarly,𝛾can be thought of as the prior count of sentiment-topic pairs before any documents are seen.
In order to estimate the model, we use the modified version of Phan’s Gibbs LDA++
package written by Lin for R.1 This is calibrated using the coherence score of the model and searching over the range of topics from 2 to 30 (which correlates with 6 to 90 sentTopic values). The various results can be seen in Figure A.1. For each term, a higher value indicates a better fit and the precise meaning of each term can be found in the documentation for the tex2vec R package.2 These values lead us to choose a final choice of 5 topics. We left the number of sentiments as three following Lin and He, 2009. The most frequently used words in each senTopic can be seen in Figure A.2, the size represents the number of tweets the word appears in.
In Table 2, we list the author-generated label for each topic-sentiment pair. For the remainder of the analysis, we focus on the 5 BLM related senTopics which cover:
BLM General, BLM George Floyd/Breaonna Taylor, BLM Civil Rights, BLM Los Angeles News, and BLM Police Violence. This choice is validated in Appendix A.2.
Topic Choices
Figure A.1 shows the different coherence scores for each of the topic numbers chosen. For more information on the different metrics, check out https://rdrr.io/
github/dselivanov/text2vec/man/coherence.html. These results were the main driver in our decision to choose 5 topics over a different number of topics.
1See http://gibbslda.sourceforge.net/ and https://github.com/linron84/JST/
2https://rdrr.io/github/dselivanov/text2vec/man/coherence.html
mean_npmi_cosim mean_npmi_cosim2 mean_pmi
mean_difference mean_logratio mean_npmi
2 3 4 5 6 7 2 3 4 5 6 7 2 3 4 5 6 7
0.0675 0.0700 0.0725 0.0750 0.0775
0.4 0.5 0.6 0.7
−4.10
−4.05
−4.00
−3.95
0.13 0.14 0.15 0.16 0.17 0.020
0.025 0.030 0.035
0.40 0.42 0.44
Topics
Figure A.1: Coherence metrics for various numbers of topics.
Topic Overview
The word clouds for each topic can be seen in the figure below, this, as well as a detailed analysis of tweets scoring high in each sentiment topic pair are what lead to our author generated labels presented in Table 2.
georg floyd
murder breonna
taylor georg_floyd
breonna_taylor polic
arrest
kill justic cop offic
death charg
protest black
day
famili shot demand
die
free peac
time
breonnataylor polic_offic georgefloyd
sign
break
life
white
minneapoli blacklivesmatt
killer
fire
video chang
home
live
biden vote
trump joe
joe_biden elect
presid joebiden
day voter
democrat
poll win
mail ballot campaign
support join
novemb sign
time
event run
kamalaharri
republican nation
harri american realdonaldtrump
black
debat plan
america kamala
hes vote_mail
parti pick
tonight senat
mask
wear covid19
wear_mask coronavirus test
covid trump
death day
die posit
pandem
week
report virus
american
month countitime
live
spread health
public break
florida test_posit
stay
million countri
home care
social
close
record
news protest
texa school
due
marley
bobmarley bob
drop love king rock
black timeday
feel song
live art fire
girl
night dem gonna
world
bad
soul
babi system power
red
stori cri
unit music
michael morn action
aint
trump
presid donald
donald_trump
american joebiden realdonaldtrump
america
presid_trump
countri tax
projectlincoln
lie
time nation
unit
million die
hes pandem
day
obama dead
support coronavirus
covid19 kill histori
elect
biden call paid
war vote
job fail
administr live campaign
world
black live
matter white
black_live
live_matter protest racism
support communiti
america polic
blacklivesmatt racist
system
american
fight countri
stop
time
chang movement
life stand
justic
folk
histori blm love world
color trump kill
human
your
speak street
power day
race
time watch game
play music love video
song live
day season
week team
start tonight
listen
movi
releas
album film
book final stream fan
read
favorit
player black
episod join
win
seri
check stori
ago
share come
feel
artist
ive
trump vote
senat court
republican democrat law
presid elect
break
suprem
bill suprem_court
justic
right act
feder american parti
gop
pass rule
hous
barr john time protect
nation offic
judg
day
countri fight
govern polit
power
support realdonaldtrump leader
call
los angel los_angel
post
photo california
post_photo
citi houston
protest texa
angel_california houston_texa
blacklivesmatt
counti video chicago
day
hollywood
join
park time
live west sign polic fire
donat march black
share tomorrow close
morn love
mayor week
link offic
blm
happi love day birthday
happi_birthday life friend
time god
famili
feel hope
world beauti
live
father heart rest
bless
power
celebr miss share peac
word amaz black
your chang
dad
proud
moment
stay
fight send
ive
real rememb
brother
lost
school student
health money
kid pay
public time
black care trump
busi
educ job polic support
famili
communiti
million
fund
children parent
pandem
colleg tax donat
worker start
social child
american mental
home learn
class countri
free
system
lot food
trump white
hous white_hous
tweet
presid realdonaldtrump twitter
lie media news
time
video racist
report support
social account
retweet read
question
call watch
follow
post fake
stori
day word
power
social_media speech
stand your
truth
stop america
press told
polit
timeday lol
watch yall shit feel
start
night love
home gonna girl
wait week
friend guy
live
walk play fuck
ago
life
hour dog car
mom
bad
miss stop
ive rememb
eat month
kid hit
sleep run
happen liter
polic protest
offic cop
black peac kill
shot white
fire arrest
shoot protestor
stop
trump citi
video portland
polic_offic brutal
violenc
time
gun car
riot call
street peac_protest
murder tear
start
fuck report
polic_brutal
forc
happen night
chicago
break attack
yall fuck shit
ass love lol
nigga feel
black
bitch hate stop time
white real bad
racist
your
wanna ppl
damn gonna
life girl
lmao guy
gotta
wrong
liter talk
stupid make
bro care
lot friend
aint post
tweet mad
2020 Pres. Election Family Anger/Frustration
Music City News Police Violence
Covid/Wear Masks Political Confrontation Sadness/Nostolgia
Voting Pop Culture Media
George Floyd/Breonna Taylor BLM Public Programs
Figure A.2: sentTopic WordClouds.
Topic Validation
If we look at the distribution of all the senTopics individually over time, it is clear that our five BLM ones have the same structure while the others appear random.
This can be seen in Figure A.3. Additionally, the patter mimics the Google Trends structure of “BLM” searches over the same time period. This is seen in Figure 1.
2020 Pres. Election Family Anger/Frustration
Music BLM City News BLM Police Violence
Covid Believers/Wear Masks Political Confrontation Sadness/Nostalgia
Vote General Pop Culture Media
BLM George Floyd/Breonna Taylor BLM General Public Programs
Jun Jul Aug Sep Oct Jun Jul Aug Sep Oct Jun Jul Aug Sep Oct
0.06 0.08 0.10 0.12
0.06 0.08 0.10
0.075 0.100 0.125 0.150
0.05 0.10 0.15
0.08 0.10 0.12 0.14 0.03
0.06 0.09 0.12
0.050 0.075 0.100 0.125 0.150
0.025 0.050 0.075 0.100 0.125
0.03 0.04 0.05 0.06 0.07
0.10 0.15 0.20 0.25 0.04
0.08 0.12
0.05 0.10 0.15
0.04 0.06 0.08 0.10
0.0040 0.0045 0.0050 0.0055
0.04 0.08 0.12 0.16
Date
Percent of Discussion
Class BLM Not BLM
Figure A.3: Average distribution of senTopics over time.
We also label the tweets originally found when searching for protesters, and thus including at least one of our BLM relevant keywords, asprotest tweetsand the rest of the tweets an individual user publishes over the summer astimeline tweets. The box and whisker plot of the percent BLM topic for each city for these types of tweets can be seen in Figure A.4. In this way, we are using hand-labeled BLM tweets to check their consistency with the unsupervised topic modeling technique. The clear separation between the two groups further increases our confidence in the model.
0 25 50 75 100
Timeline Tweets Protest Tweets
Percent BLM
City Chicago Houston Los Angeles
Figure A.4: Comparison of protest tweets and other tweets by protesters in an effort to validate BLM measure.
Finally, we selected the 400 tweets with the highest BLM rating, the 200 with the lowest BLM rating and then 200 closest to 50%. These tweets were then hand coded by four individuals on a scale of 0 to 1 for percent related to BLM. A boxplot for mean hand codings for each tweet can be seen in Figure A.5. It is clear from these responses that the unsupervised method is in line with the hand codings done by the four individuals. In addition in Table A.5 the correlation of the scores for each person as well as the RJST model can be seen.
0.00 0.25 0.50 0.75 1.00
0 0.5 1
rJST
Hand Coding
Figure A.5: Boxplots for the average score for each tweet based on the hand coders based on whether they were within the group closest to 0, 50, and 100 percent related to BLM according to the RJST model.
BLM_topic P1 P2 P3 P4 Avg BLM_topic 1.00 0.78 0.76 0.61 0.76 0.80
P1 0.78 1.00 0.88 0.76 0.82 0.95
P2 0.76 0.88 1.00 0.72 0.75 0.92
P3 0.61 0.76 0.72 1.00 0.70 0.88
P4 0.76 0.82 0.75 0.70 1.00 0.90
Avg 0.80 0.95 0.92 0.88 0.90 1.00
Table A.5: Correlation matrix between the hand coded response of the four people and then RJST model. It is clear that the unsupervised model is as close to the individuals as they are to each other. This helps to validate our model and lends support to the conclusions drawn using it