This is the html version of the file http://psiexp.ss.uci.edu/research/papers/isi2006.pdf.
Google automatically generates html versions of documents as we crawl the web.
LNCS 3975 - Analyzing Entities and Topics in News Articles Using Statistical Topic Models
Page 1
Analyzing Entities and Topics in News Articles
Using Statistical Topic Models
David Newman1, Chaitanya Chemudugunta1,
Padhraic Smyth1, and Mark Steyvers2
1
Department of Computer Science,
UC Irvine, Irvine, CA
{newman, chandra, smyth}@uci.edu
2
Department of Cognitive Science,
UC Irvine, Irvine, CA
msteyver@uci.edu
Abstract. Statistical language models can learn relationships between
topics discussed in a document collection and persons, organizations and
places mentioned in each document. We present a novel combination
of statistical topic models and named-entity recognizers to jointly an-
alyze entities mentioned (persons, organizations and places) and topics
discussed in a collection of 330,000 New York Times news articles. We
demonstrate an analytic framework which automatically extracts from a
large collection: topics; topic trends; and topics that relate entities.
1 Introduction
The ability to rapidly analyze and understand large sets of text documents is a
challenge across many disciplines. Consider the problem of being given a large set
of emails, reports, technical papers, news articles, and wanting to quickly gain
an understanding of the key information contained in this set of documents. For
example, lawyers frequently need to analyze the contents of very large volumes
of evidence in the form of text documents during the discovery process in legal
cases (e.g., the 250,000 Enron emails that were made available to the US Justice
Department [1]). Similarly, intelligence analysts are faced on a daily basis with
vast databases of intelligence reports from which they would like to quickly
extract useful information.
There is increasing interest in text mining techniques to solve these types of
problems. Supervised learning techniques classify objects such as documents into
predefined classes [2]. While this is useful in certain problems, in many applica-
tions there is relatively little knowledge a priori about what the documents may
contain.
Unsupervised learning techniques can extract information from sets of doc-
uments without using predefined categories. Clustering is widely used to group
S. Mehrotra et al. (Eds.): ISI 2006, LNCS 3975, pp. 93–104, 2006.
c© Springer-Verlag Berlin Heidelberg 2006

Page 2
94
D. Newman et al.
documents into K clusters, where the characteristics of the clusters are deter-
mined in a data-driven fashion [3]. By representing each document as a vector
of word or term counts (the “bag of words” representation), standard vector-
based clustering techniques can be used, where the learned cluster centers rep-
resent “prototype documents.” Another set of popular unsupervised learning
techniques for document collections are based on matrix approximation meth-
ods, i.e. singular value decomposition (or principal components analysis) of the
document-word count matrix [4,5]. This approach is often referred to as latent
semantic indexing (LSI).
While both clustering and LSI have yielded useful results in terms of reducing
large document collections to lower-dimensional summaries, they each have their
limitations. For example, consider a completely artificial data set where we have
three types of documents with equal numbers of each type: in the first type
each document contains only two words wordA, wordB, in the second type each
document contains only the words wordC, wordD, and the third type contains a
mixture, namely the words wordA, wordB, wordC, wordD.
LSI applied to this toy data produces two latent topic vectors with orthogonal
“directions”: wordA + wordB + wordC + wordD and wordA + wordB - wordC
- wordD. These two vectors do not capture the fact that each of the three types
of documents are mixtures of two underlying “topics”, namely wordA + wordB
and wordC + wordD. This limitation is partly a reflection of the fact that LSI
must use negative values in its basis vectors to represent the data, which is
inappropriate given that the underlying document-word vectors (that we are
representing in a lower-dimensional space) consist of non-negative counts. In
contrast, as we will see later in the paper, probabilistic representations do not
have this problem. In this example the topic model would represent the two
topics wordA + wordB and wordC + wordD as two multinomial probability
distributions, [1
2 , 1
2 , 0, 0] and [0, 0, 1
2 , 1
2
], over the four-word vocabulary, capturing
the two underlying topics in a natural manner. Furthermore, the topic model
would correctly estimate that these two topics are used equally often in the data
set, while LSI would estimate that its first topic direction is used approximately
twice as much as its second topic direction.
Document clustering techniques, such as k-means and agglomerative cluster-
ing, suffer from a different problem, that of being forced to assume that each
document belongs to a single cluster. If we apply the k-means algorithm to the
toy data above, with K = 2 clusters it will typically find one cluster to be cen-
tered at [1, 1, 0, 0] and the other at [1
2 , 1
2 , 1, 1]. It is unable to capture the fact that
there are two underlying topics, corresponding to wordA + wordB and wordC
+ wordD and that that documents of type 3 are a combination of these two
topics.
A more realistic example of this effect is shown in Table 1. We applied k-
means clustering with K = 100 and probabilistic topic models (to be described
in the next section) with T = 100 topics to a set of 1740 papers from 12 years of
the Neural Information Processing (NIPS) Conference1. This data set contains
1
Available on-line at http://www.cs.toronto.edu/˜roweis/data.html

Page 3
Analyzing Entities and Topics in News Articles
95
a total of N = 2, 000, 000 word tokens and a vocabulary size of W = 13, 000
unique words. Table 1 illustrates how two different papers were interpreted by
both the cluster model and the probabilistic topic model. The first paper dis-
cussed an analog circuit model for auditory signal processing—in essence it is a
combination of a topic on circuits and a topic on auditory modeling. The paper
is assigned to a cluster which fails to capture either topic very well—shown are
the most likely words in the cluster it was assigned to. The topic model on the
other hand represents the paper as a mixture of topics. The topics are quite
distinct (the highest probability words in each topic are shown), capturing the
fact that the paper is indeed a mixture of different topics. Similarly, the sec-
ond paper was an early paper in bioinformatics, again combining topics that
are somewhat different, such as protein modeling and hidden Markov models.
Again the topic model can separate out these two underlying topics, whereas
the clustering approach assigns the paper to a cluster that is somewhat mixed
in terms of concepts and that does not summarize the semantic content of the
paper very well.
The focus of this paper is to extend our line of research in probabilistic topic
modeling to analyze persons, organizations and places. By combining named
entity recognizers with topic models we illustrate how we can analyze the re-
lationships between these entities (persons, organizations, places) and topics,
using a large collection of news articles.
Table 1. Comparison of topic modeling and clustering
Abstract from Paper
Topic Mix
Cluster Assignment
Temporal Adaptation in a Silicon
Auditory Nerve (J Lazzaro)
Many auditory theorists consider the tem-
poral adaptation of the auditory nerve a
key aspect of speech coding in the audi-
tory periphery. Experiments with models
of auditory localization . . . I have designed
an analog integrated circuit that models
many aspects of auditory nerve response,
including temporal adaptation.
[topic 80] analog cir-
cuit chip current voltage
vlsi figure circuits pulse
synapse silicon implemen-
tation cmos output mead
hardware design
[topic 33] auditory sound
localization
cochlear
sounds owl cochlea song
response system source
channels analysis location
delay
[cluster 8] circuit figure
time input output neural
analog neuron chip system
voltage current pulse sig-
nal circuits networks re-
sponse systems data vlsi
Hidden Markov Models in Molecular
Biology: New Algorithms and Appli-
cations (P Baldi, Y Chauvin, T
Hunkapiller, M McClure)
Hidden Markov Models (HMMs) can be
applied to several important problems in
molecular biology. We introduce a new
convergent learning algorithm for HMMs
...that are trained to represent several
protein families including immunoglobu-
lins and kinases.
[topic 10] state hmm
markov sequence models
hidden states probabil-
ities sequences parame-
ters transition probabil-
ity training hmms hybrid
model likelihood modeling
[topic 37] genetic struc-
ture chain protein popula-
tion region algorithms hu-
man mouse selection fit-
ness proteins search evo-
lution generation function
sequence sequences genes
[cluster 88] model data
models time neural figure
state learning set param-
eters network probability
number networks training
function system algorithm
hidden markov

Page 4
96
D. Newman et al.
2 A Brief Review of Statistical Topic Models
The key ideas in a statistical topic model are quite simple and are based on a
probabilistic model for each document in a collection. A topic is a multinomial
probability distribution over the V unique words in the vocabulary of the corpus,
in essence a V -sided die from which we can generate (in a memoryless fashion)
a “bag of words” or a set of word counts for a document. Thus, each topic t is
a probability vector, p(w\t)=[p(w1\t),...,p(wV \t)], where
v p(wv\t) = 1, and
there are T topics in total, 1 ≤ t ≤ T.
A document is represented as a finite mixture of the T topics. Each document
d, 1 ≤ d ≤ N, is assumed to have its own set of mixture coefficients, [p(t =
1\d),...,p(t = T\d)], a multinomial probability vector such that
t p(t\d) = 1.
Thus, a randomly selected word from document d has a conditional distribution
p(w\d) that is a mixture over topics, where each topic is a multinomial over
words:
p(w\d) =
T
t=1
p(w\t)p(t\d).
If we were to simulate W words for document d using this model we would
repeat the following pair of operations W times: first, sample a topic t according
to the distribution p(t\d), and then sample a word w according to the distribution
p(w\t).
Given this forward or generative model for a set of documents, the next step
is to learn the topic-word and document-topic distributions given observed data.
There has been considerable progress on learning algorithms for these types of
models in recent years. Hofmann [6] proposed an EM algorithm for learning in
this context using the name “probabilistic LSI” or pLSI. Blei, Ng and Jordan [7]
addressed some of the limitations of the pLSI approach (such as the tendency to
overfit) and recast the model and learning framework in a more general Bayesian
setting. This framework is called Latent Dirichlet allocation (LDA), essentially a
Bayesian version of the model described above, and the accompanying learning
algorithm is based on an approximation technique known as variational learning.
An alternative, and efficient, estimation algorithm based on Gibbs sampling was
proposed by Griffiths and Steyvers [9], a technique that is closely related to
earlier ideas derived independently for mixture models in statistical genetics
[10]. Since the Griffiths and Steyvers paper was published in 2004, a number of
different groups have successfully applied the topic model with Gibbs sampling
to a variety of large corpora, including large collections of Web documents [11],
a collection of 250,000 Enron emails [12], 160,000 abstracts from the CiteSeer
computer science collection [13], and 80,000 news articles from the 18th-century
Pennsylvania Gazette [14]. A variety of extensions to the basic topic model have
also been developed, including author-topic models [15], author-role-topic models
[12], topic models for images and text [16,7], and hidden-Markov topic models
for separating semantic and syntactic topics [17].
In this paper all of the results reported were obtained using the topic model
outlined above with Gibbs sampling, as described originally in [9]. Our description

Page 5
Analyzing Entities and Topics in News Articles
97
of the model and the learning algorithm is necessarily brief: for a more detailed
tutorial introduction the reader is recommended to consult [18].
3 Data Set
To analyze entities and topics, we required a text dataset that was rich in en-
tities including persons, organizations and locations. News articles are ideal be-
cause they have the primary purpose of conveying information about who, what,
when and where. We used a collection of New York Times news articles taken
from the Linguistic Data Consortium’s English Gigaword Second Edition corpus
(www.ldc.upenn.edu). We used all articles of type “story” from 2000 through
2002, resulting in 330,000 separate articles spanning three years. These include
articles from the NY Times daily newspaper publication as well as a sample of
news from other urban and regional US newspapers.
We automatically extracted named entities (i.e. proper nouns) from each
article using one of several named entity recognition tools. We evaluated two
tools including GATE’s Information Extraction system ANNIE (gate.ac.uk), and
Coburn’s Perl Tagger (search.cpan.org/ acoburn/Lingua-EN-Tagger). ANNIE is
rules-based and makes extensive use of gazetteers, while Coburn’s tagger is based
on Brill’s HMM part-of-speech tagger [19]. ANNIE tends to be more conserva-
tive in identifying a proper noun. For this paper, entities were extracted using
Coburn’s tagger. For this 2000-2002 period, the most frequently mentioned peo-
ple were: George Bush; Al Gore; Bill Clinton; Yasser Arafat; Dick Cheney and
John McCain. In total, more than 100,000 unique persons, organizations and
locations were extracted. We filtered out 40,000 infrequently occurring entities
by requiring that an entity occur in at least ten different news articles, leaving
60,000 entities in the dataset.
After tokenization and removal of stopwords, the vocabulary of unique words
was also filtered by requiring that a word occur in at least ten different news
articles. We produced a final dataset containing 330,000 documents, a vocabulary
of 40,000 unique words, a list of 60,000 entities, and a total of 110 million word
tokens. After this processing, entities occur at the rate of 1 in 6 words (not
counting stopwords).
4 Experiments
In this section we present the results from a T = 400 topic model run on the three
years of NY Times news articles. After showing some topics and topic trends,
we show how the model reveals topical information about particular entities,
and relationships between entities. Note that entities are just treated as regular
words in the learning of the topic models, and the topic-word distributions are
separated out into entity and non-entity components as a postprocessing step.
Models that treat entity and non-entity words differently are also of interest, but
are beyond the scope of this paper.

Page 6
98
D. Newman et al.
4.1 Topics and Topic Trends
Upon completion of a topic model run, the model saves data to compute the
likelihood of words and entities in a topic, p(w\t) and p(e\t), the mix of topics in
each document, p(t\d), and zi the topic assigned to the i
th
word in the corpus.
For each topic, we print out the most likely words and most likely entities. We
then review the list of words and entities to come up with a human-assigned topic
label that best summarizes or captures the nature of the topic. It is important
to point out that these topic labels are created after the model is run; they are
not a priori defined as fixed or static subject headings.
Unsurprisingly, our three-years of NY Times includes a wide range of topics:
from renting apartments in Brooklyn to diving in Hawaii; from Tiger Woods
to PETA liberating tigers; from voting irregularities to dinosaur bones. From
a total of 400 diverse topics, we selected a few to highlight. Figure 1 shows
four seasonal topics which we labeled Basketball, Tour de France, Holidays and
Oscars. Each of these topics shows a neat division within the topic of what
(the words in lowercase), and who and where (the entities in uppercase). The
Basketball topic appears to focus on the Lakers; the Tour de France topic tell
us that it’s all about Lance Armstrong; Barbie trumps the Grinch in Holidays;
and Denzel Washington most likely had a good three years in 2000-2002.
Figure 2 shows four “event” topics which we labeled September 11 Attacks,
FBI Investigation, Harry Potter/Lord of the Rings, and DC Sniper. This Sept.
11 topic – one of several topics that discuss the terrorist attacks on Sept 11 –
is clearly about the breaking news. It discusses what and where, but not who
(i.e. no mention of Bin Laden). The FBI Investigation topic lists 9/11 hijackers
team
0.028
tour
0.039
holiday
0.071
award
0.026
play
0.015
rider
0.029
gift
0.050
film
0.020
game
0.013
riding
0.017
toy
0.023
actor
0.020
season
0.012
bike
0.016
season
0.019
nomination
0.019
final
0.011
team
0.016
doll
0.014
movie
0.015
games
0.011
stage
0.014
tree
0.011
actress
0.011
point
0.011
race
0.013
present
0.008
won
0.011
series
0.011
won
0.012
giving
0.008
director
0.010
player
0.010
bicycle
0.010
special
0.007
nominated
0.010
coach
0.009
road
0.009
shopping
0.007
supporting
0.010
playoff
0.009
hour
0.009
family
0.007
winner
0.008
championship
0.007
scooter
0.008
celebration
0.007
picture
0.008
playing
0.006
mountain
0.008
card
0.007
performance
0.007
win
0.006
place
0.008
tradition
0.006
nominees
0.007
LAKERS
0.062
LANCE-ARMSTRONG
0.021
CHRISTMAS
0.058
OSCAR
0.035
SHAQUILLE-O-NEAL
0.028
FRANCE
0.011
THANKSGIVING
0.018
ACADEMY
0.020
KOBE-BRYANT
0.028
JAN-ULLRICH
0.003
SANTA-CLAUS
0.009
HOLLYWOOD
0.009
PHIL-JACKSON
0.019
LANCE
0.003
BARBIE
0.004
DENZEL-WASHINGTON
0.006
NBA
0.013
U-S-POSTAL-SERVICE
0.002
HANUKKAH
0.003
JULIA-ROBERT
0.005
SACRAMENTO
0.007
MARCO-PANTANI
0.002
MATTEL
0.003
RUSSELL-CROWE
0.005
RICK-FOX
0.007
PARIS
0.002
GRINCH
0.003
TOM-HANK
0.005
PORTLAND
0.006
ALPS
0.002
HALLMARK
0.002
STEVEN-SODERBERGH 0.004
ROBERT-HORRY
0.006
PYRENEES
0.001
EASTER
0.002
ERIN-BROCKOVICH
0.003
DEREK-FISHER
0.006
SPAIN
0.001
HASBRO
0.002
KEVIN-SPACEY
0.003
Basketball
Holidays
Oscars
Tour de France
Fig. 1. Selected seasonal topics from a 400-topic run of the NY Times dataset. In each
topic we first list the most likely words in the topic, with their probability, and then
the most likely entities (in uppercase). The title above each box is a human-assigned
topic label.

Page 7
Analyzing Entities and Topics in News Articles
99
attack
0.033
agent
0.029
ring
0.050
sniper
0.024
tower
0.025
investigator
0.028
book
0.015
shooting
0.019
firefighter
0.020
official
0.027
magic
0.011
area
0.010
building
0.018
authorities
0.021
series
0.007
shot
0.009
worker
0.013
enforcement
0.018
wizard
0.007
police
0.007
terrorist
0.012
investigation
0.017
read
0.007
killer
0.006
victim
0.012
suspect
0.015
friend
0.006
scene
0.006
rescue
0.012
found
0.014
movie
0.006
white
0.006
floor
0.011
police
0.014
children
0.006
victim
0.006
site
0.009
arrested
0.012
part
0.005
attack
0.005
disaster
0.008
search
0.012
secret
0.005
case
0.005
twin
0.008
law
0.011
magical
0.005
left
0.005
ground
0.008
arrest
0.011
kid
0.005
public
0.005
center
0.008
case
0.010
fantasy
0.005
suspect
0.005
fire
0.007
evidence
0.009
fan
0.004
killed
0.005
plane
0.007
suspected
0.008
character
0.004
car
0.005
WORLD-TRADE-CTR
0.035
FBI
0.034
HARRY-POTTER
0.024
WASHINGTON
0.053
NEW-YORK-CITY
0.020
MOHAMED-ATTA
0.003
LORD OF THE RING
0.013
VIRGINIA
0.019
LOWER-MANHATTAN
0.005
FEDERAL-BUREAU
0.001
STONE
0.007
MARYLAND
0.013
PENTAGON
0.005
HANI-HANJOUR
0.001
FELLOWSHIP
0.005
D-C
0.012
PORT-AUTHORITY
0.003
ASSOCIATED-PRESS
0.001
CHAMBER
0.005
JOHN-MUHAMMAD
0.008
RED-CROSS
0.002
SAN-DIEGO
0.001
SORCERER
0.004
BALTIMORE
0.006
NEW-JERSEY
0.002
U-S
0.001
PETER-JACKSON
0.004
RICHMOND
0.006
RUDOLPH-GIULIANI
0.002
FLORIDA
0.001
J-K-ROWLING
0.004
MONTGOMERY-CO
0.005
PENNSYLVANIA
0.002
TOLKIEN
0.004
MALVO
0.005
CANTOR-FITZGERALD
0.001
HOGWART
0.002
ALEXANDRIA
0.003
September 11 Attacks
FBI Investigation
Harry Potter/Lord Rings
DC Sniper
Fig. 2. Selected event topics from a 400-topic run of the NY Times dataset. In each
topic we first list the most likely words in the topic, with their probability, and then
the most likely entities (in uppercase). The title above each box is a human-assigned
topic label.
Mohamed Atta and Hani Hanjour. The Harry Potter/Lord of the Rings topic
combines these same-genre runaway successes, and the DC Sniper topic shows
specific details about John Muhammad and Lee Malvo including that they were
in a white van.
What year had the most discussion of the Tour de France? Is interest in foot-
ball declining? What was the lifetime of Elian Gonzalez story? These questions
can be answered by examining the time trends in the topics. These trends are
easily computed by counting the the topic assignments zi of each word in each
time period (monthly). Figure 3 uses the topics already presented plus additional
topics to show some seasonal/periodic time trends and event time trends. We
see from the trends on the left that Basketball gets 30,000 in May; discussions
of football are increasing; 2001 was a relatively quiet year for the Oscars; but
2001 had the most buzz over quarterly earnings. The trends on the right on the
other hand shows some very peaked events: from Elian Gonzalez in April 2000;
thru the September 11 Attacks in 2001; to the DC Sniper killing spree and the
collapse of Enron in 2002.
4.2 Entity-Entity Relationships
We use the topic model to determine topic-based entity-entity relationships.
Unlike social networks created from co-mentions – which would not link two
entities that were never co-mentioned – our topic-based approach can poten-
tially link a pair of entities that were never co-mentioned. A link is created when
the entity-entity “affinity”, defined as (p(e1\e2) + p(e2\e1))/2, is above some
threshold. The graph in Figure 4, constructed using this entity-entity affinity
was created in two steps. First we selected key entities (e.g. Yasser Arafat, Sad-

Page 8
100
D. Newman et al.
0
20
40
kwords
Basketball
0
20
40
60
Elian−Gonzalez
0
20
40
kwords
Pro−Football
0
50
100
150
Sept−11−Attacks
0
10
20
30
kwords
College−Football
0
20
40
FBI−Investigation
0
5
10
15
kwords
Tour−de−France
0
50
100
150
Anthrax
0
10
20
30
kwords
Holidays
0
10
20
Harry−Potter/Lord−Rings
0
10
20
30
kwords
Oscars
0
50
100
DC−Sniper
Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03
0
10
20
30
kwords
Quarterly−Earnings
Jan00 Jul00 Jan01 Jul01 Jan02 Jul02 Jan03
0
50
100
Enron
Fig. 3. Selected topic-trends from a 400-topic run of the NY Times dataset. Sea-
sonal/periodic topics are shown on the left, and event topics are shown on the right.
Each curve shows the number of words (in thousands) assigned to that topic for the
month (on average there are 9,000 articles written per month containing 3 million
words, so if the 400 topics were equally likely there would be 8 kwords per topic per
month). The topic words and entities for Basketball, Tour de France, Holidays and Os-
cars are given in Figure 1, and Sept 11 Attacks, FBI Investigation, Harry Potter/Lord
of the Rings and DC Sniper are given in Figure 2.

Page 9
Analyzing Entities and Topics in News Articles
101
BARAK
ANWAR_SADAT
ARIEL_SHARON
ASSAD
DENNIS_ROSS
HEZBOLLAH
KING_ABDULLAH
KING_HUSSEIN
LEONID_KUCHMA
BORIS_YELTSIN
NORTHERN_ALLIANCE
ABDUL_HAQ
ABDULLAH_ABDULLAH
BURHANUDDIN_RABBANI
DOSTUM
GULBUDDIN_HEKMATYAR
HAMID_KARZAI
HEATHER_MERCER
ISLAM_KARIMOV
ISMAIL_KHAN
LAKHDAR_BRAHIMI
MASSOUD
MOHAMMAD_ZAHIR
MULLAH_OMAR
BIN_LADEN
ABU_ZUBAYDAH
ALI_MOHAMED
AL_QAEDA
EL_HAGE
JEMAAH_ISLAMIYAH
LOTFI_RAISSI
MOKHTAR_HAOUARI
RAED_HIJAZI
RAMZI_BINALSHIBH
SADDAM_HUSSEIN
KHIDHIR_HAMZA
RICHARD_PERLE
SALEH
SCOTT_RITTER
SHAS
SHIRZAI
TALIBAN
RED_ARMY
VLADIMIR_GUSINSKY
VLADIMIR_PUTIN
ANDREI_BABITSKY
ASLAN_MASKHADOV
BORIS_BEREZOVSKY
EDMOND_POPE
EDUARD_SHEVARDNADZE
IGOR_IVANOV
LENIN
MIKHAIL_GORBACHEV
NIKITA_KHRUSHCHEV
SERGEI_IVANOV
YASSER_ARAFAT
ANTHONY_ZINNI
BEN_ELIEZER
CROWN_PRINCE_ABDULLAH
FATAH
HAMAS
JIBRIL_RAJOUB
LIKUD_PARTY
MARWAN_BARGHOUTI
MOHAMMED_DAHLAN
NABIL_SHAATH
NETANYAHU
PERES
PLO
SAEB_EREKAT
SECRETARY_POWELL
TANZIM
YITZHAK_RABIN
ZACARIAS_MOUSSAOUI
AL_HAZMI
FRANK_LINDH
JAMES_BROSNAHAN
JOHN_WALKER_LINDH
JOSE_PADILLA
ZADRAN
ZAWAHIRI
Fig. 4. Social network showing topic-model-based relationships between entities. A link
is present when the entity-entity “affinity” (p(e1|e2)+p(e2|e1))/2 is above a threshold.
dam Hussien, Osama Bin Laden, Zacarias Moussaoui, Vladamir Putin, Ariel
Sharon, The Taliban ) and determined what other entities had some level of
affinity to these. We then took this larger list of 100 entities, computed all
10,000 entity-entity affinities, and thresholded the result to produce the graph.
It is possible to annotate each link with the topics that most contribute to the
relationship, and beyond that, the original documents that most contribute to
that topic.
A related but slightly different representation is shown in the bipartite graph
showing relationships between entities and topics in Figure 5. A link is present
when the likelihood of an entity in a particular topic p(e\t), is above a threshold.
This graph was created by selecting 15 entities from the graph shown in Figure 4
and computing all 15 × 400 entity-given-topic probabilities, and thresholding

Page 10
102
D. Newman et al.
Muslim_Militance
Mid_East_Conflict
Palestinian_Territories
Pakistan_Indian_War
FBI_Investigation
Detainees
Mid_East_Peace
US_Military
Religion
Terrorist_Attacks
Afghanistan_War
AL_QAEDA
HAMID_KARZAI
MOHAMMED
MOHAMMED_ATTA
NORTHERN_ALLIANCE
BIN_LADEN
TALIBAN
ZAWAHIRI
YASSER_ARAFAT
EHUD_BARAK
ARIEL_SHARON
HAMAS
AL_HAZMI
KING_HUSSEIN
KING_ABDULLAH
Fig. 5. Bipartite graph showing topic-model-based relationships between entities and
topics. A link is present when the likelihood of an entity in a particular topic p(e|t) is
above a threshold.
the result to plot links. Again, with this bipartite graph, the original documents
associated with each topic can be retrieved.
5 Conclusions
Statistical language models, such as probabilistic topic models, can play an im-
portant role in the analysis of large sets of text documents. Representations
based on probabilistic topics go beyond clustering models because they allow
the expression of multiple topics per document. An additional advantage is that
the topics extracted by these models are invariably interpretable, facilitating the
analysis of model output, in contrast to the uninterpretable directions produced
by LSI.
We have applied standard entity recognizers to extract names of people, orga-
nizations and locations from a large collection of New York Times news articles.
Probabilistic topic models were applied to learn the latent structure behind these
named entities and other words that are part of documents, through a set of in-
terpretable topics. We showed how the relative contributions of topics changed
over time, in lockstep with major news events. We also showed how the model
was able to automatically extract social networks from documents by connecting

Page 11
Analyzing Entities and Topics in News Articles
103
persons to other persons through shared topics. The social networks produced in
this way are different from social networks produced by co-reference data where
persons are connected only if they co-appear in documents. One advantage over
these co-reference methods is that a set of topics can be used as labels to explain
why two people are connected. Another advantage is that the model leverages
the latent structure between the other words present in document to better esti-
mate the latent structure between entities. This research has shown the benefits
of applying simple statistical language models to understand the latent structure
between entities.
Acknowledgements
Thanks to Arthur Asuncion and Jason Sellers for their assistance. This material
is based upon work supported by the National Science Foundation under award
number IIS-0083489 (as part of the Knowledge Discovery and Dissemination
Program) and under award number ITR-0331707.
References
1. Klimt, B., and Yang, Y.: A New Dataset for Email Classification Research. 15th
European Conference on Machine Learning (2004)
2. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Journal
of Information Retrieval, Vol. 1 (1999) 67–88
3. Chakrabarti, S: Mining the Web: Discovering Knowledge from Hypertext Data.
Morgan Kaufmann Publishers (2002)
4. Deerwester, S.C. , Dumais, S.T. , Landauer, T.K. , Furnas, G.W. , Harshman, R.A.:
Indexing by Latent Semantic Analysis. American Society of Information Science,
41(6) (1990) 391–407
5. Berry, M.W., Dumais, S.T., O’Brien G.W.: Using Linear Algebra for Intelligent
Information Retrieval. SIAM Review 37 (1994) 573–595
6. Hofmann, T.: Probabilistic Latent Semantic Indexing. 22nd Int’l. Conference on
Research and Development in Information Retrieval (1999)
7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine
Learning Research, 1 (2003) 993–1022
8. Minka, T., and La, J.: Expectation-Propagation for the Generative Aspect Model.
18th Conference on Uncertainty and Artificial Intelligence (2002)
9. Griffiths, T.L., and Steyvers, M.: Finding Scientific Topics. National Academy of
Sciences, 101 (suppl. 1) (2004) 5228–5235
10. Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of Population Structure using
Multilocus Genotype Data. Genetics 155 (2000) 945–959
11. Buntine, W. , Perttu, S. , Tuulos, V.: Using Discrete PCA on Web Pages. Pro-
ceedings of the Workshop W1 on Statistical Approaches for Web Mining (SAWM).
Italy (2004) 99-110
12. McCallum, A., Corrada-Emmanuel, A., Wang, X.: Topic and Role Discovery in
Social Networks. 19th Joint Conference on Artificial Intelligence (2005)
13. Steyvers, M., Smyth, P., Rosen-Zvi, M., Griffiths, T.: Probabilistic Author-Topic
Models for Information Discovery. 10th ACM SIGKDD (2004)

Page 12
104
D. Newman et al.
14. Newman, D. J., and Block, S.: Probabilistic Topic Decomposition of an Eighteenth-
Century Newspaper. Journal American Society for Information Science and Tech-
nology (2006)
15. Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The Author-Topic Model for
Authors and Documents. 20th Int’l. Conference on Uncertainty in AI (2004)
16. Blei, D., and Jordan, M.: Modeling Annotated Data. 26th International ACM
SIGIR (2003) 127-134
17. Griffiths, T., Steyvers, M., Blei, D. M., Tenenbaum, J. B.: Integrating Topics and
Syntax. Advances in Neural Information Processing Systems, 17 (2004)
18. Steyvers, M., and Griffiths, T.L.: Probabilistic Topic Models. T. Landauer et al.
(eds), Latent Semantic Analysis: A Road to Meaning: Laurence Erlbaum (2006)
19. Brill E.: Some Advances in Transformation-Based Part of Speech Tagging. National
Conference on Artificial Intelligence (1994)