Interim Review of TextAnalyst from MicroSystems,
Ltd.
Colleen
E. Crangle
for
Tecomac, Inc.
April 5, 2001
Recommendation
TextAnalyst’s approach to summarization is the standard one; that is, the summary
of a document consists of sentences extracted from the document. However, at
the core of TextAnalyst is a method for building a semantic
network to represent the content of a document. This core technology can be
used for far more than summarization. It provides a form of concept-based
indexing and it allows automatic document clustering and categorization.
TextAnalyst’s core technology is remarkably similar to CognIT’s,
but it appears to have a more sound scientific basis.
The company also provides a software development kit (SDK), which permits rapid
integration of the TextAnalyst functions into
Tecomac’s technology.
Based this
interim review, and the preliminary April 3, 2001 review by Drs. Kaufmann and Arkadov, I strongly recommend that Tecomac investigate the
company further and pursue a collaboration.
Further Details on the Core Technology
TextAnalyst performs the following steps in creating a semantic network to
represent the content of a document:
1.
It identifies the significant
words or phrases in the document. These are called text concepts. As far as I
can tell it measures significance mainly by using frequency of occurrence.
2.
It determines the relative
importance of a text concept by analyzing its connections to other concepts in
the text. As far as I can tell, it mainly uses co-occurrence in sentences as a
way to measure the strength of the link between two text concepts. That is, if
two words occur in the same sentence, the strength of the link between those
two words in increased. So if two high-frequency words are strongly linked, for
example, their relative importance is high. Or if every occurrence of a
low-frequency word is in a sentence with a high-frequency word, its relative
importance is increased. And so on.
3.
It adjusts the strengths of the
links based on the information gained in step 2 ( using
neural nets, but I’m not sure of the details or the exact technique used).
4.
It creates a semantic network
consisting of the linked text concepts, with the adjusted link strengths
computed in 3. Each node in the network has two numbers associated with it. The
first number gives the weight, or strength, of the semantic relationship of
word at that node to the word at its parent node. The second number gives the semantic weight
of the word relative to the entire text. Assuming these numbers accurately
represent something in the document, the network is a rich source of
information for further processing.
The value of this approach to document
analysis is that it allows you to build a network that implicitly has semantic
content, without using background knowledge of the subject. It is based on the
assumption that the text itself contains a semantic model of its content, and
that you “get at” that model by analyzing:
v
the choice of words in the
text, and
v
the words’ co-occurrences within sentences.
As far as I can tell, TextAnalyst
simply treats a sentence as a list of words. It does not do any
syntactic or other processing on the sentence that would allow a main subject,
for instance, to be identified. In this, it has less power than CognIT’s approach which, I believe, does some analysis of
semantic or functional “roles.” This
difference may account for the better summaries CognIT
may produce.
Nonetheless, even if TextAnalyst
does not now do anything more than treat a sentence as a list of words, there
is plenty of scope for doing more. The only constraint is speed. There is no
shortage of techniques to mine the further information encoded in a
natural-language sentence.
The Summarizer in TextAnalyst
Sentences are selected for the summary
based on concepts that are extracted from the text and the relationships
between those concepts. The semantic network constructed for the document
contains these concepts and relations.
I performed several test runs of the TextAnalyst summarizer.
Here
are some limitations I noted in the performance of the summarizer:
v
The program does not yet do an
adequate job of detecting the end of a sentence. It thought the following
sentence ended after Talk City Inc., for instance.
``The
whole Internet market crashed down, and we're rolling with it,'' says Peter
Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn't boost its
stock price
CognIT has similar spotty performance.
v
TextAnalyst does not use a thesaurus to catch the relationships between words
such as “firm” and “company” that in the context of the document are
synonymous. CognIT has a similar shortcoming. This
omission can reduce the robustness of the summarizer and, in fact, all semantic
processing such as topic identification. Fortunately, there is room even in the
current architecture for including a thesaurus. TextAnalyst
makes use of a language-dependent dictionary. It can be accessed via the VocEdit application, a dictionary program that TextAnalyst uses in certain circumstances.
v
The program does sometimes
correctly identify multi-word terms such as “stock price” but not always. CognIT has similar spotty performance.
To discuss the summarizer in more detail,
here are several test runs on a newspaper article that I have been using for
comparison testing on other summarizers.
Published Wednesday, Jan. 24, 2001, in the San Jose Mercury News
20 area firms face delisting
by Nasdaq
BY MATT MARSHALL
Mercury News
It's
the company version of the pink slip in the mail -- get your act together, or you're fired from Nasdaq.
About 20 Bay Area companies
are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that
lists most of the area's high-tech companies. Five local companies were already
bumped off last year, and a sixth -- PlanetRx.com Inc., a former South San
Francisco health care company -- was just delisted.
Nationwide, Nasdaq has either sent notices
or is close to notifying at least 200 other companies, many of whom offered
stocks to the public for the first time last year.
While the delisting doesn't
have to mean the game is over, it relegates companies to the junior and less
reputable leagues of the stock exchange world, where it's much harder to raise
money. For shareholders, a Nasdaq
delisting sounds like a chilling death knoll -- the value of their stock could
all but implode. Some delisted companies, like
Pets.com, simply close their doors.
``The whole Internet market
crashed down, and we're rolling with it,'' says Peter Friedman, CEO of Talk
City Inc., a company that could get kicked off Nasdaq if it doesn't boost its stock price soon.
``The emotion was too much. Things just snapped.''
This round of delistings is the ignominious end to a year of decadence
now coming back to haunt us.
Most of these companies had
no profits, and many had hardly any sales, when investor enthusiasm created a
wave of new stock offerings last year. If a company sold things on the Web --
cars, pet food, you name it -- it was almost guaranteed a spot on the stock
exchange.
But in less than a year, many
of the same investors have abandoned their former darlings. With stock prices
down and the economy slowing, companies are falling short of the standards Nasdaq sets for its some 3,802
companies.
While the listing standards
are arcane, the most obvious cardinal sin in the eyes of Nasdaq's regulators is simple: The fall of a
company's stock price below $1 for 30 consecutive trading days.
When that happens, Nasdaq sends a notice giving the
company 90 calendar days to get the stock price up again. If it fails to do so
-- for 10 consecutive days -- the firm has one last resort: an appeal to Nasdaq.
That involves a trek to Washington, D.C., and a quick
hearing at a room in the St. Regis Hotel, where Nasdaq's three-person panel grills
executives. Unless there's good reason to prolong the struggle, the company's Nasdaq days are over.
Once booted, companies
usually end up in the netherworlds of the stock market, where only a few brave
investors venture.
First, it's the Over The Counter Bulletin Board, which is considerably more risky
and yields lower return to investors. However, even the OTCBB has requirements.
Failing that, the next step
down is the so-called Pink Sheets, named for the color of the paper they used
to be traded on. This exchange doesn't require firms to register with the
Securities and Exchange Commission or even file financial statements.
``They're the wild, wild West,'' says Nasdaq spokesman
Mark Gundersen.
Autoweb.com Inc., a Santa
Clara Internet company that specializes in auto consumer services, has about 40
days left under the 90-day rule, but is busy scrambling to avoid a hearing.
``We're working on strategic
partnerships that will have a major impact on the stock,'' says Nadyne Edison, chief marketing officer for the company. On
Tuesday, Edison was in Detroit, busy opening a
new office near the nation's auto capital. Edison says the firm is
considering moving its headquarters to Detroit to be nearer its
clients.
Other companies that got
delisting notices are trying layoffs. Take Mountain View-based
Network Computing Devices, which provides networking hardware and software to
large companies. Its sales have been pinched as the personal computer industry
slows down, so it has laid off people.
``We've had to downsize,
downsize, downsize,'' says Chief Financial Officer Michael Garner.
Women.com, a San Mateo-based
Internet site devoted to women, has laid off 25
percent of the workforce recently to avoid delisting. Becca
Perata-Rosati, vice president of communications, says
the site isn't being fairly rewarded by Wall Street. The company is the 29th
most heavily visited Web site in the world, she says.
One trick that doesn't seem
to work is the so-called ``reverse stock split,'' which PlanetRx.com tried on
Dec. 1. By converting every eight shares into one, PlanetRx.com hoped each
share price would be boosted eightfold. But the move was seen by investors as a
sign of desperation, and the stock plunged from $1 to 53 cents.
Out of alternatives, PlanetRx didn't even show up for its hearing with Nasdaq. It is now trading on the
OTCBB after a recent move to Memphis and faces an
uncertain future.
At least one executive says
he doesn't mind the prospect of going to the OTCBB.
Talk City's Friedman says
his company is growing, and expects its 9 million in service fee revenue to
double this year. Even if he's forced off the Nasdaq, he has hopes of returning.
``I'd like to stay on the Nasdaq,'' he says. ``If we get
off, we'll build a business. Then we'll go back on.''
Contact Matt Marshall at
mmarshall@sjmercury.com or (408)920-5920.
This is the summary TextAnalyst
produced. The number before each summary sentence is a measure of the “semantic
weight” of the sentence, that is, a measure of how important it is in the document.
That weight is determined by the weight of the concepts in the sentence. (Note
that the second sentence has an incorrectly computed sentence boundary.)
94 About 20 Bay Area companies are
performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's
high-tech companies.
96 , a company that could get
kicked off Nasdaq if it doesn't boost its stock price
soon.
99
With stock prices down and the economy slowing, companies are falling
short of the standards Nasdaq
sets for its some 3,802 companies.
99 While the listing standards are
arcane, the most obvious cardinal sin in the eyes of Nasdaq's
regulators is simple: The fall of a company's stock price below $1 for 30
consecutive trading days.
96 When that happens, Nasdaq sends a notice giving the company 90 calendar days
to get the stock price up again.
TextAnalyst chose the length of the summary, based on the structure of the
semantic network it built. This is a much more principled approach to
determining summary length than expecting the user to determine the length, as
other summarizers do. (The user is expected to set the number of concepts that
will guide sentence selection, or to give the explicit number of sentences the
summary should have.) When the summarizer itself determines the appropriate
summary length, summaries can be automatically produced; the process does not
require user intervention.
(TextAnalyst does allow the user to change the size of the
summary by changing the semantic weight threshold. By increasing the semantic threshold you can
decrease the size of the summary.)
What about the robustness of the
summarization technique? This can be judged by substituting synonyms for
several occurrences of key words in the document to be summarized. If different
summaries result, it suggests that the method lacks robustness.
For the above document, I substituted the
word “firm” for “company” in several places as shown here.
About 20 Bay
Area companies are performing so badly that they are in danger of being booted
off the Nasdaq, the stock
exchange that lists most of the area's high-tech firms. Five local firms were
already bumped off last year, and a sixth -- PlanetRx.com Inc., a former South San Francisco
health care company -- was just delisted.
This is the summary that resulted.
98 24, 2001, in the San Jose Mercury News 20 area firms face delisting by Nasdaq
99 About 20 Bay Area companies are
performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's
high-tech firms.
95 , a company that could get
kicked off Nasdaq if it doesn't boost its stock price
soon.
99 With stock prices down and the economy
slowing, companies are falling short of the standards Nasdaq sets for its some 3,802 companies.
99 While the listing standards are
arcane, the most obvious cardinal sin in the eyes of Nasdaq's
regulators is simple: The fall of a company's stock price below $1 for 30
consecutive trading days.
95 When that happens, Nasdaq sends a notice giving the company 90 calendar days
to get the stock price up again.
The summarizer simply added a first
sentence (with incorrect sentence boundary, however.) The other sentences were
the same. This stability in sentence selection shows a greater robustness than
the summarizer from CognIT.
When the following further substitutions
were made, the summarizer once again picked the same sentences.
Substitutions:
``The whole Internet market
crashed down, and we're rolling with it,'' says Peter Friedman, CEO of Talk
City Inc., a firm that could get kicked off Nasdaq if it doesn't boost its stock price soon.
…With stock prices down and the economy slowing, firms are falling short of the
standards Nasdaq sets for
its some 3,802 firms.
98 24, 2001, in the San Jose Mercury News 20 area firms face delisting by Nasdaq
99 About 20 Bay Area companies are
performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's
high-tech firms.
94 , a firm that could get
kicked off Nasdaq if it doesn't boost its stock price
soon.
99 With stock prices down and the economy
slowing, firms are falling short of the standards Nasdaq sets for its some 3,802 firms.
99 While the listing standards are
arcane, the most obvious cardinal sin in the eyes of Nasdaq's
regulators is simple: The fall of a company's stock price below $1 for 30
consecutive trading days.
When that happens,
Nasdaq sends a notice giving
the company 90 calendar days to get the stock price up again.
This robustness speaks very well of the
core technology and the summarization technique TextAnalyst
is using.