Microsystems, Ltd/
 Home  | News | Technology | Products | Downloads | Contact Us |   29 April 2017

 

 

Technology

Technology Home Page
Approach
Text analisis
Bibliography

Interim Review by Colleen E. Crangle

Search

© 2001 Microsystems, Ltd.
 All rights reserved.
Terms of use
.

    
Interim Review of TextAnalyst from MicroSystems, Ltd

Interim Review of TextAnalyst from MicroSystems, Ltd.

Colleen E. Crangle

for Tecomac, Inc.

April 5, 2001

 

Recommendation

TextAnalysts approach to summarization is the standard one; that is, the summary of a document consists of sentences extracted from the document. However, at the core of TextAnalyst is a method for building a semantic network to represent the content of a document. This core technology can be used for far more than summarization. It provides a form of concept-based indexing and it allows automatic document clustering and categorization.

 

TextAnalysts core technology is remarkably similar to CognITs, but it appears to have a more sound scientific basis. The company also provides a software development kit (SDK), which permits rapid integration of the TextAnalyst functions into Tecomacs technology.

 

Based this interim review, and the preliminary April 3, 2001 review by Drs. Kaufmann and Arkadov, I strongly recommend that Tecomac investigate the company further and pursue a collaboration.

 

Further Details on the Core Technology

TextAnalyst performs the following steps in creating a semantic network to represent the content of a document:

1.      It identifies the significant words or phrases in the document. These are called text concepts. As far as I can tell it measures significance mainly by using frequency of occurrence.

2.      It determines the relative importance of a text concept by analyzing its connections to other concepts in the text. As far as I can tell, it mainly uses co-occurrence in sentences as a way to measure the strength of the link between two text concepts. That is, if two words occur in the same sentence, the strength of the link between those two words in increased. So if two high-frequency words are strongly linked, for example, their relative importance is high. Or if every occurrence of a low-frequency word is in a sentence with a high-frequency word, its relative importance is increased. And so on.

3.      It adjusts the strengths of the links based on the information gained in step 2 ( using neural nets, but Im not sure of the details or the exact technique used).

4.      It creates a semantic network consisting of the linked text concepts, with the adjusted link strengths computed in 3. Each node in the network has two numbers associated with it. The first number gives the weight, or strength, of the semantic relationship of word at that node to the word at its parent node. The second number gives the semantic weight of the word relative to the entire text. Assuming these numbers accurately represent something in the document, the network is a rich source of information for further processing.

 

The value of this approach to document analysis is that it allows you to build a network that implicitly has semantic content, without using background knowledge of the subject. It is based on the assumption that the text itself contains a semantic model of its content, and that you get at that model by analyzing:

v     the choice of words in the text, and

v     the words co-occurrences within sentences.

 

As far as I can tell, TextAnalyst simply treats a sentence as a list of words. It does not do any syntactic or other processing on the sentence that would allow a main subject, for instance, to be identified. In this, it has less power than CognITs approach which, I believe, does some analysis of semantic or functional roles. This difference may account for the better summaries CognIT may produce.

 

Nonetheless, even if TextAnalyst does not now do anything more than treat a sentence as a list of words, there is plenty of scope for doing more. The only constraint is speed. There is no shortage of techniques to mine the further information encoded in a natural-language sentence.

 

The Summarizer in TextAnalyst

Sentences are selected for the summary based on concepts that are extracted from the text and the relationships between those concepts. The semantic network constructed for the document contains these concepts and relations.

 

I performed several test runs of the TextAnalyst summarizer.

 

Here are some limitations I noted in the performance of the summarizer:

v     The program does not yet do an adequate job of detecting the end of a sentence. It thought the following sentence ended after Talk City Inc., for instance.

``The whole Internet market crashed down, and we're rolling with it,'' says Peter Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn't boost its stock price

CognIT has similar spotty performance.

v     TextAnalyst does not use a thesaurus to catch the relationships between words such as firm and company that in the context of the document are synonymous. CognIT has a similar shortcoming. This omission can reduce the robustness of the summarizer and, in fact, all semantic processing such as topic identification. Fortunately, there is room even in the current architecture for including a thesaurus. TextAnalyst makes use of a language-dependent dictionary. It can be accessed via the VocEdit application, a dictionary program that TextAnalyst uses in certain circumstances.

v     The program does sometimes correctly identify multi-word terms such as stock price but not always. CognIT has similar spotty performance.

 

To discuss the summarizer in more detail, here are several test runs on a newspaper article that I have been using for comparison testing on other summarizers.

 

Published Wednesday, Jan. 24, 2001, in the San Jose Mercury News

20 area firms face delisting by Nasdaq

BY MATT MARSHALL

Mercury News

 

It's the company version of the pink slip in the mail -- get your act together, or you're fired from Nasdaq.

About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's high-tech companies. Five local companies were already bumped off last year, and a sixth -- PlanetRx.com Inc., a former South San Francisco health care company -- was just delisted.

Nationwide, Nasdaq has either sent notices or is close to notifying at least 200 other companies, many of whom offered stocks to the public for the first time last year.

While the delisting doesn't have to mean the game is over, it relegates companies to the junior and less reputable leagues of the stock exchange world, where it's much harder to raise money. For shareholders, a Nasdaq delisting sounds like a chilling death knoll -- the value of their stock could all but implode. Some delisted companies, like Pets.com, simply close their doors.

``The whole Internet market crashed down, and we're rolling with it,'' says Peter Friedman, CEO of Talk City Inc., a company that could get kicked off Nasdaq if it doesn't boost its stock price soon. ``The emotion was too much. Things just snapped.''

This round of delistings is the ignominious end to a year of decadence now coming back to haunt us.

Most of these companies had no profits, and many had hardly any sales, when investor enthusiasm created a wave of new stock offerings last year. If a company sold things on the Web -- cars, pet food, you name it -- it was almost guaranteed a spot on the stock exchange.

But in less than a year, many of the same investors have abandoned their former darlings. With stock prices down and the economy slowing, companies are falling short of the standards Nasdaq sets for its some 3,802 companies.

While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq's regulators is simple: The fall of a company's stock price below $1 for 30 consecutive trading days.

When that happens, Nasdaq sends a notice giving the company 90 calendar days to get the stock price up again. If it fails to do so -- for 10 consecutive days -- the firm has one last resort: an appeal to Nasdaq.

That involves a trek to Washington, D.C., and a quick hearing at a room in the St. Regis Hotel, where Nasdaq's three-person panel grills executives. Unless there's good reason to prolong the struggle, the company's Nasdaq days are over.

Once booted, companies usually end up in the netherworlds of the stock market, where only a few brave investors venture.

First, it's the Over The Counter Bulletin Board, which is considerably more risky and yields lower return to investors. However, even the OTCBB has requirements.

Failing that, the next step down is the so-called Pink Sheets, named for the color of the paper they used to be traded on. This exchange doesn't require firms to register with the Securities and Exchange Commission or even file financial statements.

``They're the wild, wild West,'' says Nasdaq spokesman Mark Gundersen.

Autoweb.com Inc., a Santa Clara Internet company that specializes in auto consumer services, has about 40 days left under the 90-day rule, but is busy scrambling to avoid a hearing.

``We're working on strategic partnerships that will have a major impact on the stock,'' says Nadyne Edison, chief marketing officer for the company. On Tuesday, Edison was in Detroit, busy opening a new office near the nation's auto capital. Edison says the firm is considering moving its headquarters to Detroit to be nearer its clients.

Other companies that got delisting notices are trying layoffs. Take Mountain View-based Network Computing Devices, which provides networking hardware and software to large companies. Its sales have been pinched as the personal computer industry slows down, so it has laid off people.

``We've had to downsize, downsize, downsize,'' says Chief Financial Officer Michael Garner.

Women.com, a San Mateo-based Internet site devoted to women, has laid off 25 percent of the workforce recently to avoid delisting. Becca Perata-Rosati, vice president of communications, says the site isn't being fairly rewarded by Wall Street. The company is the 29th most heavily visited Web site in the world, she says.

One trick that doesn't seem to work is the so-called ``reverse stock split,'' which PlanetRx.com tried on Dec. 1. By converting every eight shares into one, PlanetRx.com hoped each share price would be boosted eightfold. But the move was seen by investors as a sign of desperation, and the stock plunged from $1 to 53 cents.

Out of alternatives, PlanetRx didn't even show up for its hearing with Nasdaq. It is now trading on the OTCBB after a recent move to Memphis and faces an uncertain future.

At least one executive says he doesn't mind the prospect of going to the OTCBB.

Talk City's Friedman says his company is growing, and expects its 9 million in service fee revenue to double this year. Even if he's forced off the Nasdaq, he has hopes of returning.

``I'd like to stay on the Nasdaq,'' he says. ``If we get off, we'll build a business. Then we'll go back on.''

 

Contact Matt Marshall at mmarshall@sjmercury.com or (408)920-5920.

 

This is the summary TextAnalyst produced. The number before each summary sentence is a measure of the semantic weight of the sentence, that is, a measure of how important it is in the document. That weight is determined by the weight of the concepts in the sentence. (Note that the second sentence has an incorrectly computed sentence boundary.)

 

94 About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's high-tech companies.

96 , a company that could get kicked off Nasdaq if it doesn't boost its stock price soon.

99 With stock prices down and the economy slowing, companies are falling short of the standards Nasdaq sets for its some 3,802 companies.

99 While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq's regulators is simple: The fall of a company's stock price below $1 for 30 consecutive trading days.

96 When that happens, Nasdaq sends a notice giving the company 90 calendar days to get the stock price up again.

 

TextAnalyst chose the length of the summary, based on the structure of the semantic network it built. This is a much more principled approach to determining summary length than expecting the user to determine the length, as other summarizers do. (The user is expected to set the number of concepts that will guide sentence selection, or to give the explicit number of sentences the summary should have.) When the summarizer itself determines the appropriate summary length, summaries can be automatically produced; the process does not require user intervention.

 

(TextAnalyst does allow the user to change the size of the summary by changing the semantic weight threshold. By increasing the semantic threshold you can decrease the size of the summary.)

 

What about the robustness of the summarization technique? This can be judged by substituting synonyms for several occurrences of key words in the document to be summarized. If different summaries result, it suggests that the method lacks robustness.

 

For the above document, I substituted the word firm for company in several places as shown here.

About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's high-tech firms. Five local firms were already bumped off last year, and a sixth -- PlanetRx.com Inc., a former South San Francisco health care company -- was just delisted.

 

 

 

 

 

 

 

 

This is the summary that resulted.

98 24, 2001, in the San Jose Mercury News 20 area firms face delisting by Nasdaq

99 About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's high-tech firms.

95 , a company that could get kicked off Nasdaq if it doesn't boost its stock price soon.

99 With stock prices down and the economy slowing, companies are falling short of the standards Nasdaq sets for its some 3,802 companies.

99 While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq's regulators is simple: The fall of a company's stock price below $1 for 30 consecutive trading days.

95 When that happens, Nasdaq sends a notice giving the company 90 calendar days to get the stock price up again.

 

The summarizer simply added a first sentence (with incorrect sentence boundary, however.) The other sentences were the same. This stability in sentence selection shows a greater robustness than the summarizer from CognIT.

 

When the following further substitutions were made, the summarizer once again picked the same sentences.

Substitutions:

``The whole Internet market crashed down, and we're rolling with it,'' says Peter Friedman, CEO of Talk City Inc., a firm that could get kicked off Nasdaq if it doesn't boost its stock price soon. With stock prices down and the economy slowing, firms are falling short of the standards Nasdaq sets for its some 3,802 firms.

 

Summary:

98 24, 2001, in the San Jose Mercury News 20 area firms face delisting by Nasdaq

99 About 20 Bay Area companies are performing so badly that they are in danger of being booted off the Nasdaq, the stock exchange that lists most of the area's high-tech firms.

94 , a firm that could get kicked off Nasdaq if it doesn't boost its stock price soon.

99 With stock prices down and the economy slowing, firms are falling short of the standards Nasdaq sets for its some 3,802 firms.

99 While the listing standards are arcane, the most obvious cardinal sin in the eyes of Nasdaq's regulators is simple: The fall of a company's stock price below $1 for 30 consecutive trading days.

When that happens, Nasdaq sends a notice giving the company 90 calendar days to get the stock price up again.

 

This robustness speaks very well of the core technology and the summarization technique TextAnalyst is using.

 

Home | News | Technology | Products | Download | Contact Us

 

.