Text in a class of its own The HCRC Language Technology Group (LTG) is a technology transfer group working in the area of natural language engineering. It works with clients to help them understand and evaluate natural language processing methods and to build language engineering solutions tailored to the clients' specific needs. The LTG does this through consultancy services, joint projects, and training provision. We describe here an example of the group's work on text classification. This work builds on HCRC's basic research in combined models of rule- and statistics-based language processing. Text routing and text classification are just one area of language engineering in which the LTG builds on HCRC's basic research results to provide working prototypes and demonstrators for clients. Other areas include large-volume text handling, spelling and style checking, and tools and tutorial material for language engineers. Key points Text categorization and text routing both involve taking a text, and assigning keywords to it, to reflect its content. The applications of categorization and routing are many and varied. For example, large companies sometimes use a text routing tool to scan incoming telexes and assign a keyword to them, typically the name of the department or of the person the telex should go to. Messages can then be forwarded to appropriate departments without human intervention. News agencies use message filters to collect messages from across the world, scan them, and classify them into a number of categories so that relevant messages can be forwarded immediately to appropriate clients. Medical categorisation tools are used to assign medical keywords to texts written by clinicians, both to allow the compilation of performance statistics for hospitals, and to enable the retrieval of relevant texts for research purposes. Some individuals even have filters on their email which scan incoming mail and sort it into categories like departmental, research, and personal. Automatic and semi-automatic text categorisation is also an important service for secondary publishing enterprises. These are agencies who publish technical abstracts from a wide collection of journals in a particular field, first assigning keywords to these abstracts to allow researchers to find abstracts of publications that may be of interest to them. In all these applications, the assignment of a category or set of categories is not an arbitrary naming procedure --- the department or client a message is routed to, or the technical keywords assigned to a text, should be relevant to the content or purpose of the text. Correct keyword assignment demands some form of text understanding. At the same time, in many such applications the volume of text to be processed is too high, or the time the system has to decide who to forward a message to is too short, to try and achieve full text interpretation. Many techniques have been suggested in the literature for producing text categorisation and routing solutions. At one end of the scale lie simple techniques, such as text matching --- does a fixed string actually appear in the document? At the other end, there are highly sophisticated methods --- statistical models trained from clients' data, which predict the likelihood that a certain category should be assigned to the document. In our experience of categorisation and routing problems, we have found that different information retrieval techniques work well for different problems. Consequently, a highly flexible approach to solving categorisation and routing problems is necessary. In the Language Technology Group, we have developed a text categorisation system, SISTA. It was designed for secondary publishing enterprises. The system can be used in a variety of different settings, and can either fit into existing production systems, or be used as a stand-alone product by technical abstracters on their home PCs. Above, we give an example screen of the system in interactive mode. In the top left corner, the system displays the technical abstract the human indexer is currently working on. The bottom left corner gives bibliographical information associated with this abstract. In the top right hand corner, the system displays the keywords it considers appropriate for this abstract. When the indexer clicks on a keyword, the evidence for this keyword is highlighted (underlined) in the text. The indexer can confirm or deselect suggested keywords. For example, `plastic' has already been confirmed by the indexer as a good hit. It disappears from the `suggested' list and moves to the `confirmed' list. The word `saw' did play no role in this deed. Indexers can also add new keywords which they think are appropriate. The `Terms' menu contains all allowable terms for a particular application --- in some applications this can be as many as 30,000. Indexers can also use the thesaurus to find appropriate words more quickly. The figure to the left shows SISTA in operation: First, the text of the article is processed by empirical NLP techniques to identify likely noun groups, and other facts about the language contained in the document (for example, that a pair of words or noun groups appears in the document). Together this set of facts, called diagnostic units, constitute the document's representation. In this example, the system has pulled out the pair of words `flow & polymer' as a diagnostic unit. Then the identified diagnostic units are compared to a list of correspondences between diagnostic units and document keywords, observed in a training corpus. These are used to suggest descriptors for the human indexer. In this case, the empirically derived rule base knows that `flow & polymer' is a diagnostic unit for the keyword `melt flow', and this will be suggested to the human indexer. When the human indexer clicks on that particular keyword, all occurrences of `flow' and `polymer' will be underlined in the text. Obviously, training is the key to the system's success. The figure to the right indicates how SISTA's models are trained up. First, a document representation paradigm, consisting of a large set of `representational units', is chosen. In SISTA the representational units include the individual words which appear in the document, together with frequently occurring pairs of words and noun groups as identified by a robust noun-group parser. All the documents in a large training corpus of pre-classified documents are represented using this scheme, and the contingency table of document representational units against descriptor assignments is collected. From this it is easy to find which representational units are most strongly associated with individual descriptors by simply dividing the number of documents a representational unit occurs with a certain descriptor by the number of documents in the training collection in which the representational unit occurs at all. This gives an estimate of the conditional probability that a descriptor should be assigned to a document if a certain representational unit appears in it. The strongest associations for each descriptor are taken and compiled into a set of rules for descriptor assignment, where each rule has the form ``if representational unit x appears in a document, assign descriptor y''. Our work on text categorization and routing has obviously been devoted to real applications. However, the tools we have developed are rooted in HCRC's work on corpus handling, and in its research into robust parsing and, most obviously, the tools depend on our research into hybrid models of language processing, which combine statistical methods with rule-based methods.