LOGO Applied Informatics - Distributed Systems

Deutsch -- Homepage - Projects - Publications - Courses -- Search - Up

Structured Documents: Models and Algorithms


Contact: Prof. Dr. A. Brüggemann-Klein

Text processing is the raison d'être for many personal computers. In the eighties, the Apple Macintosh gained a reputation as the computer for desktop publishing, and text processing is often the first application the general public encounters in a course on computer literacy. A new market has opened up for specialized magazines that cater to amateurs and professionals engaged in personal publishing, and the trade magazines of the publishing world watch closely the trends and products in document processing.

Despite the economic pull of the market and despite longstanding research efforts at laboratories and universities, only very few fundamentals of document processing have yet been established. In the words of the organizers of the 1992 international workshop on Principles of Document Processing, "document processing can be fairly assessed as being in a state similar to that obtaining for programming languages prior to the development of syntax- and semantics-directed compilation techniques, and that obtaining for databases prior to the development of relational and deductive data models. (...) It is time to try to exploit current techniques and ideas from computer science to raise principles and models of document processing to the same intellectual level as principles and models for programming languages and databases."

One paradigm that has been established so far is generic coding as a method to separate the document text from any meta-information required by a specific processing application, such as formatter code. The document is considered as a structure of logical elements, such as chapters, headings, paragraphs, or emphasized phrases, that are marked up accordingly. There seems to be some agreement that hierarchical, tree-like structures are a valid model for logically marked-up documents.

The international standard~SGML (Standard Generalized Markup Language) is a language to denote hierarchically structured, logically marked-up documents. In fact, SGML does not define the markup tags of logically marked-up documents directly; rather it allows a definition of document grammars that specify which markup tags may appear in what structural relationship in a document that conforms to the grammar. Therefore, SGML is more adequately described as a syntactic metalanguage for the specification of document grammars and conforming documents.

Document grammars make generic coding more flexible and facilitate document interchange. They also invite the application of methods and techniques from compiler construction, programming languages, and formal language theory.

In this project, we have investigated the expressive power of SGML document grammars. In particular, we have developed a linear-time algorithm to decide whether a document grammar is valid SGML. Furthermore, we have given an algorithm to decide whether the language denoted by a more general document grammar can also be denoted by a more restricted SGML document grammar.

Our current research deals with query languages for structured documents. We use regular expressions and "caterpillar" automata that walk the tree structure to specify and evaluate context. We relate our method to other approaches, such as tree pattern matching and classical tree automata. We also research inductive-inference algorithms that learn caterpillar automata from example contexts.

This project is in part a joint project with Professor Wood, UST, Hongkong.