Computer Science at the University of Helsinki 1994

2.2 Projects

a) General Computer Science

Algorithms and data structures

Algorithms on strings is the main subproject in the area of algorithmics. This work on "stringology" started at the department quite early, in 1981. The initial impulse came from computer applications in molecular genetics where there is a lot of demand for efficient and sophisticated string-manipulation algorithms. The group is now one of the leaders in its special field in the world and has obtained several internationally respected basic results.

The reputation of the group is based on the work done in the 1980's on developing new algorithms for the following problems: edit distance, approximate string matching, DNA sequence assembly, and shortest common superstring. In 1990's, various 'sublinear' approximate string-matching algorithms and new measures for string similarity have been developed. For the important problem of how a text should be preprocessed to speed-up subsequent approximate string searches two solutions have been discovered. One is based on the so-called 'q-grams' and the other on applying dynamic programming over the suffix-tree of the text. For constructing suffix-trees (suffix-tree is the most important data structure used in string matching algorithms) a natural 'on-line' algorithm has been developed.

Most recently, the two-dimensional string matching problem was studied as a new branch of the activities of the group. This problem has received a lot of attention in the field. We were able to obtain the first optimal expected time solutions using simple, practical algorithms. This is in contrast with the many worst-case optimal solutions proposed in the literature which are very complicated and have small practical value. Another new research direction is the study of special databases for strings. We have obtained initial results on reasoning about strings in databases and on extending the relational database model to allow for typical queries on strings.

The on-going work deals with different feature-based schemes for approximate string matching, two and higher dimensional string matching and string databases.

Current members of the group are Prof. Esko Ukkonen (group leader) and Doc. Gösta Grahne (part-time), Doc. Jorma Tarhio (currently visiting University of California, Berkeley), Dr. Niklas Holsti (part-time), Juha Kärkkäinen, Matti Nykänen and Erkki Sutinen. The group gets support from the Academy of Finland.

Publications: [70, 71, 79-89].

Data Structures is another subproject in which we have studied problems arising from the use of data structures in a concurrent environment. Efficiency considerations require that the search paths of a data structure should be short and that operations should not block too large a part of the structure for too long a time. These two requirements conflict, since keeping the paths short after an update can, in a straightforward implementation, block too large a part of the structure. We would prefer that all operations reserve only a small, fixed-size part of a structure at all times.

We started in 1987 by proposing an elegant solution for these problems with dictionary operations. The solution was based on the notion of relaxed balance in AVL and B-trees and the decoupling of updates and rebalancing. In 1991 the idea of relaxing the balance condition was applied to red-black trees, the most promising data structures for implementing a data base in main memory. Our newest results on the area concern augmenting the structures with further operations such as batch updating, and tightening the balance conditions.

The research has been done in co-operation with Prof. D. Wood from University of London, Canada; J. Eckerle from Universität Freiburg, Germany; the group Prof. E. Soisalon-Soininen (leader), L. Malmi, E. Nuutila, and K. Pollari-Malmi from Helsinki University of Technology; and Dr. O. Nurmi from University of Helsinki. The research is partially supported by the Academy of Finland, Deutsche Forschungsgemeinschaft, Natural Sciences and Engineering Research Council of Canada, and Cultural Foundation of Finland.

Publications: [47, 49, 51, 52, 69, 78].

Machine learning

Being able to build computer systems that can learn in some sense is one of the very central problems in artificial intelligence. Inspired by certain theoretical advances in the field such as Valiant's PAC model and Rissanen's MDL principle, the project generally aims, unlike traditionally in AI, to apply to machine learning the approach of theoretical computer science and algorithmics. Our theoretical work is supported by experimenting with the algorithms.

Our results include generalizations and analyses of the RANK algorithm by Ehrenfeucht and Haussler, several theoretical results on the power of the so-called useful and reliable learnability (a variant of the PAC model), and introducing a new family of learning algorithms satisfying the Occam condition for learning decision lists and certain decision trees. The Occam algorithms are suitable for finding common patterns in sets of strings; we have already applied them to synthesize hyphenation rules for Finnish. A new 'geometric approach' to the feature selection in decision tree learning has been developed. In some cases, this method gives extremely accurate learning results. Two extensive tools for experimenting with learning algorithms have been produced: The TELA system for the study of learning algorithms that use traditional form of the data represented as attribute vectors, and the DALI system for the study of learning algorithms that use strings as the basic data.

The current research themes of the group are:

The group has good international reputation and is together with nine other European groups a member of the ESPRIT Working Group 'NeuroCOLT'. The group also works in close co-operation with Prof. Heikki Mannila's data mining group.

The members of the group are Prof. Esko Ukkonen (group leader), Dr. Jyrki Kivinen (currently visiting the University of California at Santa Cruz), Tapio Elomaa, and Jaak Vilo. The group gets funding from the Academy of Finland (Project: 'Machine learning and combinatorial pattern matching - theory, algorithms and applications', 1994-96) and from the European ESPRIT Programme.

Publications: [151-161].

Neural Computing

The neural computing activities at the department were initiated by the project NEULOG, funded from 1988 to 1991 by the Finnish Ministry of Trade and Industry (TEKES). The project developed a prototype of a hybrid neural-symbolic programming environment, consisting of a probabilistic high-level knowledge representation language NEULA, and a compiler for translating NEULA declarations into stochastic neural networks capable of performing Bayesian reasoning with respect to the original high-level description. Much of this work is summarized in [7]. The system has later been extended by, e.g., a graphical user interface, and by embedding it as a probabilistic reasoning component in a standard Prolog system [147].

The NEULOG project has spawned two successors. Fundamental problems in the theory of neural computation have been studied since 1992 in project NETO, funded by the Academy of Finland. The results so far include, e.g., complexity analyses of various problems related to the Hamming and Hopfield neural network models [54, 57, 60], and a characterization of the computational power of the discrete Hopfield model [62]. A related development is the investigation into the foundations of genetic algorithms, initiated during Patrik Floréen's visit as a postdoc at Utrecht University.

The knowledge representation and reasoning issues raised in the NEULOG project led in 1992 to the TEKES-funded project REX, studying massively parallel implementations of case-based reasoning techniques. The concrete goal of the project is to implement a generic expert system shell (CAse-Based INferencE Tool, CABINET) for case-based reasoning. The CABINET shell will contain many different computational models for case matching and case adaptation. This allows the user to experiment with various case-based reasoning schemes. The current version of CABINET uses a theoretically justifiable Bayesian method for case matching and case adaptation [23], which can be implemented on a massively parallel feedforward neural network structure [162]. A learning scheme for such networks has also been developed [163].

Current members of the neural computing research group are Ph.D. Patrik Floréen, Mr. Kari Laasonen, Ph.Lic. Petri Myllymäki, Ph.D. Pekka Orponen (project manager for the NETO project), and M.Sc. Henry Tirri (project manager for the REX project). The group has also hosted 1-2 visiting foreign graduate students per year, and participates in two EC-funded networks of neural computing research (the NeuroCOLT working group, and the Network of Excellence in Neural Computing, NEURONET). Besides neural computing, members of the group have worked also on miscellaneous other topics; pointers to this work are given in the full list of publications.

Publications: [2, 7, 8, 23, 54, 57, 60-62, 143, 147, 162, 163].

Individual research

Scientific visualization (Assoc. Prof. M. Mäkelä): Advanced computer graphics provides means of visualizing huge amounts of scientific data and complicated formal structures and interactions which have earlier been difficult to manage and conceive by human brains. Scientific visualization offers a method to make pictures from the abstract material produced by scientific models and measurements. Thus, the natural human ability to see and think visually are utilized besides the traditional scientific thinking. The most essential problem is to select a proper type of pictures for representing the abstract data and to formalize the transformation from data to pictures. The scope of this research is to find out practical recommendations to attack the problem. Under consideration have been e.g. particle systems, models for flock or herd behavior, similation of solid objects, visualization of large 3-D-nets, visualization of different types of medical data, and the concept and architecture of the visualization system. As the problem area is highly interdisciplinary there are several contacts with the Faculty of Medicine, Helsinki University of Technology, Center for Scientific Computing, and University of Industrial Arts (Helsinki), among others.

b) Computer Software

Open Distributed Processing

The rapid advances in data communication networks have greatly increased the possibilities to make use of the various computational and information services available within reach of a computer network. These services may have different origins, they are implemented on different types of computational platforms, and they can be located on sites far apart. As different services become reachable, interoperability of services must be considered. Distributed computation needs an enhanced functionality of the infrastructure. Furthermore, emerging technologies around wireless communication open up ways for new flexibility.

The department started research activity in the area of open distributed processing at the end of 1980's. In our first project (AHTO) we designed a distributed software environment ("midware") offering for a user a homogeneous interface to the computing services available in a heterogeneous computer network. Since that time the department has participated in the development of the Reference Model for Open Distributed Processing, under the auspices of ISO and ITU-T (former CCITT).

For the time being, there are two projects active in this area: DRYAD, started in 1992, and MOWGLI, started in 1993.

The DRYAD project studies the distributed infrastructure needed for interoperability of services. In this project we have developed an architecture that tolerates heterogeneity. The key concept of the distributed architecture is an abstract service. Clients specify implicitly or explicitly the service behaviour they demand. The infrastructure tries to fulfill these service demands with the concrete tools i.e. concrete services in the network. It activates a service using three steps: 1) it modifies a concrete service request to abstract service description (type management), 2) selects service providers (trading) and 3) invokes the service providers to accomplish the task and to support the needed data communication between the client and the service providers (type and object management). To support this architecture, we have implemented a prototype trading service. Trader performs the task of finding the most suitable service provider available, based on the dynamic state of clients, service providers and network, and the static knowledge about the clients view to abstract services.

The members of the DRYAD project group are prof. Martti Tienari, Dr. Timo Alanko, Lea Kutvonen, Petri Kutvonen and Liisa Marttinen. The work is partially funded by the Technology Development Centre TEKES, ICL Personal Systems, Telecom Finland, Helsinki Telephone Company and Finnish State Computing Centre.

The goal of the MOWGLI project is to study, design and test a data communication architecture for a pan-European GSM-based mobile data service and to develop a prototype based on that architecture. The environment of an application consists of mobile PC's, which can be connected, through a mobile phone, to any part of a fixed data communication network.

The work in the project will concentrate on the architectural aspects which support the mobility of the client, allow parts of the application to operate in a disconnected mode and hide the problems of the wireless connection.

The members of the MOWGLI project group are prof. Martti Tienari, Dr. Timo Alanko, Heimo Laamanen and Markku Kojo. The industrial partners and financial supporters of the project are Digital Equipment Corporation, Nokia Mobile Phones and Telecom Finland. The project is also partially financed by the Ministry of Education.

Publications: [13, 20-22, 40, 41].

Modelling of concurrency

Distributed systems and parallel programs are notoriously prone to errors caused by subtle differences in the relative execution orderings of the components of the system. In order to avoid such errors systems should be carefully specified and analyzed. The international research in this area, often called simply "concurrency", has been active in the 1980's and has resulted in numerous specification methods and models for this purpose, e.g. state transition systems, Petri nets, process algebras and temporal logics. Elegant and rich theories supporting these models have made the models even more useful. Results in these theories can be employed in computer based analysis tools for checking and verifying concurrent real life software.

An important class of distributed algorithms consists of computer communication protocols.We have worked with the problems of protocol analysis since 1984. We begun by constructing a reachability analysis tool PROTAN88 [19] designed for protocols specified with an extended state transition model (ESTELLE specification language). In the 1980's this tool has been tried out by us in analyzing several well-known real-life protocols like X.25, FTAM and X.411/P1. Some of our improvement suggestions were included in the FTAM state tables published by CCITT and ISO.

Lately our work has been more conceptual and theoretical. Various equivalence concepts have been studied in process algebra contexts (Milners CCS and ISO LOTOS) and the equivalence preserving minimization of labeled state transition systems has been studied [90-92, 97, 98]. We have been especially interested in divergence preserving behavioural equivalences and preorders. These equivalences allow neat comparisons between bisimulation and failures based equivalences and congruences. If an equivalence does not preserve divergences, it is difficult to analyze liveness properties after equivalence preserving transformations of a transition system. Many aspects of temporal logic are closely related with equivalences and preorders [93, 94].

A recurrent theme in our research is to make analysis and verification feasible for concurrent systems of realistic size. That is why the dominant goal in our research is to understand and support compositional (or modular) specification and verification of concurrent systems. We have been e.g. analyzing and applying the weakest equivalence (CFFD-equivalence) preserving deadlocks and linear next-timeless linear temporal formulae. Our research employs case analyses (often computer communication protocols) to ensure the practical usefulness of the theoretical concepts.

At the university of Helsinki we have been developing a computer-based verification tool BIDMIN allowing us to minimize labeled transition systems with respect to divergence preserving bisimilarity and branching bisimilarity as well as the related congruences. We also often use a LOTOS oriented verification tool ARA TOOLS developed at the Technical Research Centre of Finland. ARA TOOLS is a prototype aiming at industrial quality.

Concurrency research team is lead by professor Martti Tienari. The team comprises a postdoctoral research fellow Dr. Jaana Eloranta and three graduate research students Ph.Lic Roope Kaivola (Currently at the University of Edinburgh), Ph.Lic. Timo Karvi and M.Sc. Päivi Kuuppelomäki. As an associate member of our team we have professor Antti Valmari ( Tampere University of Technology, Software Systems Laboratory) who has an external docent appointment in our department. Professor Valmari is the chief designer of ARA TOOLS. Our concurrency group is participating (1994-96) in European COST247-project "Verification and Validation Methods for Formal Descriptions".

Publications: [18, 19, 90-98].

Performance analysis

The general goal of the research group is to develop tools and techniques for modelling, measurement and analysis of computing systems, especially using empirically oriented methods. The main research areas have been simulation methods and interactive solution environments of queueing networks. New areas of concern will be the performance of data communication systems and distributed systems; especially of interest are the techniques used in telecommunications and the effects of mobility of workstations in a network.

Researchers: Dr. Kimmo Raatikainen, Dr. Teemu Kerola, Dr. Timo Alanko, Dr. Inkeri Verkamo.

Publications: [42-46].

Individual research

Language design and implementation for multiple paradigms (Ph.Lic. Juha Vihavainen): Multiparadigm programming integrates different programming perspectives in order to simultaneously utilize concepts and mechanisms related to alternative programming paradigms. The different programming paradigms (the main ones being procedural, object-oriented, functional, and logic) are considered to be complementary rather than conflicting computational models. The constructive goals are to develop a general uniform implementation model for multiple paradigms and to produce software (a class library) to implement this model. The implementation approach is based on meta-programming methodology in which a reflective object-oriented architecture is extended by redefining and specializing meta objects. This software addresses traditional translation issues (such as scanning, parsing, semantic analysis, and code generation) but also special problems related to class hierarchies (inheritance), dynamic and static polymorphism (generics), and higher-order values. The implementation is done in C++.

Publications: [24].

Correctable computing (Dr. Niklas Holsti): A correctable computation is a computation that reacts to changed inputs by updating its outputs to match. As an example, we have developed user interface tools that work in a correctable way: an interactive transcript editor lets the user edit any input, and cooperates with the application program to undo and redo the computation and update the output. Reversible parsing and interpretation methods were developed and a number of applications implemented. Future plans include the design of protocols for propagating corrections between programs.

Publications: [139, 140].

c) Information Systems

Design-By-Example: A tool for database design (DBE)

Relational database design has been studied thoroughly. Unfortunately, use of the design theory in practical database design requires a good knowledge of the theory. Besides being hard to learn, another problem of the design methods is their batch oriented nature. The design data (typically integrity constraints) is given to a design algorithm that produces a schema with the required properties. If the resulting schema is not satisfactory (a sign of omissions or errors in the design data), the process must be repeated in its entirety. The standard design theory does not lend itself easily for incremental database design.

This research project contains strongly interacting theoretical and systems work. The project is constructing a database design tool called Design-By-Example (DBE). The tool will make it easy to design good schemas; design theory will be hidden from the users of the tool. Theoretical research on database design has been carried out at the Universities of Helsinki and Tampere during the last five years; the work is still continuing. Many of the theoretical research problems have been found from the practical work. The following properties set DBE apart from existing tools.

A prototype implementation of the relational level was completed in 1988. Currently the full system is being implemented for the Macintosh in C in the University of Tampere. It will support the design of SQL-schemas, and provides a data dictionary in the form of an SQL-database. Besides the Macintosh, the system will be ported into the OS/2 environment. The implementation work provides still new problems for more theoretical work.

The principal researchers of the project are Prof. Heikki Mannila and Prof. Kari-Jouko Räihä (University of Tampere). The theoretical work has been funded by the Academy of Finland. The implementation work is supported by the Technology Development Centre (TEKES). The industrial partners of the project include Nokia Data Systems, Oracle Finland, and United Paper Mills.

Publications: [108, 110, 112-115].

Structured text databases (RATI)

A structured text database system is a tool for managing text documents which have some kind of internal structure and in which dependencies between different parts of the document occur. Examples of such documents are manuals for software and equipment, encyclopedias, and dictionaries. Structured text databases attempt to combine the advantages of text processing systems and database systems.

The RATI project studies methods for manipulation of structured documents. The project has devised a data model and a query language for such documents. The data model is based on the use of context-free grammars for defining the structures of documents.

The project has implemented a prototype database system, where this data model and query language are in use. The user manipulates documents in the familiar way with a text editor. The user interface supports multiple representations. For each document type the user or document designer can define several appearances. The document can be edited in any representation. The definition of a representation, called a view, is defined by annotating the grammar defining the structure of the document. A possible representation is, e.g., a SGML document.

The prototype is implemented in C using the X Window System. It uses either a relational database system or the Unix file system for the storage of the documents.

The prototype is being extended by hypertext features and by support for very large volumes of static text. Research topics related to static texts are computer-aided tools for the extraction of grammars out of existing texts, and incrementalization of queries and updates of structured text.

Other ongoing research includes developing high-level query languages for structured text, increasing the power of the view definition mechanism, and studying the relationships and applicability of object-oriented data models to manipulation of documents.

The principal researcher of the project is Prof. Heikki Mannila. M.Sc. Helena Ahonen, Dr. Pekka Kilpeläinen, Ph.Lic. Greger Lindén, and Ph.Lic. Erja Nikunen (Research Center of Domestic Languages) are working in the project. The project has been funded by TEKES and by the Academy of Finland.

Publications: [72-74, 111, 117, 123, 150, 164].

Data mining

Data mining (or knowledge discovery in databases) aims at semiautomatic tools for extracting interesting information from knowledge bases. Typical application areas are analysis of customers databases or error logs.

We have developed applied data mining methods in several application areas. The research on data mining began in connection with the DBE project by developing tools for finding integrity constraints that hold in a database instance. The group has also given a complete analysis of the complexity of this problem. We have also considered finding document structures from marked documents in connection with the RATI project.

The group has also studied the power of sampling in knowledge discovery and given the first general bounds on the sample sizes needed for reliable identification of useful rules from samples. Starting from late 1993, the focus of the group is in the discovery and analysis of association rules. We are looking for efficient methods for finding interconnected elements from ordinary and time-dependent data.

Data mining has strong connections to machine learning, and the project works in close cooperation with Prof. Ukkonen's machine learning group. The group has good international connections.

The members of the group are Prof. Heikki Mannila (group leader), Prof. Kari-Jouko Räihä (University of Tampere), M.Sc. Helena Ahonen, M.Sc. Pirjo Ronkainen, M.Sc. Hannu Toivonen, and Dr. Inkeri Verkamo.

Publications: [112, 114, 136-138, 150].

Software engineering for knowledge-based systems (VITAL)

There is a growing need to make building knowledge-based systems (KBS) more robust, structured and rigorous if the technology is to fulfill its promise. VITAL is a ESPRIT II (P5365) research and development project with nine participating organizations in five European countries. The project aims at building an industrial-strength software engineering methodology together with a software workbench (the VITAL Tower) for building large-scale KBS. VITAL approach is based on KADS methodology, which advocates handling of the complexity of KBS development by using multiple models. Unlike regular software engineering projects, KBS projects emphasize explicit knowledge representation and manipulation during application execution.

VITAL has centered its main development around the notion of process products. Besides these process products, the VITAL methodology has three components:

The members of the VITAL group at our department have focused on the knowledge transformation issues. In this context a general purpose transformation toolkit ALCHEMIST has been developed. This toolkit allows rapid development of transformation software by providing graphical tools for specifying the transformation and then providing code generation facilities (C++). This software is being used to build transformations needed within the VITAL workbench as well as to external software. The tool is not restricted to VITAL context, it can be used to build any well-defined transformations between persistent representations. In addition to the transformation work, the project group has also participated into the software engineering life-cycle model development.

The principal investigators of the group are Prof. Heikki Mannila (group leader), Ph.Lic. Greger Lindén, M.Sc. Henry Tirri and Dr. Inkeri Verkamo. The Finnish participation into the project is funded by TEKES.

Publications: [39].

Computer Supported Cooperative Work (CSCW)

Work in this area is still in its early stages. During recent years we have participated in two COST 14 working groups. One group studied how group knowledge building could be supported by various forms of information technology [141]. Knowledge is here interpreted widely, meaning practically any representations that aid a cooperative group to perform better than the individuals alone. The other group has been developing an architectural framework supporting CSCW systems.

The other activities in the CSCW area include various student projects where experiments in implementing semantically advanced communication based on e-mail have been carried out. A newer point of interest is work flow management where some experiments have been performed using Lotus Notes. We have planned to continue with studying also conceptually how organizational work flow computing should be supported. One part of this work is concerned with the transaction mechanisms needed.

Researchers: Dr. Hannu Erkiö, M.Sc. Henry Tirri.

Publications: [141].

Transaction management

Concepts such as transaction and a serializable history were introduced when only centralized database systems existed. As distributed systems developed, these concepts were extended to distributed systems. However, an important difference between a distributed system and a centralized system are the relatively large communication delays between the sites and the autonomy of local systems. As a result, the traditional concepts of the transaction and the serializable history, though well suited for traditional centralized systems, may not be suitable in distributed systems. Moreover, as the database concepts have been applied to new areas such as computer-aided design, office information systems and cooperative work, refined and generalized transaction models are needed.

In this research area at our department new transaction models together with protocols for transaction management in heterogeneous distributed database systems have been developed. One approach exploits the information of distributed consistency constraints. This approach has resulted in the introduction of the notions of "partial serializability" and "combined transactions". The transaction research at our department is partly performed in cooperation with INRIA (France) and Purdue University in the United States. This research has resulted in the notions of "value dates" and "fragmented composite objects". The principal researchers in the transaction field are Ph.Lic. Juha Puustjärvi, M.Sc. Henry Tirri and Dr. Heikki Tuuri.

Beginning of the year 1994 our department anticipates involvement in a related ESPRIT III Basic Research Action (TRANSCOOP). The other participants of this project are the Technical Research Centre of Finland (VTT), GMD IPSI (Germany) and the University of Twente (Netherlands). The goal of this project is to develope a transaction model suitable for cooperative work together with mechanisms for controlling cooperative transactions.

Publications: [127, 128, 133-135].

Logic databases and database algorithms

The general goal of this research is to develop efficient data structures and algorithms for database and knowledge base systems. The main concern is the design and analysis of recursive query processing methods for logic databases. The problem of recursion is inherent in knowledge base systems, and recursively defined queries are needed by many new database applications, e.g. for querying complex database objects that are composed of subobjects to an arbitrary number of levels.

New algorithms have been designed for the evaluation and optimization of recursive Datalog queries. A prototype for an experimental logic database system has been designed and implemented. The system is running in the Sun/Unix environment and it has been used e.g. for the analysis of different iteration strategies for bottom-up query evaluation and for testing main-memory database structures.

The members of the research group are Assoc.Prof. Seppo Sippu, Prof. Eljas Soisalon-Soininen (Helsinki Univ. of Technology), Ass.Prof. Otto Nurmi, and M.Sc. Juhani Kuittinen.

Publications: [49, 122, 124, 129-132].

Knowledge representation and reasoning (doc. Gösta Grahne)

As soon as data becomes more complex than a collection of records, there is a need to address knowledge representation and reasoning issues. These issues require solving semantic, algorithmic, and complexity theoretic subproblems. The research has focused on problems stemming from incomplete information in relational databases [107, 119, 120], on recursion [122], on the relationship between updates to logical theories and counterfactual reasoning [116, 144], and between updates and belief revision [145], on database updates and generalized datalog [146]. Current research involves the reasoning about strings (such as DNA sequences) in databases [121].

Publications: [107, 116, 119, 120-122, 144-146].