Kurt Majcen, JOANNEUM RESEARCH, Graz
The central LEAF system comprises a repository with appropriate linking and access mechanisms, the harvesting mechanisms to fill the repository, a web user interface being the public entrance to the system and facilities for converting records between formats. All aforementioned components are hosted on a PC with a LINUX operating system.
The main functions of the system available to the different user groups (public users, data providing institutions and their staff members, service providers like research projects and memory institutions not providing data to LEAF) via web user interfaces are:
- Search for person descriptions
- Search for linked person descriptions
- Annotate records
- Search the annotations
- Access to a user workspace
- User registration
- Search for users
- Login / logoff
All retrieved search results receive, behind the scene, the status indication of a Central Name Authority Record thus allowing for the identification of records that were relevant for users.
Additionally numerous processes are necessary to keep the data in the system up-to-date. This means that data are regularly harvested from local data providers; the harvested data are converted into one common exchange format (EAC XML), the EAC records are inserted into the LEAF database where they are then linked. Insertion of data providers' records into the database can mean either of 3 things: insert, update or delete. Regarding the update of records in the database a few actions have to take place because of the reasons previously mentioned. These actions assure consistency between modified data and other information in the system like users' annotations or IDs of records extracted from the system when being used via the web user interface.
Several technologies are used for the production of the central LEAF system:
- The repository and its mechanisms for importing the harvested data make use of an ORACLE RDBMS and a range of PL/SQL stored procedures.
- The JRun application server is acting as the web server which hosts the LEAF system web site.
- On top of the application server the Cocoon XML publishing framework controls the flow between the various requested URIs and also provides conversions via style sheets for output of the web pages. Working with Cocoon includes mainly using technologies like Java, XML and XSL.
For the communication between the modules of the LEAF system and the systems of the LEAF data providers and of potential future participants (i.e. the harvesting processes) some communication protocols (FTP, OAI, Z39.50) will be used.
The FTP protocol is used by several partners in the LEAF consortium to provide their data to the central LEAF system. The means for the data transfer is offered either from a local FTP-Server at the provider site (the central system pulls the data from there) or the FTP server at the central server (thus the provider pushing his data for further processing).
The OAI protocol for metadata harvesting allows data providers to offer their data via an HTTP interface. The central LEAF system queries an OAI server at the provider's site to request bulk transfer of data to the central server. This mechanism further allows to receive only those records which are newly available or updated at the local sites.
The Z39.50 protocol for search and retrieval is used within the users' search for records. Results from such searches are added to the overall result from the central system and visited data are stored in the central database thus allowing for later updates to keep the central repository up-to-date.
An XML interface grants external access to other systems which want to query LEAF for person descriptions and their details.
The software layers and components used in the central LEAF system are illustrated in the following figure 2.
Figure 2: Software layers & components of the central LEAF system
- The JRun application server accepts the incoming HTTP requests.
- The Cocoon XML publishing framework controls the flow between the various requested URIs and also provides conversions via style sheets for output of the web pages.
- Interactions with the LEAF database and external sites are done via the business logic including actions (Java classes) which are used by the Cocoon configuration.
- The configuration utilises a variety of different file types as listed in the following table.
| Configuration elements |
File types |
Actions |
Java classes |
Business logic |
Java classes |
Dynamic pages |
XSP |
Form validations |
XML |
Images |
GIF, JPG |
Sitemaps |
XML |
Static pages |
XHTML |
Styles |
CSS |
Transformations |
XSL |
Translations |
XML |
Table 1: File types used in the Cocoon configuration
The Cocoon framework, an MVC implementation, provides several basic mechanisms for processing web requests and XML formatted inputs (shown in figure 3, taken from http://xml.apache.org/cocoon/userdocs/concepts/index.html):
- Dispatching LEAF requests based on Matchers
- Generation of XML documents (from content, logic, relational DB, objects or any combination) through Generators
- Transformation of XML documents into other XML documents through Transformers
- Aggregation of XML documents through Aggregators
- Rendering XML through Serializers
Figure 3: Major components of the Cocoon pipeline
Figure 4 (taken from http://xml.apache.org/cocoon/userdocs/concepts/index.html) shows the interaction between the client's web browser and Cocoon.
- The web browser sends an HTTP request regarding LEAF to the Cocoon servlet
- The servlet forwards the request to the sitemap
- The sitemap selects a pipeline to be used and initiates the execution of that pipeline
- The pipeline is executed as defined to produce the HTTP response
- The HTTP response from LEAF is returned to the client's web browser
Figure 4: Interaction with a Cocoon based site
The general structure of a Cocoon pipeline can be seen in figure 5 (also taken from http://xml.apache.org/cocoon/userdocs/concepts/index.html):
- The request is handled by a Generator which produces SAX events for an XML document (e.g. the simplest generator – the default generator – reads an XML file from the file system, but a large range of other generators exist which can be also enhanced if necessary)
- The SAX events from the Generator are processed by a sequence of zero or more transformers which provide transformations into other nearly arbitrary XML structures (as well a lot of transformers exist like XSLT using XSL style sheets, i18n for internationalisation or filter for limiting the number of results returned)
- A Serializer concludes the pipeline producing binary or character streams from SAX events for final client consumption (e.g. the simplest serializer – the XML – provides an XML document from the SAX events and the default serializer produces HTML for direct consumption in the client browser)
Figure 5: Pipeline principles |