Next: Transcription rules Up: From Cards to Computer Previous: Data retrieval

Subsections

Future work

Elaboration of the relational database system

The BSI database system in its current stage of development serves the purpose of accomodating the set of individual pieces of data recorded from each informant. This arrangement allows the preparation of certain summary statistics but when it comes to investigating linguistic questions in a general way, the data is difficult to handle. For example, in order to find out about a given research topic, say, l- deletion, one would have to cull all the items that this phenomenon was tested with from the list of the BSI items (as published in V radi 1998) and build a query expression listing all the individual items collected. In addition, one would have to bear in mind not only which items relate to which research topic but also what the informants' numerically coded response actually stand for. In order to look up all the standard variants of a given sociolinguistic variable, one would have to know what number they were coded with for the particular item. (Given that there is a varying number of alternatives listed for the different items and they were coded in increasing order of likelihood of occurrence there is no guarantee that the standard forms would have the same numerical value.)

It follows from the above that in order to facilitate linguistically relevant queries the current database representation needs to be further elaborated. As a first step individual items will have to be assigned to the research topic(s) which they are meant to investigate. Next, the responses will have to be sorted out according to their linguistic relevance. At least, for those items where this dichotomy applies, the standard variants should be distinguished from the non-standard alternatives. These remarks are offered merely as suggestions of the work that lies ahead in this respect.

SGML coding

As described in Appendix A, especially in A.6, the transcript follows a rigid format where each line of text contains all the information necessary to identify the informant, the conversational module, the current speaker as well as the number of the line within the module. This essential reference information is adequate to unambiguously trace back any single line to its original context when the passages are input to a concordance program, which was the main application that the transcripts were anticipated to be used with at the time that their format was designed.

In the meantime, the text formatting conventions have proved somewhat limiting on certain points. During the revision of the transcript it proved difficult to insert comments and corrections without disturbing the carefully laid out format. The format acts a hindrance also when it comes to inserting grammatical annotations. Recall that the transcripts are riddled with codes that contain information about certain phenomena that the BSI project decided to monitor in the conversation modules as well. (See Appendix A for a detailed list.) The guiding principle in devising the codes was that they should identify the research topic that the particular form exemplifies. For example, the code <zuk> is meant to indicate hypercorrect use of the -szuk/-szk suffix. In this sense, the code system used in the transcript already goes some way in assigning actual variants to the linguistic phenomenon they belong to. Nevertheless, one can easily see the need to elaborate the system of cross-references between data and research topics, which would mean implanting further annotation in the text, something that can be done at the moment only at the expense of upsetting the layout of the transcript. More importantly, the BSI transcription rules represent in-house by-laws that require some effort to understand and to employ preventing the data from being portable, i. e. readily interpretable elsewhere.

Since the conception of the BSI project, however, guidelines have been worked out for the standardization of encoding of texts of all kinds, including spoken language. The recommendations which have been worked out as a result of several years' of international effort by expert from various fields, known as the Text Encoding Initiative (TEI) is now widely used (despite strong pockets of resistance with an obvious and respectible vested interest) and is quickly assuming the role of an international standard. The TEI guidelines use a system of text annotation that has in fact been accepted as an international standard. This is SGML (Standard Generalized Markup Language) which is quickly spreading in use as a world wide standard.

SGML provides a simple but very powerful means of structuring the text into its logical components. One important principle is the separation of the logical structure of content from its layout and formatting, which is seen as a transient and replacable surface feature. The logical structure of the text is defined in a separate file (called DTD, Document Type Description) much in the form of context-free grammar rules. The DTD specifies the main elements of a document and the hierarchical and sequential relationship between them. Each text file must obey the rules defined in the DTD that it belongs to or else the document is ill formed and will not be accepted by SGML processing tools.

The markup is inserted in the text in the form of tags which occur in pointed brackets. Tags are usually applied in pairs, one is used to mark the beginning and the other the end of the stretch of text that it refers to. For example,

<author>Aldous Huxley</author>
<title>Brave New World</title>
<genre>novel</genre>

Note the similarity between database fields as discussed in Chapter 3 and the units that occur in the above examples within tags. In fact, SGML files are actually textual database structures capable of representing highly complex hierarchical relationships in a flexible way.

One apparent drawback to SGML notation is lack of readability: a sufficiently rich SGML file may be so densely riddled with codes as to render the text completely beyond human consumption. However, with appropriate SGML processing software the whole clutter of SGML annotation can be hidden or displayed at will. Also, because the formatting used in the source text will have no bearing on the final appearance of the document, the source text can be formatted in any way that reduces the problem. On the other hand, the same text can be assigned different layouts serving different purposes.

SGML provides the means and the mechanism of marking up the constituent parts of documents but leaves one free to decide how to apply the rules for a particular type of text. It is the TEI guidelines that contain recommedations as to how to structure a given type of text, what tags to use and how to relate the constituent units.

Adopting the TEI guidelines would bring obvious benefits in terms of portability of data. It would also introduce flexibility in revising and extending the trascript in that the transcript would not be subject to any rigid formatting constraint at all (except those relating to the syntax of the SGML tags themselves). At the same time, the SGML annotation would make it possible to describe the conversations in terms of their natural units, i.e. conversation turns. After all, text lines as units of transcription are arbitrary and artificial expediencies which can now be dispensed with altogether.

Implementing the database in a Client/Server setting

Converting the BSI data tables and integrating them in textual form together with the transcript in a common hypertext system as described in 5.2 is obviously a great help in providing access to the data. However, it has one crucial limitation. It gives a static picture of the whole of the data collected from a single informant. True, the hypertext navigation tools allow one to zoom in to any part of the data. Notice, however, that they can only take us to pages that are ready-made, prepared beforehand. What is lacking is the facility to make online queries and receive any kind of groupings of the data involving several informants or summary statistics computed on the fly in response to a query.

Therefore, we need to develop the system to accomodate online queries. Fortunately, we can resort to the same technology described in 5.2 except that the hypertext system would function not merely to display static pages of data but also as an online interactive query tool mediating between the user and the data as well as displaying the result. The details of this process are too technical to go into here but the technique is widely used on the Internet. Consider, for example, how popular Internet search engines (like Excite, Yahoo etc.) are used. One fills in a form, submits it to the system and the result is displayed in the same browser window. What happens behind the scenes is that the request is forwarded to a program which then typically translates it into a database query, passes it to a database server, collects the response data and compiles the HTML files on the fly containing the data received from the database system. This chain of commuication is regulated by the so-called Common Gateway Interface, and the CGI programs mediate between the user, who typically uses an Internet client program (i. e. a browser), and the database system, operated as the server.

What remains to be done, then, is to set up the BSI database as a server and writing the CGI programs and the HTML pages that would allow querying the system through an Internet browser. We may use this setup to query the data that is available locally (on the same premises or on the same machine, for that matter). Using this technology even in such cases has obvious benefits. The user interface is familiar, intuitive, robust enough and, most importantly, comes free both for the developer and the user. At the same time, it also allows access to the data from any remote corner of the cyberspace at no additional programming expense.

Next: Transcription rules Up: From Cards to Computer Previous: Data retrieval

Tamás Váradi
12/26/1997