Relational database systems

A relational database system consists of a set of data tables. Each table is a set of data that have the same structure. A table is composed of an arbitrary number of records. A record is an elementary cluster of information relating to a single entity, typically containing a set of attributes called fields. Each entity (record) within a table must be characterised by the same attributes. This is just another way of saying that records must have the same structure. A relational database table is best conceptualised as a two dimensional table where records are the rows and fields are the columns. One can easily see that the number and sequence of columns must be exactly specified (otherwise we could not draw the table at all) whereas the number of records can easily vary (i.e. the table can be arbitrarily long).

Let's see an example. Obviously, we'd like to store information of our informants. We want to record their personal details like name, address, age, sex etc. Already here the question arises of how to group this information. With names, for example, should there be two fields, one for family names, one for Christian names? And how should addresses be broken up? Well, what one must bear in mind in such cases is that it is relatively more difficult to access information from within a field than the contents of the whole field. Therefore, what is likely to be the target of a lookup either on its own or in combination with other items of information is best put in a separate field. So far, we may have the following scheme.

**Figure 3.1:** Details of informants divided into fields
1#1

The table in Figure 3.1 could be sufficient in itself but most of the time we are dealing with several tables which are related to each other. For example, associated with the answers would be the informant. Obviously, it would be hugely redundant to store the information on the informants in the table in which we keep the answers. (There are 250 informants, but up to 160 000 individual responses.) Instead, if we just stored the name of the informant next to each response, we could use that as a pointer to look up further details of the person from the INFMANTS table. More efficient than using the surnames or even combination of surname and first name would be to set up an ID field in the INFORMANTS table and use that as a key into the ANSWERS table. Note that in the INFORMANTS table there can't be two records of the same ID, whereas in the ANSWERS table the answers given by the same informant will be placed in separate records with the INF_ID field containing the same ID. Indeed, that is how 'relationships' are expressed between tables; through the identical content of fields whose function may be solely to act as link between tables. This is illustrated by Figure 3.2. The nature of the relationship may or may not be explicitely indicated in the field name or its content.

**Figure 3.2:** Linking between two tables through INF_ID field
2#2

So much will be enough to convey the gist of how relational database systems are structured. Of course, these are merely ground rules enabling us to arrange our data in the required way but far from adequate to design an EFFICIENT system. In fact, these few rules are all that the relational database model imposes on the data. They leave practically limitless scope for arranging the data one is dealing with in any way that one finds suitable or relevant.

It should be emphasized that the key to success in constructing an efficient database system lies in the careful modelling of the relationships inherent in the data. This should be the first step and one that deserves meticulous analysis. It is basically a pencil and paper work, yielding an abstract scheme, independent of any software considerations.

The next step is to implement the data model on the computer. This typically means turning to a general purpose database management software package and using its programming language and facilities to develop a software system.

Structure of output data

Let's see in some detail a first attempt at arranging the output of the card-based part of the interviews.

Table 3.2: The structure of answer files for production type of data Table 3.1: The contents of an answer file for production type of data
INF_ID Card No Counter R_Tr R_Ch1 R_Ch2 Remark

B7211VL1.DBF

B7211 1701 1a1932 1

B7211 1701 3

B7211 1802 1

B7211 1903 2 I

B7211 2004 1

B7211 2105 2

B7211 2206 2

B7211 2307 2

B7211 2408 2

B7211 2509 1

Field name Description

Inf_ID The name of the informant

Card_No The ID number of the card used to elicit the answer

Counter The tape counter reading

R_Tr Informant response noted down by the transcriber

R_Ch1 Informant response noted down by the first checker

R_Ch2 Informant reponse noted down by second checker

Remark Flag to indicate any remark (stored in a separate file)

Field name	Description
Inf_ID	The name of the informant
Card_No	The ID number of the card used to elicit the answer
Counter	The tape counter reading
R_Tr	Informant response noted down by the transcriber
R_Ch1	Informant response noted down by the first checker
R_Ch2	Informant reponse noted down by second checker
Remark	Flag to indicate any remark (stored in a separate file)

The idea behind setting up the table in this way is to ensure that any particular record will reveal the essential points of who said what in response to which card. Note that the informant ID is carried in every record, apparently a huge redundancy. After all, one could argue, if we put the answers from the same informant in a file named B7211, we wouldn't have to set up a field for this. However, this would mean that once we remove the item from the file, we have no way of identifying the informant.

Each informant's answers were stored in separate files. Moreover, for each informant the answers given to the 26 different modules were again each stored in different files. This soon led to an explosion of the number of files generated as transcription work gathered momentum. This meant that the total set of results would have resided in 250 x 26 = 6500 files + the 250 text files. With hindsight, it was unnecessary to store the data in so many files. It put a substantial burden on the file storage system but, more importantly, it would have made data access from across different files extremely difficult.

This proliferation of files, however, was merely a nuisance, as the files could be collated into a single one without much difficulty.

Unfortunately, the structure of the answer files differed according to the type of modules they came from. Production data, judgement data and the answers to the staple remover test were processed differently.

One obvious difference between the production and the judgement data lies in the way the answers are coded (letters ``a'' for identical, and ``k'' for different, as against numbers for the production data) as well as in the reference system of the cards. As individual items in the judgement task were elicited not on separate cards but on a sheet which contained 20 - 21 items, card number served for the sheet, and the auxiliary field (Item No) was devised to establish unique reference to individual items.

Accordingly, the judgement data were accomodated in the following type of tables:

**Table 3.3:** Sample answers file for decision data
Inf_ID	Card No	Item No	Counter	Response	Remark
B7301	9100	1	1a2610	a
B7301	9100	2		k
B7301	9100	3		a
B7301	9100	4		k
B7301	9100	5		a
B7301	9100	6		a
B7301	9100	7	1b0116	k	I
B7301	9100	8		a
B7301	9100	9	1b0151	k	I
B7301	9100	10		k

First, card numbers were not unique. As pointed out in 2.1, the same card served to elicit a number of linguistic variables. Originally, the idea was to use consecutive numerical codes for the responses so that answer codes 1-3 covered lanuage problem a, codes 4-5 language problem b etc. To add to the problems, itwas decided that a single digit number would be sufficient to record alternatives. This may have proved adequate for a single language variable but not when there were several variables all consecutively numbered. So it happened that occasionally the principle of consecutive numbering had to be broken anyway.

Secondly, and this was indeed most serious, the slow and fast reading data was assigned the same card number reference within the files. Again, this stems from an attempt to map closely the physical data and its representation in the database. As the same card was used for both the slow and the fast readings, the data was assigned to the same card number. True, they were put in different files and the filename did reflect what module the data came from. However, this information was only good until the data was used in terms of files and not in terms of individual records. As soon as one wanted to pool the answers together, one would have been left with no means of distinguishing the fast and slow reading data.

Data entry

So far, we have been concerned with structuring the data so as to arrive at an optimal model. Optimal in the sense that all the relevant information should be recorded and be accessible in the most economical and efficient way. Implicitly, this also requires having regard to the way the information in the database is going to be processed but at this stage this should be a secondary consideration.

Let's now look at how the data was actually handled. Corresponding to the excessively fragmented and elaborate data structure, was a fairly complex program that controlled the data entry. It was bound end up like that because most of the complexity of finding one's way among the numerous files was left to the data entry program to sort out. Fortunately, what went on behind the scenes was not apparent to the end users. They soon came to lament certain inflexibilities in the operation of the program, which were produced as a matter of design principles. These included the following decisions.

The operation of the system was fairly simple. Transcribers' work environment included a Sony BM 88 dictation machine and a PC. What they heard on the cassette tape was entered directly into the computer system. The initial screen of the program is displayed in Figure 3.3. Having chosen the data entry function of the program, the transcribers saw the contents of the cards coming up on the screen one after another. Figure 3.4 is a screen shot of a data entry screen. Below the cards, the screen displayed the anticipated variants with their numerical codes and the transcribers were prompted to enter the variant they heard the informant say on the tape.

**Figure 3.4:** The data entry screen **Figure 3.3:** The main menu of the old user interface
3#3

The smooth operation of the data entry was ensured by a database table. Recall that the whole interview was structured in terms of modules, some conversational, some card-based elicitation. The latter were listed in the table MODULES, whose contents is displayed in Tables 3.4 and 3.5.

**Table 3.5:** The structure of the MODULES table **Table 3.4:** Contents of MODULES table
Modul	M_ID	Program	From	Till	No of records
KARTYAK1	VL1	9	1701	4630	47
KARTYAK2	VL2	9	4931	7860	46
JOSKA	O1L	9	7901	7917	17
JOSKA	O1G	9	8001	8017	17
MEGHIRDE	O2L	9	8201	8218	18
MEGHIRDE	O2G	9	8301	8318	18
KARTYAK3	VL3	9	8501	8826	32
V.LAP_1	AK1	A	9100	9100	21
KARTYAK4	VL4	9	9701	10216	56
HATODIK	O3L	9	10301	10336	39
HATODIK	O3G	9	10401	10436	39
V.LAP_2	AK2	A	10500	10500	22
PISTA	O4L	9	10801	10833	28
PISTA	O4G	9	10901	10933	28
FELMERÜL	O5L	9	11001	11023	33
FELMERÜL	O5G	9	11101	11123	33
V.LAP_3	AK3	A	11200	11200	22
EZERSZER	O6L	9	11501	11527	27
EZERSZER	O6G	9	11601	11627	27
KARTYAK5	VL5	9	11801	13331	34
HOL_VAN	O7L	9	13401	13423	23
HOL_VAN	O7G	9	13501	13523	23
KARTYAK6	VL6	9	13601	14207	17
RIPORTER	RIP	9	14301	1430	46
DEMOGRAF	DEM	9	14500	14500	1
KISZEDO	KIS	K	14600	14600	1

Field name Description

Modul The name of the module

M_ID Three letter ID used in the names of answer files

Program An internal flag to indicate what type of

data entry program to use

From The number of the card the module starts

Till The number of the card the module ends

No of records The number of items in the module

Field name	Description
Modul	The name of the module
M_ID	Three letter ID used in the names of answer files
Program	An internal flag to indicate what type of
	data entry program to use
From	The number of the card the module starts
Till	The number of the card the module ends
No of records	The number of items in the module

MODULES serves as a source of data to control the procedure of the data entry as well as reflecting certain features of the data. The sequence of records here capture the chronological sequence of the modules and this aspect was used to control the order of the data entry. The advantage of using a table to determine the procedural aspects of the program lies in the flexibility this method provides. By replacing the contents or the sequence of the records, the same control program can be used to handle a different set of data.

The field Modul served to identify the source of the data that was displayed on the data entry screen. Again, these had been arranged in tables of the name registered in the modul field. This way, the input to the data entry screen could be manipulated easily and at any time without having to rewrite the program. Changes between Buszi2 and Buszi3 only call for changing these tables.

The contents of a sample input table (KARTYAK1 'cards1') is shown in Table 3.6. As you see, fields S1 and S2 two longlines of text, the second optional, containing the frame into which the target word was to be inserted. Fields V1 - V9 contained the slots for the anticipated variants of the target form. The field Limit was designed to record the range of the numerical codes of the possible answers in response to the particular item. This served the double purpose of disallowing any mistaken entry by the transcriber and identifying which variable the answer referred to in case the same card served as the prompt for several variables. If none of the anticipated variants was actually used, the fom had to be entered in the memo field attached to the entry. Also, the memo notes were used to record any remark, paralinguistic or prosodic feature in the informant's speech that was relevant.

As mentioned earlier, the answers given by an informant in response to a module were stored in separate files the names of which were composed out of the informant's ID and the module ID as stored in the M_ID field. For example, items elicited with the first batch of cards (in KARTYAK1 module) from, say, informant B7303 were stored in the file named B7303VL1.dbf. (Dbf is the standard extension name assigned by the program used, which was DBASE).

Which modules belongs to what type is recorded in the ``Program'' field. The two fields From and Till were meant to be looked up to see which module any particular card belongs to. The Number of records field served to establish when coding a module is complete.

In conclusion, the following seems to be a reasonable summary assessment of the data entry operation:

**Table 3.6:** Contents of input card table KARTYAK1
Card No	1701
S1	Ebben a ............... jól nézel ki.
S2	ebben
V1	ebbe
V2
V3
V4
V5
V6
V7
V8
V9
Lim	12
Card No	1701
S1	Ebben a ............... jól nézel ki.
S2	farmerben
V1	farmerbe
V2	farmerban
V3	farmerba
V4
V5
V6
V7
V8
V9
Lim	36
Card No	1903
S1	Mari ....... egy ingemet tegnap.
S2	kimosta
V1	kimosott
V2
V3
V4
V5
V6
V7
V8
V9
Lim	12

INF_ID	Card No	Counter	R_Tr	Remark
B7211VL1.DBF
B7211	1701	1a1932	1
B7211	1701		3
B7211	1802		1
B7211	1903		2	I
B7211	2004		1
B7211	2105		2
B7211	2206		2
B7211	2307		2
B7211	2408		2
B7211	2509		1