next up previous contents
Next: A revised system Up: From Cards to Computer Previous: Transcribing the tape recorded

Subsections

Relational database systems

 

A relational database system consists of a set of data tables. Each table is a set of data that have the same structure. A table is composed of an arbitrary number of records. A record is an elementary cluster of information relating to a single entity, typically containing a set of attributes called fields. Each entity (record) within a table must be characterised by the same attributes. This is just another way of saying that records must have the same structure. A relational database table is best conceptualised as a two dimensional table where records are the rows and fields are the columns. One can easily see that the number and sequence of columns must be exactly specified (otherwise we could not draw the table at all) whereas the number of records can easily vary (i.e. the table can be arbitrarily long[*]).

Let's see an example. Obviously, we'd like to store information of our informants. We want to record their personal details like name, address, age, sex etc. Already here the question arises of how to group this information. With names, for example, should there be two fields, one for family names, one for Christian names? And how should addresses be broken up? Well, what one must bear in mind in such cases is that it is relatively more difficult to access information from within a field than the contents of the whole field. Therefore, what is likely to be the target of a lookup either on its own or in combination with other items of information is best put in a separate field. So far, we may have the following scheme.


  
Figure 3.1: Details of informants divided into fields
1#1

The table in Figure 3.1 could be sufficient in itself but most of the time we are dealing with several tables which are related to each other. For example, associated with the answers would be the informant. Obviously, it would be hugely redundant to store the information on the informants in the table in which we keep the answers. (There are 250 informants, but up to 160 000 individual responses.) Instead, if we just stored the name of the informant next to each response, we could use that as a pointer to look up further details of the person from the INFMANTS table. More efficient than using the surnames or even combination of surname and first name would be to set up an ID field in the INFORMANTS table and use that as a key into the ANSWERS table. Note that in the INFORMANTS table there can't be two records of the same ID, whereas in the ANSWERS table the answers given by the same informant will be placed in separate records with the INF_ID field containing the same ID. Indeed, that is how 'relationships' are expressed between tables; through the identical content of fields whose function may be solely to act as link between tables. This is illustrated by Figure 3.2. The nature of the relationship may or may not be explicitely indicated in the field name or its content.


  
Figure 3.2: Linking between two tables through INF_ID field
2#2

So much will be enough to convey the gist of how relational database systems are structured. Of course, these are merely ground rules enabling us to arrange our data in the required way but far from adequate to design an EFFICIENT system. In fact, these few rules are all that the relational database model imposes on the data. They leave practically limitless scope for arranging the data one is dealing with in any way that one finds suitable or relevant.

It should be emphasized that the key to success in constructing an efficient database system lies in the careful modelling of the relationships inherent in the data. This should be the first step and one that deserves meticulous analysis. It is basically a pencil and paper work, yielding an abstract scheme, independent of any software considerations.

The next step is to implement the data model on the computer. This typically means turning to a general purpose database management software package and using its programming language and facilities to develop a software system.

Structure of output data

Let's see in some detail a first attempt at arranging the output of the card-based part of the interviews[*].

Table 3.1 shows a sample of an answer file of production data.


  

  
Table 3.2: The structure of answer files for production type of data Table 3.1: The contents of an answer file for production type of data
INF_ID Card No Counter R_Tr R_Ch1 R_Ch2 Remark
B7211VL1.DBF            
B7211 1701 1a1932 1      
B7211 1701   3      
B7211 1802   1      
B7211 1903   2     I
B7211 2004   1      
B7211 2105   2      
B7211 2206   2      
B7211 2307   2      
B7211 2408   2      
B7211 2509   1      

Field name Description
Inf_ID The name of the informant
Card_No The ID number of the card used to elicit the answer
Counter The tape counter reading[*]
R_Tr Informant response noted down by the transcriber
R_Ch1 Informant response noted down by the first checker
R_Ch2 Informant reponse noted down by second checker
Remark Flag to indicate any remark (stored in a separate file)


The idea behind setting up the table in this way is to ensure that any particular record will reveal the essential points of who said what in response to which card. Note that the informant ID is carried in every record, apparently a huge redundancy. After all, one could argue, if we put the answers from the same informant in a file named B7211, we wouldn't have to set up a field for this. However, this would mean that once we remove the item from the file, we have no way of identifying the informant.

Each informant's answers were stored in separate files. Moreover, for each informant the answers given to the 26 different modules were again each stored in different files. This soon led to an explosion of the number of files generated as transcription work gathered momentum. This meant that the total set of results would have resided in 250 x 26 = 6500 files + the 250 text files. With hindsight, it was unnecessary to store the data in so many files. It put a substantial burden on the file storage system but, more importantly, it would have made data access from across different files extremely difficult.

This proliferation of files, however, was merely a nuisance, as the files could be collated into a single one without much difficulty.

Unfortunately, the structure of the answer files differed according to the type of modules they came from. Production data, judgement data and the answers to the staple remover test were processed differently.

One obvious difference between the production and the judgement data lies in the way the answers are coded (letters ``a'' for identical, and ``k'' for different, as against numbers for the production data) as well as in the reference system of the cards. As individual items in the judgement task were elicited not on separate cards but on a sheet which contained 20 - 21 items, card number served for the sheet, and the auxiliary field (Item No) was devised to establish unique reference to individual items.

Accordingly, the judgement data were accomodated in the following type of tables:


 

 
Table 3.3: Sample answers file for decision data
Inf_ID Card No Item No Counter Response Remark
B7301 9100 1 1a2610 a  
B7301 9100 2   k  
B7301 9100 3   a  
B7301 9100 4   k  
B7301 9100 5   a  
B7301 9100 6   a  
B7301 9100 7 1b0116 k I
B7301 9100 8   a  
B7301 9100 9 1b0151 k I
B7301 9100 10   k  

As it turned out later, the card numbering system was more seriously flawed.

First, card numbers were not unique. As pointed out in 2.1, the same card served to elicit a number of linguistic variables. Originally, the idea was to use consecutive numerical codes for the responses so that answer codes 1-3 covered lanuage problem a, codes 4-5 language problem b etc. To add to the problems, itwas decided that a single digit number would be sufficient to record alternatives. This may have proved adequate for a single language variable but not when there were several variables all consecutively numbered. So it happened that occasionally the principle of consecutive numbering had to be broken anyway.

Secondly, and this was indeed most serious, the slow and fast reading data was assigned the same card number reference within the files. Again, this stems from an attempt to map closely the physical data and its representation in the database. As the same card was used for both the slow and the fast readings, the data was assigned to the same card number. True, they were put in different files and the filename did reflect what module the data came from. However, this information was only good until the data was used in terms of files and not in terms of individual records. As soon as one wanted to pool the answers together, one would have been left with no means of distinguishing the fast and slow reading data.

In conclusion, we should note the following shortcomings in the data model:

Data entry

So far, we have been concerned with structuring the data so as to arrive at an optimal model. Optimal in the sense that all the relevant information should be recorded and be accessible in the most economical and efficient way. Implicitly, this also requires having regard to the way the information in the database is going to be processed but at this stage this should be a secondary consideration.

Let's now look at how the data was actually handled. Corresponding to the excessively fragmented and elaborate data structure, was a fairly complex program that controlled the data entry. It was bound end up like that because most of the complexity of finding one's way among the numerous files was left to the data entry program to sort out. Fortunately, what went on behind the scenes was not apparent to the end users. They soon came to lament certain inflexibilities in the operation of the program, which were produced as a matter of design principles. These included the following decisions.

The operation of the system was fairly simple. Transcribers' work environment included a Sony BM 88 dictation machine and a PC. What they heard on the cassette tape was entered directly into the computer system. The initial screen of the program is displayed in Figure 3.3. Having chosen the data entry function of the program, the transcribers saw the contents of the cards coming up on the screen one after another. Figure 3.4 is a screen shot of a data entry screen. Below the cards, the screen displayed the anticipated variants with their numerical codes and the transcribers were prompted to enter the variant they heard the informant say on the tape.


    
Figure 3.4: The data entry screen Figure 3.3: The main menu of the old user interface
3#3

The smooth operation of the data entry was ensured by a database table. Recall that the whole interview was structured in terms of modules, some conversational, some card-based elicitation. The latter were listed in the table MODULES, whose contents is displayed in Tables 3.4 and 3.5.


  

  
Table 3.5: The structure of the MODULES table Table 3.4: Contents of MODULES table

Modul

M_ID Program From Till No of records
KARTYAK1 VL1 9 1701 4630 47
KARTYAK2 VL2 9 4931 7860 46
JOSKA O1L 9 7901 7917 17
JOSKA O1G 9 8001 8017 17
MEGHIRDE O2L 9 8201 8218 18
MEGHIRDE O2G 9 8301 8318 18
KARTYAK3 VL3 9 8501 8826 32
V.LAP_1 AK1 A 9100 9100 21
KARTYAK4 VL4 9 9701 10216 56
HATODIK O3L 9 10301 10336 39
HATODIK O3G 9 10401 10436 39
V.LAP_2 AK2 A 10500 10500 22
PISTA O4L 9 10801 10833 28
PISTA O4G 9 10901 10933 28
FELMERÜL O5L 9 11001 11023 33
FELMERÜL O5G 9 11101 11123 33
V.LAP_3 AK3 A 11200 11200 22
EZERSZER O6L 9 11501 11527 27
EZERSZER O6G 9 11601 11627 27
KARTYAK5 VL5 9 11801 13331 34
HOL_VAN O7L 9 13401 13423 23
HOL_VAN O7G 9 13501 13523 23
KARTYAK6 VL6 9 13601 14207 17
RIPORTER RIP 9 14301 1430 46
DEMOGRAF DEM 9 14500 14500 1
KISZEDO KIS K 14600 14600 1

Field name Description
Modul The name of the module
M_ID Three letter ID used in the names of answer files
Program An internal flag to indicate what type of
  data entry program to use
From The number of the card the module starts
Till The number of the card the module ends
No of records The number of items in the module


MODULES serves as a source of data to control the procedure of the data entry as well as reflecting certain features of the data. The sequence of records here capture the chronological sequence of the modules and this aspect was used to control the order of the data entry. The advantage of using a table to determine the procedural aspects of the program lies in the flexibility this method provides. By replacing the contents or the sequence of the records, the same control program can be used to handle a different set of data.

The field Modul served to identify the source of the data that was displayed on the data entry screen. Again, these had been arranged in tables of the name registered in the modul field. This way, the input to the data entry screen could be manipulated easily and at any time without having to rewrite the program. Changes between Buszi2 and Buszi3 only call for changing these tables.

The contents of a sample input table (KARTYAK1 'cards1') is shown in Table 3.6. As you see, fields S1 and S2 two longlines of text, the second optional, containing the frame into which the target word was to be inserted. Fields V1 - V9 contained the slots for the anticipated variants of the target form. The field Limit was designed to record the range of the numerical codes of the possible answers in response to the particular item. This served the double purpose of disallowing any mistaken entry by the transcriber and identifying which variable the answer referred to in case the same card served as the prompt for several variables. If none of the anticipated variants was actually used, the fom had to be entered in the memo field attached to the entry. Also, the memo notes were used to record any remark, paralinguistic or prosodic feature in the informant's speech that was relevant.

As mentioned earlier, the answers given by an informant in response to a module were stored in separate files the names of which were composed out of the informant's ID and the module ID as stored in the M_ID field. For example, items elicited with the first batch of cards (in KARTYAK1 module) from, say, informant B7303 were stored in the file named B7303VL1.dbf. (Dbf is the standard extension name assigned by the program used, which was DBASE).

Which modules belongs to what type is recorded in the ``Program'' field. The two fields From and Till were meant to be looked up to see which module any particular card belongs to. The Number of records field served to establish when coding a module is complete.

In conclusion, the following seems to be a reasonable summary assessment of the data entry operation:


 

 
Table 3.6: Contents of input card table KARTYAK1
Card No 1701
S1 Ebben a ............... jól nézel ki.
S2 ebben
V1 ebbe
V2  
V3  
V4  
V5  
V6  
V7  
V8  
V9  
Lim 12
Card No 1701
S1 Ebben a ............... jól nézel ki.
S2 farmerben
V1 farmerbe
V2 farmerban
V3 farmerba
V4  
V5  
V6  
V7  
V8  
V9  
Lim 36
Card No 1903
S1 Mari ....... egy ingemet tegnap.
S2 kimosta
V1 kimosott
V2  
V3  
V4  
V5  
V6  
V7  
V8  
V9  
Lim 12


next up previous contents
Next: A revised system Up: From Cards to Computer Previous: Transcribing the tape recorded
Tamás Váradi
12/26/1997