As the previous chapter demonstrated, we are faced with a large number of apparently disparate kind of data elicited through a battery of tasks which followed one another in a more or less set manner.
The tape recording contained a full record of what was said throughout the interview and apart from three tasks where answers were to be given by filling in questionnaire forms (` `Same or different?'' etc.), they included all the information that we were concerned with.
The first question is what to take down of this body of data.
A word by word transcript of the full taped interview does not seem a good idea. First of all, we don't need to take down everything because the interview contained set elements, i. e. frames. The most typical example are the sentences with missing words in the oral sentence completion task. Even the reading passages can be regarded as fulfilling the same role on the discourse level. Their purpose was to provide a naturalistic context in which to elicit occurrences of the sociolinguistic variables that the BSI interview focussed on. On the other hand, part of the frame material also contained data that was elicited in other test items. For example, the very first test card
Ebben a ... nem mehetsz színházba.FARMERcontains three words with the -bVn suffix. One of them occurred in the word that was to be inserted into the sentence. This is what was presumably in the centre of attention of the informants, hence they were termed primary variables. The other two instances ebben, színházba were thought to engage the informants' attention to a lesser degree, hence they were termed secondary variables.
You will recall that the whole interview was designed to elicit informants' use of certain linguistic items termed sociolinguistic variables. To a casual listener the tapes may sound just a stream of words, for us the tapes contain these nuggets of information scattered throughout the interview. Some of them are in predictable places (reading tasks), some are unpredictable (guided conversations). Among the data that we focus on, there will be recurrent patterns ie variants of the same variable. Again, it would be redundant and cumbersome to make a verbatim textual record of such items. Instead, it makes much more sense to code the different variants of the same variable with a number and enter only the number into the records.
IN SHORT, THE DATA SHOULD BE EXTRACTED IN A FORM THAT IS APPROPRIATE TO ITS CONTENT.
Let's now take a closer look at the data that we have to deal with. Immediately, we see a sharp division between the more or less free-form conversation part and the more closely structured card-based elicitation tasks. Accordingly, these two parts of the interview are processed in different ways. The conversations will be transcribed in the form of a more or less conventional transcript with some auxiliary information on the margins and some annotation interspersed in the text as will be described in the next section. The data from card-based elicitations will be entered into database tables in numerically coded form.
The conversations were recorded according to a set of guidelines both as regards the form and the content of the transcript. Appendix A contains the full text of the annotation rules that were used in the transcription of the guided conversations. Following the text annotation methods employed by leading corpus projects at the time such as the LOB corpus (ref!!) and the London-Lund Corpus (ref!!), the BSI transcripts used a fixed format line based approach. This means that each line is self-contained in the sense that it carries all the information necessary to uniquely identify it in the whole corpus. This information is encoded at the beginning of each line on character positions 1 -16 which act as a virtual margin. This arrangement ensures that even if the corpus is subjected to a concordance search in the course of which the text lines may be jumbled up and sorted in alphabetical order of the query word, each line may be identified and the original context be looked up. See Figure A.1 in Appendix A for a sample page of transcription.
The first task in processing the card based data is to itemise them, i. e. break them up into separate linguistic variables. At first blush, one would have thought that one card represents one item. For various reasons, however, this is not necessarily the case. Recall that in a frequently used task the informant is asked to produce the form of a word fitting the given context. The form itself that is produced may display a number of features which belong to different variables. Consider again the example cited in 2.1
Ebben a ... nem mehetsz színházba.FARMERIn order to insert the prompt word into the sentence, the informant has to make choices along two different sociolinguistic variables monitored by BSI: (1) vowel harmony (farmerban vs. farmerben ) and (2) the -bVn variable (e. g. (farmerban vs. (farmerba ). Therefore, just to encode the form of the prompt word in all aspects relevant to the BSI investigation, we need to enter it into two different records. In addition, as discussed in 2.1, the above test sentence frame contains a number of secondary variables as well.
The first step, then, is to assign the maximum number of variables to each card. This yields a sequence of variables, some recurring, in the order that the informant is asked to produce them.
The next task is to establish the range of variants of each variable. They are based on a more or less educated guess of what the informant is likely to, or can possibly, produce in the given context. Strictly speaking, this is anticipating and facilitating the coding of the data. There is no theoretical reason why a finite number of variants should be established beforehand. However, it does so happen that the spread of variant usage can be captured in a finite number of alternatives. In fact, we adopted the position that the number of variants would be a single digit figure and it very rarely proved insufficient. (Though a single such case was bad news enough!)
In practical terms, what happened was that before the data entry program was compiled, each card was carefully examined for all the potential forms that the given context might invoke. These were then arranged in decreasing order of likelihood of occurrence in that particular context and assigned a number. This order proved to be less insightful than possible. As likelihood of occurrence varied with context, preference to this order meant that the same variant was not necessarily assigned the same numeric value. This problem only emerges when we are concerned about retrieval of data, a topic we thought we should face once we have sorted out all preceding stages. With hindsight, we can now conclude that it would have been wiser if the whole process of data collection, encoding, data entry and retrieval had been considered in its entirety from the beginning.
Having considered the structure of the data that served as input for our records, let's tackle the question of how to store them. We have briefly mentioned that they are stored in a database. The term database is often used fairly liberally to refer to any collection of data, however, it has a more restricted technical sense as well.
We'll be introducing the essential ideas of database systems as we go along in the discussion, but let's start by considering how the most popular type of database systems, the so-called relational databases work. Then we'll see how the BUSZI data could be arranged in terms of this scheme.
==1 Garside, R. G., G. Leech & G. Sampson (eds.). 1987. The omputational analysis of English. A corpus-based approach. London: Longman.
==1 Svartvik, Jan. (ed.) 1990. The London-Lund Corpus of Spoken English. Description and Research. Lund Studies in English 82. Lund: Lund University Press.
==1 V radi, Tam s. 1995/1996. Stylistic variation and the (bVn) variable in the Budapest Sociolinguistic Interview. Acta Linguistica Hungarica 43:295-309.