The processing of data

The transcription is done by means of SONY BM-80 desktop transcribers and IBM PC/XT compatible personal computers. The entire material of the interview is transcribed and/or coded straight onto electronic medium.

Computationally, the data collected through the interview fall into two broad categories: (1) test-like tasks and (2) continuous speech. These two kinds of data require different treatment:

CONTINUOUS SPEECH is transcribed with the help of a specially programmed word processing program in the form of a standard ASCII text file; TESTS are processed by means of a database management program so that only relevant parts of the informants' responses are recorded and coded. Both parts are integrated in the same system (dBASE III plus), which means that the transcriber can shift fairly easily between the transcription of the continuous speech and the coding of test items.

Although all key aspects of the interview are carefully controlled, the length and structure of each interview will inevitably be different. For our purposes the structure of the individual interviews can be seen as a sequence of conversation and test modules, where the number and ordering of such units are not rigidly controlled. It is essential, therefore, that the system should be flexible enough to accommodate such a varied material, yet every single item should be uniquely identifiable and amenable to further processing.

Transcription of guided conversations

The basic philosophy

Format conventions

The whole of the spontaneous speech from a single informant is entered in a single file. Each conversation module (CM) is recorded in a single paragraph, that is, an empty line is used to set off one conversation module from another. Each CM is headed by an identifier line consisting of the following information:

**Figure 6.4:** Test items in the ``How do YOU say it?'' section **Figure 6.5:** The list of elicited lexical items
columns	Content
1-5	identifier of the informant
6-8	identifier of the conversational unit
10-15	location of CM on tape
17-18	identifier of transcriber
20-30	date

Each line of transcribed speech is 80 characters long and consists of the following parts:

columns	Content
1-5	identifier of the informant
6-8	identifier of the conversational unit
10-13	line number within CM
15	identifier of current speaker
16	continuity marker
17-72	text
74-79	location on tape

The last line of a CM consists of information (in exactly the same format as the header information detailed above) of the unit (whether conversational or test unit) that follows the current unit.

Sample transcription

**Figure 7.1:** Transcription of conversation modules
$\begin{figure} % latex2html id marker 1328 \begin{small} \begin{sffamily} ... ...$\space &explanation; the standard spelling form\\ \end{tabular} \end{figure}$

Figure 7.1 shows a sample page of transcript. It shows excerpts from two conversation modules. The first marked bio on personal data of the informant is followed by a test coded vl1 as well as other modules omitted here for lack of space. The second module (coded cmö on the Gipsy question) is followed by test unit vl2 .

The coding of test-like material

The majority of the test materials involves the informant reading out or saying what s/he thinks is the correct response. Only the relevant parts of the informant's responses are coded by the transcriber. Coding is done through screen masks containing the original stimulus sentence and the anticipated reponses with the numerical codes supplied.

**Figure 7.2:** Screen print of old data entry program in 1988
$\begin{figure} \centering \includegraphics [width=\textwidth]{..../images/image3.eps} \end{figure}$

The assignment of the various items to primary or secondary data status is done automatically and so is the assignment of the individual cards to the various research questions they are aimed to survey. Owing to the intricate nature of the testing involved, not only is it the case that a single test sentence may examine a number of different research questions but also the same research question may be involved in a number of different test sentences (as well as in the guided conversations, of course).

Since the original compilation of this document the data processing software has been thoroughly revised. See Váradi 1998 for more details. Figure 7.3 shows a screen print of the revised data entry system.

**Figure 7.3:** Screen print of revised data entry program in 1994
$\begin{figure} \centering \includegraphics [width=\textwidth]{..../images/image4.eps} \end{figure}$

Some envisaged applications of the system

With the help of the present system it will be possible to tell exactly what cards examine the same research questions e.g. -suk/-sük conjugation as secondary and primary features; what was the distribution of the informants' responses over the total number of contexts in which the question was analysed or any subset of them.

Furthermore, because this issue is manually coded in the transcription of guided conversation as well, one can also collect accurate information (through concordance searches) about the incidence of the same question in the entire set of one informant's utterances.That is, test data and conversational data can be collected for any selected variable (e.g. -suk/-sük ).

As each line is equipped with reference to its locus, it will be possible to examine the distribution of certain features in a given conversation module only, e.g. it will be easy to say whether a particular lexeme or grammatical variable is spread evenly across all conversation modules, or it is frequent in one module but infrequent in another.

References

Váradi, Tamás. 1998. From Cards to Computer Files: Processing the Data of The Budapest Sociolinguistic Interview. Working Papers in Hungarian Sociolinguistics No. 3, January 1998. Linguistics Institute, Hungarian Academy of Sciences, Budapest.