Indexing Old Literary Finnish text

Kimmo Koskenniemi
University of Helsinki
kimmo.koskenniemi@helsinki.fi

Pirkko Kuutti
Institute for the Languages of Finland
pirkko.kuutti@kotus.fi

Abstract

A purpose of this study was to test the Helsinki Finite-State Transducer (HFST) technology tools, including its hfst-twolc compiler, the use of weighted finite-state transducers, to use the HFST tools out of Python scripts, and to use them together for comparing two related language forms. A strict procedure was followed in constructing, testing and revising two-level rules which relate written Modern Standard Finnish and Old Literary Finnish as used in the 17th century Bible. In particular, the advantages of the strict independence of the two-level rules were utilised. No practical production system was planned, but the results could be quite useful for indexing and concordancing similar Old Literary Finnish texts.

1 Corpus

A corpus of readily available old Finnish texts was needed for the study, more specifically texts whose language was sufficiently different from Modern Standard Finnish (MSF), but where the variation within the corpus was reasonable. The Finnish language used between years 1540 and 1810 which is called Old Literary Finnish (OLF, in Finnish “vanha kirjasuomi”) is sufficiently distinct from MSF for the purposes of this study. Morphological analyzers for MSF cannot be used as such for any OLF texts. The differences are greater the further one goes back in time.

The Finnish translation of the Bible from year 1642 (often called Biblia) seemed suitable for the purposes of this project. Its language is homogenous enough and the text of Biblia is available as a digital text from the Kaino1 service of The Centre for Languages in Finland (Kotus). The whole translation consist of some 900,000 word tokens. For the present study, the fourth part of the Old Testament (VT-4, some 20,000 word tokens) and the first part from the New Testament (UT-1, some 12,000 word tokens) were selected and used together as our corpus2 . A smaller corpus could have been sufficient for the design of the rules but one needed a fair amount of text in order to extract a list of common word forms.

The material chosen was fairly but not fully homogeneous. Orthographic conventions used in the corpus were reasonably consistent although they represent significantly more variation than what one finds in MSF texts. Some older materials might have been harder to handle, and some more recent materials might have been easier but less interesting to process.

An extract of the 1642 translation of the Bible (B1-Jes-3:11 - 3:12) along with its modern translation3 (1992), is given in Figure 1 with some notes on the structural differences between them.


OLF

MSF

Note

Mutta woi jumalattomita:

Voi jumalatonta!

missing word

sill he owat pahat

Hnen ky huonosti,

PL vs. SG, different verb and construction

ja heille maxetan

hnelle tehdn

different verb

nijncuin he ansaidzewat.

niin kuin hn itse teki.

different construction

Lapset owat minun Canssani waiwajat

Kansani valtiaat ovat lapsia,

different

ja waimot wallidzewat heit.

ja naiset hallitsevat sit.

different verb

Minun Canssani

Kansani,

sinun lohduttajas hiridzewt sinun

sinun opastajasi vievt sinut harhaan,

different verb and case

ja turmelewat tien jota sinun kymn pidis.

he ovat hmmentneet askeltesi suunnan.

different verb construction


Figure 1: A small extract of Biblia 1642 and the same passage in modern translation.


There are many kinds of differences between the translations. Some of the reflect orthographic conventions which have changed meanwhile, such as using w instead of v and sometimes a single letter a instead of a double aa for a long vowel. OLF of those days had more features from the western dialects than MSF. The language itself has also changed meanwhile and continues to change. The changes are both phonological and morphological: the OLF texts often omit word final letters. The use of ending allomorphs was then quite different, and has changed significantly even during the last fifty years, as one can see by comparing Nykysuomen sanakirja (Sadeniemi1961) and Kielitoimiston sanakirja (Grönros and Kotimaisten kielten tutkimuskeskus2006).

One can also find some words in the corpus that are not used in the MSF, and some familiar words are used in another sense. The present study did not try to solve such problems which concern the vocabulary and the lexicon. In this study, only phonological, morphophonological and to some extent the allomorphic differences are addressed.

2 Representative example words

It is obvious from the above examples that one cannot align the Biblia 1642 with the modern translation word by word directly, because the translations are so far apart from each other. Instead of statistical word alignment and large sets of words, we use a fairly small set of carefully chosen good quality examples.

We started from the list of all word forms which occurred at least six times in the corpus. The list browsed and some 180 example words were picked up. The words were chosen so that there were a few examples from each type of systematic differences between OLF and MSF written forms of the Finnish language. Figure 2 shows a fragment out of the word forms chosen by her.


    caupungihin  
    caupungijn  
    corwes  
    cuckoi  
    cuolemaan  
    cuoleman  
    cuolluitten  
    cuulitta  
    cuulcan  
    krsimn


Figure 2: Some of the selected example words out of the corpus.


It was important that the list of example words would cover all common systematic differences between the MSF and the OLF forms, including orthographic and morphophonological ones.

3 Example word pairs

The next step was to associate each of these OLF word forms with its likely MSF counterparts. The possible MSF forms corresponding to each OLF form were addded, see Figure 3. If the OLF word form could correspond to several MSF word forms, the OLF form was repeated, see cuoleman and krsimn below. The relation between the OLF forms and the MSF forms is inherently many-to-many, i.e. one modern form may correspond to several different old forms, and an old form may correspond to several distinct modern forms. Rules must permit some variation but still constrain the possibilities to a minimum.


    kaupunkiin:caupungihin  
    kaupunkiin:caupungijn  
    korvessa:corwes  
    kukko:cuckoi  
    kuolemaan:cuolemaan  
    kuoleman:cuoleman  
    kuolemaan:cuoleman  
    kuolleitten:cuolluitten  
    kuulitte:cuulitta  
    kuulkoon:cuulcan  
    krsimn:krsimn  
    krsimn:krsimn


Figure 3: The selected OLF example word forms with their corresponding forms in MSF. The MSF form is to the left of the colon and the OLF form to the right.


4 Character by character alignment

The above example word pairs are not usable for our purposes as such, because the OLF and the MSF word forms are sometimes of different length. The OLF form often omits a final vowel, reduces long vowels into short ones and shortens geminate consonants, but sometimes geminates a consonant or adds a vowel etc.

Therefore, we must add some zero symbols as necessary so that the similar letters correspond to each other, first letter in the MSF word to the first letter in the OLF word etc. If MSF word is longer than the OLF word, one must add one or more zeros to the old word in order to make the letters correspond to each other. Correspondingly, if the modern word is shorter, the zeros have to be added in it. The goal is that the letters in all corresponding positions would be similar. Zeros are added as necessary, but sparingly, e.g. as in krsimn:krsimn (‘to suffer’).

It must be stressed that a real character is used as a zero instead of an epsilon, an empty string or its representation 0 in the XFST regular expression language. For practical reasons, the Danish was chosen as the zero symbol in rules, examples consistently in this article.4

The exact positions of the inserted zeros are as important as is the selecting of the examples. The positions of the zeros determine what kinds of character correspondences we have. One must describe each correspondence with a rule, so the grammar may change a lot by changing the positions of the zeros a bit. In particular, any poorly positioned zero would force us to write more rules, and possibly very inadequate rules. The proper alignment also affects how well the grammar can apply to the rest of the corpus.

Letters representing similar (or identical) sounds ought to be matched with each other. Matching very different ones, e.g. consonants with a vowels must be avoided.

The initial insertion of the zeros was made manually using one’s linguistic intuition as a guideline. Once the zeros were in place, we converted the pairs of words into sequences of pairs5 of letter pairs as shown in Figure 4 where pairs with identical letters are printed as a single letter and pairs of corresponding non-identical letters are separated by a colon.


    k:c a u p u n k:g i :h i n  
    k:c a u p u n k:g i i:j n  
    k:c o r v:w e s: s a:  
    k:c u k:c k o :i  
    k:c u o l e m a a n  
    k:c u o l e m a: a n  
    k:c u o l e m a n  
    k:c u o l l e:u i t t e n  
    k:c u u l k:c o: o:a n  
    k:c u u l i t t e:a  
    k  r s i m  n  
    k  r s i m :  n


Figure 4: Some example words with zeros added and aligned letter by letter and shown as a sequence of letter pairs.


Once we have the aligned pairs, we compute a list of different pairs and their frequencies as in Figure 5. The pairs end up as the declaration of the alphabet in the two-level rule grammar. The frequencies guide the authoring of rules and can be directly used for weighting alternative analyses.


158 a     2 e:   5 j: 130 n    84 s    60 u     5   
 39 a:  22 e:  23 k    61 o     2 s:n   4 u:   1 :d  
  1 b     2 f:p  53 k:c   2 o:a  16 s:z   1 v     1 :e  
  9 d     3 g    12 k:g   7 o:  16 s:   1 v:f   3 :g  
 97 e    26 h     6 k:x  30 p   102 t     2 v:g   4 :h  
  3 e:a 122 i    54 l     5 p:b  29 t:d  34 v:w   1 :i  
  3 e:i   4 i:j   1 l:   1 p:w   1 t:l  17 y     1 :n  
  1 e:u  12 i:  36 m     1 p:   1 t:r   1 y:   1 :s  
  1 e:   6 j     3 m:  30 r     1 t:  44      4 :t  
                                         22 :


Figure 5: Frequencies of the letter pairs found in the aligned example words.


5 Automatic alignment

One may add further examples at the later stages of the research. One may also want to remove some examples, if they turn out not to represent any general patterns. To facilitate the maintenance of the collection of examples, an automatic character by character alignment was constructed, see also (Koskenniemi2017). Such an automatic procedure for character by character alignment is expected to be useful for other purposes as well, including computational historical linguistics where it can be used in relating cognate words, c.f. Koskenniemi (2013a).

The International Phonetic Alphabet (IPA) presents a general taxonomy for vowels and another for consonants, both based on the articulatory features of sounds. This taxonomy and the features can be utilised in computing approximate distances between sounds. Alphabetic scripts of MSF and OLF can quite well be characterised using the articulatory features of the IPA. For our purposes, only a subset of all features permitted by the IPA is needed.

A short Python script (see Appendix 13) was written for building a weighted finite-state transducer (WFST) out of the IPA features for the letters. For two-valued features, and for tongue height of vowels and for the place of articulation of consonants, an ad hoc numeric value was assigned for each position.6 The distances were computed by adding the absolute values of the differences in each feature. Insertions and deletions of letters were all given a constant fairly long distance. In addition to these systematically computed distances, some individual distances were set. These were needed e.g. in order to guarantee a unique treatment of the shortening of double vowels or consonants. Otherwise one could delete either of the two and there would be no difference in the overall sum of distances. Thus, a few extra items were added in the distance calculation so that it is always the latter letter of the two that is deleted if any (e.g. a a: rather than a: a). Ambiguities caused by the orthographic conventions, e.g between k:x s: and k: s:x and gemination (adding a second identical consonant after, not before the existing one) were resolved in a similar manner.


import sys, io, fileinput  
import libhfst  
tok = libhfst.HfstTokenizer()  
algfile = libhfst.HfstInputStream("chardist.fst")  
align = algfile.read()  
for line in sys.stdin:  
   (f1,f2) = line.strip().split(sep=":")  
   w1 = libhfst.fst(f1).insert_freely(("","")).minimize()  
   w2 = libhfst.fst(f2).insert_freely(("","")).minimize()  
   w1.compose(align).compose(w2)  
   res = w1.n_best(1).minimize()  
   paths = res.extract_paths(output=’text’)  
   print(paths.strip())


Figure 6: Python script for aligning words letter by letter.


A Python script was written, see Figure 6 to implement the actual alignment. The script uses the WFST for distances that was created as discussed above. The script reads an example word pair (w1, w2) at a time, converts the MSF word w1 into a FST and inserts zero symbols freely to it. The same is done for the OLF word w2. Then, w1 is composed with the alignment WFST align and w2:

        w1/ .o. align .o. w2/

Out of the many possible string pairs that the resulting WFST represents, only the one with the smallest weight taken and printed. When testing the alignment procedure, one can assess the relative success of each aligned pair of words. Each pair of words gets a score as the sum of all character pair correspondence weights. Very high total weights indicate untypical pairs of characters which may sometimes be a error in the example word pair.

All finite-state functions that were needed for the script were available in the HFST-Python integration. This particular operation appears to be clumsy to perform using the standalone programs or XFST or Foma.7

6 Writing the two-level rules

For a more detailed description of two-level rules see (Beesley and Karttunen2003) and Karttunen et al. (1987). For method of finding contexts for rules, see Koskenniemi (2013b). The rules to be written in this project have a common alphabet which consists of the letter pairs shown above in Figure 5. We have to write a two-level rule for each non-identical pair (unless there is just one alternative, or if we let all alternatives be allowed anywhere). The rules may be written in any order one finds convenient. Let us start with the pair e:a. Gnu Emacs was used for editing of test examples, rules and all other files. Emacs command Occurs was thus available and used for extracting the right kind of information out of the examples in letter pair format as in Figure 7.


     3 matches for "e:a" in buffer: new-old-pairs.text  
         71: k:c u u l i t t e:a  
        141: t u l e t t e:a  
        142: t u l i m m e:a


Figure 7: Occurrences of e:a in the examples.


These OLF word forms sound like some dialectal forms found even today. It was deduced that the correspondence e:a was restricted to two personal plural endings in verbs. Any other MSF word forms ending in e do not have OLF forms with a instead. Any letters e inside the MSF words are likewise unaffected by this alternation. This rule has no access to the grammatical features, it relies on patterns consisting of letters. Thus, the following rule in Figure 8 was written.


    "e:a" e:a => [t t | m m] _ .#. ;  
    !                    k:c u u l i t t e:a  
    !                    t u l i m m e:a


Figure 8: Two-level rule which restricts the positions where MSF e may correspond to a in OLF.


By convention, examples on which the rule was designed, were always included as comments to the rule. According to the conventions of the two-level rules, see Karttunen et al. (1987), this rule says that the pair e:a may occur only if preceded by tt or mm and is at the end of a word. Only the context restriction (=>) is used, not the double arrow8 because there are some words where the stem ends similarly, e.g. lumme or amme where the final vowel does not change. Even the best and most obvious rules are bound to be ambiguous as long as one only has the surface representations available without any morphological or grammatical knowledge.

One can test the first rule right away after it has been written, as will be explained in the next section. Experienced two-level grammar writers often design a few rules before they test them. So, let us study another letter pair s: before we proceed to testing, see Figure 9.


16 matches for "s:" in buffer: new-old-pairs.text
14: e d e s s:
25: h a a: h d e s s: a:
26: h a a k:x s: i
31: h e n g e s s: :
37: h y v:w k:x s: i
52: k:c a n s s: a n s a:
60: k:c o r v:w e s s: a:
80: m u r h e e: l l i s e k:x s: i
83: n i i:j s s:
119: s e a s s: a:
126: s y d m e s s: n s :
127: s y n a g o g a s s: a:
129: t a p p a a: k:x s: e n s a:
150: u n e s s: a:
160: v:w a p a a: k:x s: i
172: y k:x s: i n n s :

Figure 9: Occurrences of s: in the examples.


It is easy to see two patterns here. A double ss in MSF is reduced into a single s in OLF and ks in MSF words is represented as x in OLF. Thus, we need a rule with two context parts as in Figure 10.


    "s:" s: => s _ ;  
    !                    e d e s s:   
    !                    s e a s s: a:  
                 :x _ ;  
    !                    h a a k:x s: i


Figure 10: Two-level rule for restricting the deletion of s.


Each rule is then compiled into a FST using the two-level compiler hfst-twolc. All rules together form the two-level grammar which is compiled into a sequence of such rule transducers. If one has forgotten or mixed some punctuation in the rules when writing the grammar, there will be error messages with a pointer to the probable location and cause of the error. The grammar writer is expected to correct the error and recompile.

7 Validating the rules against examples

There is a facility for testing two-level grammars. There is a special program, hfst-pair-test which checks whether the grammar accepts all examples given as sequences of letter pairs. The same Makefile which compiles the rules, does this check right away. The program reports any inconsistencies, e.g. character pairs occurring in other contexts than those allowed by the rules or misaligned words resulting in character pairs not allowed by the grammar.

Two familiar concepts from information retrieval are used here with a specific interpretation. Recall means here the proportion of OLF words that will get the correct MSF word among the results of the analysis (no matter how many wrong alternatives were produced). Precision means here the propotion of correct MSF results among all proposed results for a set of OLF words, eg. all word tokens in the corpus. Recall and precision can equally well be used for the inverse relation, i.e. from the modern words to the old words.

One ought to remember that the testing of pair string examples only detects problems where the rules are too restrictive. Initially, before we have any rules, all examples would pass the check. Using just a few rules, one could retrieve all old forms for a modern form (as long as they participate in those alternations that were present in the examples). But the initial grammar has a very poor precision. A modern word corresponds to very many (possibly infinitely many) old words and vice versa. As we write more rules in our two-level grammar, the recall can only degrade, but every new rule improves the precision.

If one finds new types of regularities during the process, one ought to add new word pairs to the examples. New letter pairs can then be introduced in the examples, aligned and tested.

8 Standalone testing of the grammar

When one has rules for all letter pairs, the two-level grammar can be tested in a new manner. One can now generate tentative OLF forms from the MSF ones. One gets several results per each modern word. Using unweighted rules, all results of such generation are equal. There is one trivial weighting that can be used here to prioritise more likely result words: use the statistics we have from the example words as in Figure 5. A short Python script is used for computing a WFST out of the frequencies. Intersecting the weighted transducer with intersected rule FST gives us a new rule WFST. This one can be safely tested by inputting MSF words to it and selecting at most N, say 20, best results. If the correct one is among the top results, the rules seem to do the right thing. See the transcript in Figure 11 where one can see what the grammar generates out of a few modern words.


$ hfst-strings2fst | hfst-compose -2 intro.fst | \  
  hfst-compose -2 new2old-one-w.fst |\  
  hfst-compose -2 delete.fst | hfst-project -p output | \  
  hfst-fst2strings -w -N 20  
 
>>sija  
sija         1.86035  
sia          2.12402 +  
>>sokeat  
sokeat       3.84277  
sokiat       8.85742 +  
>>ruoskitte  
ruoskitte    4.18848 +  
ruoskitt     6.3291  
ruoskitta    9.20312 +  
ruoskite    10.8613  
ruoskit     13.002


Figure 11: Testing how the plain rules generate tentative OLF word forms out of MSF word forms. The MSF word form as input is marked with >> and the correct results are marked with a plus sign (+).


The weighted rule transducer can be inverted and thereafter tested in the same way. In the present project, the mapping from OLF to MSF words is expected to be more ambiguous than the other direction. Thus, the weighting is useful in checking the production of candidate modern forms. Figure 12 shows the 20 first results generated from an old word ism (‘our father’) out of the total of 32 results.


$ hfst-strings2fst | hfst-compose -2 intro.fst | \  
 hfst-compose -2 old2new-one-w.fst | \  
 hfst-compose -2 delete.fst | hfst-project -p output | \  
 hfst-fst2strings -w -N 20  
 
>>ism  
ism       1.23633  
ism      2.94141  
isme      3.82129  
issm      4.31348  
iism      4.84863  
isme     5.52637  
issm     6.01855  
iism     6.55371  
issme     6.89844  
ismme     7.40625 +  
iisme     7.43359  
iissm     7.92578  
issme    8.60352  
ismme    9.11133 +  
iisme    9.13867  
iissm    9.63086  
issmme   10.4834  
iissme   10.5107  
iismme   11.0186  
issmme  12.1885


Figure 12: A test where we see the first 20 results that the inverted rules generate out of one OLF word ’ism’. The correct results are marked with a plus (+).


For some other OLF words, there will be many more results, e.g. for cullainen (‘golden’), more than 300 results were produced. Even as such, the mapping might be useful in indexing or searching a corpus. One may easily produce a transducer oldwords which accepts exactly the word forms in the corpus. Composing the mapping new2old used in Figure 11 with oldwords could be quite useful. One could build a search facility using it which would use modern word forms as search keys and expand it according to new2old and do the actual search using the existing OLF words the mapping gives.

It would be impractical to use the above method in existing concordance programs, as it would require the inclusion of all alternatives, even the nonsense modern “word forms” in the index. However, nothing would prevent to use new2old in a front end processor to traditional concordance programs such as Korp, described in e.g. Borin et al. (2012).

9 Combining the grammar with OMORFI

As we noticed above, the rules are quite ambiguous when generating tentative modern word forms out of an OLF word form. We have a lot of candidates, among which the correct one is hidden. Most of the noise words are non-words in MSF. Thus, it is a natural idea to filter the noisy output of the rules using a spell-checker for MSF.

OMORFI is a finite-state morphological analyser which is open source and freely available, cf. Pirinen (2015). It is made using the same HFST tools, so it was easy to combine it to other transducers used in this study, for further information on the HFST morphological tools, see e.g. Lindn et al. (2011). OMORFI is distributed both as source code and as binary FSTs.9 The source form consists of more than 300 files and appears fairly complicated. More than a dozen Makefiles are needed for building the FST that recognises Finnish word forms. Therefore, it was easier to use the binary transducer which comes with the package even if there would have been an obvious need to modify the lexicon and rules to better suit the needs of this project.

The transducer finnish-analyze.fst takes a Finnish word form as its input and outputs its analyses as a combination of a base form and the morphosyntactic features characterising the grammatical form, e.g. as in Figure 13.


        $ hfst-strings2fst |\  
            hfst-compose -2 finnish-analysis.fst |\  
            hfst-fst2strings  
        >>kuutamoilta  
        kuutamoilta:kuutamo N Abl Pl  
        kuutamoilta:kuutamo#ilta N Nom Sg


Figure 13: Morphological analysis using plain OMORFI. Two outputs are generated from the input ”kuutamoilta” which is either ”from moonlights” or ”moonlight” + ”evening”. Note the word boundary in the second result.


The morphosyntactic features are not needed for the filtering of the noise words from the set of candidates that the rule transducer generates. Only the input side of the transducer is needed for the selection of acceptable word forms of MSF. One can simply drop the output part and keep the input side of the analysis FST.10

The mapping all the way from OLF word forms into valid MSF word forms is the composition of four transducers in sequence, see Figure 14. One may run these as a run-time pipeline using separate HFST programs or one may compose them in advance for efficiency.


OLF word form
intro.fst
OLF word form with zeros added in all possible ways
old2new.fst
candidates for MSF word forms with zeros
delete.fst
candidates cleaned of all zeros
finnish-analyze-surf.fst
valid MSF word form candidates

Figure 14: Producing MSF word form candidates out of OLF word forms.


The combination of the steps in Figure 14 does roughly what was expected. If we feed the OLF words in the Figure 2 to it, each old word will be expanded to several possible MSF word candidates, and the analyser will then filter away all but those candidates that it considers acceptable MSF word forms, as is seen in Figure 15.


caupungihin [kaupunkihiin, kaupunkiin]  
caupungijn [kaupunkiin, kaupunkiini]  
corwes [korvessa]  
cuckoi [kukko]  
cuolemaan [kuolemaan, kuolemaani]  
cuoleman [kuolemaan, kuolemaani, kuoleman, kuolemana,  
          kuolemani, kuolleemman]  
cuolluitten [kuolleitten, kuolleitteni]  
cuulcan [kuulkoon]  
cuulitta [kuulitta, kuulitte]  
krsimn [krsimn, krsimni, krsimn, krsimni]


Figure 15: Analyses of some example words using the two-level grammar and filtering with OMORFI.


It is noticed that most of the modern word forms offered by the sequence are quite acceptable. In particular, all correct interpretations that we wanted, are present. In addition to the desired results, there are some artificial words. One of them is the very first result kaupunkihiin (‘to the city’) which looks odd. It turns out to be a compound of kaupunki (‘town’) and hiki (‘sweat’) which is a nonsense word. Another extra result is kuulitta (‘you heard’), is also an odd compound of kuu (‘moon’) and litta (a children’s play e.g. with a ball).

The number of compound boundaries in a word form would be useful as a criterion for excluding less likely analyses. Unfortunately, when using OMORFI, this information is only available when one reduces the word forms all the way to their base forms. With some Python scripting and processing the knowledge about the number of compound boundaries can be used at the right place. One first produces a list of all pairs where the first component is the OLF word form and the second component is the analyses OMORFI accepts out of the many candidates that the rules propose. The following pairs are in the long list:

        aitais:aitaiisi  
        aitais:aitaisi  
        aitais:aitasi  
        aitais:aittaiisi  
        aitais:aittaisi  
        aitais:aittasi

The next step is to analyse again the right parts which were already once accepted by OMORFI, and we get a list containing entries like the following:

        aitaiisi:aita#iisi A Pos Nom Sg  
        aitaisi:aidata V Cond Act ConNeg  
        aitaisi:aidata V Cond Act Sg3  
        aitaisi:aita#isi N Nom Sg  
        aitasi:aidata V Pst Act Sg3  
        aitasi:aita N Gen Sg PxSg2  
        aitasi:aita N Nom Pl PxSg2  
        aitasi:aita N Nom Sg PxSg2  
        aittaiisi:aitta#iisi A Pos Nom Sg  
        aittaisi:aitta#isi N Nom Sg  
        aittaisi:aitta N Gen Pl PxSg2

From these pairs, we only use the number of word boundaries # in the stem that is on the right. For each surface form we store the least number of boundaries its base form analyses have. The first one has only a compound analysis, so it gets the count 1. The next, aitaisi (‘of your barn(s)’ or ‘of your fence(s)’) has three analyses, two without boundaries and one with one boundary, so it gets the count 0:

        aitaiisi 1  
        aitaisi 0  
        aitasi 0  
        aittaiisi 1  
        aittaisi 0  
        aittasi 0

Now one can return to the processing of the result pairs where the left part is the OLF word and the right part is a word form proposed by the rules and accepted by OMORFI. For each OLF word, we now have a list of candidate MSF words. We can fairly safely drop some candidate MSF word forms by using their compound boundary count as computed above. We throw away all candidates which have more compound boundaries than the one that has the least number of them. Thus, we start from the following list of modern forms for the OLF word form aitais:

        aitais 1 [aitaiisi, aitaisi, aitasi, aittaiisi,  
                  aittaisi, aittasi]

According to the counts we computed, the first and the fourth have a boundary count 1 and the rest have no boundaries. Thus, we drop the first and the fourth, and get the final result which now contains only acceptable words and no artificial constructions:

        aitais 1 [aitaisi, aitasi, aittaisi, aittasi]

This processing sounds complicated,11 but it is motivated by the fact that OMORFI produces a lot of extra analyses using its liberal compounding mechanism. Anyway, the Python script which does the trick, is short, fast and straightforward.

10 Reducing to the base forms

Normally, OMORFI reduces word forms to their base forms and base forms would be often even better for searching and indexing than the word forms themselves. Thus, in parallel to the operations in the previous section, the candidate MSF word forms were filtered and reduced to their base forms. This list had the same kinds of problems with the liberal compounding of OMORFI as we saw in the previous section. The artificial compounds could be removed in the same way, in fact easier as the compound boundaries were present in the base forms directly. Before the filtering, the results for a base form alendamisest (‘from lowering’)looked like the following:

    alen#da#miss#eesti   ’sale’+’da’+’miss’+’Estonian’  
    alen#da#miss#este    ’sale’+’da’+’miss’+’obstacle’  
    alentaa              ’to lower’  
    alentaminen          ’lowering’  
    alentamis#eesti      ’lowering’+’Estonian’  
    alentamis#este       ’lowering’+’obstacle’  
    alen#tamminen        ’sale’+’made of oak’  
    alen#tammis#eesti    ’sale’+’made of oak’+’Estonian’  
    alen#tammis#este     ’sale’+’made of oak’+’obstacle’

Again, the filtering program considers this set of candidate MSF base forms. It finds two candidates with no compound boundaries and seven with one or two boundaries. The program throws away those seven and keeps the two. So the result for alendamisest becomes:

        alentaa  
        alentaminen

11 Tuning the two-level rules

At this point the rules have been tested against the examples and they have been used separately for some manually typed in words in order to assess the precision of the rules, i.e. how many unwanted analyses they produce. We have tools for reducing OLF word forms into MSF word forms and also to MSF base forms. Now one can see what the rules and OMORFI together actually do to the masses of words of the Biblia 1642 corpus.

One can expect that some rules are too permissive. This will show up as too many candidate MSF words. On the other hand, some rules might have too narrow context conditions, which will be seen as some OLF words left without the desired candidate words. It is also possible that some regular phenomena were not present in the example words. Then we have no applicable rules and many OLF words remain without the desired candidate MSF words. In the two first cases, we must consider revising the two-level rules we have written. In the last case, we must select further example word pairs and write yet another rule and test it.

For the checking what actually exists in the the Biblia 1642 corpus, three files were used: the source text itself, an alphabetical list of distinct OLF word forms in the corpus, and a list of reversed OLF word forms (sorted starting from the last character). Using the Gnu/Linux less and egrep commands, one got quick answers to questions such as: “Are there many other words similar to this one?” or “Is this really a form of the word I think?”.

The tuning consumed more time than the writing of the initial two-level grammar. It was also more demanding because one must check that changes in rules do not have negative effects, such as dropping some desired candidate words which were previously correctly generated. For this purpose, the changes in the rules were always checked by producing a separate new list and comparing it against a previous full list of analyses12 . If the differences were all for the better, then the new rules were accepted, and the new lists taken as the new benchmark for the following changes. Some of the new or lost analyses required checking from the corpus or the lists of old words as mentioned above.

The tuning required partial knowledge of the language in the corpus and was made by Kimmo Koskenniemi. An overall sense of present day Finnish and some familiarity with Finnish dialects seemed to be sufficient for finding generalisations and adequate context characterisations. Just one OLF word form (ktyxi, ‘that has been turned’) could not be interpreted by looking at the Biblia 1642 occurrences. One had to look it up in a more recent Bible translation.

All changes of the rules were automatically checked against the collection of hand-selected example word pairs. Any discrepancies were immediately detected and the rule violating some word pair was pointed at. After correcting the rule that failed, the rules were recompiled, retested and the full lists were recomputed. A handful of new example words were included in this process. The original and the new examples were used testing the rules thereafter at every cycle.

There appears to be no clear limit how long one can tune a grammar. After a certain level is reached the return of each cycle diminishes. Many of the remaining shortcomings could be better solved if one could have a different morphological analyser for Finnish. In particular, one would like to modify the compounding mechanism, make the derivational capacity more productive, and use a morphophonemic representation for MSF as the basis for rules. Then one would have access to many relevant conditions for determining the forms of the OLF.13 Such a re-implementation of Finnish morphological analysis would be motivated also when applying two-level methods to historical linguistics of Finno-Ugric languages, see Koskenniemi (2013a).

A couple cases occurred where a new letter pair and an entirely new rule had to be established. That posed no major problem as long as the main principles of letter alignment and correspondences remained unchanged. With a few new example word pairs, there were no particular problems.

A common question that arose was to decide whether the rejection of MSF word forms was an error or a feature of OMORFI. The analyser is committed to obey the guidelines for word inflection as described in Kielitoimiston sanakirja14 (2006) which is also available as a net service.15 In most cases, a fifty years earlier norm of MSF would have better suited the needs of this project.

12 Evaluation of the mapping

The rules were developed using a set of example words. So the discussion of the success and the shortcomings of the mappings cannot be estimated by testing with the same words. One can assess how the mapping covers the vocabulary of the corpus by taking a sample of the list of all distinct word forms in the corpus, i.e. some 26,500 words. This list consist mostly of infrequent words. Half of them are hapax legomena, i.e. occurring only once. Less than 5,000 words occur more than 5 times in the corpus. Two 100 word samples were selected, one out the full list of distinct word forms and another from a list consisting of word forms occurring at least six times in the corpus. Both samples were made out of the respective total list by first skipping some entries and then proceeding with even intervals (the length of the list divided by 100). A third sample was made from the running text.

12.1 Proper nouns and abbreviations

The Biblia 1642 corpus contains plenty of proper names, biblical and other. Names of persons and places occur typically fairly few times and only within a short passage of text. There are two kinds of problems concerning them. Dictionaries lack most of them, so the filtering could not work properly. Most proper names are unlike normal Finnish words, and the orthography used in writing them differed from that of normal OLF words. Proper names are often written as in Swedish or German and not adjusted to Finnish.

By mistake, some material, such as references to other parts of the Bible remained in the corpus although the intention was to exclude them all. This happened probably because such markings had more variable forms than was expected. The abbreviations so included, are not valid OLF words and not a target of this study.

Thus, the proper nouns and abbreviations do occur in the samples, but could be ignored in the results. Proper nouns, and many other words were written with capital letters in Biblia 1642. In addition, capital letters are found in the corpus in unusual places, e.g. both as the first and the second letters. Precise normalisation of the corpus was not a goal of this project, so nothing was done beyond forcing all text to lower case.

12.2 Words occurring more than five times

The result of testing a sample of 100 word forms out of the list of word forms occurring at least six times in the corpus is in Appendix 1. The following is a summary of the results with this sample:

12.3 All word forms in the corpus

The full list of the other 100 word sample which was taken out of the total list of all word forms occurring in the corpus is in Appendix 2. Following is a summary of the results:

12.4 Sample of word tokens from the running text

The two above test estimated how well the method covers the vocabulary. Another aspect is, how well the method covers the text, i.e. how large a portion of word tokens in the running text would get a proper base form using which the place could be retrieved. For this purpose, a sample of 100 words was made, starting with a small offset and stepping through the text at equal intervals. A summary of results with this test:

One may speculate that frequent words are more common in samples of running text than in samples from lists of distinct word forms. Therefore, it would be expected that the rules and OMORFI perform better with such samples.

13 Conclusion

On the whole, the authors consider the precision and recall of the combination of the two-level rules and OMORFI successful, somewhat better than was expected. Spending more time with the rules and tuning the context conditions would not have a significant effect on the performance. By making the conditions looser, one may improve the recall at the expense of precision. With some manually compiled lists and paying attention to the capital letters, one could handle the proper names much better.

The most promising line of development would be to build a different type of morphological analyser. OLF form like jalgat (‘feet’) corresponds to MSF surface form jalat. Adding a potential g in all possible places seems a bad idea. The morphophonemic representation of the modern form could be something as j a l k a + t. Relating g with the morphophoneme k describes the phenomenon more logically.

Acknowledgements

We are grateful for the Institute for the Languages of Finland for the free availability and use of the Old Literary Finnish materials which was necessary for this project. We are also grateful for the FIN-CLARIN project which has implemented the HFST software and offers technical support for it.

The division of labour between the authors was that Pirkko Kuutti selected the material for the corpus and prepared the set of examples and checked the judgements in the final results. Kimmo Koskenniemi did all other tasks.

References

    Beesley, K. R. and Karttunen, L. (2003). Two-Level Rule Compiler. Xerox PARC. https://web.stanford.edu/~laurik/.book2software/.

    Biblia (1642). Biblia, Se on: Coco Pyh Ramattu, Suomexi. Pramattuin, Hebrean ja Grecan jlken: Esipuhetten, Marginaliain, Concordantiain, Selitsten ja Registerein cansa. Pipping 42, Henrik Keyser, Stockholmis.

    Borin, L., Forsberg, M., and Roxendal, J. (2012). Korp - the corpus infrastructure of sprkbanken. In Proceedings of LREC 2012. Istanbul: ELRA, page 474478.

    Grönros, E.-R. and Kotimaisten kielten tutkimuskeskus (2006). Kielitoimiston sanakirja. Kotimaisten kielten tutkimuskeskuksen julkaisuja. Kotimaisten kielten tutkimuskeskus.

    Karttunen, L., Koskenniemi, K., and Kaplan, R. M. (1987). A compiler for two-level phonological rules. In Dalrymple, M., Kaplan, R., Karttunen, L., Koskenniemi, K., Shaio, S., and Wescoat, M., editors, Tools for Morphological Analysis, volume 87-108 of CSLI Reports, pages 1–61. Center for the Study of Language and Information, Stanford University, Palo Alto, California, USA.

    Koskenniemi, K. (2013a). Finite-state relations between two historically closely related languages. In Proceedings of the workshop on computational historical linguistics at NODALIDA 2013; May 22-24; 2013; Oslo; Norway, number 87 in NEALT Proceedings Series 18, pages 53–53. Linkping University Electronic Press; Linkpings universitet.

    Koskenniemi, K. (2013b). An informal discovery procedure for two-level rules. Journal of Language Modelling, 1(1):155–188.

    Koskenniemi, K. (2017). Aligning phonemes using finte-state methods. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 56–64, Gothenburg, Sweden. Association for Computational Linguistics.

    Lindn, K., Axelson, E., Hardwick, S., Pirinen, T. A., and Silfverberg, M. (2011). Hfst – framework for compiling and applying morphologies. In Mahlow, C. and Piotrowski, M., editors, Systems and Frameworks for Computational Morphology 2011 (SFCM-2011), volume 100 of Communications in Computer and Information Science, pages 67–85. Springer-Verlag.

    Pirinen, T. A. (2015). Omorfi —free and open source morphological lexical database for finnish. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015), pages 313–315, Vilnius, Lithuania. Linköping University Electronic Press, Sweden.

    Sadeniemi, M., editor (1951-1961). Nykysuomen sanakirja, volume 1–6. WSOY.

Appendix 1: Sample out of words occurring at least 6 times

This sample was made out of the word forms occurring at least six times in the corpus. The list was divided in 100 parts of equal length and then the 42nd word from each part was selected. This sample was not used for writing or tuning the rules. The rules were not changed after this sample was processed and this article was written.

The analysis of the sample was manually checked looking up the passages in the corpus for judging what the original word form stood for. Decisions were inserted in the list using following markings: Names or abbreviations are marked with a preceding (). They were left outside the present study because OMORFI would not cover them in any case. OLF words which were left without any correct analyses are marked with a preceding (@) sign. After an equal sign (=), a desired result is given, i.e. a result that one would wish that the analysis would produce. Those MSF word forms which were attested to be correct, are marked with a plus sign (+). The results which were considered to be wrong, are marked with an asterisk (*). The remaining unmarked MSF word forms in the results are formally possible but not attested in the Biblia corpus.

ajasta 22 +ajasta, ajastaa  
 amoxen 6 =Amoksen  
apostoleille 8 +apostoleille  
asti 178 +asti  
autuuden 21 +autuuden, autuuteen, autuuteeni, +autuuteni  
cadzos 20 +katsos  
callis 7 kalliisi, kalliissa, +kallis  
 cap 150 =abbreviation (not part of the text)  
catumattomudens 9 kaatumattomuuteensa, kaatumattomuutensa,  
                  katumattomuuteensa, +katumattomuutensa  
 cesareaan 11 =Kesareaan  
colmas 26 +kolmas  
costaman 10 koostamaan, koostamaani, koostaman, koostamani,  
   +kostamaan, kostamaani, +kostaman, kostamani  
cuitengin 328 +kuitenkin, kuittenkin  
cunnias 9 kunniaasi, +kunniasi, +kunniassa  
cuulemma 7 kuulemma, +kuulemme, kuulleemme, kuullemme  
duomidzeman 20 +tuomitsemaan, tuomitsemaani, +tuomitseman,  
               tuomitsemani  
egyptilisten 10 +egyptilisten, egyptilisteni  
engelille 8 +enkelille  
epjumalain 22 +epjumalain, epjumalaini  
ett 2808 +ett  
 gedalia 7 =Gedalia  
harwat 15 +harvat  
heitn 9 +heitn, =heidt, heittne, *heittni, *heittn,  
   *heitni, *heitn  
hetke 14 +hetke  
huones 35 +huoneesi, +huoneessa  
hyw 160 +hyv, +hyv, =hyvt  
hwitetn 14 hvitettne, hvitettni, hvitettn,  
   +hvitetn, hvitteettni, hvitteettn  
ihmisest 12 +ihmisest  
 ioh 21 =abbreviation (not part of text)  
itkemn 17 +itkemn, itkemni, +itkemn, itkemni  
 jerusalemist 58 =Jerusalemista  
jolle 14 +jolle  
judalaisista 12 +juutalaisista  
jumalinen 9 +jumalinen, jumalineen, jumalineni  
kedolla 53 +kedolla, ketolla  
kircasta 7 kirkasta, +kirkastaa  
kyln 7 +kyln, kylni, kyylln, kyyln, kyylni  
ksiwartens 14 ksivarteensa, +ksivartensa, ksivarttensa  
@ ktyxi 21 =ktyksi (obsolete inflection pro ’knnetyksi)  
laskeman 7 +laskeman, +laskemaan, laskemaani, laskemani  
lewitat 11 +leviitat  
luetan 14 luetan, +luetaan, luettane  
lydyxi 6 +lydyksi  
@ lytty 13 =lytty (obsolete inflection pro ’lydetty’)  
mailma 57 +maailma, +maailmaa  
menewt 44 +menevt  
miehens 11 mieheens, +miehens  
@ mixette 6 =miksette (OK in MSF but OMORFI rejects)  
muu 30 +muu  
neljkymmend 16 +neljkymment, neljkymment  
nimitt 12 nimitt, +nimitt  
nhd 88 +nhd  
oikein 87 +oikein, oikeine, oikeini  
oma 34 +oma, +omaa  
opetuslastens 40 +opetuslastensa  
oxa 11 +oksa, +oksaa  
paimen 19 +paimen  
palwelusta 22 palvellusta, +palvelusta  
parembi 11 +parempi  
perkelest 6 +perkeleest  
pidetn 14 pidettne, pidettni, pidettn, +pidetn,  
           piteettni, piteettn  
pohjaisest 8 +pohjaisesta, pohjaisesti  
prophetalle 26 +profeetalle  
puolelle 19 +puolelle  
piwi 9 +pivi  
pt 7 +pt, pte  
rangaisewa 18 +rankaiseva, rankaisevaa  
ristinnaulidzit 6 ristiinnaulitsit, ristiinnaulitsitte,  
                  +ristiinnaulitsivat  
ruumis 10 +ruumis, +ruumiisi, ruumiissa, ruumissa  
@ saatit 14 =saatit (old inflection pro ’saattoivat’), *saatit  
sanani 45 +sanani, +sanaani  
 saul 8 =Saul  
seuracunda 38 +seurakunta, +seurakuntaa  
@ sijpein 10 =siipein (old inflection pro ’siipien’)  
sisld 11 +sislt, sislt  
sucucunda 25 +sukukunta, +sukukuntaa, *suukukunta, *suukukuntaa  
suus 9 +suusi, +suussa  
synnytt 13 synnytt, +synnytt  
tahto 243 tahto, +tahtoa, +tahtoo  
tapahtunut 65 +tapahtunut  
tehkt 47 +tehk  
tie 15 +tie  
toiseens 6 +toiseensa  
tulella 35 +tulella, tulleella, tuulella, tuulleella  
turmelit 6 +turmelit, turmelitte, +turmelivat  
tyttres 10 tyttreesi, +tyttresi, tyttress  
tytti 11 +tytti  
uscollinen 10 +uskollinen  
waelsi 26 +vaelsi  
waldacundain 15 +valtakuntain, valtakuntaini  
wanhast 8 +vanhasta, vanhasti  
@ wartioidzit 8 =vartioitsivat (old inflection pro ’vartioivat’)  
wertauxen 34 vertaukseen, vertaukseeni, +vertauksen, vertaukseni  
wihollisen 15 viholliseen, viholliseeni, +vihollisen, +viholliseni  
woi 223 +voi  
wuorten 22 +vuorten, vuorteni  
 xi 13 =XI (not part of the text)  
yljn 13 +yljn  
ystw 6 +ystv, ystv  
nell 27 +nell

Appendix 2: Sample out of all OLF word forms

This sample was taken out of the full list of all distinct OLF word forms of the corpus, i.e. each word form was just once in the list no matter how many times it occurred in the corpus. The sample starts with the 86th word and proceeds with steps of equal length in the alphabetical list. The markings for proper names or abbreviations (), words with no analysis (@), manually added interpretations (=), correct analyses (+) and completely irrelevant candidates for MSF words (*) follow the same principles as in the sample in Appendix 1.

 ahabin 3 =Ahabin  
@ alaidzen 1 =alitse  
andimexi 1 +antimeksi  
arpoja 1 +arpoja  
asuwat 59 +asuvat  
 bath 3 =Bath  
cahleis 5 +kahleissa  
@ cananeri 1 =kanaaneri=Kaanaan asukas, *kanan-erie  
carsi 2 kaarsi, +karsi, karsii  
@ cauhiuttan 1 =kauheuttaan  
@ cherubim 5 =kerubim=kerubi  
colminaisuden 2 +kolminaisuuden, kolminaisuuteen,  
    kolminaisuuteeni, kolminaisuutena, kolminaisuuteni  
cotcatkin 1 kotkaatkin, +kotkatkin  
cullastans 1 +kullastansa, kultastansa  
cuolettaman 1 +kuolettamaan, kuolettamaani, kuolettaman,  
    kuolettamana, kuolettamani  
cuurnidzet 2 +kuurnitset, kuurnitsette  
edestm 8 +edestmme  
elwin 1 +elvin, elvini  
epjumalista 2 +epjumalista  
@ etts 100 =etts (OMORFI)  
 gileadis 5 =Gileadissa  
halkeisit 1 halkeisit, +halkeisivat, halkeisitte  
hedelmlisest 2 +hedelmllisesti, hedelmllisest  
@ herj 1 =herj=her, herj, herj  
hopiaxi 1 +hopeaksi  
hurscana 1 +hurskaana  
hpis 13 +hpesi, +hpess, +hpesi  
ihmisild 17 +ihmisilt  
 ismaelille 1 =Ismaelille  
jalca 3 +jalka, +jalkaa  
johdatan 3 +johdatan, johdattane  
@ julgista 1 =julkistaa  
jutteli 7 +jutteli  
kelwatcon 1 +kelvatkoon  
kijtoswirren 4 +kiitos-virren  
kitans 1 kitaansa, +kitansa  
 kyrenist 1 =Kyrenest  
ktens 1 +ktens, +kteens, kttens  
lainaxi 1 +lainaksi  
laulun 3 +laulun, lauluna, lauluni, lauluun, lauluuni  
lewollisest 1 +levollisesta, levollisesti  
lohdutuxellans 1 +lohdutuksellansa  
luotat 5 +luotat, luotaat, luotaatte, luotatte  
lhikylins 1 +lhikylins  
 magnus 3 =Magnus=Suuri, *maa-gnuusi  
@ medzficunapuulle 1 =mets-viikuna-puulle  
miehest 4 +miehest  
 moph 1 =Moph  
muucalaisilda 1 +muukalaisilta  
@ nautitcat 1 =nautitkaa=nauttikaa  
nimi 1 +nimi, nime, nime  
nurisewat 1 +nurisevat, nurissevat  
@ ohrapion 1 =ohrapivon=ohrakourallisen, *ohrapioni  
@ onnettomudexen 1 =onnettomuudeksenne, onnettomuudekseen,  
    onnettomuudekseni  
ota 102 +ota, oitta, otaa, +ottaa  
pahenetta 1 +pahenette  
paljastawat 1 +paljastavat  
paransin 2 +paransin  
peljnnet 2 peljnnet, +peljnneet, peljnnette  
 pet 6 =Pet=Petrus=abbreviation, *peet  
 pilatus 58 =Pilatus, *pilattusi, *pilattuusi, *pilatussa  
@ poismenit 1 =poismenit=pois menit  
 publiuxella 1 =Publiuksella  
purpuraan 1 purpuraan, +purppuraan, purppuraani, purpuraani  
plimmist 1 +pllimmist  
racastawanans 1 +rakastavanansa  
rascautta 2 +raskautta, +raskauttaa  
riemuhuudon 1 +riemuhuudon, riemuhuutona, riemuhuutoni,  
    riemuhuutoon, riemuhuutooni  
rucouxens 3 rukoukseensa, +rukouksensa  
saamme 8 +saamme  
saitte 4 +saitte  
@ saphir 3 =safiiri  
selitetyt 1 +selitetyt  
siel 1 +siell  
@ sislmisin 1 =sislmisiin=sisimmisiin?  
sotawke 4 +sotavke  
suremmaxi 1 +suuremmaksi  
@ syndeins 30 =synteins/syntiens (OMORFI)  
syxemn 1 +syksemn, syksemni, +syksemn,  
    syksemni  
taitons 1 +taitonsa, taitoonsa, taittonsa, taittoonsa  
taudist 3 +taudista, tautiista, tautiist, tautista  
@ terwehdimm 2 =tervehdimme  
todistaja 4 +todistaja, todistajaa  
tottunet 3 +tottunet, tottuneet, tottunette  
tunnustin 2 +tunnustin  
tyhmin 1 +tyhmin, tyhmini  
tills 1 +tillsi, tiltsi  
uscalda 4 +uskaltaa  
wacudes 1 vakuudessa, vakuuteesi, +vakuutesi  
waiwan 12 +vaivaan, vaivaani, +vaivan, vaivana,  
    +vaivani  
wallidzewat 7 +vallitsevat  
warcaudella 1 +varkaudella  
weidzet 1 +veitset  
wialliset 2 +vialliset  
wihollisillans 1 +vihollisillansa  
wircaan 6 +virkaan, virkaani  
wuoria 7 +vuoria  
wrist 4 +vrist, vrist  
@ ylllist 1 =ylllist=ilket?  
yxi 334 yksi

Appendix 3: Two-level rules

Alphabet  
 a   a: b d e e:a e:i e:u e: e: e: f:p g h i i:j i:  
 j j:i j: k k:c k:g k:x l l: m m: n o o:a o: p p:b p:w  
 r s s:n s:z s: t t:d t:l t:n t:r t: u u: v v:f v:g v:w v:  
 y y:  :  : : :d :e :g :h :i :n :s :t ;  
 
Sets  
 Vowel = a e i o u y   ;  
 Cons = b c d f g h j k l m n p r s t v w x z ;  
 
Definitions  
 Suf1 = [n i: | n s a: | s i: | m m: e:] ;  
 Suf2 = (k i n | k a : n) ;  
 aSuff =  ((a:) (n | Suf1) | i n | i Suf1 | l [l e|t a] (Suf1) |  
           n | n a (Suf1) | s [s:|t] a: (Suf1) | t |  
           [t:|:t] a (Suf1) | k:x s: e Suf1 | k:x s: i) Suf2 .#. ;  
 oSuff =  [k:x s: i | l l [a|e (e: n)] (Suf1) | n [a|e] (Suf1) |  
           s [s:|t] a: (Suf1) | t | [t:|:t] a (Suf1) |  
           t t e n | t t e Suf1] Suf2 .#. ;  
Rules  
 
"a:" a: => a: _ ;  
!                    p a l a j a a:  
!                    r a a: m a t t u  
             :Cons e _ .#. ;  
!                    k:c a i k:c k e a:  
             :Cons o a: _ .#. ;  
!                    h o l h o a: a:  
             :Cons o _ [(a:) .#. | :i :s | j [a|i] | :m |  
                        :n .#. | t (t e:) .#. | :w | :* :x] ;  
!                    k i r o a: i s i t  
!                    p u t o a: v:w a t  
!                    p u t o a: m i s i l l a  
!                    v a i n o a: a:  
!                    v:w a i n o a: j a n i  
!                    v:w a i n o a: t t e:a  
!             [n | s (s:) | s t] _ .#. ;  
             [s (s:) | s t] _ .#. ;  
!                    s e a s s: a:  
!                    a i k:c a n a:  
!                    e v a n k:g e l i u m i s t a:  
             i v: _ t .#. ;  
!                    a n t:n o i v: a: t  
!                    s o t:d i:e i v: a: t  
 
"e:" e: => e _ ;  
!                    i h m e e: t  
             _ .#. ;  
!                    i s  m m: e:  
             _ i t [t e n | a | ] .#. ;  
!                    a p o s t o l e: i t t e n  
             i s _ n [a | ] .#. ;  
!                    t o i s e: n a  
             a t _ r i [a | o] ;  
!                    a t e: r i o i t:d s:z i  
 
"e:a" e:a => [t t | m m] _ .#. ;  
!                    k:c u u l i t t e:a  
!                    t u l i m m e:a  
 
"e:u" e:u => l _ i t ;  
!                    k:c u o l l e:u i t t e n  
 
"e:" e: => n _ m ;  
!                    e n e: m p:b  :t   
 
"e:i" e:i => t a _ n .#. ;  
!                    o p e t t a e:i n  
             _ [a | ] ;  
!                    r u s k e:i a t  
 
"e:" e: => .#. y l _ n ;  
!                    y l e: n k:c a t:d s:z o  
 
"f:p" f:p => _ :h ;  
!                    p r o f:p :h e e: t t: a i n  
 
"i:j" i:j => i _ ;  
!                    n i i:j s s:   
 
"i:" i: => i _ ;  
!                    r u u m i i: n  
             o _ t ;  
!                    o s o i: t t i  
             [n | s | s t] _ .#. ;  
!                    n i m e s i:  
!                    k:c o l m a s t i:  
 
"i: .#." i: <= i _ .#. ;  
 
"j:i" j:i => .#. o r _ a ;  
!                    o r j:i a t  
 
"ij:i" j: => [k: | l | s: | t:] i _ [a aSuff | o i oSuff]  ;  
!                    k:c a m a r i p a l v:w e l i j: a :t a  
!                    k:c a u p i t:d s:z i j: a t  
!                    k:c u l k i j: o i t a  
!                    k:c u r k i s t e l i j: a t  
!                    h a k i j: a t  
!                    h a l t:d i j: a  
!                    h a l t:d i j: o i l l e  
!                    h a l l i t:d s:z i j: a  
!                    j u o k:x s: i j: a n  
!                    p a l v:w e l i j: a  
!                    p a l v:w e l i j: o i t a  
!                    r a n k:g a i s i j: a  
!                    v:w a a t i j: a n s a:  
!                    v:w a l e h t e l i j: a t  
!                    v:w a r t i j: a  
               s i _ [a oSuff | o oSuff | o i t ] ;  
!                    s i j: a  
               t e k i _  ;  
!                    t e k i j:   
 
"k:c" k:c => \:k _ [k | (:) [:a | :o | :u] | :h | l | r] ;  
!                    j a l k:c a i n s a:  
 
"kk:ck" k:c <= _ k ;  
 
"k:g" k:g => n _ ;  
!                    e n k:g   
             .#. t y _  ;  
!                    t y k:g  s i:  
 
"k:x" k:x <=> _ s: ;  
!                    h a a k:x s: i  
 
"ll:l" l: => l _ ;  
!                    e h t o o: l l: i s e n  
 
"mm:m" m: => m _ Vowel: ;  
!                    i s  m m: e:  
 
!"nn:n" n: => n _ e: .#. ;  
!                    k  s i  n n: e:  
!                    t e i t  n n: e:  
!                    a j a t u k:x s: i a n n: e:  
!                    k y m m e n e n n: e n  
 
"o:" o: => o: _ ;  
!                    e h t o o: n a  
 
"o:a" o:a => k: _ o: n .#. ;  
!                    k:c u u l k:c o:a o: n  
"~ oo:oa" o:o /<= o:a _ ;  
 
"p:b" p:b => m _ ;  
!                    s u u r e m p:b i  
!                    ?? s a p:b :b a t h:t i  
!                    ?? m u :u l p:b e: r i n  
!                    ?? t o p:b i a a: n  
 
"p:w" p:w => _ [u | y| ] :i (s i: (v: : t)) .#. ;  
!                    v:w i i:j p:W y i  
!                    l u o p:w u i  
!                    r e p:w  i s i:  
!                    l e p:w  :i s i v: : t  
 
"pp:p" p: => :Vowel (:m | :l | :r) p _ ;  
!                    k:c u m p p: a n i  
 
"s:" s: => s _ ;  
!                    e d e s s:   
!                    s e a s s: a:  
             :x _ ;  
!                    h a a k:x s: i  
 
"s:n" s:n => s _ [e | u | y] t .#. ;  
!                    n o u: s s:n u t  
!                    k:c a t k:c a i s s:n e e: t  
 
"s:z" s:z => t: _ ;  
!                    e t:d s:z i  
             .#. j o k: a i :d _ e ;  
!                    j o k:c a i :d s:z e l l e  
 
"t:d" t:d => [a|e|i|o|u (u:)|y|||h|l|n|t:] _ [a|e|i|o|u|y||] ;  
!                    p e l t:d o  
             _ s:z ;  
!                    p a i t:d s:z i  
             .#. _ u o m [a | i] ;  
!                    t:d u o m i o n  
"lt:ll" t:l => .#. :Cons* :Vowel (:Vowel) l _ ;  
!                    k:c u l t:l a i n e n  
 
"t:n" t:n => n _ [[o|u] i (v: a:) (t)] .#. ;  
!                    a n t:n o i  
!                    i l m a a: n t:n u i  
!                    a n t:n o i v: a: t  
             n _ [[|y] i (v: :) (t)] .#. ;  
!                    s y n t:n y i v: : t  
             n _ [a||y] i s i (v: [a:|:]) t .#. ;  
!                    a n t:n a i s i v: a: t  
 
"rt:rr" t:r => r _ a i s ;  
!                    k:c u m a r t:r a i s i:  
 
"t:" t: => t _ ;  
!                    p r o f:p :h e e: t t: a i n  
 
"u:" u: => u _ ;  
!                    h a l t:d u u: n  
!                    p a k:c a n a l l i s u u: d e s t a:  
             n o _ s s:n ;  
!                    n o u: s s:n u t  
 
"v:f" v:f => .#. _ [a n :g | i :c u n a] ;  
!                    v:f a n k:g i n a  
!                    v:f i k:c u n a  
 
"v:g" v:g => u _ u ;  
!                    s u v:g u n  
!                    l u v:g u n  
!                    r i u v:g u l l a  
 
"v:" v: => i _ [a:|:] t .#. ;  
!                    s a i s v: a: t  
 
"y:" y: => y _ ;  
!                    v:w   r y y: t t   
 
":" : =>  _ ;  
!                    k  : r m e e: n  
             [e | n s | s s: | s t] _ .#. ;  
!                    h e t k e :  
!                    n  k  n s :  
!                    h e n g e s s: :  
             .#. t i e t _ k [ : :t| ] ;  
!                    t i e t : k  : :t  
             i v: _ t .#. ;  
!                    k   n t:n i v: : t  
 
":" : => k : _ [n | t] .#. ;  
!                     l k  : n  
 
":" : => _ : n .#. ;  
!                     l k : : n  
 
"~ :" : /<= : _ ;  
 
":d" :d => _ s:z ;  
!                    j o k:c a i :d s:z e l l e  
!                    j o u 0:d s:z e n  
!                     k k: i :d s:z e l t   
 
":e" :e => .#. [e :d :z | k  r s | k   r | p y : h k |  
                  r u o :c k | s a l | s o t: | v: a a t:] _ i .#. ;  
!                    e t:d s:z :e i  
!                    k  r s :e i  
!                    k   r :e i  
!                    p y y: h k :e i  
!                    r u o k:c k :e i  
!                    s a l l :e i  
!                    s o t:d :e i  
!                    s o t :e i  
!                    v:w a a t:d :e i  
 
":g" :g => .#. [a i|a l|j a|p a|t e|k:c o (r)|r u o|h u o|n ]  
             _ [a|e|o|u|y|] ;  
!                    a i :g o i t  
!                    a l :g u s t a  
!                    h u o :g a t a  
!                    j a :g a t t e  
!                    k:c o :g o s s: a:  
!                    k:c o :g o l l a  
!                    k:k o r :g o t a n  
!                    n  :g y n  
!                    n  :g  n  
!                    p a :g o s t a:  
!                    r u o :g o n  
!                    t e :g o i l l a  
!                    v:w a a ’:g a l l a  
 
":h" :h => f:p _ ;  
!                    f:p :h a r i s e u s t e n  
             a _ a n .#. ;  
!                    j u h l a :h a n  
             e _ e n .#. ;  
!                    h  n e :h e n  
             i _ i n .#. ;  
!                    k:c a r i :h i n  
             o _ o n .#. ;  
!                    a r m o :h o n  
             u _ u n .#. ;  
!                    l o p p u :h u n  
              _  n .#. ;  
!                    e l  m  :h  n  
              _  n .#. ;  
!                    k i v:w i s t  :h  n  
             .#. k:c _ r i s t ;  
!                    k:c :h r i s t u s  
 
"asi:ais" :i => [a|] _ s i: ;  
!                    a v:w a :i s i:  
                 [a|] _ s i v: [a:|:] t .#. ;  
!                    l e p:w  :i s i v: : t  
             k:c u k:c k o _ .#. ;  
!                    k:c u k:c k o :i  
             o _ n u t ;  
!                    a i k:c o :i n u t  
 
":n" :n => t:d s:z e _ .#. ;  
!                    y l i t:d s:z e :n  
!                    o h i t:d s:z e :n  
!                    l  p i t:d s:z e :n  
!                    e d i t:d s:z e :n  
!                    a l a i t:d s:z e :n  
 
":s" :s => .#. :c a n s _ [:a | :o] ;  
!                    k:c a n s :s a n  
 
":t" :t => k: [a a: |  :] _ .#. ;  
!                    a n t:d a k:c a a: :t  
             Vowel: Cons:+ Vowel:+ Cons:+ [a|o|i] _ a .#. ;  
!                    a s i a :t a  
!                    p a h e m p:b i :t a  
             Vowel: Cons:+ Vowel:+ Cons:+ [|i] _  .#. ;  
!                    k y y n  r  :t   
":w" :w => l _ o i [l | s] ;  
!                    j a l :w o i l l a

Appendix 4: Distances for automatic character by character alignment

The following short Python program builds a WFST which relates MSF word forms to OLF word forms. The resulting WFST is used in the alignment script in Figure 6. The WFST restricts the character by character matching by rejecting most consonant to vowel and vowel to consonant correspondences. Furthermore it gives penalty weights for letter correspondences depending on how many of their features differ and how much they differ. The numerical values used in the program are more or less arbitrary and one may tune them in order to improve the accuracy.

The program was made for written Finnish language, but one could modify it in order to use it for some other languages. In particular, it would be interesting to extend it so that it would cover phonetic IPA representations of any language.

"""Produces a kind of a distance matrix between  
characters in an alphabet."""  
import sys, io  
import libhfst  
algfile = libhfst.HfstOutputStream(filename="chardist.fst")  
 
vowels = {  
    ’i’:(’Close’,’Front’,’Unrounded’),  
    ’y’:(’Close’,’Front’,’Rounded’),  
    ’u’:(’Close’,’Back’,’Rounded’),  
    ’e’:(’Mid’,’Front’,’Unrounded’),  
    ’’:(’Mid’,’Front’,’Rounded’),  
    ’o’:(’Mid’,’Back’,’Rounded’),  
    ’’:(’Open’,’Front’,’Unrounded’),  
    ’a’:(’Open’,’Back’,’Unrounded’)  
    }  
cmo = {’Close’:1, ’Mid’:2, ’Open’:3}  
fb = {’Front’:1, ’Back’:2}  
ur = {’Unrounded’:1, ’Rounded’:2}  
 
consonants = {  
    ’m’:(’Bilab’,’Voiced’,’Nasal’),  
    ’p’:(’Bilab’,’Unvoiced’,’Stop’),  
    ’b’:(’Bilab’,’Voiced’,’Stop’),  
    ’v’:(’Labdent’,’Voiced’,’Fricative’),  
    ’w’:(’Labdent’,’Voiced’,’Fricative’),  
    ’f’:(’Labdent’,’Unvoiced’,’Fricative’),  
    ’n’:(’Alveolar’,’Voiced’,’Nasal’),  
    ’t’:(’Alveolar’,’Unvoiced’,’Stop’),  
    ’d’:(’Alveolar’,’Voiced’,’Stop’),  
    ’s’:(’Alveolar’,’Unvoiced’,’Sibilant’),  
    ’l’:(’Alveolar’,’Voiced’,’Lateral’),  
    ’r’:(’Alveolar’,’Voiced’,’Tremulant’),  
    ’j’:(’Velar’,’Voiced’,’Approximant’),  
    ’k’:(’Velar’,’Unvoiced’,’Stop’),  
    ’g’:(’Velar’,’Voiced’,’Stop’),  
    ’h’:(’Glottal’,’Unvoiced’,’Fricative’)}  
pos = {’Bilab’:1, ’Labdent’:1, ’Alveolar’:2, ’Velar’:3, ’Glottal’:4}  
voic = {’Unvoiced’:1, ’Voiced’:2}  
def cmodist(x1, x2):  
    """Computes a distance of Close/Mid/Open and returns it"""  
    return abs(cmo[x2] - cmo[x1])  
 
def posdist(x1, x2):  
    """Computes a distance of articulation position and returns it"""  
    return abs(pos[x2] - pos[x1])  
 
def adist(x1, x2):  
    """Computes a distance between symbols"""  
    return (0 if x1 == x2 else 1)  
 
def printlset(lset):  
    """Print the set of letters and their features"""  
    ll = sorted(lset.keys());  
    flist = []  
    for l in ll:  
        (x,y,z) = lset[l]  
        flist.append("{} : {},{},{}".format(l, x, y, z))  
    print(’\n’.join(flist))  
 
def featmetr(lset1, lset2, f1, f2, f3):  
    """Compute all metric distances between letters in d1 and d2  
 according to their features."""  
    ll1 = sorted(lset1.keys())  
    ll2 = sorted(lset2.keys())  
    ml = []  
    for l1 in ll1:  
        (x1,y1,z1) = lset1[l1]  
        for l2 in ll2:  
            (x2,y2,z2) = lset2[l2]  
            dist = f1(x1,x2) + f2(y1,y2) + f3(z1,z2)  
            ml.append("{}:{}::{}".format(l1,l2,dist))  
    return (ml)  
 
vvlist = featmetr(vowels, vowels, cmodist, adist, adist)  
cclist = featmetr(consonants, consonants, posdist, adist, adist)  
vowl = sorted(vowels.keys())  
cons = sorted(consonants.keys())  
letters = sorted(vowl + cons)  
dellist = [’{}:::{}’.format(l,3) for l in letters]  
epelist = [’:{}::{}’.format(l,3) for l in letters]  
dbllist = [’{} :{}::{}’.format(l,l,2) for l in letters]  
sholist = [’{} {}:::{}’.format(l,l,2) for l in letters]  
 
speclist = [’k:c::0 k::0’, ’k:x s:::0’, ’t:d s:z::0’, ’:d s:z::3’,  
            ’i:j::1’, ’j:i::1’, ’i j:::0’, ’i i:j::0’,  
            ’f:p :h::0’, ’u:v::1’, ’v:u::1’, ’u:w::1’, ’k:c::1’,  
            ’[o: o:?]::5’]  
all = vvlist + cclist + dbllist +  
      sholist + dellist + epelist + speclist  
re = ’[{}]*’.format(’ | ’.join(all))  
 
algfst = libhfst.regex(re)  
algfile.write(algfst)  
algfile.flush()  
algfile.close()