|  
 Starting Page > 
	  Departments >
      Department of Language Technology and Applied Linguistics >
	  Research Group for 
      Language Technology     Research Group for Language Technology        Chair: Tamás Váradi, Senior Research Fellow      Secretary: Vera Arató      E-mail: 
arato.vera[at]nytud.mta.hu      Phone: (36-1) 3214-830/191   RP The Department of Language Technology was created in 1997, as a formal 
recognition of several years of research and development in the field of 
language technology. The department has since accumulated significant research 
experience and has made remarkable achievements, especially in the development 
of linguistic resources. It has participated in several successful international 
projects which were aiming, on the one hand, to adopt certain processes 
developed for western European languages and now considered part of the standard 
for the analysis of Hungarian (Multext-East, Gramlex) and, on the other hand, to 
develop new standards of creating linguistic resources (electronic dictionary 
databases, CONCEDE). The researchers at the department have acquired significant 
knowledge about computerized language processing systems and technologies 
developed or applied in these projects, and have played an active role in 
adapting these to the needs of Hungarian. An extended version of the Hungarian National Corpus, a reference corpus of present-day Hungarian, which reflects written use and now consists of 187 million 
words from language variants form Slovakia, Subcarpathia, Transylvania and Vojvodina also, has recently been completed at the department. The use of the processes 
and programs which have already been applied successfully in the course of the 
processing of the corpus (i.e., those used for tokenizing or disambiguating on 
the basis of statistical data), and of the technologies used in international 
projects for building the lexical database (e.g., SGML/XML editors, validating 
programs and descriptive grammars) has provided an opportunity for the 
researchers at the department to test and develop important language 
processing applications for Hungarian.   It can be considered a sign of great international recognition of the 
department as well as of the institute that Budapest was awarded the right to 
organize the 2003 conference of the European Association of Computational 
Linguistics (EACL’03).  The department played a central role both in preparing 
the application (which was evaluated on the basis of very strict criteria) and 
in organizing the event itself.  Summing up, it can be said that the Department of Corpus Linguistics has 
accumulated a decade of experience in computational linguistics. As a result of 
its participation in several international projects in the field of language 
technology, and its regular and active presence at leading international 
conferences and workshops from the 1990s onwards, the department has acquired 
the status of a dominant intellectual base for Hungarian language technology. Main research topics: Natural language processing. Computer-based analysis of the 
mophology and syntax of Hungarian.  Development of language resources, 
especially:  The Hungarian National Corpus. One of the most important tasks of the 
department is the development of the corpus, which at the moment contains 187 
million words, with morphological analysis and automatic part-of-speech tagging. 
The corpus is available through the Internet. The corpus 
includes texts representing five varieties of written language: the language of 
the press, of fiction, of popular science, as well as official and personal 
writings. The development of dictionaries and lexical databases on the basis of a large 
amount of data reflecting language use.  Current international and national projects: 
"Cross-language Access to Catalogues And On-line libraries" (CACAO)Duration: 2007-2009
 Funding: eContentPlus programme of the European Union
 Partners:
 * Xerox Research Centre Europe (XRCE), France -- coordinator
 * Centre Georges Pompidou, France
 * INESC-ID (Instituto de Engenharia de Sistemas e Computadores Investigaçao e 
Desenvolvimento em Lisboa) Portugal
 * The Portuguese National Digital Library, Portugal
 * CELI, (Centro per l'Elaborazione del Linguaggio e dell'Informazione) Italy
 * Bolzano University Library, Italy
 * Freie Universität Bozen / Libera Universita di Bolzano, Italy
 * Kornik Library, Poland
 * National Szechenyi Library, Hungary
 * Research Institute for Linguistics, Hungarian Academy of Sciences, Hungary
 * Göttingen University, Germany
 * The European Library
 
 CACAO offers an innovative approach for accessing, understanding and navigating 
multilingual textual
content in digital libraries and OPACs, enabling European users to better 
exploit the available European
electronic content. By coupling Natural Language Processing techniques with 
available information retrieval systems and tools for facilitating the 
maintenance of multilingual resources we aim at the delivery of a non intrusive 
infrastructure to be integrated with current OPAC and digital libraries. The 
result of such an integration will be the possibility for the user to type in 
queries in his/her own language and retrieve volumes and documents in any 
available language.
 Website:
www.cacaoproject.eu    "Comparative Evaluation of the Hungarian and Slovene Wordnet in Machine 
Translation"  2009-2010
  The project aims to evaluate the Slovene and the Hungarian WordNet in 
Slovene-English and Hungarian-English Machine Translation. The tool we 
plan to use is the language-independent tool developed by IXA Group that 
performs WSD with the help of WordNets (http://ixa2.si.ehu.es/ukb).
Besides in the field of WSD we expect improvement of results in cases 
where no translation equivalent is found for a source word or phrase, as 
the semantic database may provide a hypernym.
    International and national projects already completed
 Construction of the Hungarian WordNet Ontology and its Application in 
Information Extraction Systems (2005-2007)
 Project type: Economic Competitiveness Operative Program (GVOP) 2004-05-191 
project
 Funding: Agency for Research Fund Management and Research Exploitation (KPI)
 Duration: April 2005 - July 2007
 Consortium members:
 * University of Szeged, Department of Informatics, HLT Group (coordinator)
 * MorphoLogic Ltd. Budapest
 * Research Institute for Linguistics at HAS, Department of Corpus Linguistics
 
 Computer application development concerning Hungarian language calls for the 
development of a Hungarian vocabulary database manageable by automated processes. 
In computational linguistics, ontology can be defined as the data structure of 
formally defined concepts and relations, by means of which semantic inferences 
can be drawn. The so-called language ontologies form an important sub-class of 
computational ontologies.
 The objective of the project was to create a semantically structured, general 
purpose Hungarian concept set on the basis of the results and formalism of 
EuroWordNet language ontology. Further it was aimed to supplement the created 
ontology with a special sub-language already examined by the consortium and a 
domain-specific ontology including expressions of business language. Finally, we 
wished to present a potential application of the thus created concept network in 
the field of information extraction.
 The main result of the project is the development of a large, strictly 
structured natural language concept set (ontology), which helps in finding 
solutions to several important scientific and technological problems. Regarding 
scientific achievements, it is important to emphasize that developments concern 
the semantics of Hungarian language, i.e. of a language, which typologically and 
morphologically significantly differs from other investigated European languages.
 Further scientific and technical objectives of the project included:
 (1) research and development of machine learning algorithms to support automatic, 
heuristic-based ontology building (algorithms help reduce manual work to 
validation);
 (2) research in fields of word sense disambiguation and anaphora resolution;
 (3) development of an ontology-based information extraction software prototype 
for the domain of business news, which is capable of demonstrating the 
advantages of the application of the concept network.
 As the structure of WordNet ontologies is much more complex than that of any 
simple lexicon or thesaurus, its application potentials are far richer. As a 
mental encyclopedia of native speakers of Hungarian, a Hungarian WordNet 
ontology could - to a large extent - assist language teaching in schools. Its 
standardised interconnection with the other WordNets guarantees its 
applicability in teaching foreign languages as well. The proper acquisition of 
the lexical material of the studied foreign language, for example, may 
significantly contribute to the learner's clear understanding of the differences 
and similarities of his/her native and the target language. Apart from this, the 
concept network of WordNet may have a great role in psycho-linguistic 
experiments concerning Hungarian language.
 Beyond purely scientific applicability, electronic-based language technology 
applications of a Hungarian WordNet may also open new vistas. Search efficiency 
of different search engines is greatly increased if these tools have reliable 
access to the semantic environment of the search expression. This may lead to 
the improvement of future search engines that are capable of satisfying user 
needs to a greater extent. This may also increase the efficiency of information 
extraction and machine translation technologies by providing information about 
the semantic attributes of the analysed text. Automatisms supported by 
ontologies can handle the context of the information that has to be extracted or 
translated, therefore, it is likely to produce more reliable results than mere 
pattern matching or word-by-word translating methods.
 
 Hungarian-English Machine Translation System
 Project type: National Research and Development Programme (NKFP) 2/008/2004 
project
 Duration: 01. January 2005 - 31. May 2007
 Funding: Agency for Research Fund Management and Research Exploitation (KPI)
 Consortium members:
 * MorphoLogic Ltd. Budapest (coordinator)
 * University of Szeged Department of Informatics, HLT Group
 * Research Institute for Linguistics at HAS, Department of Corpus Linguistics
 
 The aim of the project was to implement a Hungarian-English machine translation 
(MT) system. Three application prototypes can be built upon it: example sentence 
translator, software supporting the understanding of free text strings, and a 
form-filler translator. The long-term aim of the project is to enhance the 
Hungarian language infrastructure and increase the competitiveness of the 
economic entities. The system helps to fill in official forms in English, to 
translate business letters into English and to help Hungarian firms in entering 
the international markets. Interested non-Hungarians can have access to 
information on Hungarian organisations and events about which no English 
description is made otherwise. The translation service of the European 
Commission uses MT systems in translating less sensitive documents (among the 
languages of the "old" members), thus, a software of this type would also be of 
great use in the EC.
 It is a common practice in EU institutions to use MT for documents requiring 
fast and cheap, comprehensible but not too stylish translation (like in case of 
proposals, comments, inter-institution mails), the output of which is then 
corrected by human translators. Its cost per page is less than one-fifth of that 
of a human translation. Automated translation system with Hungarian as source 
language at present does not exist, so we suppose that the English-Hungarian MT 
system already under development and the Hungarian-English MT system to be 
developed would be used by EU institutions for quick translation of documents. 
International institutions other than the EU also use MT, so the system could 
find its place either as product or as service both in the Hungarian and the 
international markets.
 The project aimed, therefore, at developing a Hungarian-English MT system that 
places a great emphasis on facilitating the international integration of 
Hungary. Through this, it increases the competitiveness of the economic entities 
in the international market, makes EU development resources more available, thus 
it encourages the innovation activities of small- and medium-sized enterprises 
as well as state-financed organisations, which entails the visible improvement 
of the country's R&D potential. The focus areas of the development were:
 
 * translating tender forms and schematised international contact correspondence 
into English;
 * familiarising foreigners with Hungarian enterprises or, to be more precise, 
facilitating the appearance of Hungarian enterprises in international markets;
 * satisfying the requirements of the EU translation organisations, especially 
the Commission's Translation Service (Service de Traduction).
 
 The quality of MT is obviously far from a human translation but, since its costs 
are also considerably lower, its application can be justified in certain areas. 
Documents translated by computers are not intended for publication: their prime 
objective is to support the understanding of foreign language documents and the 
reader is left to his/her own intelligence to filter out and understand 
confusing misinterpretations that are trivial for human wit yet, at the present 
stage of artificial intelligence research, irresolvable for computers. For this 
reason the better the reader knows the domain of the text the more useful is the 
output of the translation software for him/her.
 As regards machine translations the targeted enhancement of the translation in a 
given domain - subsequent to the development of the core system - results in a 
remarkable translation quality increase in case of deterministic translation 
systems. It is well worth to assign a few areas in which the translation system 
is to generate translations in quality above the average. Taking the above 
priorities into consideration the translation system to be developed will be 
optimized to the translation of public administration and economic domain.
 From technical aspect, the system merges the advantages of direct and transfer 
translation mechanisms and incorporates corpus linguistic methods as well - thus 
it is capable of doing pattern-based translations, too. It is based on the 
concept that the translation process is carried out not in two strictly 
separated phases of analysis and synthesis but rather in one single phase, 
practically simultaneously with analysis. The system analyses the texts and 
constructs the translation during analysis not (only) through abstract rules but 
on the basis of lexically more or less specified or under-specified patterns.
 
 Examination of National and Ethnic Identity by Means of Computerised 
Content-analysis of Narratives pertaining to Historic Events
 Project type: Ányos Jedlik Programme 6/074/2005 project
 Duration: 01. January 2006 - 12. December 2008
 Funding: Agency for Research Fund Management and Research Exploitation (KPI)
 Consortium members:
 * University of Pécs, Institute of Psychology (coordinator)
 * Hungarian Academy of Sciences, Research Institute of Psychology
 * University of Szeged, Department of Informatics, Human Language Technology 
Group
 * MorphoLogic Ltd. Budapest
 * Research Institute for Linguistics at HAS, Department of Corpus Linguistics
 
 This research explored the historically changing strategies of identity 
construction in historical narratives of traumatic events of the Hungarian 
historical past (Trianon, World War II, Holocaust, 1956) with the help of 
automated language analysis methods. The analyses explored the processes, in 
which different qualities of the Hungarian national identity are shaped. They 
also enabled us to map the trends and psychological conditions of change, and 
the knowledge of the processes of coping with negative historical events.
 Another important element of the research was the analysis of parallel stories, 
e.g., the history of the Austro-Hungarian Monarchy from Austrian and Hungarian 
viewpoints in the field of group references, evaluation perspectives, and the 
comparison of group aims. We identified, on the level of the text, formal and 
informal groups separated in historical memory, language patterns of group 
agency, group aims, subjectivity, inter-group relations, struggle and emotional 
identification, and their connection with agents. To this effect, we prepared a 
language analyser for the following concrete psychological processes: group 
agency, group proximity and abduction, emotional evaluation, group struggle, 
group viewpoint, and change of viewpoint, time continuity and discontinuity.
 The project aim was of double nature. On the one hand, we created a content 
analysis software that is able to analyse content above sentence level. On the 
other hand, we deepened the present knowledge of Hungarian national identity, 
modes and components of the identity construction, and the influential factors 
of changes. Within the latter topic, we intended to compare stable and changing 
social psychological constructions, analyse the relationship between competing 
representations, and check the generational hypotheses referring to the 
representation of traumatic historical events. Our further aim was the 
description of the strategies for coping with loss, shame, and the sense of 
guilt, and the examination of the appearance of different perspectives in 
historical narratives. Distribution of responsibilities, distribution of agency 
vs. submission, the appearance of endangerment vs. safety, and solitude vs. 
interdependence, the appearance of the evaluation pattern of acceptability and 
unacceptability from the viewpoint of the group as well as narrower and wider 
social environment.
 
 Unified Hungarian Ontology
 Project type: National Research and Development Programme (NKFP) 2/042/2004 
project
 Duration: 01. October 2004 - 31. October 2006
 Funding: Agency for Research Fund Management and Research Exploitation (KPI)
 Consortium members:
 
 * Budapest University of Technology and Economics, Department of Sociology and 
Communication (coordinator)
 * Budapest University of Technology and Economics, Department of 
Telecommunication and Media Informatics
 * MorphoLogic Ltd. Budapest
 * Scriptum Informatics Corporation
 * Applied Logic Laboratory
 * University of Szeged, Department of Informatics, HLT Group
 * Research Institute for Linguistics at HAS, Department of Corpus Linguistics
 
 In the public services practice of companies valuable knowledge arises on a 
daily basis, which is worth keeping track of in a company knowledge-base so that 
on the following occasion any PR co-worker can utilise it. For the operation of 
a continuously growing knowledge-base, such ontology-based knowledge management 
skills are required that can ensure the integration and systematisation of 
practical, factual information of the knowledge-base. The immediate objective of 
the project is the intelligent and computational support of such public service 
activities in the field of telecommunication. A successfully successfully 
developed ontology infrastructure can be made use of in any other domain 
provided a domain specific knowledge-base and ontology.
 In order to achieve immediate objectives, the project had to carry out 
developments in a way that their results could be used in a wider circle and for 
other purposes, as well. Therefore, the indirect objective of the project was 
the creation of a unified national ontology framework that contains a freely 
available top ontology and a domain ontology of public telecommunication 
services. By this means, the consortium wishes to create an open, feely 
available ontology infrastructure containing an ontology management methodology, 
ontology handling tools, a practical guide and the necessary cooperative system 
for the maintenance of the framework.
 The term 'ontology' first appeared in the world of data modelling and artificial 
intelligence, it was only later that it was used in an increasing number of 
other fields, e.g., cognitive psychology, natural language processing. The 
current and still growing popularity of the term is due to the international 
Semantic Web initiative. For some it may seem that this category, which has 
emerged in the past few decades, is the product of informatics, but ontology has 
always been a special field of philosophy. Therefore, if we want to thoroughly 
understand the activities of information technology concerning ontology, it is 
worth separating philosophical ontologies and the so-called industrial 
ontologies from each other. No matter what we call them, primarily, the true 
sense for us in this separation can be a more precise and more unambiguous 
description of the inner structure and features of applicable ontologies.
 Ontology building requires strict methodology, adequate ontology management 
skills and the establishment of a robust infrastructure. We have to be prepared 
that, in a short while, the unified ontology framework might have the function 
to loosely connect different domain ontologies. This will require skills in 
comparing ontologies and matching them loosely. One possible tool for the 
comparability of ontologies is connection through top categories when formal 
logic tools, methodologies are required.
 
 E-vocabulary -- Educational aid for examining contemporary Hungarian 
literature and its vocabulary from multiple angles
 (IHM-ITP-11 /106)
 Duration: 2004-2005
 In the framework of this project we have morphologically analysed and 
disambiguated a 33 million word corpus containing texts of contemporary 
Hungarian literature, forming part of the "Digital Literary Academy". We have 
developed a related intelligent query interface, tables showing all possible 
word forms of words, as well as word form-, word stem- and part-of-speech based 
frequency lists. These developments are what the electronic curriculum in 
Sulinet STD is based on.
 
 Intelligent multilingual document classification in EUROVOC system (ITEM 
2003/000165)
 Duration: 2004 January - 2004. December
 Partners:
 * JRC-IPSC
 * MorphoLogic Ltd. Budapest
 
 The project aimed at developing a multilingual system which automatically 
classifies documents according to their content following the categories of the 
EUROVOC categorisation system (thesaurus), which is regularly used in the 
European Union. During the project the Hungarian version of the whole EUROVOC 
system has been developed, along with the technology with which the automatic 
content-based classification of texts in primarily Hungrian, English, German and 
French can be accomplished.
 
 Intelligent electronic dictionary and lexical database (INLEX) 48/ 2002 
ITEM projektum
 Duration: 2003-2004
 The aim of the project was to develop an up-to-date, electronic, 
machine-readable dictionary and lexical database which satisfies the needs of 
the information society, and to make it searchable via the internet. The 
database was created by means of a technology which follows and applies 
international standards. It was sufficiently explicit and practical, confirming 
to the needs of computer-based applications, and thus it could provide 
up-to-date information which flexibly adapts to the needs of language technology, 
scientific research, education, or those of the general public. The project was 
essentially based on one technological and one content source. In the course of 
the CONCEDE (Consortium 
for Central European Dictionary Encoding) project, in which the Department 
of Corpus Linguistics also took part, a representation formalism was developed 
which, being based on international standards but taking into account the 
specific features of individual languages, including Hungarian, is capable of 
coding and storing lexical information in a way which satisfies the above 
requirements. The INLEX project aimd to use this technological basis and to 
develop it further. The source of the content of the electronic dictionary is 
the Concise Hungarian Explanatory Dictionary, compiled at the Research Institute 
for Linguistics, which was also the basis for the CONCEDE project, although the 
processing of the whole dictionary could not be carried out in the framework of 
the latter, and only some parts it were used as test data.
 
 Machine learning of syntax rules (application of machine learning methods for 
the generation of Hungarian syntactic rules)
 Project type: Info-communication Technologies and Applications (IKTA) 37/2002 
RTD project
 Duration: 01. October 2002 - 31. October 2004
 Funding: Ministry of Education
 Consortium members:
 
 * University of Szeged, Department of Informatics, HLT Group (coordinator)
 * MorphoLogic Ltd. Budapest
 * Research Institute for Linguistics at HAS, Department of Corpus Linguistics
 
 Parsing, or syntactic analysis of texts plays a key role in natural language 
processing (NLP). Similarly to many other languages, Hungarian heavily relies on 
the use and interrelation of suffixes (morphemes) and elementary word structures 
(syntagmas). The recognition of syntagmas and identification of their relation 
to each other is essential in NLP systems. Lacking this, semantic analysis of 
natural language sentences would not be executable. Also, artificial 
intelligence programs could work much more efficiently by the introduction of a 
thorough syntactical analysis. Promising fields of application include machine 
translation, automatic information extraction, and text analysis for scientific 
or commercial purposes.
 Research groups studying the structure of Hungarian sentences have made a great 
effort to produce a consistent syntax rule system, yet these have not been 
adaptable to practical, computer related purposes so far. This implies that 
there is a strong demand for the development of a technology, that would be able 
to divide a Hungarian sentence into syntactical segments, recognize their 
structure, and based on this recognition, would assign an annotated tree 
representation to each sentence. Such, so called treebank representations have 
already been developed for most West European languages, and some Central and 
East European languages as well.
 In relation to the above, the project's main goal was twofold. On the one hand, 
we aimed to develope a general purpose syntactic parser for Hungarian, with the 
support of machine learning algorithms. An inevitable precondition of the 
technology behind a syntactic parser that has the required efficiency is the 
existence of a syntactically annotated Hungarian language corpus of suitable 
size (a treebank), which can serve as learning database for the machine learning 
system, and also as a basic reference for future similar research. Therefore, 
another aim of the project was to develop such a treebank.
 
 Information Extraction from Short Business News
 Project type: National Research and Development Programme (NKFP) 2/17/2001 
project
 Duration: 01. July 2001 - 31. July 2003
 Funding: Ministry of Education
 Consortium members:
 
 * MorphoLogic Ltd. Budapest (coordinator)
 * University of Szeged, Department of Informatics, HLT Group
 * Research Institute for Linguistics at HAS, Department of Corpus Linguistics
 
 The central aim of the project was to develop a technology which is capable of 
content-analysis and information-retrieval, with the help of which the relevant 
information could be obtained in a structured form from texts (from short 
business news). During the IE process, first textual data (natural language text) 
had to be parsed for relevant information, then the identified information had 
to be extracted and stored in a pre-defined structure. It was important that the 
system disregards irrelevant information, and that the structured data can be 
easily managed and queried by automated means. To accomplish this goal, 
participants represented the most typical events of business life by so-called 
semantic frames. The recognition of semantic frames was supported by shallow 
syntactic parsing methods. Consortium members applied machine learning 
algorithms for determining shallow syntactic rules. The learning process was 
conducted on the Szeged Treebank 1.0 already containing hierarchic noun phrase (NP) 
annotation and the marking of clause boundaries.
 A by-product of the project was an annotated corpus of Hungarian, which serves 
as a reference for future linguistic and language technological research.
 
2000-2002: MATCHPAD (Machine Translation for the Czech, Polish and Hungarian Public 
Administration) 1998-2000: CONCEDE 
(Consortium for Central European Dictionary Encoding) COPERNICUS project 1997-2002: TELRI (Trans European Language Resources Infrastructure) project 1995-1998: MULTEXT-EAST (Multilingual Text Tools and Corpora) COPERNICUS project         |   |