Starting Page >
Departments >
Department of Language Technology and Applied Linguistics >
Research Group for
Language Technology
Research Group for Language Technology
Chair: Tamás Váradi, Senior Research Fellow
Secretary: Vera Arató
E-mail:
arato.vera[at]nytud.mta.hu
Phone: (36-1) 3214-830/191
RP
The Department of Language Technology was created in 1997, as a formal
recognition of several years of research and development in the field of
language technology. The department has since accumulated significant research
experience and has made remarkable achievements, especially in the development
of linguistic resources. It has participated in several successful international
projects which were aiming, on the one hand, to adopt certain processes
developed for western European languages and now considered part of the standard
for the analysis of Hungarian (Multext-East, Gramlex) and, on the other hand, to
develop new standards of creating linguistic resources (electronic dictionary
databases, CONCEDE). The researchers at the department have acquired significant
knowledge about computerized language processing systems and technologies
developed or applied in these projects, and have played an active role in
adapting these to the needs of Hungarian.
An extended version of the Hungarian National Corpus, a reference corpus of present-day Hungarian, which reflects written use and now consists of 187 million
words from language variants form Slovakia, Subcarpathia, Transylvania and Vojvodina also, has recently been completed at the department. The use of the processes
and programs which have already been applied successfully in the course of the
processing of the corpus (i.e., those used for tokenizing or disambiguating on
the basis of statistical data), and of the technologies used in international
projects for building the lexical database (e.g., SGML/XML editors, validating
programs and descriptive grammars) has provided an opportunity for the
researchers at the department to test and develop important language
processing applications for Hungarian.
It can be considered a sign of great international recognition of the
department as well as of the institute that Budapest was awarded the right to
organize the 2003 conference of the European Association of Computational
Linguistics (EACL’03). The department played a central role both in preparing
the application (which was evaluated on the basis of very strict criteria) and
in organizing the event itself.
Summing up, it can be said that the Department of Corpus Linguistics has
accumulated a decade of experience in computational linguistics. As a result of
its participation in several international projects in the field of language
technology, and its regular and active presence at leading international
conferences and workshops from the 1990s onwards, the department has acquired
the status of a dominant intellectual base for Hungarian language technology.
Main research topics:
Natural language processing. Computer-based analysis of the
mophology and syntax of Hungarian. Development of language resources,
especially:
The Hungarian National Corpus. One of the most important tasks of the
department is the development of the corpus, which at the moment contains 187
million words, with morphological analysis and automatic part-of-speech tagging.
The corpus is available through the Internet. The corpus
includes texts representing five varieties of written language: the language of
the press, of fiction, of popular science, as well as official and personal
writings.
The development of dictionaries and lexical databases on the basis of a large
amount of data reflecting language use.
Current international and national projects:
"Cross-language Access to Catalogues And On-line libraries" (CACAO)
Duration: 2007-2009
Funding: eContentPlus programme of the European Union
Partners:
* Xerox Research Centre Europe (XRCE), France -- coordinator
* Centre Georges Pompidou, France
* INESC-ID (Instituto de Engenharia de Sistemas e Computadores Investigaçao e
Desenvolvimento em Lisboa) Portugal
* The Portuguese National Digital Library, Portugal
* CELI, (Centro per l'Elaborazione del Linguaggio e dell'Informazione) Italy
* Bolzano University Library, Italy
* Freie Universität Bozen / Libera Universita di Bolzano, Italy
* Kornik Library, Poland
* National Szechenyi Library, Hungary
* Research Institute for Linguistics, Hungarian Academy of Sciences, Hungary
* Göttingen University, Germany
* The European Library
CACAO offers an innovative approach for accessing, understanding and navigating
multilingual textual
content in digital libraries and OPACs, enabling European users to better
exploit the available European
electronic content. By coupling Natural Language Processing techniques with
available information retrieval systems and tools for facilitating the
maintenance of multilingual resources we aim at the delivery of a non intrusive
infrastructure to be integrated with current OPAC and digital libraries. The
result of such an integration will be the possibility for the user to type in
queries in his/her own language and retrieve volumes and documents in any
available language.
Website:
www.cacaoproject.eu
"Comparative Evaluation of the Hungarian and Slovene Wordnet in Machine
Translation" 2009-2010
The project aims to evaluate the Slovene and the Hungarian WordNet in
Slovene-English and Hungarian-English Machine Translation. The tool we
plan to use is the language-independent tool developed by IXA Group that
performs WSD with the help of WordNets (http://ixa2.si.ehu.es/ukb).
Besides in the field of WSD we expect improvement of results in cases
where no translation equivalent is found for a source word or phrase, as
the semantic database may provide a hypernym.
International and national projects already completed
Construction of the Hungarian WordNet Ontology and its Application in
Information Extraction Systems (2005-2007)
Project type: Economic Competitiveness Operative Program (GVOP) 2004-05-191
project
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Duration: April 2005 - July 2007
Consortium members:
* University of Szeged, Department of Informatics, HLT Group (coordinator)
* MorphoLogic Ltd. Budapest
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics
Computer application development concerning Hungarian language calls for the
development of a Hungarian vocabulary database manageable by automated processes.
In computational linguistics, ontology can be defined as the data structure of
formally defined concepts and relations, by means of which semantic inferences
can be drawn. The so-called language ontologies form an important sub-class of
computational ontologies.
The objective of the project was to create a semantically structured, general
purpose Hungarian concept set on the basis of the results and formalism of
EuroWordNet language ontology. Further it was aimed to supplement the created
ontology with a special sub-language already examined by the consortium and a
domain-specific ontology including expressions of business language. Finally, we
wished to present a potential application of the thus created concept network in
the field of information extraction.
The main result of the project is the development of a large, strictly
structured natural language concept set (ontology), which helps in finding
solutions to several important scientific and technological problems. Regarding
scientific achievements, it is important to emphasize that developments concern
the semantics of Hungarian language, i.e. of a language, which typologically and
morphologically significantly differs from other investigated European languages.
Further scientific and technical objectives of the project included:
(1) research and development of machine learning algorithms to support automatic,
heuristic-based ontology building (algorithms help reduce manual work to
validation);
(2) research in fields of word sense disambiguation and anaphora resolution;
(3) development of an ontology-based information extraction software prototype
for the domain of business news, which is capable of demonstrating the
advantages of the application of the concept network.
As the structure of WordNet ontologies is much more complex than that of any
simple lexicon or thesaurus, its application potentials are far richer. As a
mental encyclopedia of native speakers of Hungarian, a Hungarian WordNet
ontology could - to a large extent - assist language teaching in schools. Its
standardised interconnection with the other WordNets guarantees its
applicability in teaching foreign languages as well. The proper acquisition of
the lexical material of the studied foreign language, for example, may
significantly contribute to the learner's clear understanding of the differences
and similarities of his/her native and the target language. Apart from this, the
concept network of WordNet may have a great role in psycho-linguistic
experiments concerning Hungarian language.
Beyond purely scientific applicability, electronic-based language technology
applications of a Hungarian WordNet may also open new vistas. Search efficiency
of different search engines is greatly increased if these tools have reliable
access to the semantic environment of the search expression. This may lead to
the improvement of future search engines that are capable of satisfying user
needs to a greater extent. This may also increase the efficiency of information
extraction and machine translation technologies by providing information about
the semantic attributes of the analysed text. Automatisms supported by
ontologies can handle the context of the information that has to be extracted or
translated, therefore, it is likely to produce more reliable results than mere
pattern matching or word-by-word translating methods.
Hungarian-English Machine Translation System
Project type: National Research and Development Programme (NKFP) 2/008/2004
project
Duration: 01. January 2005 - 31. May 2007
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Consortium members:
* MorphoLogic Ltd. Budapest (coordinator)
* University of Szeged Department of Informatics, HLT Group
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics
The aim of the project was to implement a Hungarian-English machine translation
(MT) system. Three application prototypes can be built upon it: example sentence
translator, software supporting the understanding of free text strings, and a
form-filler translator. The long-term aim of the project is to enhance the
Hungarian language infrastructure and increase the competitiveness of the
economic entities. The system helps to fill in official forms in English, to
translate business letters into English and to help Hungarian firms in entering
the international markets. Interested non-Hungarians can have access to
information on Hungarian organisations and events about which no English
description is made otherwise. The translation service of the European
Commission uses MT systems in translating less sensitive documents (among the
languages of the "old" members), thus, a software of this type would also be of
great use in the EC.
It is a common practice in EU institutions to use MT for documents requiring
fast and cheap, comprehensible but not too stylish translation (like in case of
proposals, comments, inter-institution mails), the output of which is then
corrected by human translators. Its cost per page is less than one-fifth of that
of a human translation. Automated translation system with Hungarian as source
language at present does not exist, so we suppose that the English-Hungarian MT
system already under development and the Hungarian-English MT system to be
developed would be used by EU institutions for quick translation of documents.
International institutions other than the EU also use MT, so the system could
find its place either as product or as service both in the Hungarian and the
international markets.
The project aimed, therefore, at developing a Hungarian-English MT system that
places a great emphasis on facilitating the international integration of
Hungary. Through this, it increases the competitiveness of the economic entities
in the international market, makes EU development resources more available, thus
it encourages the innovation activities of small- and medium-sized enterprises
as well as state-financed organisations, which entails the visible improvement
of the country's R&D potential. The focus areas of the development were:
* translating tender forms and schematised international contact correspondence
into English;
* familiarising foreigners with Hungarian enterprises or, to be more precise,
facilitating the appearance of Hungarian enterprises in international markets;
* satisfying the requirements of the EU translation organisations, especially
the Commission's Translation Service (Service de Traduction).
The quality of MT is obviously far from a human translation but, since its costs
are also considerably lower, its application can be justified in certain areas.
Documents translated by computers are not intended for publication: their prime
objective is to support the understanding of foreign language documents and the
reader is left to his/her own intelligence to filter out and understand
confusing misinterpretations that are trivial for human wit yet, at the present
stage of artificial intelligence research, irresolvable for computers. For this
reason the better the reader knows the domain of the text the more useful is the
output of the translation software for him/her.
As regards machine translations the targeted enhancement of the translation in a
given domain - subsequent to the development of the core system - results in a
remarkable translation quality increase in case of deterministic translation
systems. It is well worth to assign a few areas in which the translation system
is to generate translations in quality above the average. Taking the above
priorities into consideration the translation system to be developed will be
optimized to the translation of public administration and economic domain.
From technical aspect, the system merges the advantages of direct and transfer
translation mechanisms and incorporates corpus linguistic methods as well - thus
it is capable of doing pattern-based translations, too. It is based on the
concept that the translation process is carried out not in two strictly
separated phases of analysis and synthesis but rather in one single phase,
practically simultaneously with analysis. The system analyses the texts and
constructs the translation during analysis not (only) through abstract rules but
on the basis of lexically more or less specified or under-specified patterns.
Examination of National and Ethnic Identity by Means of Computerised
Content-analysis of Narratives pertaining to Historic Events
Project type: Ányos Jedlik Programme 6/074/2005 project
Duration: 01. January 2006 - 12. December 2008
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Consortium members:
* University of Pécs, Institute of Psychology (coordinator)
* Hungarian Academy of Sciences, Research Institute of Psychology
* University of Szeged, Department of Informatics, Human Language Technology
Group
* MorphoLogic Ltd. Budapest
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics
This research explored the historically changing strategies of identity
construction in historical narratives of traumatic events of the Hungarian
historical past (Trianon, World War II, Holocaust, 1956) with the help of
automated language analysis methods. The analyses explored the processes, in
which different qualities of the Hungarian national identity are shaped. They
also enabled us to map the trends and psychological conditions of change, and
the knowledge of the processes of coping with negative historical events.
Another important element of the research was the analysis of parallel stories,
e.g., the history of the Austro-Hungarian Monarchy from Austrian and Hungarian
viewpoints in the field of group references, evaluation perspectives, and the
comparison of group aims. We identified, on the level of the text, formal and
informal groups separated in historical memory, language patterns of group
agency, group aims, subjectivity, inter-group relations, struggle and emotional
identification, and their connection with agents. To this effect, we prepared a
language analyser for the following concrete psychological processes: group
agency, group proximity and abduction, emotional evaluation, group struggle,
group viewpoint, and change of viewpoint, time continuity and discontinuity.
The project aim was of double nature. On the one hand, we created a content
analysis software that is able to analyse content above sentence level. On the
other hand, we deepened the present knowledge of Hungarian national identity,
modes and components of the identity construction, and the influential factors
of changes. Within the latter topic, we intended to compare stable and changing
social psychological constructions, analyse the relationship between competing
representations, and check the generational hypotheses referring to the
representation of traumatic historical events. Our further aim was the
description of the strategies for coping with loss, shame, and the sense of
guilt, and the examination of the appearance of different perspectives in
historical narratives. Distribution of responsibilities, distribution of agency
vs. submission, the appearance of endangerment vs. safety, and solitude vs.
interdependence, the appearance of the evaluation pattern of acceptability and
unacceptability from the viewpoint of the group as well as narrower and wider
social environment.
Unified Hungarian Ontology
Project type: National Research and Development Programme (NKFP) 2/042/2004
project
Duration: 01. October 2004 - 31. October 2006
Funding: Agency for Research Fund Management and Research Exploitation (KPI)
Consortium members:
* Budapest University of Technology and Economics, Department of Sociology and
Communication (coordinator)
* Budapest University of Technology and Economics, Department of
Telecommunication and Media Informatics
* MorphoLogic Ltd. Budapest
* Scriptum Informatics Corporation
* Applied Logic Laboratory
* University of Szeged, Department of Informatics, HLT Group
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics
In the public services practice of companies valuable knowledge arises on a
daily basis, which is worth keeping track of in a company knowledge-base so that
on the following occasion any PR co-worker can utilise it. For the operation of
a continuously growing knowledge-base, such ontology-based knowledge management
skills are required that can ensure the integration and systematisation of
practical, factual information of the knowledge-base. The immediate objective of
the project is the intelligent and computational support of such public service
activities in the field of telecommunication. A successfully successfully
developed ontology infrastructure can be made use of in any other domain
provided a domain specific knowledge-base and ontology.
In order to achieve immediate objectives, the project had to carry out
developments in a way that their results could be used in a wider circle and for
other purposes, as well. Therefore, the indirect objective of the project was
the creation of a unified national ontology framework that contains a freely
available top ontology and a domain ontology of public telecommunication
services. By this means, the consortium wishes to create an open, feely
available ontology infrastructure containing an ontology management methodology,
ontology handling tools, a practical guide and the necessary cooperative system
for the maintenance of the framework.
The term 'ontology' first appeared in the world of data modelling and artificial
intelligence, it was only later that it was used in an increasing number of
other fields, e.g., cognitive psychology, natural language processing. The
current and still growing popularity of the term is due to the international
Semantic Web initiative. For some it may seem that this category, which has
emerged in the past few decades, is the product of informatics, but ontology has
always been a special field of philosophy. Therefore, if we want to thoroughly
understand the activities of information technology concerning ontology, it is
worth separating philosophical ontologies and the so-called industrial
ontologies from each other. No matter what we call them, primarily, the true
sense for us in this separation can be a more precise and more unambiguous
description of the inner structure and features of applicable ontologies.
Ontology building requires strict methodology, adequate ontology management
skills and the establishment of a robust infrastructure. We have to be prepared
that, in a short while, the unified ontology framework might have the function
to loosely connect different domain ontologies. This will require skills in
comparing ontologies and matching them loosely. One possible tool for the
comparability of ontologies is connection through top categories when formal
logic tools, methodologies are required.
E-vocabulary -- Educational aid for examining contemporary Hungarian
literature and its vocabulary from multiple angles
(IHM-ITP-11 /106)
Duration: 2004-2005
In the framework of this project we have morphologically analysed and
disambiguated a 33 million word corpus containing texts of contemporary
Hungarian literature, forming part of the "Digital Literary Academy". We have
developed a related intelligent query interface, tables showing all possible
word forms of words, as well as word form-, word stem- and part-of-speech based
frequency lists. These developments are what the electronic curriculum in
Sulinet STD is based on.
Intelligent multilingual document classification in EUROVOC system (ITEM
2003/000165)
Duration: 2004 January - 2004. December
Partners:
* JRC-IPSC
* MorphoLogic Ltd. Budapest
The project aimed at developing a multilingual system which automatically
classifies documents according to their content following the categories of the
EUROVOC categorisation system (thesaurus), which is regularly used in the
European Union. During the project the Hungarian version of the whole EUROVOC
system has been developed, along with the technology with which the automatic
content-based classification of texts in primarily Hungrian, English, German and
French can be accomplished.
Intelligent electronic dictionary and lexical database (INLEX) 48/ 2002
ITEM projektum
Duration: 2003-2004
The aim of the project was to develop an up-to-date, electronic,
machine-readable dictionary and lexical database which satisfies the needs of
the information society, and to make it searchable via the internet. The
database was created by means of a technology which follows and applies
international standards. It was sufficiently explicit and practical, confirming
to the needs of computer-based applications, and thus it could provide
up-to-date information which flexibly adapts to the needs of language technology,
scientific research, education, or those of the general public. The project was
essentially based on one technological and one content source. In the course of
the CONCEDE (Consortium
for Central European Dictionary Encoding) project, in which the Department
of Corpus Linguistics also took part, a representation formalism was developed
which, being based on international standards but taking into account the
specific features of individual languages, including Hungarian, is capable of
coding and storing lexical information in a way which satisfies the above
requirements. The INLEX project aimd to use this technological basis and to
develop it further. The source of the content of the electronic dictionary is
the Concise Hungarian Explanatory Dictionary, compiled at the Research Institute
for Linguistics, which was also the basis for the CONCEDE project, although the
processing of the whole dictionary could not be carried out in the framework of
the latter, and only some parts it were used as test data.
Machine learning of syntax rules (application of machine learning methods for
the generation of Hungarian syntactic rules)
Project type: Info-communication Technologies and Applications (IKTA) 37/2002
RTD project
Duration: 01. October 2002 - 31. October 2004
Funding: Ministry of Education
Consortium members:
* University of Szeged, Department of Informatics, HLT Group (coordinator)
* MorphoLogic Ltd. Budapest
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics
Parsing, or syntactic analysis of texts plays a key role in natural language
processing (NLP). Similarly to many other languages, Hungarian heavily relies on
the use and interrelation of suffixes (morphemes) and elementary word structures
(syntagmas). The recognition of syntagmas and identification of their relation
to each other is essential in NLP systems. Lacking this, semantic analysis of
natural language sentences would not be executable. Also, artificial
intelligence programs could work much more efficiently by the introduction of a
thorough syntactical analysis. Promising fields of application include machine
translation, automatic information extraction, and text analysis for scientific
or commercial purposes.
Research groups studying the structure of Hungarian sentences have made a great
effort to produce a consistent syntax rule system, yet these have not been
adaptable to practical, computer related purposes so far. This implies that
there is a strong demand for the development of a technology, that would be able
to divide a Hungarian sentence into syntactical segments, recognize their
structure, and based on this recognition, would assign an annotated tree
representation to each sentence. Such, so called treebank representations have
already been developed for most West European languages, and some Central and
East European languages as well.
In relation to the above, the project's main goal was twofold. On the one hand,
we aimed to develope a general purpose syntactic parser for Hungarian, with the
support of machine learning algorithms. An inevitable precondition of the
technology behind a syntactic parser that has the required efficiency is the
existence of a syntactically annotated Hungarian language corpus of suitable
size (a treebank), which can serve as learning database for the machine learning
system, and also as a basic reference for future similar research. Therefore,
another aim of the project was to develop such a treebank.
Information Extraction from Short Business News
Project type: National Research and Development Programme (NKFP) 2/17/2001
project
Duration: 01. July 2001 - 31. July 2003
Funding: Ministry of Education
Consortium members:
* MorphoLogic Ltd. Budapest (coordinator)
* University of Szeged, Department of Informatics, HLT Group
* Research Institute for Linguistics at HAS, Department of Corpus Linguistics
The central aim of the project was to develop a technology which is capable of
content-analysis and information-retrieval, with the help of which the relevant
information could be obtained in a structured form from texts (from short
business news). During the IE process, first textual data (natural language text)
had to be parsed for relevant information, then the identified information had
to be extracted and stored in a pre-defined structure. It was important that the
system disregards irrelevant information, and that the structured data can be
easily managed and queried by automated means. To accomplish this goal,
participants represented the most typical events of business life by so-called
semantic frames. The recognition of semantic frames was supported by shallow
syntactic parsing methods. Consortium members applied machine learning
algorithms for determining shallow syntactic rules. The learning process was
conducted on the Szeged Treebank 1.0 already containing hierarchic noun phrase (NP)
annotation and the marking of clause boundaries.
A by-product of the project was an annotated corpus of Hungarian, which serves
as a reference for future linguistic and language technological research.
2000-2002: MATCHPAD (Machine Translation for the Czech, Polish and Hungarian Public
Administration)
1998-2000: CONCEDE
(Consortium for Central European Dictionary Encoding) COPERNICUS project
1997-2002: TELRI (Trans European Language Resources Infrastructure) project
1995-1998: MULTEXT-EAST (Multilingual Text Tools and Corpora) COPERNICUS project
|
|