Index

Contact Point Metashare/f054b2a06aff11e284b6000423bfd61ccd381836ce664f3cae0507cbeb3e61f1#contact Person
Creator Joanna Świetlicka
Description Summarizer is a tool for creating short text summaries. It utilises text extraction method, i.e. the output consists of sentences from the original text. The tool uses a number of machine learning algorithms, including neural networks, linear regression, Bayesian networks and decision trees. The output sentences are chosen based on different signals, such as the length of the sentence, its position in the text structure and properties of the words it contains. The system was trained specifically for newspaper articles in Polish. It is possible, however, to adjust it for other kinds of documents and languages.
Rights CC-BY
Source META-SHARE
Title Summarizer
Type Tool Service
Contact Point Metashare/8489146281b611e2892a000c29bfc0d445c15e3917d74997925aafd9d08a24bc#contact Person
Contributor Bálint Pál Tóth
Mátyás Bartalis
Tamás Bőhm
Tamás Gábor Csapó
Klára Laczkó
Creator Csaba Zainkó
Tamás Bőhm
Description The read speech database contains sentences from weather forecast news. The sentence collection represents the four seasons. This database can be used for analysing speech characteristics in weather forecast news and also as the basic speech database of a corpus based Concept-to-Speech system.
Rights MS-C-NoReD-FF
Source META-SHARE
Title Read speech database in Hungarian
Type Corpus
Contact Point Metashare/c366848692c211e28763000c291ecfc8720a7e22a70f48ec960d5887b7e4a007#contact Person2
Metashare/c366848692c211e28763000c291ecfc8720a7e22a70f48ec960d5887b7e4a007#contact Person
Creator Jimmy O'Reagan
Description This is the LMF version of the Apertium bilingual dictionary for French and Catalan languags. Bilingual LMF dictionaries were generated from Apertium bilingual dix files. For each Apertium bilingual correspondence, the corresponding source and target monolingual entries (LexicalEntry) were generated in addition to the bilingual correspondence (SenseAxis) element. Apertium is a free/open-source machine translation platform, initially aimed at related-language pairs but recently expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides: a language-independent machine translation engine; tools to manage the linguistic data necessary to build a machine translation system for a given language pair and linguistic data for a growing number of language pairs.
Language Catalan
French
Rights GPL
Source META-SHARE
Title French-Catalan LMF Apertium Bilingual dictionary
Type Lexical Conceptual Resource
Contact Point Metashare/bcd5238492c211e28763000c291ecfc8ad3bef674daf4e53abfe5f0ab2061764#contact Person
Metashare/bcd5238492c211e28763000c291ecfc8ad3bef674daf4e53abfe5f0ab2061764#contact Person2
Creator Paul Breen
Jimmy O'Reagan
Description This is the LMF version of the Apertium bilingual dictionary for English and Catalan languages. Bilingual LMF dictionaries were generated from Apertium bilingual dix files. For each Apertium bilingual correspondence, the corresponding source and target monolingual entries (LexicalEntry) were generated in addition to the bilingual correspondence (SenseAxis) element. Apertium is a free/open-source machine translation platform, initially aimed at related-language pairs but recently expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides: a language-independent machine translation engine; tools to manage the linguistic data necessary to build a machine translation system for a given language pair and linguistic data for a growing number of language pairs.
Language English
Catalan
Rights GPL
Source META-SHARE
Title English-Catalan LMF Apertium Bilingual dictionary
Type Lexical Conceptual Resource
Contact Point Metashare/e2bc95b492c211e28763000c291ecfc8492939bf6d024f99932fa452e636dce4#contact Person
Metashare/e2bc95b492c211e28763000c291ecfc8492939bf6d024f99932fa452e636dce4#contact Person2
Creator Ana Fernandez Montraveta
Irene Castellón
Glòria Vázquez
Description The original SenSem Spanish Corpus includes syntactic and semantic annotations for a number of Spanish texts from the press domain developed by the GRIAL group (Grup de recerca consolidat de la Generalitat de Catalunya). The corpus contains one million words 300,000 of which were manually annotated at the syntactic and semantic level with syntagmatic categories, syntactic functions and semantic roles.
Language Spanish
Rights GPL
Source META-SHARE
Title GrAF version of the SenSem Spanish Corpus
Type Corpus
Contact Point Metashare/c7ed5fe892c211e28763000c291ecfc812a45eb9516243febf0d8af2fb26d6ba#contact Person2
Metashare/c7ed5fe892c211e28763000c291ecfc812a45eb9516243febf0d8af2fb26d6ba#contact Person
Creator Daniel Vicente Quílez
Description This is the LMF version of the Asturian Freeling lexicon. FreeLing is a developer-oriented library providing language analysis services. FreeLing is designed to be used as an external library from any application requiring this kind of services. Nevertheless, a simple main program is also provided as a basic interface to the library, which enables the user to analyze text files from the command line.
Rights GPL
Source META-SHARE
Title Asturian LMF Freeling Lexicon
Type Lexical Conceptual Resource
Contact Point Metashare/49b5d6ee6b0011e284b6000423bfd61c376f72c09111449181f010b89704ee07#contact Person
Creator Anna Kibort
Description A collection of sentences from the Polish National Corpus, parsed with the LFG grammar, represented as syntactic trees and analysed for grammatical functions.
Rights GPL
Source META-SHARE
Title LFG Treebank for Polish
Type Language Description
Contact Point Metashare/e5b2c1e463f111e2bff4525400d7614781a909d5ecf14df0b0bc94037575dc47#contact Person
Contributor Maciej Buczek
Creator Maciej Buczek
Description A set of Wikipedia-derived English-Polish and Polish-English thematic dictionaries available for download under the Creative Commons license of potential use in NLP applications. The dictionaries are based on existing Wikipedia categories, but they have also been manually checked for inappropriately-placed entries. The following subjects are covered in this batch of dictionaries: American universities, world cities and villages, Polish artists, Polish journalists, Polish scientists, Polish politicians, Polish companies, Polish catastrophes, Polish media, Polish organizations, Polish universities. The dictionaries are stored in the RDF (Resource Description Framework) format, which is a method for conceptual description or modeling of information that allows storage of additional information, in this case the Wikipedia categories to which the individual entries belong. The categories presented do not reflect the exact Wikipedia structure, but rather conceptual relations between the entries.
Language Polish
English
Rights CC-BY
Source META-SHARE
Title ECL Dictionaries
Type Lexical Conceptual Resource
Contact Point Metashare/7bf352e4a37611e3960f001dd8b71c190edea60c81af4aeda8b6ff48947eda32#contact Person
Creator Anne-Kathrin Schumann
Description Two small web corpora and lists of the knowledge-rich contexts found in these corpora.
Language Russian
German
Rights MSCommons-BY-NC-SA
Source META-SHARE
Title German and Russian gold standard for knowledge-rich context extraction
Type Corpus
Contact Point Metashare/e7a26ae063f111e2bff4525400d7614774d4073e4a4348a8addc4d12fddf7a9a#contact Person
Creator Piotr Pęzik
Description The PELCRA language detector is a Java tool for detecting the language of an arbitrary stretch of text developed by the PELCRA team at the University of Łódź, available under the GPL licence. The first version of this tool contains a model for distinguishing between Polish and English. The language detector uses naive Bayes classifier for language detection. The API of the application makes it possible to provide training data for the developments of detectors of other languages.
Rights GPL
Source META-SHARE
Title PELCRA Language Detector
Type Tool Service
Contact Point Metashare/3d0babc66b0011e284b6000423bfd61c0a8db49a4b1a4492b11a2e64bdc4e71f#contact Person
Creator Grzegorz Murzynowski
Description Anotatornia is a tool for the manual on-line annotation of corpora at various linguistic levels. The levels currently implemented are: word-level and sentence-level segmentation, morphosyntax, word sense disambiguation. Anotatornia implements sophisticated mechanisms of the management of texts, annotators and conflicts.
Rights GPL
Source META-SHARE
Title Anotatornia
Type Tool Service
Contact Point Metashare/fcdcb6806aff11e284b6000423bfd61cba534014faef4c6d8a8adbe5a6406b8c#contact Person
Metashare/fcdcb6806aff11e284b6000423bfd61cba534014faef4c6d8a8adbe5a6406b8c#contact Person2
Creator Bartosz Zaborowski
Michał Lenart
Szymon Acedański
Description Pantera is a morphosyntactic tagger based on Brill's Algorithm adapted for morphologically rich languages, e.g. Polish.
Rights GPL
Source META-SHARE
Title PANTERA
Type Tool Service
Contact Point Metashare/e152863e92c211e28763000c291ecfc8e5ffae34b3514ec2aca79c6f2ed0a3cd#contact Person
Metashare/e152863e92c211e28763000c291ecfc8e5ffae34b3514ec2aca79c6f2ed0a3cd#contact Person2
Creator Glòria Vázquez
Irene Castellón
Ana Fernandez Montraveta
Description This is the LMF version of the SenSem database created by the Spanish Inter-University Research Group GRIAL. As part of SenSem project, a corpus of sentences annotated at the semantic and syntactic levels was created. The source corpus is made up of around 13 million words extracted from the online versions of a Spanish newspaper. From this corpus, 25.000 sentences have been randomly selected, 100 for each of the 250 more frequent verbs in current Spanish. Each sentence has been labeled according to the verb sense it exemplifies, the type of complements it takes (arguments or adjunts), their syntactic category and function, and finally each argument has been labelled with a semantic role. The sentence has also been annotated as to its semantics both in relation with aspectual information and the type of construction being expressed. From this annotated corpus a lexical data base of verbs was created in which all the previous information will be recollected. The unit of description of the verbs is the sense. In the description of the verbs, argument structure is included, incorporating subcategorization patterns, with the information of frequency of them, semantic roles and information regarding sentence semantics. The lexicon and the corpus are associated at sense level and together shape up what we call the data bank of the sentential semantic of the Spanish verbs. Both resources are available via web and will form a very important source of linguistic information which we hope will be of utility in different areas of the natural language processing and linguistic research in general. The LMF conversion has been done by the Universitat Pompeu Fabra.
Language Catalan
Rights GPL
Source META-SHARE
Title LMF version of the SenSem Catalan Data Base
Type Lexical Conceptual Resource
Contact Point Metashare/f251f0ea6aff11e284b6000423bfd61c81cbf285575e47288bb726a6146cf32c#contact Person
Creator Dawid Weiss
Description Morfologik-stemming is a library featuring morphological analysis, spelling correction, and building of finite-state automata for these purposes. It is bundled with a morphological dictionary for Polish, Morfologik.
Rights BSD-style
Source META-SHARE
Title morfologik-stemming
Type Tool Service
Contact Point Metashare/ec2cbb6e6aff11e284b6000423bfd61cca437e34902c40e790b23005f5fc3c43#contact Person
Creator Leszek Manicki
Description The multilingual lexicon of toponyms (WikiTopoPl) contains a list of over 155,000 polish geographical proper names (countries, cities, regions, hydronyms, etc) and their equivalents in Bulgarian, German, modern Greek, English, Croatian, Hungarian, Romanian, Slovak and Serbian. These data (whenever available) have been automatically extracted from the open encyclopedia Wikipedia. The Wikipedia categories attached to the lexicon entries have been mapped to a short list of succinct categories compliant with Prolexbase, a multilingual ontology of proper names.
Language English
Rights CC-BY-SA
Source META-SHARE
Title Multilingual lexicon of toponyms
Type Lexical Conceptual Resource
Contact Point Metashare/c6aac80092c211e28763000c291ecfc89c4076c3e0bf42daa7d59a5ca6f3a51e#contact Person
Metashare/c6aac80092c211e28763000c291ecfc89c4076c3e0bf42daa7d59a5ca6f3a51e#contact Person2
Creator Carmen Armentano Oller
Description This is the LMF version of the Apertium bilingual dictionary for Portugues and Catalan languages. Bilingual LMF dictionaries were generated from Apertium bilingual dix files. For each Apertium bilingual correspondence, the corresponding source and target monolingual entries (LexicalEntry) were generated in addition to the bilingual correspondence (SenseAxis) element. Apertium is a free/open-source machine translation platform, initially aimed at related-language pairs but recently expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides: a language-independent machine translation engine; tools to manage the linguistic data necessary to build a machine translation system for a given language pair and linguistic data for a growing number of language pairs.
Language Catalan
Portuguese
Rights GPL
Source META-SHARE
Title Portuguese-Catalan LMF Apertium Bilingual dictionary
Type Lexical Conceptual Resource
Contact Point Metashare/da18da4492c211e28763000c291ecfc8073df62196504b1a87288374d5635d39#contact Person2
Metashare/da18da4492c211e28763000c291ecfc8073df62196504b1a87288374d5635d39#contact Person
Creator Jimmy O'Reagan
Paul Breen
Description This is the LMF version of the Apertium English dictionary. Monolingual dictionary for English was generated from the Apertium expanded lexicon of the en-es pair system (English/Spanish). Apertium is a free/open-source machine translation platform, initially aimed at related-language pairs but recently expanded to deal with more divergent language pairs (such as English-Catalan). The platform provides: a language-independent machine translation engine; tools to manage the linguistic data necessary to build a machine translation system for a given language pair and linguistic data for a growing number of language pairs.
Language English
Rights GPL
Source META-SHARE
Title English LMF Apertium Dictionary
Type Lexical Conceptual Resource
Contact Point Metashare/76d0e4dc6b2f11e281b65cf3fcb88b707f72aec5f99944c79f02b6e4b1b69f6e#contact Person
Creator Angel Genov
Description The Bulgarian Grammar checker is based on a language model derived from the frequency list of the annotated Bulgarian National Corpus. It checks 893,626,788 3-grams with POS tags, including punctuation. The results show the probability of an arbitrary 3-gram with part-of-speech tags to be valid in the language model. The language model is executed in the form of finite automata. For each sentence, the model consecutively applies 3-grams, and those that are below the threshold are flagged as potential errors.
Rights CC-BY-NC
Source META-SHARE
Title Bulgarian Grammar checker web service
Type Tool Service
Contact Point Metashare/1ebc2bf4703d11e28a985ef2e4e6c59e8cead57cc3314c4bb12a87eb058428bd#contact Person
Creator Nives Mikelić Preradović
Description The Croatian Valency Lexicon of Verbs, Version 2.0008 (CROVALLEX 2.0008) is an attempt of formal description of valency frames of Croatian verbs. CROVALLEX 2.0008 was developed as the part of the PhD thesis titled Approaches to the Development of the Machine Lexicon for Croatian Language written by Nives Mikelic Preradovic and supervised by prof.dr.sc. Damir Boras at the Department of Information Sciences, Faculty of Humanistics and Social Sciences, Zagreb University. The Functional Generative Description (FGD), being developed by Czech linguists Petr Sgall and his collaborators since the 1960s, is used as the background theory in CROVALLEX 2.0008. for the description of valency frames of selected verbs. CROVALLEX 2.0008 contains roughly 1740 verbs. They were selected from the Croatian frequency dictionary, according to their number of occurrences. The preparation of this version of CROVALLEX has taken around three years
Rights CC-BY-NC-SA
Source META-SHARE
Title Croatian Valency Lexicon
Type Lexical Conceptual Resource
Contact Point Metashare/159569246b0011e284b6000423bfd61c4ec4448e92954219a8567f745b178f5b#contact Person
Creator Łukasz Kobyliński
Description CorpCor is a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. National Corpus of Polish).
Rights GPL
Source META-SHARE
Title CorpCor
Type Tool Service