Index

Contact Point Metashare/8489146281b611e2892a000c29bfc0d445c15e3917d74997925aafd9d08a24bc#contact Person
Contributor Bálint Pál Tóth
Mátyás Bartalis
Tamás Bőhm
Tamás Gábor Csapó
Klára Laczkó
Creator Csaba Zainkó
Tamás Bőhm
Description The read speech database contains sentences from weather forecast news. The sentence collection represents the four seasons. This database can be used for analysing speech characteristics in weather forecast news and also as the basic speech database of a corpus based Concept-to-Speech system.
Rights MS-C-NoReD-FF
Source META-SHARE
Title Read speech database in Hungarian
Type Corpus
Contact Point Metashare/e5b2c1e463f111e2bff4525400d7614781a909d5ecf14df0b0bc94037575dc47#contact Person
Contributor Maciej Buczek
Creator Maciej Buczek
Description A set of Wikipedia-derived English-Polish and Polish-English thematic dictionaries available for download under the Creative Commons license of potential use in NLP applications. The dictionaries are based on existing Wikipedia categories, but they have also been manually checked for inappropriately-placed entries. The following subjects are covered in this batch of dictionaries: American universities, world cities and villages, Polish artists, Polish journalists, Polish scientists, Polish politicians, Polish companies, Polish catastrophes, Polish media, Polish organizations, Polish universities. The dictionaries are stored in the RDF (Resource Description Framework) format, which is a method for conceptual description or modeling of information that allows storage of additional information, in this case the Wikipedia categories to which the individual entries belong. The categories presented do not reflect the exact Wikipedia structure, but rather conceptual relations between the entries.
Language Polish
English
Rights CC-BY
Source META-SHARE
Title ECL Dictionaries
Type Lexical Conceptual Resource
Contact Point Metashare/e82eb3046ba511e2aa7c68b599c26a066e18110360194f7eaf7d09a41fd722df#contact Person2
Metashare/e82eb3046ba511e2aa7c68b599c26a066e18110360194f7eaf7d09a41fd722df#contact Person
Contributor Tamás Péter Szabó
Description The present corpus contains semi-structured research interview texts recorded and transcribed for the purposes of a PhD project entitled 'Learning, following and disseminating language rules as a topic in the metalinguistic knowledge of students and their teachers' by Tamás Péter Szabó. 133 interviewees (partly students, partly teachers) were asked to speak about their linguistic routines and their opinion on various trends in language use. They were asked to evaluate linguistic trends and to explain linguistic phenomena as well. Data collection was carried out in elementary schools, training colleges and grammar schools, on years 1–4, 7 and 11 in Hungary (Budapest and ten counties). Students (years 1–4, 7 and 11) and their teacher of Hungarian grammar and literature (years 7 and 11) were interrogated. The research interviews were made with 1, 2, 3 or – in extreme cases – more interviewees. All of the interviews were recorded, transcribed and annotated by Tamás Péter Szabó.
Rights MS-NC-NoReD
Source META-SHARE
Title CHSM-IC: Corpus of Hungarian School Metalanguage – Interview Corpus
Type Corpus
Contact Point Metashare/f9e499c663f111e2bff4525400d761477c36ad442d124e6892bb3c8ce1a1ecdf#contact Person
Metashare/f9e499c663f111e2bff4525400d761477c36ad442d124e6892bb3c8ce1a1ecdf#contact Person2
Contributor Piotr Pęzik
Łukasz Dróżdż
Creator Tomasz Szwelnik
Piotr Pęzik
Description SNUV (Spelling and NUmbers Voice database) is a spelling and number and recognition speech database containing over 220 hours of recordings of Polish speakers reading numbers and spelling words, recorded in 22050kHz, 16-bit *.wav files. 210 different participants were paid to produce a sample of their speech through an online spoken data collection platform. Written representation of the recordings is provided with the original sound files. The envisaged application of this resource is to enable the creation of automatic speech recognition (ASR) tools that allow users to spell out words and numbers to be recognized. SNUV has been released under a CC-BY license and cen be used for both academic and commercial purposes free of charge.
Language Polish
Rights CC-BY
Source META-SHARE
Title Spelling and NUmbers Voice database
Type Corpus
Contact Point Metashare/910ece2a8be311e29256001517144592fc653419adb545cbad12ada09dc9fa18#contact Person
Contributor Miljana Mladenović
Creator Miljana Mladenović
Description Rhetorical Figures is a database for Serbian that consists of 98 rhetorical figures related to rehetorical figures for English located at http://rhetfig.appspot.com/list. It is downloadable in xml format. The RhetFig tool is created for maintaining the database, adding examples in it and sorting by: rhetorical types, linguistic types or linguistic operations.
Rights MS-NC-NoReD
Source META-SHARE
Title Rhetorical Figures for Serbian
Type Lexical Conceptual Resource
Contact Point Metashare/e99fa4c063f111e2bff4525400d761472dc239ffeb6f47bda0553af53ddd5ef0#contact Person2
Metashare/e99fa4c063f111e2bff4525400d761472dc239ffeb6f47bda0553af53ddd5ef0#contact Person
Contributor Piotr Pęzik
Łukasz Dróżdż
Creator Łukasz Dróżdż
Piotr Pęzik
Description A subset of the PELCRA Polish parallel corpora licensed under the CC-BY license. This resource contains 17 public-domain literary works and their English-Polish/Polish-English translations. The texts have been aligned manually on the sentence level. The texts are provided as TEI P5-compliant XML files with custom PELCRA extensions to mark complex translation equivalence types, and in the XLIFF format.
Language Polish
English
Rights CC-BY
Source META-SHARE
Title PELCRA Polish-English parallel corpus of literary works (CC-BY)
Type Corpus
Contact Point Metashare/7006fcde81b611e2892a000c29bfc0d46366f03ad99043119fd454184f04fa06#contact Person
Contributor Tamás Gábor Csapó
Creator Tamás Bőhm
Csaba Zainkó
Description This sentence corpus is supplied with yes/no accent markers on each word.
Rights MS-C-NoReD
Source META-SHARE
Title Accent marker database for Hungarian written sentences
Type Corpus
Contact Point Metashare/61caeaae64e211e2aa7c68b599c26a06a02655c476264027b093b7ae4abc9779#contact Person2
Metashare/61caeaae64e211e2aa7c68b599c26a06a02655c476264027b093b7ae4abc9779#contact Person
Contributor Eszter Simon
Dávid Márk Nemeskey
Description The text of the corpus is automatically generated from Hungarian Wikipedia articles. It contains Named Entity (NE) tagging according to the CoNLL standard (Person, Organization, Location and Miscellaneous), and additional morphological annotation. The corpus is the largest ever NE tagged corpus for Hungarian, which can be used for training and testing NE recognizer applications. Thanks to the standard tagset, the performance of systems trained on the hunNERwiki corpus is comparable with the performance of other state-of-the-art systems. Besides the obvious advantages of fully automatic building and annotation procedure (reducing the annotation cost), the novelty of the corpus is the application of collaboratively constructed resources (Wikipedia, DBpedia).
Rights CC-BY-SA
Source META-SHARE
Title HunNERwiki: Automatically generated NE tagged corpus for Hungarian
Type Corpus
Contact Point Metashare/a89c02f4663d11e28a985ef2e4e6c59e76428bf02e394229a70428f25a839f75#contact Person
Contributor Nikola Ljubešić
Description CollTerm is a language independent tool for collocation and term extraction. It is an application that collects collocation and term candidates based on five different co occurrence measures for multiword units (i.e. collocations) or distributional differences from large representative corpus by application of the TF-IDF measurement on singleword units. The language dependent part consists of stop-word list and list of MWU MSD-patterns that can be coded with regular expressions as well. The application is describe in the paper presented at TKE2012 by Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I, Tadić, Gornostay, T. Term Extraction, Tagging, and Mapping Tools for Under-Resourced Languages. The first version of this application is available as an integral part of ACCURAT Toolkit that is available under Apache 2.0 license (http://www.accurat-project.eu/index.php?p=accurat-toolkit). In this version of the tool a calibration of MWU MSD-patterns has been provided for Croatian thus enhancing the usability of the tool. The plan is to provide calibration for other CESAR languages as well.
Rights ApacheLicence_2.0
Source META-SHARE
Title Collocation and Term Extractor
Type Tool Service
Contact Point Metashare/a3c297c4486111e2a2aa782bcb07413510a103f90b964b3aa62a658afe18f904#contact Person
Contributor Dan Tufis
Creator Amália Mendes
Description The LT Corpus (Literary Corpus) contains approximately 1,781,083 running words of European and Brazilian Portuguese. It includes 70 copyright-free classics (61 Portugal and 9 from Brazil) published before 1940.
Language Portuguese
Rights ELRA_END_USER
Source META-SHARE
Title LT Corpus
Type Corpus
Contact Point Metashare/cb3f1b3c8ca511e29b2e0015171445920a508eb19d5542a4b8a3823f22925c0e#contact Person
Contributor Ivana Tanasijević
Creator Ivana Tanasijević
Description This tool is an application for collecting and presenting multimedia informations. It works with multmedial documents and enables database search using different criteria. For various multimedia documents metadata describing them as well as links to their location (on web or locally) are stored into NDX database. Metadata define search criteria that is enabled through web interface. The demo-version illustrates its functionalities with soem data about CESAR project and its participants.
Rights GPL
Source META-SHARE
Title Organizing digitized material
Type Tool Service
Contact Point Metashare/88bbec4e81b611e2892a000c29bfc0d4024ef1f818174c7ea2a2ad7f06452ec1#contact Person
Contributor Tamás Gábor Csapó
Klára Laczkó
Mátyás Bartalis
Creator Tamás Bőhm
Csaba Zainkó
Description Phonetically balanced sentence set read by 10 speakers.
Rights MS-C-NoReD-FF
Source META-SHARE
Title Hungarian Read Speech Precisely Labelled Parallel Speech Corpus Collection
Type Corpus
Contact Point Metashare/77ea812efccc11e18b49005056be118e6c67459edb4342268e01f4eae3e5bbf7#contact Person
Contributor Wilhelmina Dyster
Mietta Lennes
Emilia Erkama
Minna Toivola
Esko Niiranen
Eija Aho
Jenni Korvala
Kristian Järventaus
Language Chinese
French
Finnish
Russian
Arabic
Romanian
English
Source META-SHARE
Title ProoF - Pronunciation of Finnish by Immigrants in Finland
Type Corpus
Contact Point Metashare/314b93d26b0011e284b6000423bfd61c36a51e4b609742288e99ba691f07dfdb#contact Person
Contributor Łukasz Dróżdż
Maciej Ogrodniczuk
Creator Łukasz Dróżdż
Maciej Ogrodniczuk
Description A corpus of the Centre for Eastern Studies (CES) texts. This resource contains 56 Polish-English texts (6 CES reports, 28 issues of CES studies and 22 issues of the CES publication \"Point of View\") licensed under the CC-BY-NC license. The texts have been aligned manually on the sentence level using the MemoQ software. The resource is provided as TEI P5-compliant XML files with custom extensions and in the XLIFF and TMX formats.
Language Polish
English
Rights CC-BY-NC
Source META-SHARE
Title Manually aligned CES Polish-English parallel corpus
Type Corpus
Contact Point Metashare/c8a540a0fb6711e2a8ad00237df3e3584159e2a550584f6d9af53132b5aeebeb#contact Person
Contributor Radu Ion
Creator Radu Ion
Description This corpus is a collection of strongly comparable English, French and Romanian documents collected from http://ec.europa.eu/ website that are sentence split, POS tagged, lemmatized and chunked and that are also sentence aligned using Moore's sentence aligner (http://research.microsoft.com/pubs/68886/sent-align2-amta-final.pdf).
Language Romanian
English
French
Rights MSCommons-BY-NC-ND
Source META-SHARE
Title Strongly Comparable and Aligned Legal News EN-FR-RO News Corpus
Type Corpus
Contact Point Metashare/fb0e2c8663f111e2bff4525400d76147aef2393f2ffc496c9c1ea8e250a5a2cf#contact Person
Contributor Piotr Pęzik
Creator Piotr Pęzik
Description An API for the English version of the HASK dictionary of frequent word combinations automatically generated from the British National Corpus. Developed by the PELCRA group at the University of Łódź, HASK dictionaries are essentially phraseological databases meant to be used by linguists, language teachers, lexicographers, language materials developers, translators and other language professionals and casual dictionary users.
Rights CC-BY-NC
Source META-SHARE
Title HASK collocation dictionary (English)
Type Tool Service
Contact Point Metashare/bbe8ee646aff11e284b6000423bfd61cf73819a718ef48b9a9608b54de8bba8e#contact Person
Contributor Jakub Waszczuk
Creator Michał Lenart
Jakub Waszczuk
Description Nerf is a statistical tool for Named Entity Recognition (NER) based on the Conditional Random Fields (CRF) modelling method. The tool has been constructed as a part of the National Corpus of Polish project. It has been adapted to recognize tree-like structures of NEs (i.e., with recursivelly embeded NEs) using the Joined Label Tagging (JLT) method. The JLT method is a simple method of encoding NE structures as a sequence of labels. With this method various additional informations about NEs of categorical nature – type, subtype, type of derivation – can be encoded on the level of labels and subsequently recognized using the resultant CRF model. The tool can be configured to use various types of observations during the training and recognition process, for example: lexical informations from textual level, or grammatical informations from morphosyntactic level.
Rights GPL
Source META-SHARE
Title Polish Named Entity Recognition Tool
Type Tool Service
Contact Point Metashare/e47ff96863f111e2bff4525400d76147c34334267af0462daa1f95b542c203d9#contact Person
Metashare/e47ff96863f111e2bff4525400d76147c34334267af0462daa1f95b542c203d9#contact Person2
Contributor Łukasz Dróżdż
Piotr Pęzik
Creator Piotr Pęzik
Łukasz Dróżdż
Description A subset of the PELCRA Polish spoken corpus licensed under the CC-BY-NC license. This resource contains 347 transcriptions of recordings made in the years 2000-2010. Individual headers may override the licensing information.
Language Polish
Rights CC-BY-NC
Source META-SHARE
Title PELCRA Polish spoken corpus (CC-BY-NC)
Type Corpus
Contact Point Metashare/f17648f466ce11e281b65cf3fcb88b70ed358b478abe44ceb8994a97debb017e#contact Person
Contributor Angel Genov
Description The Bulgarian Language Processing Chain includes the following types of text processing and linguistic annotation: Sentence segmentation; Tokenisation; POS tagging and grammatical annotation; Lemmatisation. The Bulgarian POS tagger marks up each word with the most probable Part of Speech and unambiguous morphosyntactic information among the set of tags associated with a given word. The tagger is based on SVM (Support Vector Machines) learning. The tagger predicts the POS tag of a word based on a set of features describing the word and its context. These features are words, word bigrams and trigrams within a window of words around the currently tagged word; POS tags, POS tags bigrams and trigrams in the current window, and information about suffixes, prefixes, capitalization, hyphenation etc. for the unknown words. The tagger is trained and tested on manually POS disambiguated corpus. The strategy chosen for training Bulgarian tagger is two passes in both directions; a window of five tokens, the currently tagged word being on the second position; two and three-grams of words or tags or ambiguity classes, lexical parameters as prefixes, suffixes, sentence borders, and capital letters. The trained model is applied to disambiguate texts. The precision of the tagger up to the moment is 96,58%. The Bulgarian lemmatizer determines for a given word form its lemma and detailed morphosyntactic annotation. The lemmatization is based on an unambiguous association between the tagger output and information encoded in a large grammatical dictionary of Bulgarian language. At the tagging a reduced tagset is used (75 word classes compering to 1029 unique grammatical tags in the dictionary) compiled in a way that the minimum necessary information for unambiguous association with the respective lemma to be ensured. A small number of rules and preferences are also implemented to limit the ambiguity in lemmatization. Some additional tools for advanced processing and annotation are available, as well as for annotation and alignment of parallel texts at sentential and subsentential level. A highly scalable web service based infrastructure was developed to provide easy access to the tools for text processing and annotation of Bulgarian. Three different types of access is provided to facilitate the user access to the system: online access; access via RESTful API; asynchronous access. Online access is suitable for users who need processing of relatively small amount of data occasionally. RESTful API access is suitable for software developers who can integrate the processing tools in high level applications. Asynchronous access is aimed for processing large corpora – the user uploads the archived corpus, it is processed on the server, a notification email is sent upon completion of the task and the annotated corpus can be downloaded. The system is highly scalable and can be distributed on different machines. The service infrastructure consist of three main components: Frontend, Backend and TaskDispatcher, each of these can be deployed on different machines. The Frontend component is responsible for implementation of the access policies of the service apis, error handling, logging, support of different return formats (xml,json,plain text), communication with the Backend. Also the Fronted provides the Web UI to user to control the asynchronous tasks: start, stop or monitor a task and upload/download data. The Backend performs the actual processing and it combines the Bulgarian tokenizer, sentence splitter, tagger and lemmatiser in the form of a server application which handles the requests of the Frontend over tcp/ip. Even though the Frontend is implemented efficiently and can handle many request simultaneously, whenever necessary several instances of the Frontend can be distributed on different machines. The TaksDispatcher is responsible for managing the processes of the asynchronous tasks. It receives the start/stop commands by the Frontend and notifies the user by e-mail when the result is ready.
Rights other
Source META-SHARE
Title Web based infrastructure for Bulgarian data processing
Type Tool Service
Contact Point Metashare/8c13600ccd0711e1a404080027e73ea2f9cfd28f51d5437b8f5827c516c348fe#contact Person
Contributor Dan Tufis
Creator Amália Mendes
Description This lexicon includes multiword expressions (MWE) of European Portuguese extracted from a balanced 50,8M word written corpus – a subcorpus of the Reference Corpus of Contemporary Portuguese (CRPC). This corpus covers different genres, being mainly constituted by journalistic texts (59%), but it also includes texts from literature (21%), magazines (15%), miscellaneous, supreme court verdicts, parliament sessions and leaflets (5%). The MWE lexicon covers 1.198 lemmas (composed of single words from different POS categories: nouns, adjectives, verbs and adverbs) and a total of 12.753 MWE lemmas (which include inflectional variants of the MWE lemmas) and 242.233 concordances of those MWE expressions manually verified.
Rights underNegotiation
Source META-SHARE
Title LEX-MWE-PT: Word Combination in Portuguese Language
Type Lexical Conceptual Resource