Index

Contact Point Metashare/7b10d004a37611e3960f001dd8b71c190dc084e401f34fbe95921abf1ac2cc7f#contact Person
Description Collection of comparable sentences from Wikipedia obtained with Lexacc tool developed in ACCURAT project. For each sentence pair confidence score of the alignment is assigned.
Language English
Rights CC-BY
Source META-SHARE
Title English-Lithuanian cross-linked collection of comparable sentences from Wikipedia
Type Corpus
Contact Point Metashare/f054b2a06aff11e284b6000423bfd61ccd381836ce664f3cae0507cbeb3e61f1#contact Person
Creator Joanna Świetlicka
Description Summarizer is a tool for creating short text summaries. It utilises text extraction method, i.e. the output consists of sentences from the original text. The tool uses a number of machine learning algorithms, including neural networks, linear regression, Bayesian networks and decision trees. The output sentences are chosen based on different signals, such as the length of the sentence, its position in the text structure and properties of the words it contains. The system was trained specifically for newspaper articles in Polish. It is possible, however, to adjust it for other kinds of documents and languages.
Rights CC-BY
Source META-SHARE
Title Summarizer
Type Tool Service
Contact Point Metashare/a8bd02c25b2711e2a6e4005056b40024cb25f09c5d5442eca755d617c46060c8#contact Person
Description Frequency lists based on 0,5 million words of fiction texts (representing years 1992-1998), and 0,5 million words newspaper texts (from years 1995-1999). Three frequency lists, with words and their frequencies in the sub-corpora and in the whole corpus: 10 000 lemmas (includes also POS) 1000 most frequent word forms 100 words representing only one of the sub-corpora - words that counted as frequent in one of the sub-corpora, but were missing in the other.
Rights CC-BY
Source META-SHARE
Title Estonian Frequency Dictionary
Type Lexical Conceptual Resource
Contact Point Metashare/c855da065a6811e29a5400504503039c72725b558f9b4f1bae4c5a3b5ac39aa8#contact Person
Description A corpus with texts from Göteborgsposten
En korpus med texter från Göteborgsposten
Language Swedish
Rights CC-BY
other
Source META-SHARE
Title GP 2010
GP 2010
Type Corpus
Contact Point Metashare/f73b18cc5a6811e29a5400504503039ca4aa15146248480b8e13540df445473b#contact Person
Description Texter från Dramawebben, ett digitalt arkiv över fri svensk dramatik.
Texts from Dramawebben, a digital archive of free Swedish drama.
Language Swedish
Rights CC-BY
other
Source META-SHARE
Title Dramawebben (demo)
Dramawebben (demo)
Type Corpus
Contact Point Metashare/e5b2c1e463f111e2bff4525400d7614781a909d5ecf14df0b0bc94037575dc47#contact Person
Contributor Maciej Buczek
Creator Maciej Buczek
Description A set of Wikipedia-derived English-Polish and Polish-English thematic dictionaries available for download under the Creative Commons license of potential use in NLP applications. The dictionaries are based on existing Wikipedia categories, but they have also been manually checked for inappropriately-placed entries. The following subjects are covered in this batch of dictionaries: American universities, world cities and villages, Polish artists, Polish journalists, Polish scientists, Polish politicians, Polish companies, Polish catastrophes, Polish media, Polish organizations, Polish universities. The dictionaries are stored in the RDF (Resource Description Framework) format, which is a method for conceptual description or modeling of information that allows storage of additional information, in this case the Wikipedia categories to which the individual entries belong. The categories presented do not reflect the exact Wikipedia structure, but rather conceptual relations between the entries.
Language Polish
English
Rights CC-BY
Source META-SHARE
Title ECL Dictionaries
Type Lexical Conceptual Resource
Contact Point Metashare/93986504b00411e181cb080027f903f27029b47a99c5450ea96b5571ef1e1326#contact Person
Description This corpus is composed of the audio, the automatic transcriptions, the manual transcriptions and the translations for Portuguese of Ted Talks from Al Gore (On averting climate crisis), Dann Dennett (On Our Consciousness) and Malcolm Gladwell (On Spaghetti Sauce).
Language Portuguese
English
Rights CC-BY
Source META-SHARE
Title TED talks
Type Corpus
Contact Point Metashare/55ad0a626b0011e284b6000423bfd61ce6794e6cc93243aca387c95ff3ccd40f#contact Person
Description A part of PELCRA corpus annotated manually with FrameNet semantic roles.
Rights CC-BY
Source META-SHARE
Title The Polish SRL corpus
Type Corpus
Contact Point Metashare/e3592ca45a6811e29a5400504503039c3a091f422c934b57a6cc90ef08889786#contact Person
Description Part of SOL - Spanish Online
Del av SOL - Spanska Online
Language Spanish
Rights CC-BY
Source META-SHARE
Title Banco de Datos de Once Novelas Españolas 1951—1971 (SOL)
Banco de Datos de Once Novelas Españolas 1951—1971 (SOL)
Type Corpus
Contact Point Metashare/f9e499c663f111e2bff4525400d761477c36ad442d124e6892bb3c8ce1a1ecdf#contact Person
Metashare/f9e499c663f111e2bff4525400d761477c36ad442d124e6892bb3c8ce1a1ecdf#contact Person2
Contributor Piotr Pęzik
Łukasz Dróżdż
Creator Tomasz Szwelnik
Piotr Pęzik
Description SNUV (Spelling and NUmbers Voice database) is a spelling and number and recognition speech database containing over 220 hours of recordings of Polish speakers reading numbers and spelling words, recorded in 22050kHz, 16-bit *.wav files. 210 different participants were paid to produce a sample of their speech through an online spoken data collection platform. Written representation of the recordings is provided with the original sound files. The envisaged application of this resource is to enable the creation of automatic speech recognition (ASR) tools that allow users to spell out words and numbers to be recognized. SNUV has been released under a CC-BY license and cen be used for both academic and commercial purposes free of charge.
Language Polish
Rights CC-BY
Source META-SHARE
Title Spelling and NUmbers Voice database
Type Corpus
Contact Point Metashare/642b58defccc11e18b49005056be118e3444ea5bb1dd46a5a4ca4829e93da406#contact Person
Description The Institute for the Languages of Finland (Kotus) published the printed series Suomen kielen näytteitä (SKN) during the years 1978-2000. A total of 50 booklets appeared, each of which contains the transcripts of one-hour interviews with one female and one male dialect speaker, i.e. approximately two hours of dialect speech. The locations which were selected for the series are well representative of the Finnish dialectal regions. The speakers were born in late 19th century and were generally in their seventies to nineties at the time of the interview. Using the audio recordings in the Audio Archive of Finnish at Kotus, a database was created for the LAT platform, containing both the audio recordings and the text aligned with audio. The original audio recordings have been processed by Sakari Pietarila. The text and audio have been manually aligned by My Sjöholm, Pauliina Liuska and Olli Miettinen. The file conversions for LAT were performed by Mietta Lennes. The normalized word readings have been created by Maria Vilkuna, Pauliina Liuska and Pinja Ruponen.
Kotimaisten kielten tutkimuskeskus (nyk. Kotimaisten kielten keskus) julkaisi Suomen kielen näytteitä (SKN) -sarjaa vuosina 1978–2000. Yhteensä ilmestyi 50 kirjasta, joissa jokaisessa on litteroituna noin kaksi tuntia murretta. Sarjaan valitut pitäjät edustavat kattavasti eri murrealueita. Aineistona ovat olleet pääasiassa Suomen kielen nauhoitearkiston äänitteet. Alkuperäisestä SKN-sarjasta on luotu tämä LAT-tietokanta, joka sisältää sekä äänitteet että niihin kohdistetun tekstin. Korpuksen litteraatioita voi selailla ja äänitteitä kuunnella verkon kautta. Ääni- ja annotaatiotiedostoja voi myös ladata yksitellen omalle koneelleen. Tekstin ja äänen kohdistus on tehty karkeasti virkkeen tai puheenvuoron mittaisissa pätkissä. Lisäksi jokaiseen alkuperäisen litteraation sanaan on liitetty alustava yleiskielinen muoto. Huomaa kuitenkin, että yleiskielistys on suuntaa-antava ja tarkoitettu vain hakujen helpottamiseksi. Alkuperäisen ääniaineiston on käsitellyt Sakari Pietarila. Tekstin ja äänen ovat Kotuksessa kohdistaneet My Sjöholm, Pauliina Liuska ja Olli Miettinen. Äänitteet ja kohdistetut annotaatiotiedostot on muuntanut LAT-järjestelmää varten Mietta Lennes. Yleiskielistyksestä ovat vastanneet Kotuksessa Maria Vilkuna, Pauliina Liuska ja Pinja Ruponen.
Language Finnish
Rights CC-BY
CLARIN_PUB
Source META-SHARE
Title Samples of Spoken Finnish
Suomen kielen näytteitä
Type Corpus
Contact Point Metashare/e99fa4c063f111e2bff4525400d761472dc239ffeb6f47bda0553af53ddd5ef0#contact Person2
Metashare/e99fa4c063f111e2bff4525400d761472dc239ffeb6f47bda0553af53ddd5ef0#contact Person
Contributor Piotr Pęzik
Łukasz Dróżdż
Creator Łukasz Dróżdż
Piotr Pęzik
Description A subset of the PELCRA Polish parallel corpora licensed under the CC-BY license. This resource contains 17 public-domain literary works and their English-Polish/Polish-English translations. The texts have been aligned manually on the sentence level. The texts are provided as TEI P5-compliant XML files with custom PELCRA extensions to mark complex translation equivalence types, and in the XLIFF format.
Language Polish
English
Rights CC-BY
Source META-SHARE
Title PELCRA Polish-English parallel corpus of literary works (CC-BY)
Type Corpus
Contact Point Metashare/829e4f1481b611e2892a000c29bfc0d490506849a6fa4f9a8e4b9f2a187a762f#contact Person
Description Database of portions of text and audio version of a Hungarian novel. (The audio data is not stored in this database, but can be freely downloaded from librivox.org.) The recordings are segmented between speech pauses, which not necessarily correspond to sentence boundaries. The reading is mostly, but not completely accurate. Hence, an automatic speech recognizer was utilized to choose only those segments, where there is a high match between the automatic recognition result and the original text. Thus the database comprises only those segments that are considered to have a reliable transcription. The database can be applied in speech technology research, phonetic, phonological research and for developing and testing speech recognition systems.
Rights CC-BY
Source META-SHARE
Title Hungarian Book (Egri csillagok/Eclipse of the Crescent Moon by Géza Gárdonyi) Reading Speech and Aligned Text Selection Database
Type Corpus
Contact Point Metashare/f4ff49b4a54b11e3960f001dd8b71c198b312dbb930d498c8f1e29d955aae233#contact Person
Description Probabilistic bilingual dictionaries from DGT parallel corpus for Slovak-English generated by Giza++ tool in TaaS project. The resource contains the original dictionaries as well as three different filtered versions of the original dictionaries. The methods applied in the filtering procedure have been described in the following LREC 2014 paper: Aker, A., Pinnis, M., Paramita, M. L., & Gaizauskas, R. (2014). Bilingual dictionaries for all EU languages. In Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA).
Language English
Slovak
Rights CC-BY
Source META-SHARE
Title Probabilistic bilingual dictionaries from DGT parallel corpus for Slovak-English
Type Lexical Conceptual Resource
Contact Point Metashare/e608ef3e5a6811e29a5400504503039cb1f40387e3884f24b997699a88d3b783#contact Person
Description Part of the corpus for health care technical language
Del av Korpus för vårdens och omsorgens fackspråk
Language Swedish
Rights CC-BY
other
Source META-SHARE
Title Läkartidningen medical journal 1996
Läkartidningen 1996
Type Corpus
Contact Point Metashare/b58d4a121df511e2a003080027f903f270b6b01056af4c3991a448d6c54aba26#contact Person
Description This corpus is composed of audio of native and non-native speakers of European Speakers. It is comprised of read and spontaneous speech. The 'speakers_information.txt' file gives information about the speakers (age, gender, childhood district). Nevertheless, this information is not available for some speakers.
Rights CC-BY
Source META-SHARE
Title Non-native Speech in European Portuguese for Computer-Assisted Language Learning
Type Corpus
Contact Point Metashare/d31816cc80c211e28763000c291ecfc8196367558a4d4eedbe94f19d50e0722e#contact Person
Description This lexicon was produced using an inductive SCF classifier, the tpc_subcat_inductive webservice in the PANACEA project. The lexicon was automatically produced from the PANACEA MCv2 crawled corpus, by parsing the data with the RASP parser (Third Release, Open-Source Version, February 2001, available from http://ilexir.co.uk; see also E. Briscoe, J. Carroll, and R. Watson, 2006, The Second Release of the RASP System, in Proceedings of COLING/ACL Interactive Presentation Sessions), and then processing the parsed data with tpc_subcat_inductive. Only verb lemmas with at least 200 instances in MCv2 were retained.
Language English
Rights CC-BY
Source META-SHARE
Title PANACEA English automatically acquired lexicon for ENV domain: Subcategorization Frames (V-SUBCAT)
Type Lexical Conceptual Resource
Contact Point Metashare/b61e3ef280c211e28763000c291ecfc8e6991430daa543b7b7649f4e2242f40d#contact Person
Description This is the Galician Word-Net-LMF lexicon. The Galician lexicon is part of the Multilingual Central Repository (MCR http://adimen.si.ehu.es/web/MCR) and contains 23399 lexical Entries. The MCR currently integrates in the same EuroWordNet framework wordnets from five different languages: English, Spanish, Catalan, Basque and Galician. Its format was defined during the KYOTO Project (http://www.kyoto-project.eu/). The lexicon validates against the kyoto_wn.dtd which is also included in this distribution. The kyoto_wn.dtd is LMF compliant.
Language Galician
Rights CC-BY
Source META-SHARE
Title Galician WordNet-LMF
Type Lexical Conceptual Resource