Analysis of Lithuanian texts: a case of moon and femininity


Details
Presenter: Evaldas (KTU / ISM)
Tools used: gensim, fastText
Abstract: Presentation will discuss machine learning task of text classification. Text corpora was ASTRA stenograms, containing 110905 Lithuanian parliamentary transcripts from 147 speakers, collected during 1990 March - 2013 December. Texts were categorized by the political partisanship of a speaker, the gender of a speaker and the fact that a transcript was recorded around a full moon date. Types of pre-processing considered: original text, lemmized, morphized and translated to English. Lemmas and morphemes were obtained using semantika.lt and English translation using Google Translate services. Feature sets investigated: 6 from gensim (3 Doc2Vec variants, LSI, LDA, RP), 1 from fastText (Sent2Vec), and 3 custom-made (morfologija, stilometNER, ontologija). Random forest was used as a base-learner as well as a meta-learner (in 7 "stacking" configurations). Experiments reveal which categories, which types of pre-processing and which feature sets appear to be the most successful for texts analysed.
Language: EN
Image by presenter :)

Sponsors
Analysis of Lithuanian texts: a case of moon and femininity