Corpus linguistics studies language use in large, machine-readable collections of text, i.e. corpora. In my own research, I focus on the different kinds of language varieties used in the Internet and on the development of extremely large corpora compiled by automatically crawling the web. The Internet is a constantly growing source of information that has brought revolutionary possibilities for many scientific disciplines. For instance, thanks to the billions of words available online, the quality of machine translation has improved tremendously, and people’s beliefs and entire nations’ mindscapes can be explored on an unprecedented scale. Paradoxically, the Internet’s extreme size and diversity pose a serious threat to its usefulness. We develop computational methods to tackle these challenges and to analyze large volumes of language automatically.
Since 2018, I am Associate Professor of Digital Linguistics at the University of Turku. Our research projects include, e.g., The Finnish Internet Parsebank (Kone Foundation) and A piece of news, an opinion or something else? Different texts and their automatic detection from the multilingual Internet (Emil Aaltonen Foundation).