Linguistic Resources for Natural Language Processing

Oggetto:

Linguistic Resources for Natural Language Processing

Oggetto:

Linguistic Resources for Natural Language Processing

Oggetto:

Academic year 2022/2023

Course ID

STU0674

Teaching staff

Cristina Bosco (Lecturer)
Viviana Patti (Lecturer)

Degree course

Language Technologies and Digital Humanities

Year

1st year

Teaching period

First semester

Type

Distinctive

Credits/Recognition

Course disciplinary sector (SSD)

INF/01 - informatics

Delivery

Formal authority

Language

Italian

Attendance

Obligatory

Type of examination

Oral

Prerequisites

English
Italiano

There are no prerequisites for participation in the course

non sono previsti prerequisiti per la partecipazione al corso

Oggetto:

The course aims to provide the basic notions of computational linguistics and Natural Language Processing (NLP), mostly focusing on linguistic resources. It introduces the main tasks and the historical perspective where resources are collocated (from rule-based to data driven approaches to machine learning and neural networks). Referring to the scenario of the large variety of existing resources and providing several examples, the methodology for the development of different type of resources will be investigated. 

Il corso si propone di fornire le nozioni di base della linguistica computazionale e dell'elaborazione del linguaggio naturale (NLP), concentrandosi principalmente sulle risorse linguistiche. Introduce i principali task di NLP e la prospettiva storica in cui vengono collocate le risorse (dagli approcci basati sulle regole, a quelli basati sui dati, all'apprendimento automatico e alle reti neurali). Facendo riferimento allo scenario della grande varietà di risorse esistenti e fornendo diversi esempi, verrà esaminata la metodologia per lo sviluppo di vari tipi di risorse.

Oggetto:

Results of learning outcomes

English
Italiano

Students will have the opportunity to test tools to build and validate resources in different settings and will acquire the ability to collect and annotate a language resource, to calculate the disagreement between annotators, create diagrams for data representation.

Gli studenti avranno l'opportunità di testare strumenti per costruire e convalidare le risorse in diversi contesti e acquisiranno la capacità di raccogliere e annotare una risorsa linguistica, di calcolare il disaccordo tra annotatori, creare diagrammi per la rappresentazione dei dati.

Oggetto:

Course delivery

English
Italiano

Lessons and laboratories

Lezioni frontali e attività di laboratorio

Oggetto:

Learning assessment methods

English
Italiano

Oral examination and practical exercises

Esame orale e esercizi pratici

Oggetto:

Support activities

English
Italiano

Oggetto:

Program

English
Italiano

We will introduce the resources used in the NLP pipeline for morpho-syntactic analysis (text segmentation and tokenization, morpho-syntactic processing and part of Speech tagging, syntactic parsing) and in particular treebanks, those exploited in semantic analysis (distributional semantics, ontology learning, open information extraction, latent semantic analysis) and pragmatic analysis (sentiment analysis). We will focus on the steps for the creation of resources (collection, selection, annotation, analysis and inter-annotator agreement measures) and challenges to be addressed (ambiguity, genre variation, multilingualism, bias, variety of formats), but also on the steps involved in the evaluation of the resources within the context of evaluation campaigns for NLP. Finally, practical and ethical considerations in the effective use of resources (datasets, lexicons, models) will be also presented and discussed proposing ethics statements about tools to navigate (research choices and communicative implications) and analyzed data (What is a Research Ethics Statement and Why does it Matter?).
A practical counterpart of the course will give to students the opportunity to test tools to build and validate resources in different settings (collecting and annotating a linguistic resource, calculating disagreement, building diagrams for representing data), while practical exercises will be assigned for testing their ability in these tasks.

Presenteremo le risorse utilizzate nella pipeline di NLP per l'analisi Morpho-sintattica (segmentazione e tokenizzazione del testo, elaborazione Morpho-sintattica e part of Speech tagging, analisi sintattica) e in particolare treebanks, quelle sfruttate nell'analisi semantica (distributional semantics, ontology learning, open information extraction, latent semantic analys) e analisi pragmatica (sentiment analysis). Ci concentreremo sulle fasi per la creazione di risorse (raccolta, selezione, annotazione, analisi e misure di accordo inter-annotatore) e le sfide da affrontare (ambiguità, variazione di genere testuale, multilinguismo, bias, varietà di formati), ma anche sulle fasi di valutazione delle risorse nell'ambito delle campagne di valutazione di NLP. Infine, saranno presentate e discusse anche considerazioni pratiche ed etiche nell'uso efficace delle risorse (dataset, lessici, modelli) proponendo considerazioni etiche sugli strumenti di navigazione (scelte di ricerca e implicazioni comunicative) e dati analizzati (Che cosa è una considerazione etica di ricerca e perché importa?).
Una controparte pratica del corso darà agli studenti l'opportunità di testare strumenti per costruire e convalidare le risorse in diversi contesti (raccolta e annotazione di una risorsa linguistica, calcolo del disaccordo, creazione di diagrammi per la rappresentazione dei dati) e alcune esercitazioni pratiche saranno assegnate per testare la capacità raggiunta in questi compiti.

Descrizione

Suggested readings and bibliography

Oggetto:

English
Italiano

Some suggested reading for the first part of the course:

- about language variation and multilingualism it is recommended the reading of Emily Bender - "High Resource Languages vs Low Resource Languages" available at https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/#fn15

- Jurafsky & Martin - "Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition"(third edition, 2022) is a very good and updated general reference book; in particular, it is recommended the reading of (the introductory part of) chapter 8, which introduces part of speech tagging, and of (the introductory part of) chapters 12, 13 and 14, which introduce syntactic analysis (parsing). The draft is made available by the authors at https://web.stanford.edu/~jurafsky/slp3/

Letture consigliate per la prima parte del corso:

- sulla variazione del lingauggio e il multilinguismo si raccomanda la letura di Emily Bender - "High Resource Languages vs Low Resource Languages" available at https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/#fn15

- Jurafsky & Martin - "Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition"(third edition, 2022) è un ottimo manuale molto aggiornato da utilizzare come riferimento generale; in particolare si raccomanda la lettura (almeno delle parti introduttive) del capitolo 8, che introduce il part of speech tagging, e dei capitoli 12, 13 e 14 che introducono l'analisi sintattica (parsing). La bozza del libro è resa disponibile dagli autori alla pagina https://web.stanford.edu/~jurafsky/slp3/

Oggetto: