When we talk about online multimedia content, we refer to those elements that can enrich the user experience through the use of audio and visual material, such as videos, podcasts, live, and webinars. It often happens that videos, with content or language not suitable for children, are seen by minors who can be attracted and influenced by what they see. Automatic content classification would make the identification of content dangers quick and applicable to a large amount of data. Furthermore, considering that all online multimedia content contains textual components, a linguistic approach is particularly useful. The aim of this thesis is to explore the topic of automatic classification of texts related to multimedia content (such as subtitles and transcriptions) to assess the appropriateness of such content for different age groups, with a focus on protecting minors from exposure to unsuitable material. This study investigates machine learning-based approaches, comparing various classifiers, as well as semantic and lexical methods. Furthermore, resources have been developed for both Italian and English languages to support this classification process. Chapter 1 outlines the background and current state of multimedia content classification research, providing an overview of what has already been done in the field of computational linguistics on this problem and identifying the elements that still need to be developed. In Chapter 2, the methodology for carrying out the work is illustrated. Two corpora have been created: one for Italian and one for English. The initial idea was to focus exclusively on TED Talks, a particular type of content that covers a wide range of topics presented by experts in various fields. These talks are generally of high quality and well-curated, making them ideal for linguistic and content analysis. In addition, TED Talks are available in many languages, facilitating the creation of multilingual corpora and comparative analysis in different languages. However, after some experiments, it was decided to add other texts to the corpora, specifically for children, extracted from YouTube Kids and other children’s channels, and specifically for adults, as they deal with themes related to sex, drugs, violence, and fear, or are characterized by vulgar language. In order to conduct supervised classification, the corpus has been annotated following the AGCOM (Authority for Communications Guarantees) Guidelines on the Classification of Audiovisual Works Intended for the Web and Video Games (AGCOM 2017). High-specialized lexicons have been used to extract the linguistic indicators of the main features used to rate multimedia online content. The dictionaries created are the dictionary of Violence, Drugs, and Bad language. In addition, a study of emotions has been conducted, focusing particularly on emotions that indicate negative sentiment such as Anger, Disgust, and Fear. The emotions have been analyzed thanks to the NRC Emotion Intensity Lexicon (Saif M Mohammad 2017): a list of words that have been assigned a value ranging from 0 to 1 in relation to the eight main emotions theorized by Plutchik (1984). The semantic information contained in the texts played a fundamental role in the classification. All texts considered for the analysis were represented in a network based on their semantic proximity. Specifically, a distributional semantic matrix was created to extract similarity values between words and to perform a semantic expansion of the more significant words in the transcripts. Once the texts were vectorized, the similarity values between text vectors were calculated using the Cosine Similarity algorithm. After comparing all texts and generating a large network of text-per-text edges, this network was graphically represented with Gephi (Bastian et al. 2009), assigning the similarity value as weight. Modularity Class (Blondel et al. 2008) was applied to the network to generate a classification. The experimental procedure is detailed in Chapter 3. An evaluation of machine learning algorithms was conducted to determine the best algorithm for text categorization. Specifically, the algorithms tested include the Support Vector Machine, Naive Bayes classifier, Decision Tree, Random Forest, Convolutional Neural Network, Recurrent Neural Networks, BERT, DistilBERT, mBERT, RoBERTa, and, specifically for Italian, ELECTRA, AlBERTo, GilBERTo, and UmBERTo. Additionally, several large language models such as Flan T5, GPT-2, and Minerva were also tested. To further improve the results, a filtering rule based on a badwords dictionary was also added. This rule allows for the filtering of content with language that is not suitable for minors, ensuring that if the system misclassifies certain texts, those containing inappropriate language are still categorized as adult content. As discussed in Chapter 4, the results are interesting and encouraging; in a future prospective, the system could be applied to other types of online multimedia content, and an integrated analysis with other User Generated Content, such as comment and descriptions, could be developed. Other elements, such as the language complexity and the structure of the phrase, could be integrated in order to enrich the analysis, and other languages could be taken in account. Furthermore, the studies developed in this thesis regarding the textual component of multimedia content can be integrated with approaches that consider additional elements, such as metadata, images, and frames, to develop comprehensive content classification and monitoring systems.
Safeguarding Minors: A Study on Automatic Age-based Classification of Online Multimedia Content Using Text Transcripts
Antonietta Paone
2025-01-01
Abstract
When we talk about online multimedia content, we refer to those elements that can enrich the user experience through the use of audio and visual material, such as videos, podcasts, live, and webinars. It often happens that videos, with content or language not suitable for children, are seen by minors who can be attracted and influenced by what they see. Automatic content classification would make the identification of content dangers quick and applicable to a large amount of data. Furthermore, considering that all online multimedia content contains textual components, a linguistic approach is particularly useful. The aim of this thesis is to explore the topic of automatic classification of texts related to multimedia content (such as subtitles and transcriptions) to assess the appropriateness of such content for different age groups, with a focus on protecting minors from exposure to unsuitable material. This study investigates machine learning-based approaches, comparing various classifiers, as well as semantic and lexical methods. Furthermore, resources have been developed for both Italian and English languages to support this classification process. Chapter 1 outlines the background and current state of multimedia content classification research, providing an overview of what has already been done in the field of computational linguistics on this problem and identifying the elements that still need to be developed. In Chapter 2, the methodology for carrying out the work is illustrated. Two corpora have been created: one for Italian and one for English. The initial idea was to focus exclusively on TED Talks, a particular type of content that covers a wide range of topics presented by experts in various fields. These talks are generally of high quality and well-curated, making them ideal for linguistic and content analysis. In addition, TED Talks are available in many languages, facilitating the creation of multilingual corpora and comparative analysis in different languages. However, after some experiments, it was decided to add other texts to the corpora, specifically for children, extracted from YouTube Kids and other children’s channels, and specifically for adults, as they deal with themes related to sex, drugs, violence, and fear, or are characterized by vulgar language. In order to conduct supervised classification, the corpus has been annotated following the AGCOM (Authority for Communications Guarantees) Guidelines on the Classification of Audiovisual Works Intended for the Web and Video Games (AGCOM 2017). High-specialized lexicons have been used to extract the linguistic indicators of the main features used to rate multimedia online content. The dictionaries created are the dictionary of Violence, Drugs, and Bad language. In addition, a study of emotions has been conducted, focusing particularly on emotions that indicate negative sentiment such as Anger, Disgust, and Fear. The emotions have been analyzed thanks to the NRC Emotion Intensity Lexicon (Saif M Mohammad 2017): a list of words that have been assigned a value ranging from 0 to 1 in relation to the eight main emotions theorized by Plutchik (1984). The semantic information contained in the texts played a fundamental role in the classification. All texts considered for the analysis were represented in a network based on their semantic proximity. Specifically, a distributional semantic matrix was created to extract similarity values between words and to perform a semantic expansion of the more significant words in the transcripts. Once the texts were vectorized, the similarity values between text vectors were calculated using the Cosine Similarity algorithm. After comparing all texts and generating a large network of text-per-text edges, this network was graphically represented with Gephi (Bastian et al. 2009), assigning the similarity value as weight. Modularity Class (Blondel et al. 2008) was applied to the network to generate a classification. The experimental procedure is detailed in Chapter 3. An evaluation of machine learning algorithms was conducted to determine the best algorithm for text categorization. Specifically, the algorithms tested include the Support Vector Machine, Naive Bayes classifier, Decision Tree, Random Forest, Convolutional Neural Network, Recurrent Neural Networks, BERT, DistilBERT, mBERT, RoBERTa, and, specifically for Italian, ELECTRA, AlBERTo, GilBERTo, and UmBERTo. Additionally, several large language models such as Flan T5, GPT-2, and Minerva were also tested. To further improve the results, a filtering rule based on a badwords dictionary was also added. This rule allows for the filtering of content with language that is not suitable for minors, ensuring that if the system misclassifies certain texts, those containing inappropriate language are still categorized as adult content. As discussed in Chapter 4, the results are interesting and encouraging; in a future prospective, the system could be applied to other types of online multimedia content, and an integrated analysis with other User Generated Content, such as comment and descriptions, could be developed. Other elements, such as the language complexity and the structure of the phrase, could be integrated in order to enrich the analysis, and other languages could be taken in account. Furthermore, the studies developed in this thesis regarding the textual component of multimedia content can be integrated with approaches that consider additional elements, such as metadata, images, and frames, to develop comprehensive content classification and monitoring systems.File | Dimensione | Formato | |
---|---|---|---|
Safeguarding Minors: A Study on Automatic Age-based Classification of Online Multimedia Content Using Text Transcripts.pdf
accesso aperto
Descrizione: SAFEGUARDING MINORS: A STUDY ON AUTOMATIC AGE-BASED CLASSIFICATION OF ONLINE MULTIMEDIA CONTENT USING TEXT TRANSCRIPTS
Tipologia:
Documento in Post-print
Licenza:
PUBBLICO - Pubblico con Copyright
Dimensione
7.27 MB
Formato
Adobe PDF
|
7.27 MB | Adobe PDF | Visualizza/Apri |
Giudizio.pdf
accesso aperto
Descrizione: Giudizio della commissione
Tipologia:
Altro materiale allegato
Licenza:
PUBBLICO - Pubblico con Copyright
Dimensione
96.28 kB
Formato
Adobe PDF
|
96.28 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.