文章基本信息

标题：H-Prop and H-Prop-News: Computational Propaganda Datasets in Hindi
本地全文：下载
作者：Deptii Chaudhari ; Ambika Vishal Pawar ; Alberto Barrón-Cedeño 等
期刊名称：Data
印刷版ISSN：2306-5729
出版年度：2022
卷号：7
期号：3
页码：1-11
DOI：10.3390/data7030029
语种：English
出版社：MDPI Publishing
摘要：In this digital era, people rely on the internet for their news consumption. As people arefree to express their opinions on social media, much information shared on the internet is loadedwith propaganda. Propagandist contents are intended to inﬂuence public opinion. In the mainstreammedia or prominent news agencies, the authors’ and news agencies’ own bias may impact in the newscontents. Hence, it is required to detect such propaganda spread through news articles. Detectionand classiﬁcation of propagandist text require standard, high-quality, annotated datasets. A fewdatasets are available for propaganda classiﬁcation. However, these datasets are mostly in English.Hindi is the most spoken language in India, and efforts are needed to detect its propagandist contents.This research work introduces two new datasets: H-Prop and H-Prop-News, which consist of newsarticles in Hindi annotated as propaganda or non-propaganda. The H-Prop dataset is generatedby translating 28,630 news articles from the QProp dataset. The H-Prop-News dataset contains5500 news articles collected from 32 prominent Hindi news websites. We experiment with theproposed datasets using four supervised machine learning models combined with different featurevectors and word embeddings. Our experiments achieve 87% accuracy using Logistic Regressionwith TF-IDF feature vectors. The datasets provide high-quality labeled news articles in Hindi andopen new avenues for researchers to explore techniques for analyzing and classifying propaganda inHindi text.
关键词：propaganda identiﬁcation;news articles analysis;Hindi text processing