文章基本信息

标题：A data-driven framework for archiving and exploring social media data
本地全文：下载
作者：Qunying Huang ; Chen Xu
期刊名称：Annals of GIS
印刷版ISSN：1947-5683
出版年度：2014
卷号：20
期号：4
页码：265-277
DOI：10.1080/19475683.2014.942697
语种：English
出版社：Taylor & Francis Ltd.
摘要：Social media data are available and accumulated at the extra-byte level every day. As social media applications are widely deployed in various platforms from personal computers to mobile devices, they are becoming a natural extension to human sensory system. The synthesis of social media with human intelligence has the potential to be the intelligent sensor network of unprecedented scale and capacity. However, it also poses several grand challenges to archive and retrieve information from massive social media data. One of these challenges is how to archive, retrieve and mine such massive unstructured data set efficiently to support real-time emergency response. To explore potential solutions, this paper utilizes parallel computing methods to harvest social media data sets, using Twitter as an example, and to store, index, query and analyse them. Within this framework, a Not Only SQL database (DB), MongoDB, is used to store data as document entries rather than relational tables. To retrieve information from the massive data sets efficiently, several strategies are used: (1) data are archived in the MongoDB across multiple collections with each collection containing a subset of the accumulated data, (2) parallel computing is applied to query and process data from each collection and (3) data are duplicated across multiple servers to support massive concurrent access of the data sets. This study has also tested the performance of spatiotemporal query, concurrent user requests and sentiment analysis over multiple DB servers, and performance benchmark results showed that the proposed approach could provide a solution for processing massive social media data with more than 40% performance improvement. A proof-of-concept prototype implements the design to harvest, process and analyse tweets.
关键词：cloud computing;big data;NoSQL;parallel computing