摘要:Details on building levels play an essential part in a number of real-world applicationmodels. Energy systems, telecommunications, disaster management, the internet-of-things, healthcare, and marketing are a few of the many applications that require building information. Theessential variables that most of these models require are building type, house type, area of livingspace, and number of residents. In order to acquire some of this information, this paper introduces amethodology and generates corresponding data. The study was conducted for specific applicationsin energy system modeling. Nonetheless, these data can also be used in other applications. Buildinglocations and some of their details are openly available in the form of map data from OpenStreetMap(OSM). However, data regarding building types (i.e., residential, industrial, office, single-familyhouse, multi-family house, etc.) are only partially available in the OSM dataset. Therefore, a machinelearning classification algorithm for predicting the building types on the basis of the OSM buildings’data was introduced. Although the OSM dataset is the fundamental and most crucial one usedfor modeling, the machine learning algorithm’s training was performed on a dataset that wasprepared by combining several features from three other datasets. The generated dataset consistsof approximately 29 million buildings, of which about 19 million are residential, with 72% beingsingle-family houses and the rest multi-family ones that include two-family houses and apartmentbuildings. Furthermore, the results were validated through a comparison with publicly availablestatistical data. The comparison of the resulting data with official statistics reveals that there is apercentage error of 3.64% for residential buildings, 13.14% for single-family houses, and −15.38% formulti-family houses classification. Nevertheless, by incorporating the building types, this datasetis able to complement existing building information in studies in which building type informationis crucial.
关键词:missing values;class imbalance;data analysis;geospatial data;feature selection;datavisualization;classification;energy system analysis