文章基本信息

标题：A methodology for extracting quasi-random samples from world wide web domains.
作者：Featherstone, Michael ; Adam, Stewart ; Borstorff, Patricia 等
期刊名称：Journal of International Business Research
印刷版ISSN：1544-0222
出版年度：2009
期号：January
语种：English
出版社：The DreamCatchers Group, LLC
摘要：The Web is characterized by relentless growth. VeriSign estimates over 1 million domain names are acquired in the dot-com domain each month. But how much of the growth of the Web may be attributed to business? What types and proportions of businesses populate the Web? Is the Web more amenable to large business or to small business? Does the Web consist mostly of entrepreneurial start-ups or companies who have adapted their pre-existing business models to this new environment? How 'entrepreneurial' is the Web? Throughout (or because of) the frenzy of the dot com craze and the uproar over the bursting of the dot com bubble in 2001 many of these fundamental questions about business on the Web have remained unanswered.
关键词：Corporate growth;Domain names;TCP/IP (Network protocols);Transmission Control Protocol/Internet Protocol;World Wide Web

A methodology for extracting quasi-random samples from world wide web domains.

Featherstone, Michael ; Adam, Stewart ; Borstorff, Patricia 等

INTRODUCTION

The Web is characterized by relentless growth. VeriSign estimates over 1 million domain names are acquired in the dot-com domain each month. But how much of the growth of the Web may be attributed to business? What types and proportions of businesses populate the Web? Is the Web more amenable to large business or to small business? Does the Web consist mostly of entrepreneurial start-ups or companies who have adapted their pre-existing business models to this new environment? How 'entrepreneurial' is the Web? Throughout (or because of) the frenzy of the dot com craze and the uproar over the bursting of the dot com bubble in 2001 many of these fundamental questions about business on the Web have remained unanswered.

Barabasi (2002) states 'Our life is increasingly dominated by the Web. Yet we devote remarkably little attention and resources to understanding it'. Relative to the extensive literature produced on the importance and potential of the Internet as a tool (Porter, 2001) or as an element of the physical world's business environment, empirical research regarding the demographics of the vast majority of Web business entities, or their marketing and revenue strategies, is limited and sketchy (Colecchia, 2000; Constatinides, 2004). Compounding the problem, Drew (2002) notes that 'Many academic empirical investigations and surveys in e-business suffer from small sample sizes, with consequent questions as to the meaning, validity and reliability of findings'.

Because of the extraordinary growth and the sheer size of the Web, sampling methodologies are essential in order to make valid inferences regarding the nature of Web businesses. This paper discusses probability sampling methodologies which may be employed to give researchers tools to assist in answering some of the fundamental "how much", "how many" and "what type" questions regarding the conduct of business on the Web. The paper discusses procedures employed, as well as mistakes we made which finally pointed to a more productive process. This methodology does not require mastery of esoteric web software packages, nor familiarity with Web crawlers or algorithms they employ to sample pages on the Web.

WEB SAMPLING ISSUES

The original objective of the present research project required that we draw a representative sample of Web sites across multiple top level domains. The first attempt adapted a method based on O'Neill, McClain and Lavoie's (1998) methodology for sampling the World Wide Web utilizing Internet Protocol (IP) addresses. The first step taken was to develop a program which would generate random IP addresses, test the address for validity, and store resulting valid IP addresses in a file. This would enable us to resolve the domain name and then manually enter the valid domain names into a Web browser for further evaluation and classification.

In the mid 1990's, nearly all web domain names were assigned a unique, non-ambiguous IP address, referred to as a 'static' IP address. Around 1999, the practice of assigning 'dynamic' IP addresses became more common due in part to the perceived diminishing supply of static or unique IP addresses. A dynamic IP address is assigned to a computer by a Dynamic Host Protocol Server (DHPS) each time the computer connects to the Internet. The computer retains a dynamic IP address only for as long as a given Internet session lasts. The same computer might be assigned a completely different address the next time it connects to the Internet. In contrast, a static IP address remains constant.

The result of this trend toward greater usage of dynamic IP addresses is that an ever increasing number of IP addresses are essentially ambiguous, in that the IP address itself does not necessarily resolve back to the actual domain name it has been assigned, but may resolve back to the hosting Web site. The direct impact of this practice became apparent when we manually analyzed our initial randomly generated sample of 126 valid IP addresses. We categorized 68% of the sample as "business sites" (a number that was higher that we anticipated) and even more extraordinary, 80% of the business sites were sub-classified as information technology sites. At this point it was apparent that the seemingly skewed results were related to the IP addresses resolving back to the particular Internet host site which was generating the dynamic IP address for its users, rather than the ultimate recipient site of the dynamic IP address. Edleman (2003) and Ricera (2004) also discuss the impact of the increasing proliferation of dynamic or shared IP addresses.

We explored various methods to generate random domain names as opposed to IP addresses. These attempts used text databases or extant search engine databases, but these attempts consistently resulted in severely and obviously biased samples. Our attention then turned to sampling by domain name zones, where we met with greater success.

SAMPLING BY TOP LEVEL DOMAIN

Choosing to sample by Top Level Domain (TLD) changed one aspect of the project. It would no longer consist of sampling the entire Web, but instead would attempt to sample by specific TLD, in this case the dot-com zone. The dot-com zone is the single largest TLD on the Web, accounting for about 46% of all registered domain names (VeriSign, 2005). It remains a preferred naming convention for business and other enterprises. The balance of the paper describes the process used to obtain a representative sample from the dot-com zone. VeriSign is the American company charged with managing both the dot-com and dot-net zones. VeriSign provides research access to the Zone Files through a relatively simple application and agreement process. The requirements at the time we registered included that the machine from which the access was to be conducted must have a static Internet Protocol address. Most universities employ static IP addresses on all campus machines. VeriSign Corporation granted us access to the data for research purposes on October 4, 2004. Along with the completed agreement, we received a File Transfer Protocol (FTP) address and an access password.

The next step was to use a FTP program to access the VeriSign Data Base. Once we connected to the VeriSign FTP address, we were able to select and download the entire dot-com data base file. VeriSign provides the databases as highly compressed zipped files. The size of the compressed dot-com file as downloaded on November 7, 2006 was 936 megabytes. Once the entire dot-com zone file was fully extracted the size was 4.30 gigabytes.

This new database represented the universe of dot-com names. However, the sheer number of records in the dot-com zone database (more than 40 million records) and the size of the files we were dealing with would prove difficult to manage and created data access issues. Many text reading files such as MS Word or MS NotePad were simply unable to load the files. Database enumeration proved to be problematic as well since applications which might easily have been used to enumerate records could not handle files of that magnitude.

CREATION OF THE SAMPLING POOL

To address the issue of sampling frame size, we developed a Java program (available upon request) which randomly extracted 50,000 domain names from the dot-com file. The program required four parameters from the user. The file name of the dot.com zone file, the file name of the file to be extracted, a seed number, and finally the approximate record size of the file to be extracted (for example: 50,000). Based on pseudo random numbers seeded by the user, the would program read through the file, compare the pseudo random numbers to the file record number, extract records and create a sample of domain names of a more manageable size.

The resulting file of 50,000 domain names from the dot-com zone became our working database or 'sample pool' from which the final domain names could be drawn. We employed an editing tool (TextPad) to load the sample pool. This application included the capability to instantly enumerate each record (domain name) in the sample pool.

CREATION OF FINAL SAMPLE

We then generated an additional 1300 random numbers in the range of 1 to 50,000. These numbers were pasted to an Excel spreadsheet. These sets were used to select correspondingly numbered domain names from the enumerated sample pool in the TextPad file. This resulted in an Excel file of 1300 names drawn from the sample pool. Each record included the domain name and its associated number from the TextPad sample frame. This allowed us to cross check each record in the final Excel file with the associated record in the Random Number file and the TexPad enumerated sample pool.

The final task for our research purposes was to 'resolve' each of the 1300 domain names drawn. Domain name resolution is the process of typing or pasting a domain name into the browser address bar. Each of the 1300 domain names was resolved manually using a "copy and paste" process from our final Excel database to the browser address bar. This helped avoid typing errors in resolving the domain names, and still allowed individual evaluation and categorization of each site. Finally a new database was created which included all the domain names viewed and the results of the domain name, categorization process, i.e. was this an active site? Was it a business site? Did we encounter a 'Site not found' error message, and etc.

CONCLUSION

This paper suggested that in order to better understand the complex business systems emerging in World Wide Web further research from within the Web environment itself is required. Because of the size of the Web, sampling strategies must be employed in order to effectively model and study the Web business environment. We suggest that sampling the Web Top Level Domains offers a reasonable alternative for business researchers because it requires only familiarity with the use of the simple Web utilities such as File Transfer Protocols to obtain initial domain name listings. Domain names represent the fundamental building blocks of the Web. As such, every enterprise or individual seeking a presence on the Web must acquire at least one domain name. Closer examination of domain name utilization leads us to a clearer of understanding of proportions and types of businesses using the internet and can help shed light on fundamental and as yet little understood questions.

REFERENCES

Barabasi, A.-L. (2002). Linked: The new science of networks. Cambridge, Mass.: Perseus Pub.

Colecchia, A. (2000). Defining and measuring electronic commerce: Towards the development of an OECD methodology. Statistics Singapore. Retrieved November 12, 2006, from http://www.singstat.gov.sg/conferences/ec/d8.pdf

Constatinides, E. (2004). Strategies for surviving the Internet meltdown. Management Decision, 42(1), 89-107.

Drew, S. (2002). E-business research practice: Towards an agenda. Electronic Journal of Business Research Methods, Retrieved February 18, 2007, from http://www.ejbrm.com/

Edelman, B. (2003). Web sites sharing of IP addresses: Prevalence and significance. Berkman Center for Law and Society. Retrieved March 26, 2006 from http://cyber.law.harvard.edu/people/edelman/ip-sharing

O'Neill, E., McClain, P. D., & Lavoie, B. F. (1998). A Methodology for Sampling the World Wide Web. Online Computer Library Center. Retrieved September 23, 2006, from http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003447

Porter, M. (2001). Strategy and the Internet. Harvard Business Review, 79(3), 63-78.

Ricerca, J. (2004). Search Engine Optimization: Static IP vs. dynamic IP addresses. Circle ID. Retrieved July 17, 2006 from http://www.circleid.com/ posts/search_engine_optimization_static_ip_vs_dynamic_ip_addresses/

VeriSign. (2005). The VeriSign Domain Report, August 2005. The Domain Name Industry Brief. Retrieved December 14, 2006 from http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003447

Michael Featherstone, Jacksonville State University

Stewart Adam, Deakin University

Patricia Borstorff, Jacksonville State University