A methodology for extracting quasi-random samples from world wide web domains.
Featherstone, Michael ; Adam, Stewart ; Borstorff, Patricia 等
INTRODUCTION
The Web is characterized by relentless growth. VeriSign estimates
over 1 million domain names are acquired in the dot-com domain each
month. But how much of the growth of the Web may be attributed to
business? What types and proportions of businesses populate the Web? Is
the Web more amenable to large business or to small business? Does the
Web consist mostly of entrepreneurial start-ups or companies who have
adapted their pre-existing business models to this new environment? How
'entrepreneurial' is the Web? Throughout (or because of) the
frenzy of the dot com craze and the uproar over the bursting of the dot
com bubble in 2001 many of these fundamental questions about business on
the Web have remained unanswered.
Barabasi (2002) states 'Our life is increasingly dominated by
the Web. Yet we devote remarkably little attention and resources to
understanding it'. Relative to the extensive literature produced on
the importance and potential of the Internet as a tool (Porter, 2001) or
as an element of the physical world's business environment,
empirical research regarding the demographics of the vast majority of
Web business entities, or their marketing and revenue strategies, is
limited and sketchy (Colecchia, 2000; Constatinides, 2004). Compounding
the problem, Drew (2002) notes that 'Many academic empirical
investigations and surveys in e-business suffer from small sample sizes,
with consequent questions as to the meaning, validity and reliability of
findings'.
Because of the extraordinary growth and the sheer size of the Web,
sampling methodologies are essential in order to make valid inferences
regarding the nature of Web businesses. This paper discusses probability
sampling methodologies which may be employed to give researchers tools
to assist in answering some of the fundamental "how much",
"how many" and "what type" questions regarding the
conduct of business on the Web. The paper discusses procedures employed,
as well as mistakes we made which finally pointed to a more productive
process. This methodology does not require mastery of esoteric web
software packages, nor familiarity with Web crawlers or algorithms they
employ to sample pages on the Web.
WEB SAMPLING ISSUES
The original objective of the present research project required
that we draw a representative sample of Web sites across multiple top
level domains. The first attempt adapted a method based on O'Neill,
McClain and Lavoie's (1998) methodology for sampling the World Wide
Web utilizing Internet Protocol (IP) addresses. The first step taken was
to develop a program which would generate random IP addresses, test the
address for validity, and store resulting valid IP addresses in a file.
This would enable us to resolve the domain name and then manually enter
the valid domain names into a Web browser for further evaluation and
classification.
In the mid 1990's, nearly all web domain names were assigned a
unique, non-ambiguous IP address, referred to as a 'static' IP
address. Around 1999, the practice of assigning 'dynamic' IP
addresses became more common due in part to the perceived diminishing
supply of static or unique IP addresses. A dynamic IP address is
assigned to a computer by a Dynamic Host Protocol Server (DHPS) each
time the computer connects to the Internet. The computer retains a
dynamic IP address only for as long as a given Internet session lasts.
The same computer might be assigned a completely different address the
next time it connects to the Internet. In contrast, a static IP address
remains constant.
The result of this trend toward greater usage of dynamic IP
addresses is that an ever increasing number of IP addresses are
essentially ambiguous, in that the IP address itself does not
necessarily resolve back to the actual domain name it has been assigned,
but may resolve back to the hosting Web site. The direct impact of this
practice became apparent when we manually analyzed our initial randomly
generated sample of 126 valid IP addresses. We categorized 68% of the
sample as "business sites" (a number that was higher that we
anticipated) and even more extraordinary, 80% of the business sites were
sub-classified as information technology sites. At this point it was
apparent that the seemingly skewed results were related to the IP
addresses resolving back to the particular Internet host site which was
generating the dynamic IP address for its users, rather than the
ultimate recipient site of the dynamic IP address. Edleman (2003) and
Ricera (2004) also discuss the impact of the increasing proliferation of
dynamic or shared IP addresses.
We explored various methods to generate random domain names as
opposed to IP addresses. These attempts used text databases or extant
search engine databases, but these attempts consistently resulted in
severely and obviously biased samples. Our attention then turned to
sampling by domain name zones, where we met with greater success.
SAMPLING BY TOP LEVEL DOMAIN
Choosing to sample by Top Level Domain (TLD) changed one aspect of
the project. It would no longer consist of sampling the entire Web, but
instead would attempt to sample by specific TLD, in this case the
dot-com zone. The dot-com zone is the single largest TLD on the Web,
accounting for about 46% of all registered domain names (VeriSign,
2005). It remains a preferred naming convention for business and other
enterprises. The balance of the paper describes the process used to
obtain a representative sample from the dot-com zone. VeriSign is the
American company charged with managing both the dot-com and dot-net
zones. VeriSign provides research access to the Zone Files through a
relatively simple application and agreement process. The requirements at
the time we registered included that the machine from which the access
was to be conducted must have a static Internet Protocol address. Most
universities employ static IP addresses on all campus machines. VeriSign
Corporation granted us access to the data for research purposes on
October 4, 2004. Along with the completed agreement, we received a File
Transfer Protocol (FTP) address and an access password.
The next step was to use a FTP program to access the VeriSign Data
Base. Once we connected to the VeriSign FTP address, we were able to
select and download the entire dot-com data base file. VeriSign provides
the databases as highly compressed zipped files. The size of the
compressed dot-com file as downloaded on November 7, 2006 was 936
megabytes. Once the entire dot-com zone file was fully extracted the
size was 4.30 gigabytes.
This new database represented the universe of dot-com names.
However, the sheer number of records in the dot-com zone database (more
than 40 million records) and the size of the files we were dealing with
would prove difficult to manage and created data access issues. Many
text reading files such as MS Word or MS NotePad were simply unable to
load the files. Database enumeration proved to be problematic as well
since applications which might easily have been used to enumerate records could not handle files of that magnitude.
CREATION OF THE SAMPLING POOL
To address the issue of sampling frame size, we developed a Java
program (available upon request) which randomly extracted 50,000 domain
names from the dot-com file. The program required four parameters from
the user. The file name of the dot.com zone file, the file name of the
file to be extracted, a seed number, and finally the approximate record
size of the file to be extracted (for example: 50,000). Based on pseudo
random numbers seeded by the user, the would program read through the
file, compare the pseudo random numbers to the file record number,
extract records and create a sample of domain names of a more manageable
size.
The resulting file of 50,000 domain names from the dot-com zone
became our working database or 'sample pool' from which the
final domain names could be drawn. We employed an editing tool (TextPad)
to load the sample pool. This application included the capability to
instantly enumerate each record (domain name) in the sample pool.
CREATION OF FINAL SAMPLE
We then generated an additional 1300 random numbers in the range of
1 to 50,000. These numbers were pasted to an Excel spreadsheet. These
sets were used to select correspondingly numbered domain names from the
enumerated sample pool in the TextPad file. This resulted in an Excel
file of 1300 names drawn from the sample pool. Each record included the
domain name and its associated number from the TextPad sample frame.
This allowed us to cross check each record in the final Excel file with
the associated record in the Random Number file and the TexPad
enumerated sample pool.
The final task for our research purposes was to 'resolve'
each of the 1300 domain names drawn. Domain name resolution is the
process of typing or pasting a domain name into the browser address bar.
Each of the 1300 domain names was resolved manually using a "copy
and paste" process from our final Excel database to the browser
address bar. This helped avoid typing errors in resolving the domain
names, and still allowed individual evaluation and categorization of
each site. Finally a new database was created which included all the
domain names viewed and the results of the domain name, categorization
process, i.e. was this an active site? Was it a business site? Did we
encounter a 'Site not found' error message, and etc.
CONCLUSION
This paper suggested that in order to better understand the complex
business systems emerging in World Wide Web further research from within
the Web environment itself is required. Because of the size of the Web,
sampling strategies must be employed in order to effectively model and
study the Web business environment. We suggest that sampling the Web Top
Level Domains offers a reasonable alternative for business researchers
because it requires only familiarity with the use of the simple Web
utilities such as File Transfer Protocols to obtain initial domain name
listings. Domain names represent the fundamental building blocks of the
Web. As such, every enterprise or individual seeking a presence on the
Web must acquire at least one domain name. Closer examination of domain
name utilization leads us to a clearer of understanding of proportions
and types of businesses using the internet and can help shed light on
fundamental and as yet little understood questions.
REFERENCES
Barabasi, A.-L. (2002). Linked: The new science of networks.
Cambridge, Mass.: Perseus Pub.
Colecchia, A. (2000). Defining and measuring electronic commerce:
Towards the development of an OECD methodology. Statistics Singapore.
Retrieved November 12, 2006, from
http://www.singstat.gov.sg/conferences/ec/d8.pdf
Constatinides, E. (2004). Strategies for surviving the Internet
meltdown. Management Decision, 42(1), 89-107.
Drew, S. (2002). E-business research practice: Towards an agenda.
Electronic Journal of Business Research Methods, Retrieved February 18,
2007, from http://www.ejbrm.com/
Edelman, B. (2003). Web sites sharing of IP addresses: Prevalence
and significance. Berkman Center for Law and Society. Retrieved March
26, 2006 from http://cyber.law.harvard.edu/people/edelman/ip-sharing
O'Neill, E., McClain, P. D., & Lavoie, B. F. (1998). A
Methodology for Sampling the World Wide Web. Online Computer Library
Center. Retrieved September 23, 2006, from
http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003447
Porter, M. (2001). Strategy and the Internet. Harvard Business
Review, 79(3), 63-78.
Ricerca, J. (2004). Search Engine Optimization: Static IP vs.
dynamic IP addresses. Circle ID. Retrieved July 17, 2006 from
http://www.circleid.com/
posts/search_engine_optimization_static_ip_vs_dynamic_ip_addresses/
VeriSign. (2005). The VeriSign Domain Report, August 2005. The
Domain Name Industry Brief. Retrieved December 14, 2006 from
http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000003447
Michael Featherstone, Jacksonville State University
Stewart Adam, Deakin University
Patricia Borstorff, Jacksonville State University