Stylometry--definition and development.
Belak, Stipe ; Radman Pesa, Anita ; Belak, Branko 等
1. INTRODUCTION
Written texts of uncertain authorship attracted the attention of
numerous experts from different fields, mostly linguists and literary
theorists whose research focused on many famous literary works of
uncertain authorship, with the aim of establishing the authenticity of
the author and his oeuvre. Relentless studies and searches of texts for
the hallmarks of an author's linguistic style, expressions and
usage of certain words or metaphors, did not fail to yield results.
However, fast technological development in the last fifteen years
enabled stylometrists to develop different methods of authorship
identification on the basis of a given text. Those methods include
mathematical tools, statistical methods and artificial intelligence
methods, the result of which is specialized software for text analysis
and authorship identification, but also for intentional concealment of a
document's authorship. Initially, stylometry was mostly applied in
literature and other fine arts. Most stylometrists began their research
on certain widely known texts, such as Catholic epistles, ancient Greek and Latin texts, the plays of Shakespeare and Marlowe, and the
Federalist Papers (Mosteller & Wallace, 1964), a series of 85
articles published anonymously in 1787 and 1788, advocating the
ratification of the new Constitution of the United States of America (Klarreich, 2003). The hypothesis examined in the paper is that
stylometry will become more important over time due to the frequency and
significance of communication in the realm of the economy and politics.
Analyses of anonymous texts from various sources will be crucial for
gaining any kind of advantage, be it political or economic, all within
business intelligence as a modern form of protection. The paper is based
on the project entitled "031-2/2008 Research Into Matters
Warranting, Economically and Situation-wise, Adaptive Restructuring of
an Organization in a Dynamic Environment" University of Zadar,
Department of Economics, Centre for Economic Research, MER-Evrocentar
Slovenia, 2008, and is the continuation of earlier research into the
concept of terotechnology within the study of organization and business
intelligence protection (Belak & Cicin-Sain, 2005).
2. STYLOMETRY--DEFINITION
Stylometry (Greek stylos (style) + metron (measure)) is a
scientific method of studying linguistic style, which has been
successfully applied to other areas in the last several years. In
addition to art (literature, music, painting), stylometry in applied in
forensic science, security affairs, the economy, politics, but also at
universities, i.e. whenever it is necessary to identify the author of an
anonymous text or a text penned by more potential authors, whether they
all claim to be the author or the author of the text in question is
simply not known. This is usually the case with complex groups of texts
on which several authors worked together. The linguistic style of an
author can more readily be identified by analyzing only the form of
writing, rather than by studying only the content, because the content
can easily be copied while it is very difficult or almost impossible to
imitate the form of writing of an author, it being strongly influenced
by the author's subconscious. A text sample used in stylometric
analysis should contain a minimum of 1000 words. It is also important to
use the pure, original text which was not changed in the course of the
years.
3. HYSTORY AND DEVELOPMENT
In 1439 the Italian humanist Lorenzo Valla proved that the Donation
of Constantine was a forgery dating from the period between 750 and 850.
His conclusion was based on the analysis of the written text and the
comparison of Latin with that used in authentic 4th century documents.
Some Latin words and phrases used in the Donation did not exist in the
4th century, and came into use much later. This is one of the first
examples of stylometric analysis. In modern times stylometry came to be
widely used for the study of authorship issues in English Renaissance
drama. The development of computers and computer languages for analyzing
large quantities of data greatly improved stylometric efforts. In the
early 1960s, Andrew Q. Morton developed a computer programme for the
analysis of the fourteen Epistles of the New Testament attributed to St.
Paul. The analysis showed that the epistles had been written by six
different authors.
4. METHODS
The main aspect of authorship identification is the process of
selecting features to be analyzed in a text. The most immediate idea
that comes to mind at the mention of linguistic analysis of a text could
be misleading--that an author's identity can be revealed through
complicated or specific words since they mark the author's unique
style and set him apart from others. Stylometrists have proven that it
is the exact opposite which matters. Authors differ in their usage of
the most frequently used simple words such as with and in, because
one's subconscious uses words in daily parlance, such as these
prepositions, automatically and without reflection. Words such as these
are in fact an authorial "fingerprint" enabling experts to
identify the author. Rare and specific words have a strong impact on
readers but can easily be consciously inserted into a text to imitate a
specific author. It is much more difficult to imitate the usage of
simple prepositions and thus it is deemed a safer technique, according
to Holmes (1998). Ideally the two techniques should be combined, says
Holmes, analyzing first the frequency of simple widespread words and
then of the rare and author-specific ones.
4.1. Frequency in Text Segments
In the early 1960s Frederick Mosteller and David Wallace first
demonstrated the stylometric method for authorship attribution of the
aforementioned Federalist Papers of the United States of America.
Mosteller and Wallace found that Hamilton used the word upon ten times
as often as Madison did. They conducted analyses assuming both men were
authors and examined the frequency of thirty words, word upon word, to
establish differences in the frequency of repetition. Using the
frequency method they concluded that the author of all twelve disputed
articles was Madison, which was confirmed by subsequent stylometric
analyses.
4.2. Writer Invariant
The writer invariant is a specific distinctive value of a text that
can be attributed to a certain author. An example of a writer invariant
is the frequency of function words used by the writer. In one such
method (Binongo, 2003), the text is analyzed to find the 50 most common
words. The text is then broken into 5,000 word chunks and each of the
chunks is analyzed again to find the frequency of those 50 words in that
chunk. This generates a unique 50-number identifier for each chunk of
text. The identifier is then displayed as a point in a 50-dimensional
space, flattened into a plane using the PCA technique. Principal
Components Analysis is a new technique used to represent texts in
multidimensional space and to reduce multidimensional data sets to lower
dimensions. The latest similar technique of sample recognition and
representation of chunks of text in multi-dimensional space called
support-vector machines was used by Fung (2003).
4.3. Neural Networks
A neural network is an artificial intelligence method for
identifying the author from the text which imitates the neural network
of a human brain. One such network is built with random links and works
by trial and error using the method learning by doing. The network is
presented with a prepared text of a known author. Any time the network
guesses the authorship in a text segment, the correct assumption
strengthens its links until the network can properly identify known
texts. Once the training period is complete, the neural network can
determine the authorship of texts by authors that it had been trained on
previously. Matthews & Merriam (1993) were the first to create the
neural network method for the purpose of separating Shakespeare's
plays from the plays of his contemporary Christopher Marlowe. Once
applied to all Shakespeare's works, the programme only recognized
Part Three of Henry VI as Marlowe's.
4.4. Genetic Algorithms
The genetic algorithm is another author identification method which
uses a set of rules. It is based on Darwin's principle of natural
selection. For example, a rule might be, "If the conjunction but
appears more than 1.7 times in every thousand words, then the text is by
author X". The program is presented with a text and uses the rules
to determine authorship. The rules are tested against a set of known
texts and each rule gets a score depending on correct or incorrect
results. The 50 rules with the lowest scores are thrown out. The
remaining 50 rules are slightly changed and reintroduced into the
algorithm. This is repeated until, after 256 generations, the evolved
rules correctly attribute the text. In the end the finalized rules
included only 8 words. Using this method on the Federalist Papers,
Holmes and Forsyth (1995) proved once again that the author of all 12
disputed articles was Madison.
4.5. Rare Pairs
This method of authorship identification relies upon specific
individual habits of collocation, i.e. using certain words with one
another. The use of certain words may, for a particular author, entail
the use of other words, which that author usually combines with the
first word. By itself, one word means nothing, but combined with another
word it often says a lot about the author, according to Craig (2004),
who developed this method. Still, stylometrists caution that the
thorough analysis of author's style and metaphors he uses render
this method open to manipulation and document falsification.
4.6. Cryptography--Stylometry Reversed
There are many reasons to identify authorship. However there are
also many reasons for staying anonymous. A department or company manager
may wish to stay anonymous when sending out a memo with unpleasant
corporate news or some other important document. Kacmrick and Gamon
(2006), working on the famous Federalist Papers, tested the possibility
of creating a tool for intentional anonymization of a text, believing
that there are authors who wish to stay anonymous. Their technique
presupposes that an author A wishes to preserved anonymity for a
particular document D. Using cryptographic tools, the author A creates a
new document D which moves away from A. Combining stylometric and
cryptographic methods, they successfully showed that an author's
identity can be intentionally concealed in a text.
5. CONCLUSION
This paper offers a definition of stylometry, describes its
development and presents basic authorship identification methods applied
in stylometry today. The initial research of the project
"031-2/2008" points to the future development of stylometry
marked by the discovery of a unique method which will, with the support
of high technology and artificial intelligence as well as mathematical
and statistical methods, combine the most important and efficient
stylometric methods. That unique method would have to be applicable to
any text, in any language, on any subject-matter and from any source,
and provide reliable authorship identification or deliberate authorship
obfuscation.
6. REFERENCES
Belak, S., Cicin-Sain, D., (2005) The Development of The Concept of
Terotechnology, Journal of Maritime Studies, 19, ISSN 1332-0718
Binongo JNG; Smith MWA, (1999). The Application of Principal
Component Analysis to Stylometry, Literary and Linguistic Computing 1999
14(4):445-466, ISSN 1477-4615
Halteren H. Oostdijk N.H.J., (2004). Linguistic Profiling of Texts
for the Purpose of Language Verification, Annual Meeting of the ACL,
July, 2004, Barcelona
Holmes D., (1998). The Evolution of Stylometry in Humanities
Scholarship, Literary and Linguistic Computing 13(3):111-117, ISSN
1477-4615
Kacmarcik G. i Gamon M., (2006). Obfuscation Document Stylometry to
Preserve Author Anonymity, Natural Language Processing Group, ACL, 2006,
Sidney.
Mosteller F. i Wallace D. (1964). Inference and Disputed
Authorship: The Federalist. Reading, MA: Addison-Wesley, 366pp.