文章基本信息

标题：Stylometry--definition and development.
作者：Belak, Stipe ; Radman Pesa, Anita ; Belak, Branko 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Written texts of uncertain authorship attracted the attention of numerous experts from different fields, mostly linguists and literary theorists whose research focused on many famous literary works of uncertain authorship, with the aim of establishing the authenticity of the author and his oeuvre. Relentless studies and searches of texts for the hallmarks of an author's linguistic style, expressions and usage of certain words or metaphors, did not fail to yield results. However, fast technological development in the last fifteen years enabled stylometrists to develop different methods of authorship identification on the basis of a given text. Those methods include mathematical tools, statistical methods and artificial intelligence methods, the result of which is specialized software for text analysis and authorship identification, but also for intentional concealment of a document's authorship. Initially, stylometry was mostly applied in literature and other fine arts. Most stylometrists began their research on certain widely known texts, such as Catholic epistles, ancient Greek and Latin texts, the plays of Shakespeare and Marlowe, and the Federalist Papers (Mosteller & Wallace, 1964), a series of 85 articles published anonymously in 1787 and 1788, advocating the ratification of the new Constitution of the United States of America (Klarreich, 2003). The hypothesis examined in the paper is that stylometry will become more important over time due to the frequency and significance of communication in the realm of the economy and politics. Analyses of anonymous texts from various sources will be crucial for gaining any kind of advantage, be it political or economic, all within business intelligence as a modern form of protection. The paper is based on the project entitled "031-2/2008 Research Into Matters Warranting, Economically and Situation-wise, Adaptive Restructuring of an Organization in a Dynamic Environment" University of Zadar, Department of Economics, Centre for Economic Research, MER-Evrocentar Slovenia, 2008, and is the continuation of earlier research into the concept of terotechnology within the study of organization and business intelligence protection (Belak & Cicin-Sain, 2005).
关键词：Artificial intelligence

Stylometry--definition and development.

Belak, Stipe ; Radman Pesa, Anita ; Belak, Branko 等

1. INTRODUCTION

Written texts of uncertain authorship attracted the attention of numerous experts from different fields, mostly linguists and literary theorists whose research focused on many famous literary works of uncertain authorship, with the aim of establishing the authenticity of the author and his oeuvre. Relentless studies and searches of texts for the hallmarks of an author's linguistic style, expressions and usage of certain words or metaphors, did not fail to yield results. However, fast technological development in the last fifteen years enabled stylometrists to develop different methods of authorship identification on the basis of a given text. Those methods include mathematical tools, statistical methods and artificial intelligence methods, the result of which is specialized software for text analysis and authorship identification, but also for intentional concealment of a document's authorship. Initially, stylometry was mostly applied in literature and other fine arts. Most stylometrists began their research on certain widely known texts, such as Catholic epistles, ancient Greek and Latin texts, the plays of Shakespeare and Marlowe, and the Federalist Papers (Mosteller & Wallace, 1964), a series of 85 articles published anonymously in 1787 and 1788, advocating the ratification of the new Constitution of the United States of America (Klarreich, 2003). The hypothesis examined in the paper is that stylometry will become more important over time due to the frequency and significance of communication in the realm of the economy and politics. Analyses of anonymous texts from various sources will be crucial for gaining any kind of advantage, be it political or economic, all within business intelligence as a modern form of protection. The paper is based on the project entitled "031-2/2008 Research Into Matters Warranting, Economically and Situation-wise, Adaptive Restructuring of an Organization in a Dynamic Environment" University of Zadar, Department of Economics, Centre for Economic Research, MER-Evrocentar Slovenia, 2008, and is the continuation of earlier research into the concept of terotechnology within the study of organization and business intelligence protection (Belak & Cicin-Sain, 2005).

2. STYLOMETRY--DEFINITION

Stylometry (Greek stylos (style) + metron (measure)) is a scientific method of studying linguistic style, which has been successfully applied to other areas in the last several years. In addition to art (literature, music, painting), stylometry in applied in forensic science, security affairs, the economy, politics, but also at universities, i.e. whenever it is necessary to identify the author of an anonymous text or a text penned by more potential authors, whether they all claim to be the author or the author of the text in question is simply not known. This is usually the case with complex groups of texts on which several authors worked together. The linguistic style of an author can more readily be identified by analyzing only the form of writing, rather than by studying only the content, because the content can easily be copied while it is very difficult or almost impossible to imitate the form of writing of an author, it being strongly influenced by the author's subconscious. A text sample used in stylometric analysis should contain a minimum of 1000 words. It is also important to use the pure, original text which was not changed in the course of the years.

3. HYSTORY AND DEVELOPMENT

In 1439 the Italian humanist Lorenzo Valla proved that the Donation of Constantine was a forgery dating from the period between 750 and 850. His conclusion was based on the analysis of the written text and the comparison of Latin with that used in authentic 4th century documents. Some Latin words and phrases used in the Donation did not exist in the 4th century, and came into use much later. This is one of the first examples of stylometric analysis. In modern times stylometry came to be widely used for the study of authorship issues in English Renaissance drama. The development of computers and computer languages for analyzing large quantities of data greatly improved stylometric efforts. In the early 1960s, Andrew Q. Morton developed a computer programme for the analysis of the fourteen Epistles of the New Testament attributed to St. Paul. The analysis showed that the epistles had been written by six different authors.

4. METHODS

The main aspect of authorship identification is the process of selecting features to be analyzed in a text. The most immediate idea that comes to mind at the mention of linguistic analysis of a text could be misleading--that an author's identity can be revealed through complicated or specific words since they mark the author's unique style and set him apart from others. Stylometrists have proven that it is the exact opposite which matters. Authors differ in their usage of the most frequently used simple words such as with and in, because one's subconscious uses words in daily parlance, such as these prepositions, automatically and without reflection. Words such as these are in fact an authorial "fingerprint" enabling experts to identify the author. Rare and specific words have a strong impact on readers but can easily be consciously inserted into a text to imitate a specific author. It is much more difficult to imitate the usage of simple prepositions and thus it is deemed a safer technique, according to Holmes (1998). Ideally the two techniques should be combined, says Holmes, analyzing first the frequency of simple widespread words and then of the rare and author-specific ones.

4.1. Frequency in Text Segments

In the early 1960s Frederick Mosteller and David Wallace first demonstrated the stylometric method for authorship attribution of the aforementioned Federalist Papers of the United States of America. Mosteller and Wallace found that Hamilton used the word upon ten times as often as Madison did. They conducted analyses assuming both men were authors and examined the frequency of thirty words, word upon word, to establish differences in the frequency of repetition. Using the frequency method they concluded that the author of all twelve disputed articles was Madison, which was confirmed by subsequent stylometric analyses.

4.2. Writer Invariant

The writer invariant is a specific distinctive value of a text that can be attributed to a certain author. An example of a writer invariant is the frequency of function words used by the writer. In one such method (Binongo, 2003), the text is analyzed to find the 50 most common words. The text is then broken into 5,000 word chunks and each of the chunks is analyzed again to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk of text. The identifier is then displayed as a point in a 50-dimensional space, flattened into a plane using the PCA technique. Principal Components Analysis is a new technique used to represent texts in multidimensional space and to reduce multidimensional data sets to lower dimensions. The latest similar technique of sample recognition and representation of chunks of text in multi-dimensional space called support-vector machines was used by Fung (2003).

4.3. Neural Networks

A neural network is an artificial intelligence method for identifying the author from the text which imitates the neural network of a human brain. One such network is built with random links and works by trial and error using the method learning by doing. The network is presented with a prepared text of a known author. Any time the network guesses the authorship in a text segment, the correct assumption strengthens its links until the network can properly identify known texts. Once the training period is complete, the neural network can determine the authorship of texts by authors that it had been trained on previously. Matthews & Merriam (1993) were the first to create the neural network method for the purpose of separating Shakespeare's plays from the plays of his contemporary Christopher Marlowe. Once applied to all Shakespeare's works, the programme only recognized Part Three of Henry VI as Marlowe's.

4.4. Genetic Algorithms

The genetic algorithm is another author identification method which uses a set of rules. It is based on Darwin's principle of natural selection. For example, a rule might be, "If the conjunction but appears more than 1.7 times in every thousand words, then the text is by author X". The program is presented with a text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule gets a score depending on correct or incorrect results. The 50 rules with the lowest scores are thrown out. The remaining 50 rules are slightly changed and reintroduced into the algorithm. This is repeated until, after 256 generations, the evolved rules correctly attribute the text. In the end the finalized rules included only 8 words. Using this method on the Federalist Papers, Holmes and Forsyth (1995) proved once again that the author of all 12 disputed articles was Madison.

4.5. Rare Pairs

This method of authorship identification relies upon specific individual habits of collocation, i.e. using certain words with one another. The use of certain words may, for a particular author, entail the use of other words, which that author usually combines with the first word. By itself, one word means nothing, but combined with another word it often says a lot about the author, according to Craig (2004), who developed this method. Still, stylometrists caution that the thorough analysis of author's style and metaphors he uses render this method open to manipulation and document falsification.

4.6. Cryptography--Stylometry Reversed

There are many reasons to identify authorship. However there are also many reasons for staying anonymous. A department or company manager may wish to stay anonymous when sending out a memo with unpleasant corporate news or some other important document. Kacmrick and Gamon (2006), working on the famous Federalist Papers, tested the possibility of creating a tool for intentional anonymization of a text, believing that there are authors who wish to stay anonymous. Their technique presupposes that an author A wishes to preserved anonymity for a particular document D. Using cryptographic tools, the author A creates a new document D which moves away from A. Combining stylometric and cryptographic methods, they successfully showed that an author's identity can be intentionally concealed in a text.

5. CONCLUSION

This paper offers a definition of stylometry, describes its development and presents basic authorship identification methods applied in stylometry today. The initial research of the project "031-2/2008" points to the future development of stylometry marked by the discovery of a unique method which will, with the support of high technology and artificial intelligence as well as mathematical and statistical methods, combine the most important and efficient stylometric methods. That unique method would have to be applicable to any text, in any language, on any subject-matter and from any source, and provide reliable authorship identification or deliberate authorship obfuscation.

6. REFERENCES

Belak, S., Cicin-Sain, D., (2005) The Development of The Concept of Terotechnology, Journal of Maritime Studies, 19, ISSN 1332-0718

Binongo JNG; Smith MWA, (1999). The Application of Principal Component Analysis to Stylometry, Literary and Linguistic Computing 1999 14(4):445-466, ISSN 1477-4615

Halteren H. Oostdijk N.H.J., (2004). Linguistic Profiling of Texts for the Purpose of Language Verification, Annual Meeting of the ACL, July, 2004, Barcelona

Holmes D., (1998). The Evolution of Stylometry in Humanities Scholarship, Literary and Linguistic Computing 13(3):111-117, ISSN 1477-4615

Kacmarcik G. i Gamon M., (2006). Obfuscation Document Stylometry to Preserve Author Anonymity, Natural Language Processing Group, ACL, 2006, Sidney.

Mosteller F. i Wallace D. (1964). Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley, 366pp.