Thursday, July 9, 2020

Computational and Computerized Methods for Authorship Attribution - 3850 Words

Computational and Computerized Methods for Authorship Attribution (Dissertation Review Sample) Content: NameInstructorCourseDateComputational and computerized methods for authorship attributionLiterature ReviewIntroductionDetermining the authorship of a given piece of writing has raised methodological inquiries for eras. There are all kinds of groups of people including journalists, academicians, lawyers, and politicians that interested in determining the actual author of a specific text. Conventionally, a combination of investigative journalism with scientific document analysis and expert close reading of documents has been a useful method that has continually provided good results. However, because of recent developments in computerization and computational techniques, there have emerged several objective and automatic inference of text authorship.There are currently many papers on the subject of authorship attribution, some of which have frequently appeared at various conferences covering methods that depend on computation, forensics, law, and machine learning techni ques (Patrick 235). However, an indication of the best practices and techniques is still lacking in this pursuit for methods that can help determine authorship attribution. This literature review attempts, therefore, to present a survey and overview of the current methods used in determining authorship attribution. Before presenting the actual methods, however, the paper will give a quick background to authorship attribution and how to determine it.The errand of deciding or confirming the authorship of an unknown content built singularly in light of inner confirmation is an exceptionally old one, going back in any event to the medieval scholastics, for which the solid attribution of an offered content to a known antiquated power was crucial to deciding the content's veracity. All the more as of late, this issue of authorship attribution has increased more noteworthy unmistakable quality because of new applications in scientific examination, humanities grant, and electronic business, and the improvement of computational techniques for tending to the issue. In the most straightforward type of the issue, we are given illustrations of the written work of various applicant writers and are solicited to figure out which from them created a given unidentified content (Koppel, Schler Argamon 3). In this clear structure, the authorship attribution issue fits the standard advanced ideal model of a content categorization issue.The segments of content categorization frameworks are at this point genuinely well comprehended: reports are spoken to as numerical vectors that catch measurements of possibly applicable features of the content and machine-learning routines are utilized to discover classifiers that different archives that fit in with distinctive classes (Koppel, Schler Argamon 3). Then again, genuine authorship attribution issues are once in a while as rich as direct content categorization issues, in which we have a little shut arrangement of applicant creators an d basically boundless preparing content for each. There are various mixtures of attribution issues that miss the mark regarding this ideal.Authorship attribution is the process of inferring the characteristics of the author of any linguistic data. Even though this definition encompassed authorship attribution identification on different media such as text and audio file, efforts that are more concerted have remained on text-related materials (Patrick 238). Until the discovery of present-day statistics, it was impossible or difficult to answer the question about who authored a given text. Thus, for many centuries, several disputes around the authorship of text had remained unresolved. With modern statistics and advanced computers coupled with huge corpora, it is now feasible to employ specialized algorithms to retrieve authorship details by means of information retrieval methods. The whole notion of authorship attribution became popular with the invention of corpus linguistics. It is built around the identification of an authorà ¢Ã¢â€š ¬s writing style.Text documents have inherent characteristics or fingerprints that can be used to identify the creator based on several assumptions (Pandian et al. 2512) such as the following: * A document has a single author * An author decides to make certain choices * Each author applies his/her choices consistently across multiple text documents * These choices remain in each piece of the authorà ¢Ã¢â€š ¬s literary works and could be detectedAccording to Pandian and others (2512), the problem of analyzing authorship so as to determine the credible author of a document and provide necessary evidence for demining so is divided into three categories: 1 Authorship detection: this category deals with running a comparison of several literary works to help establish if one author was responsible for writing them without actually identify the actual author. This is the case with plagiarism detection. Because this involves extracti ng a distinct writing style from different online messages, identification of lexical, content-free, structural, content-specific, and syntactic features is a key consideration in this case. 2 Authorship characterization: this involves obtaining a digest of the various author characteristics and produces the profile of the author depending on his/her writings such as writing style, cultural background, educational background, and gender. 3 Authorship identification: this kind of authorship attribution is interested in determining the probability of a particular author being the producer of a piece of writing by means of examining related or other literary works of the same authorOne method used in determining authorship attribution is called à ¢Ã¢â€š ¬Ã‹Å"unmaskingà ¢Ã¢â€š ¬. This method works by attempting to decide if any two book-length texts have a common author. While this method works for text contents that can fit into a book, it fails when applied on short text documents. By comparing the core feature set applied in representing the contents in any two-text documents, it becomes possible to establish authorship difference. For instance, if there are significant changes in feature sets between two documents, then the authors are probably different as well (Moshel et al. 285). The concept of reasonable distance measures like Burrowà ¢Ã¢â€š ¬s Delta has been used widely. Its application leads to solving various attribution problems by means of probabilistic ranking depending the multi-dimensional Laplacian distribution over the frequencies of a word.Automated authorship attribution methods fall in two key types. First, similarity-based methods are based on measuring the distance from one document to another while attributing anonymous documents to an author with a writing style similar to that document. Second, a machine learning method employs a classification method to group or classify anonymous documents or any two documents based on known writing styles of authors. For similarity-based approach, the key concepts are the selection of document representation features, techniques such as feature space Principal Component Analysis (CPA) for reducing the dimensionality of a document, and distance metric selection.However, past cases of determining authorship attribution of a given document whose author is anonymous have been conducted mainly on documents whose authors form a small feature set (Moshe et al. 285). That is, a list of possible authors is already available thereby producing a closed candidate set. This has been common in trying to verify the author such as in cases of conducting contextual plagiarism analysis. Unfortunately, this approach has limitations in the sense that it is likely that the actual author is not among those in the candidate set.METHODS FOR DETERMINING AUTHORSHIP ATTRIBUTIONConventional Authorship Attribution MethodsUnitary Invariant ApproachThis is a scientific method proposed in 1987-1988 because of studying texts attributed to the four gospel books of the Bible and texts written by Bacon (Koppel, Schler Argamon 4). The logic behind the method is that authorship of a given text could be determined by studying the relationship between the length of a word and the comparative frequency of the occurrence of such a word in a piece of text. The relationship results in a distinct curve that would help determine the authorship of anonymous texts. Later, in the 20th century, statistical techniques were involved in this method with the aim of finding invariant properties of authors. Unfortunately, invariant measures such as sentence length have demonstrated themselves to be unstable or unreliable in accurately determining the authorship of text (Koppel, Schler Argamon 2-3).Multivariate Analysis ApproachThis approach finds its roots in a research by Mosteller and Wallace in 1964 where they experimented using multiple textual clues to generate information that they believed would hel p identify the authorships of text in a method called stylometric authorship attribution. In their approach, the two used Bayesian classification method on several papers, employing the concept of word frequencies based on the grammatical function of a word. They believed, basically, that if the Bayesian method were applied rigorously to a group of topic-independent functional words, the results would help identify the authorship attributes. Eventually, this approach of using certain features of text, particularly functional words led to the discovery of other newer types modeling techniques and textual features (Koppel, Schler Argamon 5).The fundamental reasoning behind these authorship attribution methods is that of relating the distance measure between an anonymous document with a set of documents viewed as being in some space (Koppel, Schler Argamon 5). In fact, this approach i... Computational and Computerized Methods for Authorship Attribution - 3850 Words Computational and Computerized Methods for Authorship Attribution (Dissertation Review Sample) Content: NameInstructorCourseDateComputational and computerized methods for authorship attributionLiterature ReviewIntroductionDetermining the authorship of a given piece of writing has raised methodological inquiries for eras. There are all kinds of groups of people including journalists, academicians, lawyers, and politicians that interested in determining the actual author of a specific text. Conventionally, a combination of investigative journalism with scientific document analysis and expert close reading of documents has been a useful method that has continually provided good results. However, because of recent developments in computerization and computational techniques, there have emerged several objective and automatic inference of text authorship.There are currently many papers on the subject of authorship attribution, some of which have frequently appeared at various conferences covering methods that depend on computation, forensics, law, and machine learning techni ques (Patrick 235). However, an indication of the best practices and techniques is still lacking in this pursuit for methods that can help determine authorship attribution. This literature review attempts, therefore, to present a survey and overview of the current methods used in determining authorship attribution. Before presenting the actual methods, however, the paper will give a quick background to authorship attribution and how to determine it.The errand of deciding or confirming the authorship of an unknown content built singularly in light of inner confirmation is an exceptionally old one, going back in any event to the medieval scholastics, for which the solid attribution of an offered content to a known antiquated power was crucial to deciding the content's veracity. All the more as of late, this issue of authorship attribution has increased more noteworthy unmistakable quality because of new applications in scientific examination, humanities grant, and electronic business, and the improvement of computational techniques for tending to the issue. In the most straightforward type of the issue, we are given illustrations of the written work of various applicant writers and are solicited to figure out which from them created a given unidentified content (Koppel, Schler Argamon 3). In this clear structure, the authorship attribution issue fits the standard advanced ideal model of a content categorization issue.The segments of content categorization frameworks are at this point genuinely well comprehended: reports are spoken to as numerical vectors that catch measurements of possibly applicable features of the content and machine-learning routines are utilized to discover classifiers that different archives that fit in with distinctive classes (Koppel, Schler Argamon 3). Then again, genuine authorship attribution issues are once in a while as rich as direct content categorization issues, in which we have a little shut arrangement of applicant creators an d basically boundless preparing content for each. There are various mixtures of attribution issues that miss the mark regarding this ideal.Authorship attribution is the process of inferring the characteristics of the author of any linguistic data. Even though this definition encompassed authorship attribution identification on different media such as text and audio file, efforts that are more concerted have remained on text-related materials (Patrick 238). Until the discovery of present-day statistics, it was impossible or difficult to answer the question about who authored a given text. Thus, for many centuries, several disputes around the authorship of text had remained unresolved. With modern statistics and advanced computers coupled with huge corpora, it is now feasible to employ specialized algorithms to retrieve authorship details by means of information retrieval methods. The whole notion of authorship attribution became popular with the invention of corpus linguistics. It is built around the identification of an authorà ¢Ã¢â€š ¬s writing style.Text documents have inherent characteristics or fingerprints that can be used to identify the creator based on several assumptions (Pandian et al. 2512) such as the following: * A document has a single author * An author decides to make certain choices * Each author applies his/her choices consistently across multiple text documents * These choices remain in each piece of the authorà ¢Ã¢â€š ¬s literary works and could be detectedAccording to Pandian and others (2512), the problem of analyzing authorship so as to determine the credible author of a document and provide necessary evidence for demining so is divided into three categories: 1 Authorship detection: this category deals with running a comparison of several literary works to help establish if one author was responsible for writing them without actually identify the actual author. This is the case with plagiarism detection. Because this involves extracti ng a distinct writing style from different online messages, identification of lexical, content-free, structural, content-specific, and syntactic features is a key consideration in this case. 2 Authorship characterization: this involves obtaining a digest of the various author characteristics and produces the profile of the author depending on his/her writings such as writing style, cultural background, educational background, and gender. 3 Authorship identification: this kind of authorship attribution is interested in determining the probability of a particular author being the producer of a piece of writing by means of examining related or other literary works of the same authorOne method used in determining authorship attribution is called à ¢Ã¢â€š ¬Ã‹Å"unmaskingà ¢Ã¢â€š ¬. This method works by attempting to decide if any two book-length texts have a common author. While this method works for text contents that can fit into a book, it fails when applied on short text documents. By comparing the core feature set applied in representing the contents in any two-text documents, it becomes possible to establish authorship difference. For instance, if there are significant changes in feature sets between two documents, then the authors are probably different as well (Moshel et al. 285). The concept of reasonable distance measures like Burrowà ¢Ã¢â€š ¬s Delta has been used widely. Its application leads to solving various attribution problems by means of probabilistic ranking depending the multi-dimensional Laplacian distribution over the frequencies of a word.Automated authorship attribution methods fall in two key types. First, similarity-based methods are based on measuring the distance from one document to another while attributing anonymous documents to an author with a writing style similar to that document. Second, a machine learning method employs a classification method to group or classify anonymous documents or any two documents based on known writing styles of authors. For similarity-based approach, the key concepts are the selection of document representation features, techniques such as feature space Principal Component Analysis (CPA) for reducing the dimensionality of a document, and distance metric selection.However, past cases of determining authorship attribution of a given document whose author is anonymous have been conducted mainly on documents whose authors form a small feature set (Moshe et al. 285). That is, a list of possible authors is already available thereby producing a closed candidate set. This has been common in trying to verify the author such as in cases of conducting contextual plagiarism analysis. Unfortunately, this approach has limitations in the sense that it is likely that the actual author is not among those in the candidate set.METHODS FOR DETERMINING AUTHORSHIP ATTRIBUTIONConventional Authorship Attribution MethodsUnitary Invariant ApproachThis is a scientific method proposed in 1987-1988 because of studying texts attributed to the four gospel books of the Bible and texts written by Bacon (Koppel, Schler Argamon 4). The logic behind the method is that authorship of a given text could be determined by studying the relationship between the length of a word and the comparative frequency of the occurrence of such a word in a piece of text. The relationship results in a distinct curve that would help determine the authorship of anonymous texts. Later, in the 20th century, statistical techniques were involved in this method with the aim of finding invariant properties of authors. Unfortunately, invariant measures such as sentence length have demonstrated themselves to be unstable or unreliable in accurately determining the authorship of text (Koppel, Schler Argamon 2-3).Multivariate Analysis ApproachThis approach finds its roots in a research by Mosteller and Wallace in 1964 where they experimented using multiple textual clues to generate information that they believed would hel p identify the authorships of text in a method called stylometric authorship attribution. In their approach, the two used Bayesian classification method on several papers, employing the concept of word frequencies based on the grammatical function of a word. They believed, basically, that if the Bayesian method were applied rigorously to a group of topic-independent functional words, the results would help identify the authorship attributes. Eventually, this approach of using certain features of text, particularly functional words led to the discovery of other newer types modeling techniques and textual features (Koppel, Schler Argamon 5).The fundamental reasoning behind these authorship attribution methods is that of relating the distance measure between an anonymous document with a set of documents viewed as being in some space (Koppel, Schler Argamon 5). In fact, this approach i...