Detecting Plagiarised Content

July 13, 2013

Eminent scientist and prime minister’s scientific adviser CNR Rao and three other Bangalore-based researchers are embroiled in an unsavory ‘plagiarism’ row. Rejecting charges of plagiarism in a paper co-authored by him, Rao said it was an instance of ‘copying of a few sentences of text.’ Most people would have raised their brows or nodded sadly. It is surprising how people with great achievements fall from grace over seemingly ‘trivial’ issues. What exactly is ‘plagiarism‘? Is it important enough to cause headlines? Or is it because the persons involved are men and women of great integrity and achievements?

Plagiarism takes place when we use someone else’s words or ideas and try to show them as our own. ‘Copy cat’? ‘Cheating’? Well, maybe. When does a work stop being original and qualify as being ‘plagiarized’?

People in show business, literature and quite commonly, students are guilty of this. How many ‘guides’, ‘text books’ and ‘study materials’ are guided by well-known ones? How many pieces of music are ‘inspired’ by similar pieces? There are many instances where a plagiarized version becomes more popular than the original!
How exactly does one determine the quantity of plagiarism? How much of ‘copying’ is permissible while writing an article? The definition of plagiarism by publishers’ and journal editors’ may vary across disciplines, but it is commonly understood as the non-acknowledgement of other people’s work or contributions. The amount of text available in electronic form on a wide variety of subjects shows an upward trend due to the rapid growth of the internet and has spewed ‘cut-and-paste’ plagiarism. Because of the sheer volumes of information, detecting whether a document has been plagiarised from another source is obviously
time-consuming. Not many are aware that when they submit a paper to a journal it will probably be checked for plagiarism. There are many types of software available, many of them free and some of them paid. All of them combine statistical tools used in computer software. So brush up on some common statistical tools if you need to use and understand these.

The term ‘count’ in any given document is the number of times a given term appears in that document. This count is usually normalized as lengthier ones have higher term count regardless of the actual importance of that term in the document. The important techniques used for plagiarism detection in software have been classified as attribute-counting techniques or ranking measures. The former uses several important statistical concepts like a normalised histogram, Bayes theorem for calculating posterior probabilities, conditional probability and so on. In the latter, the concept of ranking collections of data or objects is such that the most relevant or commonly occurring ones objects are placed at the top of the list.

Probability is a very useful concept in which a statistically improbable phrase (SIP) is used. A good SIP is usually between 6 to 12 words long and is completely unique to any piece of work. Amazon.com uses this wherein computers scan the text of all books in the Search programme. If it finds a phrase that occurs a large number of times in a particular book related to all Search books, that phrase is a SIP in that book.

The tf-idf weight (term frequency–inverse document frequency) is a numerical statistic which indicates the importance of a word in a document. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is also takes into account that some words are generally more used than others. The YAP standing for ‘Yet Another Plague’ series of tools were created by Michael Wise. He created the original version (YAP1) in 1992 and optimised it with YAP2. The result of YAP1 is a value from 0 to 100 for each comparison, 0 representing no match and 100 an exact copy. This is the cutting edge of technology spanning interdisciplinary areas.