Transcription

A Web-Enabled Plagiarism Detection ToolColin J. Neill and Ganesh ShanmuganathanVol. 6, No. 5September/October 2004This material is presented to ensure timely dissemination of scholarly andtechnical work. Copyright and all rights therein are retained by authors or byother copyright holders. All persons copying this information are expected toadhere to the terms and constraints invoked by each author's copyright. In mostcases, these works may not be reposted without the explicit permission of thecopyright holder. 2004 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material foradvertising or promotional purposes or for creating new collective works for resale or redistribution to servers orlists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Plagiarism detection systems canuse the Internet to help academicinstitutions and publishers identifystolen intellectual property.Colin J. Neill and Ganesh ShanmuganthanA Web-EnabledPlagiarismDetection ToolThe Internet and the World Wide Web haverevolutionized information sharing andsearching; it is difficult to remember whatacademic research was like without it, orhow we could possibly live without it again. It is anawesomely powerful resource, and therein lays therub. Never have the phrases, “With great powercomes great responsibility,” and “Power corruptsand absolute power corrupts absolutely,” been sotrue. We could be talking about many issues withthat introduction, anything from e-retail to hacking and virus dissemination, but it turns out thatwe’re talking about academic dishonesty. In particular we’re talking about plagiarism, or passingoff another person’s work as your own.Penn State University considers plagiarism tooccur when an individual submits a paper written by someone else, quotes or paraphrases another paper withoutproper citation, or presents another person’s ideas without attribution.How serious a problem is this? A report in theJournal of Higher Education stated that 75 percent of college students admit to some form ofcheating and half admit to serious cheating on written assignments (D. McCabe, L. Trevino, and K.Butterfield, “Academic Integrity in Honor Codeand Non-Honor Code Environments: AQualitative Investigation,” J. Higher Education,vol. 70, no. 2, 1999). An article in the Daily Pennsylvanian quotes McCabe as saying, “Students aregrowing up with technology that makes Internetplagiarism simple. It is easy to use, and almost all1520-9202/04/ 20.00 2004 IEEEwritten sources are available on the Internet. Some students actually believe that they’re notdoing anything wrong.They have this attitude thatthey’re doing research.They don’t think that theyneed to cite because everything on the Internet ispublic information” (M.C. Peterson, “Download.Steal. Copy. Cheating at the University,” DailyPennsylvanian, 27 November 2001). These areastonishing numbers, particularly considering thatwe have identified only a handful of cases at ourcampus, the Great Valley Graduate School. Thequestion is, how many cases are we missing?So this is clearly a big problem for universitiesand schools, but it is not just an educational problem.The publishing industry, and periodical publications in particular, face the same issues. Tworecent editorials addressed the increasing occurrence of plagiarism, simultaneous submission (submitting the same work for review to multiplevenues simultaneously), and republication (submitting previously published work without sufficient attribution), and how these practices are inbreach of the copyright form signed by all authors,and the integrity at the heart of academic publishing (R.L. Haupt, “Plagiarism in Journal Articles,”IEEE Antennas and Propagation, Aug. 2003, vol.45, no. 4, p. 102; and W.R. Stone, “Plagiarism,Duplicate Publication and Duplicate Submission:They Are All Wrong!” IEEE Antennas andPropagation,Aug. 2003, vol. 45, no. 4, pp. 47-49).In fact, combating this worrysome trend hasbecome a significant focus for all IEEE periodicals, which shows in changes at ManuscriptCentral, the IEEE’s online article submission system. Manuscript Central now requires authors toexplicitly state that the submitted work is theirs,Published by the IEEE Computer SocietySeptember October 2004 IT Pro19

WEBTOOLSthat every author contributed directly in its writing, andstand-alone application that detects collusion between stuthat the work has not appeared previously in another pubdents in a course rather than plagiarism of external sourcelication and is not already in submission elsewhere. Undermaterial. To use the application, the professor or teachera proposed new policy, the IEEE will ban authors caughtloads all the documents into the system’s internal archive.plagiarizing from publishing for up to five years, and willThe system compares all papers to detect copying withinidentify the offending article in the archives as such.the class. The document comparison is based on keywordReturning to higher education, universities around theprofiles (a type of linguistic fingerprint) and phrase matchworld are ramping up their efforts against plagiarism. Penning. Although this system is not strictly detecting plagiaState has implemented acarism, it could if the internaldemic integrity boards toarchive includes paper millThe professor can alsorule on contested cases andessays and similar content.wield the plagiarizer’sto set standard sanctions forUnfortunately, a 2001 reviewinfractions, to enforce theof the tool commissionedweapon of choice.inclusion of the university’sby the JISC (http://online.policy on academic integritynorthumbria.ac.uk/faculties/on every course syllabus that students receive, and toart/information studies/Imri/Jiscpas/docs/jisc/Detectionensure that professors read that policy at the start of eachTechnology.pdf) found its detection performance unsatiscourse, including how the policy works within the contextfactory.of that course.We have worked hard on the education andEVE2 (http://www.canexus.com/eve) is a commercialpolicy side of the problem at Penn State, but the issue ofapplication that downloads to a user’s desktop and deterdetection has not received enough attention; as we said, wemines if students have copied material from the Internet.have discovered only a handful of cases at our campus.For each paper, the application generates a report, includThe Internet is the offenders’ biggest weapon. Foring the percentage that contains plagiarism, the list oftunately, the professor can also wield the plagiarizer’sURLs, and an annotated copy of the paper with the copiedweapon of choice. If students use search engines to findsections highlighted in red. The tool accepts several filethe material to copy, professors can use them to find thoseformats, including plain text and Microsoft Word docuoriginal sources. Unfortunately this can be very time conments, but it will only generate annotated copies of paperssuming.The student is only working on one paper, but thefrom the plain text files. Essentially the tool is an interfaceprofessor must grade and verify all of the students’ papers.for Web searches, but this simplicity does not limit its effecTo do this by hand can take more than an hour per paper,tiveness. The only drawback, which the 2001 JISC reportand this is in addition to the standard grading activities.identifies, is that the searches are only against HTML WebThankfully, services and tools are available to aid in thecontent, and much of the material on the World Wide Webfight. We have also developed our own tool.is in alternate formats.WCopyFind l) is a freely available collusion detectionAVAILABLE TOOLStool that Prof. Lou Bloomfield developed at the UniversityTurnitin (http://www.turnitin.com) is the most wellof Virginia.At least two versions exist, but the most usefulknown plagiarism detection service. Originally plagiaversion features a simple graphical interface that lets usersrism.com, it is a commercial service that iParadigmsload the set of documents into the internal archive of thedeveloped for registered individual educators or institutool, just as with WordCHECK. The tool compares thesetions. Professors and teachers submit papers to the site anddocuments against each other and optionally against a sepreceive results a day or so later. The site compares papersarate archive of files (that the professor might have colagainst an index of Internet content as well as large datalected over the years) for matching phrases. The toolbases of “paper-mill” essays (essays available for purchasepresents the results as HTML files and hyperlinks comon the Web for use by students as term papers at schoolmon phrases between documents to indicate which stuand university) and previously submitted papers. Recentlydents in a class were colluding. Although it cannot searcha student at McGill University challenged using this servagainst Internet content, the tool is fast, very easy to use,ice partly on the grounds that Turnitin subsequently addsand the results are clear.the paper to its in-house database of material, constituting an economic benefit to Turnitin without compensationto the student. Despite this potential setback, the JointPLAGIARISM DETECTORInformation Systems Committee (JISC), an organizationSome might consider this variety in tools sufficient, parrepresenting all higher education institutions in the UK,ticularly as this is not at all an exhaustive list, but each hasrecently selected Turnitin as its plagiarism detection servlimitations or costs that accompany their use, and noneice in the form of http://www.submit.ac.uk.leverages the most powerful weapon in the available arseWordCHECK (http://www.wordchecksystems.com) is anal: Google.com. Google now references more than 4 bil20IT Pro September October 2004

lion Web pages. Most importantly, its indexFigure 1.includes non-HTML content such as PDFfiles, which are popular with online publishers, as well as article archives such asFilesCiteseer. Indeed, any tool that cannotsearch these resources has limited usefulness,at least in an educational environment.TaggingThe Internet is often the only tool many students use for topic research; the Penn StateGreat Valley library is practically a ghosttown. In light of this, only Turnitin andEVE2 would be useful and each has its limitations.Turnitin is a subscription-only, commercial service that appears to aim moretoward the detection of paper-mill essays (prewritten essaysthat students can purchase and use as their own), whichmight be common in high schools or undergraduate settings,but are rarely relevant or suitable for graduate school assignments. EVE2 is less costly and doesn’t focus on prewrittenessays, but its scope only covers HTML Web content. Giventhese restrictions, we decided to develop our own Webbased plagiarism detection tool that accessed the Googlesearching capabilities directly.The detector engine is a Java application that resides ona server for shared user access. It operates in a periodicsleep/wake mode.When the engine wakes it checks for anyfiles to analyze in the input directory.When it finds files, itstarts processing them and moves each processed file tothe output directory along with the resultant analysisreport and associated files.This makes the tool exceptionally easy to use. Registered users need only place the documents in the specified input directory and then returnlater to check the output directory for the HTML reports.As an aside, processing a typical term paper of approximately 4,000 words currently takes around three minutes.Figure 1 illustrates how the tool performs analysis instages. We apologize to those interested readers, but wepurposefully chose not to explain in detail the engine’sinternal operation and the algorithmic aspects of the analysis and detection process. We are still wrestling with theissue of how to publish these details without revealing, tothe enterprising student, a strategy to defeat the tool.Grammar analyzerThe first stages of analysis relate to the grammatical content of the input file.The file is parsed using QTAG (available at http://web.bham.ac.uk/o.mason/software/tagger),a probabilistic part-of-speech tagging package that returnsthe part of speech for each word in the file (adjective, noun,pronoun). This tagging identifies words that the tool caneliminate from searches, and it determines changes intense, voice and person as indicators suggesting multipleauthorship or cut-and-paste writing.Additionally, Flesch scoring performs document readability measuring. The tool computes the Flesch ReadingBlock diagram of detector arismscoreHTMLreportsEase Score and Flesch-Kincaid Grade Level for the entiredocument and for each paragraph. These are well-knownreadability measures based on the number of words persentence and number of syllables per word that the tooluses as indicators of potential multiple-authorship underthe assumption that a student’s scores won’t vary greatlysection to section or paragraph to paragraph.The tool analyzes dispersion of the paragraph scores about the document mean to identify potential plagiarism “hot-spots”that it excerpts for Web searching.Google APIAs mentioned previously, the detector uses the Googlesearch engine for Web searching. To do this automaticallythe detector accesses the search engine via the Google WebAPIs service using the Simple Object Access Protocol(SOAP).The tool decomposes the sections of text that the detector has identified as potential hot spots by the grammaranalysis into search terms.The tool conducts decompositionin two ways: finding exact matches of sentence fragments(Google limits length to 10 words), or finding contextualmatches of keywords (which the tool determines from theQTAG speech tagger).A user can configure the detector todo either of these directly, or leave the detector in defaultmode where context-match searches perform only if theresults of exact matches are below a preset threshold.The tool screens the list of Web links that Google returnsto remove redundant and repeated URLs and to determine the most relevant matches for each section of textand the complete input file.The tool uses the number andrelevance of “hits” as factors in the plagiarism score forthe document, along with the standard deviation of theFlesch scores and grammatical analysis. The plagiarismscore is on a scale of 1 to 5 and is indicated in the generatedreports as a five-colored scale from green, for innocent, tored, for very guilty.Some readers might be wondering why we don’t perform Web searches for the entire document sentence-bysentence, but initial experiments demonstrated that thiswas impractical given the size and number of the inputSeptember October 2004 IT Pro21

WEBTOOLSFigure 2. Segment of report generated forwork that does not contain plagiarism.Figure 3. Sample report page for a paragraphof text that does not contain plagiarism.Example reportsWe included some examples of theHTML reports that the detection toolgenerates, to demonstrate its utility.Thetool generates a top-level report (Figure2 shows an example), showing the papertitle and the date the report was generated.The report shows the Flesch scoresfor the entire paper followed by eachparagraph of text, the list of relevantURLs for that paragraph, and the difference between the overall Fleschscores and those of that paragraph. The“View” link then opens a separate pagefor each paragraph, as Figure 3 shows,presenting the paragraph text followedby excerpts from the returned URLswith the words common to both highlighted so that the user can quickly identify what text has found a match and,therefore, if the author indeed plagiarized that section of text.In the first example, a graduate research paper on Model Driven Development, you can see from the portion of thetop-level report in Figure 2 that the tooldetected no plagiarism (indicated by thehighlighting of the green segment of thescore bar). Figure 3 shows the detailedreport for the first paragraph and you cansee that only individual terms are highlighted in the URL excerpt indicatingthat the text was not copied and theGoogle search merely found termsrelated to the paper topic.Alternately, Figure 4 is a paper onsoftware process return on investmentthat included substantial plagiarism.Youcan see that the red segment of the scorebar is highlighted, indicating that thetool detected the plagiarism. When youfollow the View link for the secondparagraph in Figure 5, you see fromthe highlighted text that the authorcopied—without attribution—entiresentences from a Web page, clearlyviolating standard academic integritypolicies, including that used at PennState.Future enhancementsdocuments. In addition, the usage policy of the Google APIservice limits users to 1,000 searches per day per useraccount and this ration is quickly wasted when searchingsections of text that are unlikely to be plagiarized.22IT Pro September October 2004Although we are very happy with the operation of thecurrent version of the detection tool, we plan enhancements for future releases. In particular we are interestedin

using more of the information gleanedfrom grammatical parsing to isolatesuspicious sections of submitted documents and therefore lighten the loadon Web searching. adding reference checking capabilitiesso source-attributed sections do not factor in plagiarism scoring. Currently thetool does not determine if the authorhas correctly referenced copied material and this becomes an issue particularly when directly quoting a source;therelevance scores returned from Googlewill indicate plagiarism even when theauthor has properly cited the source.This might seem simple, but ensuringthe author has properly and accuratelycited quoted text is difficult consideringthe number of ways of doing it. experimenting with alternative searchterm approaches to narrow searchresults. The exact-match and context(keyword) match approaches are effective,but we wonder whether alternativeschemes of selecting keywords might bemore effective, such as series of twoword terms. adding capabilities to process PDF,Microsoft Word documents,and HTMLfiles. Currently we can only processASCII text files. The only format weforesee difficulty with is the Word document since the application is builtusing Java and Microsoft has not created any standard APIs and the opensource or freeware tools available struggle to keep up with Microsoft’s frequentchanges in document format.Figure 4. Segment of report for work thatcontains plagiarism.Figure 5. Sample page for paragraph thatcontains plagiarism.Of course detecting plagiarism, orany other form of academic dishonesty, is not the cure, we areonly treating the symptoms of this socialdisease, but at least now we can diagnosethe condition when it arises, and perhapsthe threat of detection will be deterrentenough in most situations. Colin J. Neill is an assistant professor ofsoftware engineering at Penn State. Contact him at [email protected] Shanmuganthan is a senior software engineer at SCT Corp. Contact himat [email protected] October 2004 IT Pro23