State Of The Art/Peer review and review metrics

From LiquidPubWiki

Jump to: navigation, search
Task 3.1. State of the art on analysis of reviews and modelling of reviewers’ behaviour in peer review. M1-M6, UNIFR, UNITN. This task reviews two lines of research: one includes efforts that analyses review processes in various scientific fields, while the other looks at existing approaches to modelling reviewers’ behaviours. Besides peer reviews and reviewers, the study also touches upon how reviews are done in other related fields where creative content is produced, such as evaluation of software artefacts, pictures, or movies. The task also identifies metrics for ‘good’ review and review processes, and how to measure them.

Contents

Introduction

Peer review is simultaneously one of the most entrenched and at the same controversial aspects of research assessment. Virtually every active researcher has experienced their papers or research proposals being blocked by reviews that seemed quite overtly malicious and perhaps even mendacious. At the same time most of us have also at one time or another gained great benefit from a referee who helped correct unnoticed (and sometimes serious) errors, suggested ways to clarify or improve our results and the description of them, or brought to our attention other related work that we found of great interest. We live in fear of the first kind of reviewer and in hope of the second—yet it seems rare that there is any consistency in the service we receive.

Stated opinions about peer review range from it being ‘crude and understudied, but indispensable’ (Kassirer & Campion, 1994) to ‘a flawed process’ whose effectiveness is a matter of faith rather than evidence (Smith, 2006). Accusations (Benos et al., 2007) include systematic bias on the grounds of gender, status and other issues, inconsistency of results, inhibition of innovation, and sheer ineffectiveness, particularly when faced with fraud. Deliberate abuse is also often suspected (Birnbaum, 2002; Lawrence, 2003), with reviewers using the process to block the work of rivals or scientists they do not like, to take revenge for rejections of their own work, to censor opinions they dislike and sometimes even to steal results or ideas[1].

Overall, perhaps the most striking thing to be found when examining peer review and the various studies performed on it is the sheer lack of agreement on every single aspect—from whether or not certain problems exist to the most basic question of what peer review is for and how it should be done (Jefferson et al., 2002b; Smith, 2006). Meta-studies are often unable to pool the results of individual investigations because processes of review (and investigation of those process) vary so greatly in design (Jefferson et al., 2002a).

It follows that in some respects this ‘state of the art’ report presents a rather uncomfortable picture: that in the present state, there seems to be very little art indeed. On the other hand, one absolutely consistent theme of the literature on peer review is the question, ‘If not peer review, then what?’ Demonstrably effective alternatives to the current system(s) are therefore highly desirable, and with the new technologies and distribution methods offered by the internet, a variety of new techniques become available. We also examine review and quality-assurance techniques from outside academia, notably from the free and open source software communities (Raymond, 1999) and new community-created reference works such as Wikipedia and Citizendium.

Analysis of peer review

[Different studies of peer review’s effectiveness, analysis techniques and so on]

History and practice

Until the 20th century there was generally little requirement on authors to justify their claims prior to publication, with the burden of proof generally being on opponents rather than proponents of ideas (Nature Editorial, 2003). Benos et al. (2007), citing Kronick (1990), note that the first scientific journal, the Journal des sçavans, considered its role to be simply to report others’ claims and findings rather than guarantee their accuracy. The early-20th-century Annalen der Physik, under Max Planck’s editorial stewardship, generally allowed authors a great deal of leeway after their first publication (Nature Editorial, 2003), and this criterion (‘published here before’) still carries weight in the selection process of many journals.

Nevertheless, peer review of one form or another dates back to at least the 18th century. The Philosophical Transactions of the Royal Society of London—founded in 1665, the same year as the Journal des sçavans—was selective in its choice of manuscripts, but this was an informal process in the hands of the editor (Spier, 2002). The Royal Society of Edinburgh’s Medical Essays and Observations, first published in 1731, was probably the first to introduce peer review as we would recognise it today, with submitted manuscripts being distributed by the editor to appropriate specialists for assessment (Spier, 2002; Benos et al., 2007); the Philosophical Transactions of the London society adopted this system in 1752 (Spier, 2002). Different forms of review were adopted by other journals over the next two centuries, with some following this procedure of reports from recognised outside experts while others employed internal review panels. A few held out for a long time: the Lancet, one of the world’s oldest and most highly-regarded medical journals, did not employ peer review until 1976 (Benos et al., 2007).

The present day ubiquitousness of peer review reflects this chequered past, with different journals[2] employing quite different practices of selecting and evaluating submitted articles. Broadly speaking, these tend to involve a mixture of editorial and reviewer-based selection. Journal editors are responsible for the primary decision of whether or not to submit a manuscript to review (some have a high rate of summary rejection) and, in the former case, to choose the most appropriate independent experts. They also have the final say in whether or not to accept referee recommendations—though rare, it is not unknown for editors to publish against the advice of reviewers[3].

The main task of reviewers—selected researchers with (hopefully) some level of expertise appropriate to the claims and techniques of the submitted manuscript—is usually to ensure the technical correctness and clarity of the work, identifying methodological or empirical flaws and making recommendations for improvement where possible. More controversially, they are frequently asked to make judgements of significance and suitability for the journal—effectively, editorial decisions (Lawrence, 2003). Many journals request recommendations for action[4], typically along the lines of ‘accept’, ‘revise and accept’, ‘revise and resubmit’ or ‘reject’, and these recommendations tend to be the key deciding force behind the final editorial decision whether or not to publish (Cicchetti, 1997). Increasing reliance on these referee judgements may provide one reason why many journals now request or require three different referee reports: the need to avoid a split decision (Lawrence, 2003).

Different journals’ review procedures place different weight on these different aspects of the system, with some requiring only technical correctness[5] and topical suitability, whereas others place great emphasis on quality, innovation and significance. Some journals employ significant editorial selection, with a large proportion of manuscripts being rejected even before peer review. Some others may still editorially rule in favour of authors they know or respect without sending out for review, or at least treat them more kindly than less well known authors (Ingelfinger, 1974). Besides asking for commentary and recommendations, some journals request authors to rate papers according to several criteria (quality, importance, etc.), for example on a 5-star scale. Lastly, journals differ in their policy towards review blindness or openness: most grant anonymity to reviewers, but a few prefer reviewers to openly sign their reports while others attempt to provide a ‘double blind’ procedure where neither authors nor reviewers are aware of each other’s identity. This issue of openness is one of the key foci of debate, ethics and research into peer review (Godlee, 2002).

[Make notes on post-publication review? Or leave until ‘New directions’ section?]

Study and analysis

It is widely recognised that, for a subject of such crucial importance in professional research, peer review is far too poorly understood or studied (Ingelfinger, 1974; Kassirer & Campion, 1994; Rennie, 2002; Smith, 2006). The views of many people might parallel those of the Nature editorial team that, as Churchill said of democracy, it is ‘the worst system ... except for all the others that have been tried’ (Nature Neuroscience Editorial, 2005). The studies that have been carried out are varied, sometimes contradictory and, overall, extremely equivocal about peer review’s effectiveness.

Research on review processes can, broadly speaking, be classified into a few general topics: effects on quality (does it achieve its commonly-stated aims of enhancing the quality of published articles and rejecting flawed or incorrect work?); biases and inconsistencies in its results; effects on innovation (does it inhibit the publication of novel ideas?); and differences between different review techniques and processes (open versus blind, trained versus untrained reviewers, and others). We give an overview of all these areas of investigation.

Quality enhancement and control

Two of the most commonly-stated purposes of peer review are to improve the quality of submitted manuscripts, and to identify error or deception (Smith, 2006). Indeed, editors will often encourage reviewers to offer helpful or thoughtful comments even when it is clear the manuscript will be rejected (Roberts et al., 2004). Although both goals could be questioned[6], most of us would certainly like to read good papers that are technically correct, and assessing peer review’s effectiveness in this respect should therefore be a key focus of research.

Goodman et al. (1994) carried out a study on research articles accepted for publication in the Annals of Internal Medicine between March 1992 and March 1993, using a 34-item assessment instrument to measure quality[7]. Manuscripts were assessed before and after revision in response to peer review and the editorial process. General conclusions were that such revision resulted in modest improvement of the work, notably in terms of description and interpretation of the results (limitations and generalisability of the study and the tone of conclusions) and in reporting of confidence intervals for statistics. Poor manuscripts were improved to a greater degree than those which were already good when first submitted. However, the reliability of the assessment instrument was low and the results may be skewed by the fact that only manuscripts that were eventually accepted were studied[8].

A brief study by Purcell et al. (1998) proposed a more general taxonomy for judging changes to manuscripts, with broad categories of flaws including ‘too much information’, ‘too little information’, ‘inaccurate information’, ‘misplaced information’ and ‘structural problems’. (Each, except the last, had two or three sub-categories.) Typical changes as a result of review were either the addition of missing information or the removal of extraneous material.

The broad implications of these studies are probably that peer review can provide assistance in matters of clarity, description and interpretation—ensuring that studies are well-described and that adequate information is provided for readers to interpret and understand them—as well as ensuring the reasonableness of inferred conclusions. In particular these improvements rely on the specificity of reviewer commentary rather than global assessments (Goodman et al., 1994). On the other hand the function of peer review as a mechanism of error detection is much more equivocal. Godlee et al. (1998) and Schroter et al. (2004) carried out different studies where deliberate errors (the majority major, some minor) were introduced into manuscripts already accepted by the BMJ. Only a small proportion were identified by reviewers; Godlee et al. (1998) report that 16% of reviewers failed to find any of the mistakes introduced, and 33% recommended acceptance despite the introduced weaknesses, while the mean number of major errors detected was 2 out of a total of 8. Schroter et al. (2004) demonstrated that training could improve performance, but the overall rate of error detection remained low[9], with a mean 2 out of 9 major errors for untrained reviewers compared to 3 for those receiving training. Callaham et al. (1998) reported a similarly low rate of error detection among 124 reviewers of the Annals of Emergency Medicine, who identified a mean 3.4 out of 10 major flaws in a fictitious manuscript[10].

One interesting but unstated result from all the studies would be whether reviewers collectively identified all the flaws in the various manuscripts. This could have strong implications for the potential effectiveness of an open community review process. For example, deliberate fraud generally proves difficult to identify in the conventional peer-review process, being more often uncovered post-publication by the wider scientific community (Lerner, 2002; Giles, 2006; Marris, 2006; Franzen et al., 2007).


Bias and inconsistency

[Observed biases, studies on whether reviewers can agree & effects on publication prospects]

‘The ideal reviewer,’ notes Ingelfinger (1974), ‘should be totally objective, in other words, supernatural.’ Scientists operate with limited time and knowledge and with (often strong) personal preferences about what good work consists of. They also work within institutional frameworks of hierarchy and status, and are part of a wider society which itself is far from free of prejudice. To what extent is peer review affected by these factors?

Early studies already reveal a number of interesting phenomena. Zuckerman and Merton (1971), examining the archives of the Physical Review, found distinct differences in the treatment received by higher- and lower-status authors. Dividing the studied authors into three tiers[11], they noted that authors of the ‘first rank’ received typically much faster response times[12], a phenomenon possibly related to bias in the degree of editorial selection: only 13% of papers by first-rank authors were sent to outside referees, compared to 27% by those of the second rank and 42% for the rest. Acceptance rates for the three groups were, respectively, 90%, 86% and 73%. On the other hand, Zuckerman and Merton’s study suggests that relative rank does not affect referee decisions—that is, referees return similar rates of acceptance whether they are ‘outranked’ by, outrank, or of similar status to the author whose work they are judging—and they note an inverse correlation between age and probability of acceptance, that is stronger for lower-status authors. Thus, their broad conclusion is that while prejudices surely play some part in the system, the different treatment accorded authors primarily reflects differences in the quality of work.

Zuckerman and Merton themselves found little disagreement between referees, with only a small percentage clashing over the fundamental decision to accept or reject: two-thirds of the differences of opinion related to proposed revisions. On the other hand, they cite earlier studies by Orr and Kassab (1965) on biomedical journals, and Smigel and Ross (1970) on sociology, showing strong disagreement between referees occurring respectively some 25% and 28.5% of the time, compared to figures of 38% and 46% that would have been expected if decisions were made by chance.

These studies date, of course, from a time when peer review was employed much less frequently than today, when journals received far fewer submissions, and when science and scientists were arguably much less specialised and diverse than they are now: we cannot necessarily assume that their results still hold. However, they identify several key issues which carry over to analysis of the present day literature. First is the importance of distinguishing between bias that is the result of prejudice and bias that in fact results from underlying quality differences. Second is the degree of difference that can be observed, in practice and results, of the peer review process as employed by different disciplines and different journals. Indeed, even within a given field, studies of peer review may give contradictory results (Jefferson et al., 2002a). There may also be differences depending on what is being assessed, for example, research articles or funding applications (Benos et al., 2007).

Status or institutional bias has been investigated more recently by several authors. Benos et al. (2007) cite a study by Ceci and Peters (1982) suggesting that researchers from prominent institutions are favoured in peer review, and another by Garfunkel et al. (1994) on submissions to the Journal of Pediatrics which suggested bias in the acceptance of brief reports but not regular research articles. A study by Link (1998) indicates bias in favour of US-based researchers, strong where the referees themselves are US-based, weaker (but still present) when the referees are not themselves based in the US; however, the work does not take account of possible quality differences. Ray (2002), noting the results of Link, also refers to a study by Nylenna et al. (1994) where Scandinavian referees were sent versions of a short manuscript either in English or their own language: the latter was far more frequently rejected. Anecdotally, Ginsparg (1994) notes that researchers from developing countries have credited the electronic arXiv preprint server with improving the consideration given to their research, feeling that previously they had suffered bias due to the low print or paper quality of their hard-copy preprints.

Gender bias is another frequent concern in peer review. Motivated by the disproportionate numbers of women leaving academic careers, Wennerås and Wold (1997) investigated the peer review process of the Swedish Medical Research Council, one of the country’s major funding organisations, for awarding postdoctoral fellowships. Their results revealed strong bias against female researchers, along with favouritism for researchers who were known to members of the MRC committee[13]. The two biases were of roughly equal magnitude, so that a female researcher known to a committee member might be judged at roughly the same level as a male researcher without such connection; but a female researcher not personally affiliated with any committee member would have to have a significantly greater body of high-impact work than a male colleague to gain an equal assessment.

Gender bias in the assessment of research articles appears more subtle in nature. A study by Lloyd (1990) suggested that in fact, while male reviewers did not discriminate on grounds of gender, female reviewers strongly favoured female authors and perhaps were biased against male authors. However, this may be a result of the study being carried out in a female-dominated field, and Lloyd notes that the bias may be influenced by perceptions of authors violating sex-role stereotypes rather than (or as well as) gender per se. The apparent biases also become less significant if one considers only outright recommendations of rejection (as opposed to ‘revise and resubmit’) or acceptance (as opposed to ‘accept pending revisions’), perhaps indicating that only initial review stages, rather than final acceptance rates, are subject to bias[14]. This latter contention is supported by the research of Gilbert et al. (1994) examining back issues of JAMA, which shows different behaviours by male and female reviewers and editors but no final biases in terms of article acceptance. (The different behaviours themselves could conceivably be due to a more subtle form of bias — in the kind of articles assigned to make or female editors.)


[Birnbaum (2002), Wenneras & ... (1997), bias about gender, status, institution/nationality, statistical significance, topic, ...]

Inhibition of innovation

[Does peer review inhibit innovation? Note Armstrong’s (1997) proposals for improving situation...]

Open versus blind review

[Studies & practice: e.g. Many journals allow confidential comments to editors (Roberts et al., 2004)]

The typical method of peer review employed by journals involves anonymous reviewers being asked to assess known authors. This practice has been called into question on a variety of grounds: lack of accountability, biases of various kinds, hidden conflicts of interest, and other forms of abuse or quality issues — as well as the basic ethical issue of whether it is fair for one party to the exchange to enjoy anonymity while the other is known. Depending on the particular concern, it may appear better either to disguise author identity — ‘double-blind’ review — or to have an open review process where both authors and reviewers are known to each other. There also exist subtleties such as whether authors or reviewers should be allowed to make confidential comments to the editors, whether reviewer identity should be revealed during or only after the end of the review process, and so on.

Studies have focused on several aspects of these issues. To begin with, there is the question of whether these alternative techniques are in fact feasible. In particular, blinding reviewers to author identity[15] appears to be particularly difficult: McNutt et al. (1990), in a trial carried out at the Journal of General Internal Medicine, discovered that in 27% of cases ‘blinded’ reviewers were able to identify the authors, while with Godlee et al. (1998), studying articles submitted to BMJ, the figure was 23% (of 90 reviewers): in a more extensive study with 309 blinded reviewers, blinding was unsuccessful in 42% of cases (van Rooyen et al., 1998). Justice et al. (1998), whose study covered several different journals, reported widely divergent rates of blinding success, but this might be due to the relatively small numbers of articles from each individual journal.

Cho et al. (1998), in a follow-up study to Justice et al., examined reasons for unmasking, specifically concentrating on reviewer characteristics and aspects of journal policy. The only factor reliably predicting ability to identify authors was reviewers’ individual research experience, including the number of years of review experience, number of articles published in recent years and percentage of time devoted to research. Katz et al. (2002) examined articles themselves for features that could indicate author identity, finding that out of 880 manuscripts submitted to two radiology journals, some 300 contained information that could potentially indicate to reviewers the identity of either authors, their institutions, or both. Editors of the journals, presented with the anonymised manuscripts, correctly identified the authors or institutions of 74% of the 300 potentially-unblindable manuscripts (corresponding to 25% of the total number of articles). The giveaway traits included, in decreasing order of frequency, authors’ initials stated in the manuscript body (106 occurrences in the 300 articles), references to the authors’ work in press (66), references identified as the authors’ previous work (57), institutional identity present in one or more figures (54), institutional identity stated in the manuscript body (47), authors’ names stated in the manuscript body (7), and authors’ identity being revealed by either previously-published figures (7) or acknowledgements (4). At least some of these factors were violations (whether deliberate or accidental) of the journals’ explicit instructions for manuscript submission, suggesting that journal policy is difficult to enforce in practice.

Open review, on the other hand, is easy (in principle) to implement, but carries with it fears of potential biases (e.g. towards high-status individuals or institutions) and the risk of increasing conflict or antagonism between authors and perhaps decreasing reviewers’ willingness to be properly critical (Godlee, 2002). In practical terms, the main obstacle appears to be the unwillingness of some reviewers to participate. Referees in the study by McNutt et al. (1990) were asked, but not required, to sign their names: only 43% complied. Van Rooyen et al. (1999b) note a difference (23% compared to 35%) in the proportion of participants declining to review, depending on whether anonymity was offered or not: of the latter, several gave as their explicit reason a personal opposition to open peer review. Walsh et al. (2000) found that only 76% of referees approached were willing to sign their reviews.

Practical issues aside, the principal question all studies have addressed is that of quality — whether, and how, open or double-blind review procedures affect the review process. McNutt et al. (1990) reported an improvement (in the opinion of editors) as a result of blinding referees to author identity, and no quality differences (again, according to editors’ opinions) between reviews that were signed or unsigned by referees. Justice et al. (1998) reported no quality difference (as perceived by editors and authors) between the reports of blinded or unblinded reviewers, but noted that this could be an effect of the lack of blinding success. The extensive study by van Rooyen et al. (1998), using the validated Review Quality Instrument (van Rooyen et al., 1999a) as a measure, reported no significant differences whether or not reviewers were blinded to author identity, and whether or not reviewer identity was hidden from authors[16]; nor were there apparent differences in the recommendations made. The same authors’ subsequent study of open peer review (van Rooyen et al., 1999b), using the RQI, again found no statistically significant differences in quality or final recommendation.

Walsh et al. (2000) conducted an interesting study which compared the quality and other factors of both signed and unsigned reviews of referees who had agreed to take part in an open-review trial (that is, agreed in principle to sign their reviews, while being asked to do so on a random basis) to those of referees who had refused to participate. Using the RQI as a measure, signed reviews were slightly higher in quality than unsigned ones, and considerably better than those given by referees who had refused to participate in the open peer review trial. Reviewers signing their reports were significantly less likely (18% compared to 33%) to recommend rejection. In contrast to van Rooyen et al. (1999b), it was found that reviewers signing their reports put more time into their review. Signed reviews were also significantly more courteous and less abusive in nature, although the majority of all reports were polite.

Research performance metrics and quality assessment

[Journal Impact Factor, h-index, citation analysis, etc. etc. etc.]

With thousands of scientists working across a huge range of ever-more-diverse disciplines, analytical methods and metrics to identify productive or important researchers are highly desirable. Detailed individual assessment being impossible on such a scale, funding agencies, research assessment panels and so on must of necessity rely either on personal contacts—carrying a strong risk of bias, nepotism and other negative influences—or attempt to identify proxy indicators of quality which allow quick and simple comparison.

On the other hand, such measurement and assessment techniques have come in for considerable and sustained criticism (Lawrence, 2003, 2007, 2008; Underwood, 2004; Campbell, 2008; Cheung, 2008; Todd & Ladle, 2008), primarily on the grounds that they are responsible for some very undesirable and destructive practices within professional science. Where a metric exists, scientists may be encouraged to follow research and publication practices that maximise their scores rather than providing the best service to the scientific community. Examples include an obsession with publishing in journals with a high impact factor, the division of research findings into ‘least publishable units’ so as to maximise the number of articles produced from a given project, and a whole range of political shenanigans ranging from citation swapping to abuse of the referee process.

It follows that when considering a potential research assessment metric, several questions must be asked. To begin with, does the metric reliably correspond to scientists’ actual perception of quality (Harnad, 2008)[17]? Just as peer review may in some cases block rather than support the most interesting or innovative papers, so too metrics may overlook important research or researchers and promote uninteresting work. Second, is the metric open to abuse or manipulation? If so, then it is likely to encourage cheating and (often unethical) professional practices which damage and distort the scientific literature, favouring those who are willing to ‘play the game’ over those whose primary focus is doing good science (Lawrence, 2003, 2007, 2008; Franzen et al., 2007; Cheung, 2008; Todd & Ladle, 2008). Other factors include whether the metric is self-distorting—that is, whether having a high score makes it easier to gain one still higher—and whether it carries implicit (or explicit) bias towards certain sections of the research community.

Citation analysis

Two of the simplest and longest-standing metrics of scientific performance have been numbers of papers produced and the numbers of citations received. The reasons are relatively clear: good scientists are likely to produce high volumes of work, and their work is more likely to be useful to (and therefore referenced by) their fellows. Even without any actual measurement or statistics, it is clear that a scientist not producing papers is professionally ‘dead’—as is an article that is no longer being cited. From these raw numbers, a great variety of further measures can be obtained, and the use and interpretation of such measures is a question of considerable subtlety (see e.g. Borgman & Furner, 2002; Weingart, 2005; van Raan, 2005, 2006; Lehmann et al., 2006; Harnad, 2008). For the purposes of this section, we will focus in particular on two that have gained great attention and popularity: the Journal Impact Factor (Garfield & Sher, 1963; Garfield, 1972, 2006) and the h-index (Hirsch, 2005, 2007).

General characteristics of the citation network

For example, the work of Price (1965) was not only confirmed the skewed, power-law distribution of citations first observed by Lotka (1926), but also shed light on the detailed history of citation of individual papers: the decline in citation by age (which also reflects the exponential growth in the literature with time), and the existence of ‘classic’ papers that buck this trend.

The Journal Impact Factor

Large-scale scientific citation analysis began in the 1950s (Garfield, 1955) with the introduction of what would become the Science Citation Index now operated by Thomson Scientific. Initially the motivation appears to have been less focused on scientometrics than on providing a means for scientists to be aware of citations made to any given paper, and so facilitate the process of discovering what criticisms or extensions have been made to a particular piece of work. However, the construction of such an index—with extensive records of the citation network of papers—provided fertile ground for more quantitative analyses to assess the quality of journals or articles, the most well-known—and arguably controversial—of these being the Journal Impact Factor (JIF), first introduced in 1963 (Garfield & Sher, 1963; Garfield, 1972, 2006).

The initial motivation for the development of the JIF was simply to select which journals should be included in the Science Citation Index, and in particular, to develop a metric that would not simply favour journals with a high publication count (Garfield, 2006). The idea is simple: the impact factor is given by the total number of citations received this year by research articles (including reviews) published in the previous 2 years, divided by the number of such articles published in this time—that is, the mean citation rate per article within a given time window:

GF_{y} = \frac{C_{y}}{P_{y-1} + P_{y-2}}

(following the notation of Vinkler, 2004).

From these simple beginnings, the JIF has over time become one of the major tools of research assessment, with individual scientists frequently being assessed on the basis of the impact factor of the journals where they publish rather than the content or actual impact of their work itself (Lawrence, 2003, 2007; Weingart, 2005). This particular use of the JIF has come in for considerable criticism, including from journals who themselves have a high JIF. To begin with, the impact factor is not representative of individual article impact: journals with a high JIF typically do so because of a tiny minority of very highly-cited papers (Seglen, 1997; Nature Neuroscience Editorial, 1998, 2003; Campbell, 2008). Secondly, the JIF varies considerably across fields, and the predominant factor in determining its value appears to be simply the average number of citations in reference lists (Althouse et al., 2008). Thirdly, the long-term impact of papers may take decades to become apparent: Stringer et al. (2008) note that the time scale can, for papers published in some journals, be as long as 26 years, and suggest an alternative ranking measure which takes into account each journal’s individual transient period.

The h-index

An alternative quality measure has been proposed by Hirsch (2005), this time focusing on the assessment of the impact of individual scientists. The h-index is, like the JIF, simple in design: in Hirsch’s own words, ‘a scientist has index h if h of his Np papers have at least h citations each and the other (Nph) papers have \leq h citations each.’ The index thus measures the broad impact of a scientist’s work, rather than just productivity (a scientist can easily produce many boring papers) or raw citation numbers (in the index’ terms, a scientist must be consistently highly cited across their output), and generally correlates well with peer rankings (van Raan, 2006).

The h-index clearly has some limits. For example, one cannot have a higher h-index than one has total publications, and its value must be taken in context of the amount of time one has spent in research as well as other factors: Hirsch (2005) notes that, ‘although I argue that a high h is a reliable indicator of high accomplishment, the converse is not necessarily always true.’ Hirsch (2007) has also suggested that, rather than a measure of quality for past work, the h-index might rather be seen as an indicator of future performance. Lehmann et al. (2006) claim that in fact the mean number of citations per paper is superior in this respect, though Hirsch (2007) obtained different results. The h-index is also potentially open to manipulation by groups of scientists consistently cross-citing each other’s papers (van Raan, 2006), and clearly varies according to discipline (Hirsch, 2005), with, for example, scientists in the biomedical sciences having significantly higher values than those in the physics community. Thus, working practices with respect to output, authorship and citation may have a strong influence on h, and in particular this may imply not just discipline-specific variation in h values but also strong gender discrimination (Symonds et al., 2006).

New and alternative directions in review and quality promotion

An interesting point raised by Spier (2002) in his history of the peer review process is that in many cases its adoption was technology-driven[18]. Only in the 1890s, with the introduction of the typewriter and carbon paper, did it become easy to make multiple copies of a manuscript; the almost universal uptake of the process in the second half of the 20th century can probably be linked to the introduction, in 1958, of the Xerox photocopier. Email, the internet and electronic documents have since facilitated the process still further. Yet this last technological revolution has opened up entirely new possibilities: to not just make traditional peer review faster and easier, but to dramatically change the way in which research is disseminated and evaluated (Harnad, 1990; Odlyzko, 1995). Among the novel developments are the rise of electronic preprint (‘e-print’) servers, the open access publication movement, the possibility of community review and commentary, and collaborative creation along the lines of Wikimedia and the free/open-source software community.

[a little more about what we’ll do in this section?]

[Pr]eprints, open access and the review process

In terms of research dissemination, some fields have been making use of alternatives to journal publication for a long time. Paul Ginsparg, creator of the arXiv online preprint service, points out (Ginsparg, 1994) that his system is simply an electronic continuation of a high-energy physics tradition dating back to the 1970s, when it became standard for research groups to post printed copies of their latest research articles to large mailing lists at the same time as they were submitted to journals[19]. The community would therefore receive the latest results months in advance of their refereed publication. Ginsparg notes that the community had ‘learned to determine from the title and abstract (and occasionally the authors) whether we wish to read a paper, and to verify necessary results rather than rely on the alleged verification of overworked or otherwise careless referees.’

Electronic preprint (or ‘e-print’) servers such as arXiv have changed the situation in a number of ways, not only greatly speeding and facilitating dissemination but also permitting long-term archival of documents, while drastically cutting costs compared to hard-copy delivery and storage (Ginsparg, 1994; Odlyzko, 1995; Jackson, 2002). The impact on a number of fields—notably physics and maths[20]—has been dramatic, and despite fears about the lack of quality assurance[21], the general standard seems comparable to that of the refereed journal literature (Jackson, 2002) and may even be of slightly higher quality due to authorial self-selection (Kurtz et al., 2005; Davis & Fromerth, 2007). The latter may provide part of the reason why papers posted on the arXiv are on average more highly (and faster) cited than those not, although the phenomenon is probably due to a mixture of reasons, notably early[22] (probably more important than open) access (Davis & Fromerth, 2007). Since arXival presence also leads to reduced downloads of the corresponding article from publisher websites, a further explanation may be that arXiv provides ease of access—a single portal to literature that in officially-published form is broken up across many different archives.

As several authors have noted (Ginsparg, 1994; Harnad, 1990, 2001; Odlyzko, 1995; Campbell, 2008), the possibility of self-archiving electronic manuscripts allows for some significant changes in the function of peer review. In the traditional world of print publishing, limits on storage capacity have meant that the primary function of the review process has been to assist editors with the problem of deciding, from an excess of submissions, what work deserves to be distributed (Ingelfinger, 1974; Spier, 2002; Schuhmann, 2008). With e-prints, the storage and distribution problem is solved and the function of peer review becomes a social one: individual researchers can decide for themselves whether to trust a paper, and the practice of peer review becomes an option which can be employed with multiple different purposes: giving a mark of professional approval (perhaps required by funding or assessment agencies), adding commentary, or giving a quality mark which can go up or down; it could also be used to hierarchically select for attention or priority, much as the journal system does now[23]. Notably, the process of review no longer has to stop with the publication of an article (MacCallum, 2006; Dayton, 2006; Koomin et al., 2006a,b), but can be extended into a long-term post-publication process of discourse and continuous assessment.

These alternative possibilities have led to some innovative publishing and review practices in the Open Access community. Beyond the immediate shift in the economics of publishing—charging authors (or their institutions or funding agencies) for the once-off costs of the editorial and review process, and making articles available to all without restriction—several have chosen to take further advantage of electronic distribution to develop novel ways of assessing and selecting research.

PLoS ONE, for example, employs conventional peer review but does so only to assess the technical aspects of a piece of work, and not its significance, novelty or subject area (MacCallum, 2006). Published papers are then open to ongoing comment and rating by readers of the journal website. Among the reasons given for this practice are a desire to avoid the ‘also-ran’ phenomenon (where a high-quality article is rejected from high-profile publication because it has been ‘scooped’ by an existing paper), the need to foster interdisciplinary links rather than splitting the literature into ever-smaller topical specialities, and a simple recognition that ‘importance’ or significance often only becomes clear some time after a paper’s publication. Where editorial selection is desirable, this can be provided by specialist access portals, which have the potential to be considerably more flexible and diverse than the relatively fixed selection criteria of many journals, able to serve both long-term and transitory areas of research interest.

An even more liberal review scheme is provided by Biology Direct, an open access journal which is pioneering a novel form of open review (Koomin et al., 2006a,b). Authors select their own reviewers from the editorial board (although board members may ask an external expert to provide a review on their behalf), and instead of the typical journal requirement of positive referee reports, Biology Direct’s only acceptance criterion[24] is that three members of the editorial board are interested enough by the paper to provide or solicit reviews. It is the author’s choice whether or not to revise or withdraw the paper in response to critical comments, or to challenge referees’ claims, and when a paper is published, the complete author-referee correspondence is published along with it. Thus, Biology Direct provides a publication scheme which reflects in many ways the more liberal and open discourse associated with scientific meetings: authors are provided with greater leeway in terms of the ideas they can share, but do so in the knowledge that they will be accompanied by critical commentary and discussion.

Community review and collaborative creation

The attempts by PLoS ONE, Biology Direct and others to provide an alternative to conventional peer-review methods also reflect more fundamental changes that can be made to the way scientists share ideas and information. As Dayton (2006) notes, while open access is important for all sorts of reasons, it still perpetuates significant inequality in scientific research: much key scientific knowledge and debate happens not on the pages of published articles but at scientific meetings or behind closed doors. Open discourse is an essential part of scientific advance, and whereas in the print publishing world the ‘right to reply’ tends to be in the hands of the few authors prestigious enough to get their commentary or letters published, electronic publishing makes possible continuous comment and debate on published material.

Such discourse-based, collaborative development processes have become the bread and butter of a number of non-scientific communities, with striking results. Most of these go far beyond commentary, feedback and ratings to allow full-scale community involvement in the creative process. Such practices are sustained as a result of several complementary factors including ethical philosophy, legal devices (in particular licensing) developed to support those ethics, and a variety of technical tools that assist the collaborative process. Two such communities are worth examining in detail: the wide variety of projects coming under the umbrella of the free and open source (FOSS) software movements, and the two major collaborative encyclopaedia projects, Wikipedia and Citizendium.

Free and open source software and distributed development

The Free Software movement was founded by Richard Stallman in the early 1980s as a reaction to the increasingly proprietary nature of software development—in particular, the practice of placing limits on the ways customers could use software[25]. Stallman’s response was to begin a project to create an operating system—the GNU system—which would grant the user exactly the desired freedom of use, but whose license would also constrain users to preserve those freedoms for others (Stallman, 2002).

This licensing concept—later dubbed ‘copyleft’—helped provide a legal framework via which diverse programmers, with diverse motivations and interests, could share and collaborate on software code. The power of these development practices were later highlighted by the Open Source movement for their strong practical benefits (Raymond, 2000): ‘many eyeballs’ to identify and fix bugs, and to propose and implement feature extensions.

The inevitable problem faced in such a collaborative environment is how to coordinate the diverse efforts of contributors. Even for single-developer projects it is important to be able to track changes to the code; where many developers are involved this becomes essential, with contributors needing to be able to keep track of the work of their peers, to review alterations to the source code and compare such changes to solved (or created) bugs in the program’s performance.

The practices of different free/open source development communities vary considerably[26], but broadly speaking can be divided into two general development models, which reflect to a high degree the choice of tools to track the development of the software code. The most longstanding practice—reflected in version control systems such as CVS and Subversion—is of centralised development, where code revision history is stored on a single server to which a limited number of people have commit access[27]. Thus, all proposed changes to the code must be filtered and approved by at least one of these privileged individuals.

At the other extreme is the distributed model of development (Shuttleworth, 2005; Torvalds, 2007), where developers operate essentially by peer-to-peer comparison and exchange of code. Pioneered particularly by the Linux kernel development team, this practice has becoming increasingly widespread as powerful distributed revision control (DRCS) tools—notably Bazaar, Git and Mercurial—have become available over the last few years. In contrast to centralised systems, these tools grant allow developers to create independent branches (copies) of the revision history to which they can privately add: these changes can then be made available to others to merge (that is, incorporate into their own copies) or ignore as they see fit.

In practice, most projects operate somewhere in the middle between these two extremes, and the particular advantage of DRCS is that it has made the precise nature of the development model (and thus the quality review system) a social choice rather than a technical requirement (Torvalds, 2007). Centralised development has the advantage of allowing control over a project and its contents, but carries disadvantages of scale: as the number of potential contributors grows, so does the load on those with commit access, making it necessary to either restrict the development community size or widen commit access to the point where quality control may be weakened. DRCS, on the other hand, makes it possible for developers to operate highly independently, each having their own circle of well-regarded collaborators whose quality of work they have learned to trust, while saving their scrutiny for those whose work they do not know or have faith in. Changes to the code can therefore propagate via these ‘circles of trust’, reaching core developers often only after multiple rounds of scrutiny, revision and testing.

Wikipedia and Citizendium

The ethical philosophy behind the Free Software movement has spawned a wide variety of children: the Creative Commons and Free Culture movements, the Scientific Commons, and a great many textbooks published under free documentation licenses. One of the most well-known and successful is the community-created encyclopaedia website, Wikipedia. Using the WikiMedia system for collaborative content creation, anyone may (anonymously, unless they wish otherwise) create or edit articles on any topic. Edits appear immediately in the published article and undergo no formal peer review.

Restrictions on participation are few—a small number of articles (for example, on contentious political topics) are protected to some degree, since otherwise they are too frequently vandalised, and while anyone can edit, only registered users can create new articles. Thus, in virtually all cases the only quality assurance is provided by the scrutiny of the unvetted contributing community. Despite this apparent lack of direction and control, in practice Wikipedia has been remarkably successful in generating a huge compendium of often very reliable information (Giles, 2005).

On the other hand, this same lack of direction and control means that whether or not material is accurate, it cannot be relied on as such (Sanger, 2004), and consistent problems remain with bias, vandalism and lack of expertise. An alternative direction has been taken by the Citizendium project, which requires contributors to use real names and which employs a measure of expert review: while anyone can edit draft versions of articles, final versions require expert editorial approval. The approved version then remains the default presented to the public, but the latest draft is available to view if desired.

Recommender systems and information filtering

[UNIFR's stuff & related material.]

Thanks to the Internet and other computer networks, a large amount of customer opinions is now available both for academical and commercial use. As a result, nowadays we can see which movies have high average user ratings in a online movie database, web bookstores point us to new books according to our shopping history, Google uses our browsing and e-mail history to target advertisements, and so forth (Schafer, 2001). All these real-life examples are based, in one way or another, on recommender systems and information filtering in general.

When speaking about information filtering, web search engines (Brin & Page, 1998; Kleinberg, 1999) present its landmark application in the age of the Internet. In its basic form, a web search engine provides a quality ranking of web pages and an efficient database that allows to quickly compare the search query entered by a user with the contents of locally stored webpages. An important drawback is that such search is not personalized: for a given search query, all users receive the same result. On the other hand, quality rankings of search engines can be successfuly applied also in other areas - for example Pagerank computed for the citation network of scientific literature can reveal influential papers (Chen et al, 2007).

When a record of past users' activities is available, personalized recommendation is likely to produce better results than "recommendation for general audience" and that's the very aim of recommender systems. A recommender system is a specific type of information filtering that uses a limited number of user assessments of certain objects (books, movies, restaurants, etc.) to find which objects are likely to be appreciated by a given user. Apart from recommendation performance, among the issues that needs to be taken into account in a recommender system are data sparsity, large size of the data, noisy ratings, and spamming (Herlocker et al, 2004; Perugini et al, 2004).

At the heart of each recommender system there is a recommendation method which is used to process the input data. The first recommendation methods were popularity-based. In the case of explicit ratings (when users are asked to evaluate the objects in a given scale) this means that to predict the rating of user i for object α, either the average rating received by object α or the average rating given by user i can be used. While both approaches yield rather imprecise predictions, thanks to their low computational costs, the methods are widely used in practice. Moreover, the prediction by object-averages can be substantially improved if users' ratings are first aligned with each other by a simple linear transformation which makes the average rating and dispersion of ratings equal for all users (Zhang et al, 2007a). Finally, in the case of implicit ratings (when users either include an object in their personal collections or not, no ratings are given), the analogue of object-averages is the assessment of object's popularity by the total number of users who has collected it.

A large number of recommendation methods is based on rating similarities between different users or different objects. That means, when recommending for a user, recommended are those objects that are liked by the users who rate similarly to the given user (we exploit user similarities) or recommended are those objects that are similarly rated as other objects already liked by the given user (we exploit object similarities). The latter approach was used in large-scale in the online shop Amazon.com (Linden, 2003). In mathematical terms, denoting the similarity of users i and j as s_{ij}\,\! and the similarity of objects α and β as s_{\alpha\beta}\,\!, a similarity-based prediction of rating of user i for object α has the form


p_{i\alpha}\sim\sum_{j} s_{ij}v_{j\alpha}\qquad\mbox{(user similarities),}

p_{i\alpha}\sim\sum_{\beta} s_{\alpha\beta}v_{i\beta}\qquad\mbox{(object similarities).}

When the number of users is much larger than the number of objects, object-based approach is computationally less expensive, and vice versa. While the basic idea is clear, much freedom is left in forming the exact equation used to obtain the rating predictions and, more importantly, in computing the similarities - for various choices see (Shardanand & Maes, 1995; Blattner et al, 2007; Takacs et al, 2007). The standard way to decrease the computational complexity of the method and, in some cases, improve its performance, is to consider only k "nearest neighbors" of a user (or an object) in the computation (Goldberg, 2001), such methods are known under the abbreviation kNN.

Another large class of recommender systems can be stamped as machine-learning techniques. It involves latent semantic analysis (Hoffmann, 2004), singular value decomposition (Berry et al, 1995), matrix factorization (Takacs et al, 2007), and so forth. For an overview of machine-learning techniques see (Adomavicius & Tuzhilin, 2005; Takacs et al, 2008). In essence, they are all based on a plausible rating model with a vast number of parameters - their values are estimated by a multivariate optimization of the prediction error on training data (this is usually referred to as training procedure).

Finally, there are recommendation methods based on a transformation (projection) of the input data to a weighted object-object network and a diffusion-like process on the network. The idea behind the transformation is that whenever one user collects/rates two objects, there is probably some similarity between the objects and hence a link connected them is created or reinforced (see Fig. 1 and Fig. 2). The transformation is similar for both implict and explicit ratings but in the latter case, the loss of information (which is always a side-effect for each projection) can be reduced if, instead of directly linking two objects, ratings given to these objects are linked. Consequently, recommendation for a particular user is obtained by propagating the opinions expressed by the user over the given network (Zhang et al, 2007a; Zhang et al, 2007b; Zhou et al, 2007).


Image:FR-links.png
Fig. 1. Implicit ratings: When there from the five available objects, a user has collected three of them, three links between the objects are created/reinforced.


Image:FR-channels.png
Fig. 2. Explicit ratings: When a user has rated only objects 1 (rating 5), 2 (rating 3) and 3 (rating 4), three channels between the objects are created/reinforced.


When user's perception of an object is given mainly by object's quality and user's tastes play only a minor role, one can use the expressed opinions to deduce qualities of the objects. For example, the number of users who collected an object or, in the case of explicit ratings, the average rating given to an object, can be considered as crude measures of the object's quality. In practice, many users are good raters but some are misled, many users are honest but some are cheaters. To account with these influences, one can extend the system by assigning each user reputation modelled as a real-valued variable. Then qualities of all objects reputations of all users can be estimated by an iterative procedure which lowers reputation of the users whose ratings diverge too much from the mainstream and when computing average ratings, gives high weight to users with high reputation (Laureti et al, 2006; de Kerchove & van Dooren, 2007).

Modelling reviews and reviewers

[Self-explanatory :-) Modelling of reputation, we should explore more some of the possibilities here.]

Outlook

[Where we can take all of this.]

The present-day ubiquity of peer review as the gateway to publication has arguably been driven as much by the increasing volume of research articles as by any concern for quality assurance. The sheer scale of submissions received by some journals overwhelms the editorial team’s ability to cope (Ingelfinger, 1974; Spier, 2002; Schuhmann, 2008). Given the widespread perception that peer review in practice creates a publication lottery (Smith, 2006), perhaps its use has more to do with the psychological need to have a system: authors and editors need somebody to take responsibility for the decision to accept or reject a paper, to give reasons no matter how unfair or incorrect[28].

What we arguably need to recognise is that now it is not just individual editorial teams but the journal system as a whole that is overstretched and overwhelmed. The hierarchical system of journals, from highly-selective to highly-permissive and from general to specialist, has reached its own limits of capacity: rather than fragmenting the literature still further into ever-more-diverse titles for the reader to choose between, it is surely preferable to attempt greater unification, to gather articles together into universal archives like arXiv or PLoS ONE and take advantage of good personalised search and recommendation tools rather than the crude methods of disciplinary or topical division.

On the other hand there are aspects of peer review which we would like to maintain. Despite all the flaws identified, we should bear in mind the ‘scores of scientists who have had their reputations saved by peer review’ (Drummond Rennie, quoted in Enserink, 2001)[29]. As Goodman et al. (1994) point out, the value of peer review lies in specific comments and advice rather than in general or abstract measures of ‘quality’ on which there is usually little agreement. This coincides with the frequent experience of journal editors (and authors too) that, where performed well, peer review can offer a valuable source of collegial advice and support (Rennie, 2002). Our aim should be to maintain possibilities for this kind of helpful assistance while attempting to offset the negative effects of the conventional review system such as delay, inconsistency and abuse.

[Where the Letters section of the Physical Review once published a great deal of the correspondence received, employing solely editorial selection, by the late 1950s the section had grown in size to a point where consultancy with external experts was unavoidable (Schuhmann, 2008).]

[N.B. Shift towards peer review may have been tech driven (Spier, 2002): in 19th century publishing capacity outstripped amount of scientific work & hence editors/journals would solicit contributions (rather than having to reject an excess). Reproducing manuscripts was also a lot of work. Post-photocopier, redistribution to experts easy. Now with web, perhaps moving back to capacity excess? So no need for entry conditions?]

[Ginsparg’s arXiv employs a ‘review’ procedure similar to Planck’s Annalen der Physik—once you’re in, you’re in.]

Notes

  1. Smith (2006) cites a rather shocking case reported by New England Journal of Medicine editor Drummond Rennie, where a reviewer—having produced a critical report on a submitted manuscript—then copied several paragraphs and submitted this ‘new’ work to another journal. He was found out when his own manuscript was sent for review to the original author. Nature experienced a similar situation where a referee held up the publication of a paper while using his privileged position to obtain materials to assist his own work and so scoop the original author (Nature Editorial, 2001).
  2. Peer review is also employed in the assessment of funding applications and the evaluation of individual researchers and job applicants, and these different aspects of professional scientific life cannot entirely be separated: success in obtaining funding or employment often rests on prior publication records.
  3. Such articles are sometimes published as ‘commentary’ or invited papers or otherwise explicitly identified as not having passed through an independent peer review process.
  4. Not all journals prefer such concrete advice. Nature’s peer review policy notes these options but adds that ‘The most useful reports ... provide the editors with the information on which a decision should be based. Setting out the arguments for and against publication is often more helpful to the editors than a direct recommendation one way or the other.’
  5. Most fields have at least one journal where even this appears to be optional.
  6. For example, a technically incorrect but creative and inspirational paper may be of greater value than a technically correct but trivial or uninteresting one. Perhaps the journals for whom technical correctness appears optional are actually on to something.
  7. Items included clarity of various issues (e.g. rationale, aims, study design), adequacy of procedures and precautions, appropriateness of methods, and others, all to be rated on a 1–5 point scale (with ‘not applicable’ an extra option).
  8. Several different interpretations could be placed on this point. It might be that the high selection criteria of the Annals of Internal Medicine (only 15% of submissions are published) mean that only manuscripts with very limited room for improvement are selected (‘less selective journals may have the potential to improve research reporting even more’). On the other hand perhaps the peer review process remains able to offer only limited help to all manuscripts—the Annals employs a large editorial staff, so perhaps the level of assistance it can provide is greater than many other journals.
  9. One possible reason for the low rate of error detection, suggested by Schroter et al. (2004), is that reviewers give up looking for further errors after having found enough to reject the manuscript. However, the result of Godlee et al. (1998) demonstrating that 33% of reviewers recommended acceptance despite the artificially-introduced weaknesses (compared to 12% recommending acceptance with major revision and 30% recommending rejection) may offer a counter-argument.
  10. The study was more generally an investigation of reviewer quality, demonstrating moderate editorial ability to identify good reviewers. When faced with the fictitious manuscript, highly-rated reviewers performed significantly better than poorly-rated reviewers.
  11. [Note on Zuckerman and Merton's status division]
  12. 42% in less than 2 months, compared to 35% and 29% for second- and third-tier authors respectively. Only 11% of first-rank authors had to wait more than 5 months for a response, compared to 20% and 30% for the other two groups.
  13. Swedish Medical Research Council policy prevents committee members from sitting in direct judgement on their colleagues or affiliates, but the supposedly neutral members tasked with this office seem nevertheless to be influenced, giving higher scores to researchers known to their committee peers.
  14. Cf. for example the differential initial treatment accorded authors of higher status, as reported by Zuckerman and Merton (1971).
  15. There appear to be no studies on incidences of authors identifying anonymous referees, although there exist anecdotal accounts. In at least some cases these may be paranoia: editors at Nature have commented how, often, recipients of negative reviews will incorrectly assume the paper was blocked by a rival abusing the review process (Dalton, 2001).
  16. This result takes into account the fact that some 42% of ‘blinded’ referees were at least partially successful in identifying authors or institutions: the performance of truly blinded referees was similar to those for whom blinding was unsuccessful.
  17. An important consideration here is whether any one-dimensional metric can actually capture all the various facets of ‘quality’ that are important to science. For example, most psychometric tests address multiple different factors, including both general and domain-specific measures (Harnad, 2008).
  18. Some of these technology-driven review processes were distinctly unpleasant in nature. As Spier points out, some of the first large-scale ‘peer review’ was conducted after the introduction of printing, when it became possible for the first time to mass-produce and widely distribute documents: obviously, ran the line of thinking at the time, it was necessary for someone to ensure that what was distributed met some basic standards. Unfortunately the ‘peers’ doing the review tended to be political and/or religious authorities whose sanction on research deemed worthy of rejection was rather more harsh than denial of publication.
  19. Some of the larger research groups might spend $15–20,000 per year on this activity in material and personnel costs.
  20. Among other examples, Grisha Perelman published his proof of the Poincaré conjecture in a series of preprints on the arXiv and never submitted it to a conventional journal. As he later commented, ‘If anybody is interested in my way of solving the problem, it’s all there—let them go and read about it.’
  21. Quality control on arXiv is limited to an initial ‘in-the-club’ selection procedure—first-time uploaders must be sponsored by an existing arXiv author—and some moderation, mostly to ensure that papers are listed in the most appropriate subject areas.
  22. On the potential citation benefits of early access, see Newman (2008) on the first-mover advantage.
  23. Harnad (2001) notes, for example, the substantial shift in physics to using arXiv as the source of current research, without reference to peer review, while at the same time relying on publication in the peer-reviewed journal system to provide marks of professional achievement.
  24. There is also an ‘alert’ system whereby reviewers can flag papers they believe to be pseudoscientific rather than genuine research articles, and editors can reject on these grounds.
  25. A parallel can be drawn to the ethical conventions of science, that researchers should share their results openly rather than keeping them secret for private gain: see for example http://www.gnu.org/fry/
  26. See e.g. http://bazaar-vcs.org/Workflows for a discussion of some of the different development models.
  27. That is, are able to make changes to the code.
  28. ‘Peer review represents a crucial democratization of the editorial process, incorporating and educating large numbers of the scientific community, and lessening the impression that editorial decisions are arbitrary’ [our emphasis] (Rennie, 2002). See also the discussion in Ingelfinger (1974).
  29. If we wish to be cynical, we should perhaps note that we do not have statistics, even anecdotal ones, on how many scientific careers have been unnecessarily or unfairly destroyed by peer review.

Further reading

Peer review

Spier (2002) and Benos et al. (2007) provide good brief histories of the peer review process and (in the latter case) a review of the major issues and accusations surrounding it; the review by Ingelfinger (1974) is now somewhat out of date but makes good reading to see how the situation has changed in the last 30–40 years. Dalton (2001) and Lawrence (2003) offer good descriptions of the ‘social’ problems and consequences of peer review.

Several special issues of JAMA have been dedicated to study and analysis of the review process: JAMA (1998) 280(3) and JAMA (2002) 287(21). Two earlier issues from 1986 and 1990 also are dedicated to this subject but are not available online apart from abstracts. Nature has dedicated a Web Focus debate to the topic.

Research assessment metrics

The various articles by Garfield (1955, 1972, 2006) provide an interesting history of the conventional journal impact factor that he played the major role in developing, while Lawrence (2007, 2008) offers a strong critique. Lehmann et al. (2006) and van Raan (2006) offer some comparisons of different quality assessment methods. Ethics in Science and Environmental Politics and Marine Ecology Progress Series have published theme sections on, respectively, ‘the use and misuse of bibliometric indices in evaluating scholarly performance’ and ‘quality in science publishing’: Ethics Sci. Environ. Polit. (2008) 8(1) and Mar. Ecol. Prog. Ser. 270: 265–287.

Recommender systems

Adomavicius and Tuzhilin (2005) and Perugini et al. (2004) provide good reviews of the research literature, while Brusilovsky et al. (2007) contains several interesting articles on different aspects and types of recommender systems.

References

[N.B. these are the references actually cited in the text. Suggestions for further citation on the comments page please. :-)]
  • Adomavicius, G. & Tuzhilin, A. (2005) Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions. IEEE Trans. Know. Data Eng. 17(6): 734–749.
  • Althouse, B. M., West, J. D., Bergstrom, T. & Bergstrom, C. T. (2008) Differences in impact factor across fields and over time. arXiv: 0804.3116. [To appear in JASIST.]
  • Benos, D. J. et al. (2007) The ups and downs of peer review. Adv. Physiol. Educ. 31: 145–152.
  • Berry, M. W., Dumais, S. T. & O'Brien, G. W. (1995) Using Linear Algebra for Intelligent Information Retrieval. SIAM Review 37: 573-595.
  • Birnbaum, H. K. (2002) A personal reflection on university research funding. Physics Today 55(3): 49–53.
  • Blattner M., Hunziker A. & Lauretiet P. (2007) When are recommender systems useful? arXiv:0709.2562.
  • Borgman, C. L. & Furner, J. (2002) Scholarly communication and bibliometrics. Annu. Rev. Info. Sci. Tech. 36: 3–72.
  • Brin S. & Page L. (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30: 107-117.
  • Brusilovsky, P., Kobsa, A. & Nejdl, W. [eds.] (2007) The Adaptive Web (LNCS 4321). Berlin: Springer.
  • Callaham, M. L., Baxt, W. G., Waeckerle, J. F. & Wears, R. L. (1998) Reliability of editors’ subjective quality ratings of peer reviews of manuscripts. JAMA 280: 229–231.
  • Campbell, P. (2008) Escape from the impact factor. Ethics Sci. Environ. Polit. 8(1): 5–7.
  • Ceci, S. J. & Peters, D. P. (1982) Peer review—a study of reliability. Change 14(6): 44–48. [Cited in Benos et al. (2007).]
  • Chen, P., Xie, H., Maslov, S. & Redner, S. (2007) Finding scientific gems with Google’s PageRank algorithm. J. Informetrics 1: 8–15. doi:10.1016/j.joi.2006.06.001
  • Cheung, W. W. L. (2008) The economics of post-doc publishing. Ethics Sci. Environ. Polit. 8(1): 41–44. doi:10.3354/esep00083
  • Cho, M. K., Justice, A. C., Winker, M. A., Berlin, J. A., Waeckerle, J. F., Callaham, M. L., Rennie, D. & PEER Investigators (1998) Masking author identity in peer review: what factors influence masking success? JAMA 280(3): 243–245.
  • Cicchetti, D. V. (1997) Referees, editors and publication practices: improving the reliability and usefulness of the peer review system. Sci. Eng. Ethics 3: 51–62.
  • Davis, P. M. & Fromerth, M. J. (2007) Does the arXiv lead to higher citations and reduced publisher downloads for mathematics articles? Scientometrics 71: 203–215.
  • Dayton, A. I. (2006) Beyond open access: open discourse, the next great equalizer. Retrovirol. 3: 55. doi:10.1186/1742-4690-3-55
  • Franzen, M., Rödder, S. & Weingart, P. (2007) Fraud: causes and culprits as perceived by science and the media. EMBO Reports 8(1): 3–7.
  • Garfield, E. (1955) Citation indexes for science: a new dimension in documentation through association of ideas. Science 122: 108–111.
  • Garfield, E. (1972) Citation analysis as a tool in journal evaluation. Science 178: 471–479.
  • Garfield, E. (2006) The history and meaning of the journal impact factor. JAMA 295(1): 90–93.
  • Garfield, E. & Sher, I. H. (1963) New factors in the evaluation of scientific literature through citation indexing. Amer. Doc. 14(3): 195–201.
  • Garfunkel, J. M., Ulshen, M. H., Hamrick, H. J. & Lawson, E. E. (1994) Effect of institutional prestige on reviewers’ recommendations and editorial decisions. JAMA 272: 137–138.
  • Gilbert, J. R., Williams, E. S. & Lundberg, G. D. (1994) Is there gender bias in JAMA’s peer review process? JAMA 272: 139–142.
  • Giles, J. (2005) Internet encyclopaedias go head to head. Nature 438: 900–901. doi:10.1038/438900a
  • Giles, J. (2006) Journals submit to scrutiny of their peer-review process. Nature 439: 252.
  • Ginsparg, P. (1994) First steps towards electronic research communication. Comput. Phys. 8: 390–396.
  • Godlee, F., Gale, C. R. & Martyn, C. N. (1998) Effect on the quality of peer review of blinding reviewers and asking them to sign their reports: a randomized controlled trial. JAMA 280: 237–240.
  • Godlee, F. (2002) Making reviewers visible: openness, accountability, and credit. JAMA 287: 2762–2765.
  • Goldberg, K., Roeder, T., Gupta, D. & Perkins, C. (2001) Eigentaste: A Constant Time Collaborative Filtering Algorithm. Information Retrieval 4: 133-151.
  • Goodman, S. N., Berlin, J., Fletcher, S. W. & Fletcher, R. H. (1994) Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann. Intern. Med. 121: 11–21.
  • Harnad, S. (1990) Scholarly skywriting and the prepublication continuum of scientific enquiry. Psychol. Sci. 1: 342–344.
  • Harnad, S. (2001) The self-archiving initiative. Nature 410: 1024–1025.
  • Harnad, S. (2008) Validating research performance metrics against peer rankings. Ethics Sci. Environ. Polit. 8(1): 103–107. doi:10.3354/esep00088
  • Herlocker J. L., Konstan, J. A., Terveen L. G. & Riedl, J. T. (2004) Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22: 5-53.
  • Hirsch, J. E. (2005) An index to quantify an individual’s scientific research output. PNAS 102(46): 16569–16572.
  • Hirsch, J. E. (2007) Does the h-index have predictive power? PNAS 104(49): 19193–19198
  • Hoffmann T. (2004) Latent Semantic Models for Collaborative Filtering. ACM Transactions on Information Systems 22: 89-115.
  • Ingelfinger, F. J. (1974) Peer review in biomedical publication. Amer. J. Med. 56: 686–692.
  • Jackson, A. (2002) From preprints to e-prints: the rise of electronic preprint servers in mathematics. Not. Amer. Math. Soc. 49: 23–32.
  • Jefferson, T., Alderson, P., Wager, E. & Davidoff, F. (2002a) Effects of editorial peer review: a systematic review. JAMA 287: 2784–2786.
  • Jefferson, T., Wager, E. & Davidoff, F. (2002b) Measuring the quality of editorial peer review. JAMA 287: 2786–2790.
  • Justice, A. C., Cho, M. K., Winker, M. A., Berlin, J. A., Rennie, D. & PEER Investigators (1998) Does masking author identity improve peer review quality? A randomized controlled trial. JAMA 280(3): 240–242.
  • Kassirer, J. P. & Campion, E. W. (1994) Peer review: crude and understudied, but indispensable. JAMA 272: 96–97.
  • Katz, D. S., Proto, A. V. & Olmsted, W. W. (2002) Incidence and nature of unblinding by authors: our experience at two radiology journals with double-blinded peer review policies. Amer. J. Roentgenol. 179: 1415–1417.
  • de Kerchove, C. & Van Dooren, P. (2007) Iterative filtering for a dynamical reputation system. arXiv: 0711.3964
  • Kleinberg J. M. (1999) Authoritative Sources in a Hyperlinked Environment. Journal of the ACM 46: 604-632.
  • Koomin, E. V., Landweber, L. F. & Lipman, D. J. (2006a) A community experiment with fully open and published peer review. Biol. Direct 1: 1. doi:10.1186/1745-6150-1-1
  • Kronick, D. A. (1990) Peer review in 18th-century scientific journalism. JAMA 263: 1321–1322. [Cited in several other references.]
  • Kurtz, M. J., Eichhorn, G., Accomazzi, A., Grant, C., Demleitner, M., Henneken, E. & Murray, S. S. (2005) Inform. Process. Manag. 41: 1395-1402.
  • Laureti, P., Moret, L., Zhang, Y.-C. & Yu, Y.-K. (2006) Information filtering via iterative refinement. Europhys. Lett. 75(6): 1006–1012. doi:10.1209/epl/i2006-10204-8
  • Lawrence, P. A. (2003) The politics of publication. Nature 422: 259–261.
  • Lawrence, P. A. (2007) The mismeasurement of science. Curr. Biol. 17(15): R583–R585.
  • Lawrence, P. A. (2008) Lost in publication: how measurement harms science. Ethics Sci. Environ. Polit. 8(1): 9–11. doi:10.3354/esep00079
  • Lehmann, S., Jackson, A. D. & Lautrup, B. E. (2006) Measures for measures. Nature 444: 1003–1004.
  • Lerner, E. J. (2002) Fraud shows peer review flaws. Industrial Physicist 8: 12–17.
  • Linden, G., Smith, B. & York, J. (2003) Amazon. com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Computing 7: 76-80.
  • Link, A. M. (1998) US and non-US submissions. JAMA 280(3): 246–247.
  • Lloyd, M. E. (1990) Gender factors in reviewer recommendations for manuscript publication. J. Appl. Behav. Anal. 23(4): 539–543.
  • Marris, E. (2006) Should journals police scientific fraud? Nature 439: 520–521.
  • McNutt, R. A., Evans, A. T., Fletcher, R. H. & Fletcher, S. W. (1990) The effects of blinding on the quality of peer review. A randomized trial. JAMA 263(10): 1371–1376.
  • Nature Editorial (2003) Coping with peer rejection. Nature 425: 645.
  • Nature Neuroscience Editorial (1998) Citation data: the wrong impact? Nature Neurosci. 1(8): 641–642.
  • Nature Neuroscience Editorial (2003) Deciphering impact factors. Nature Neurosci. 6(8): 783.
  • Nature Neuroscience Editorial (2005) Revolutionizing peer review? Nature Neurosci. 8: 397.
  • Newman, M. E. J. (2008) The first-mover advantage in scientific publication. arXiv: 0809.0522
  • Odlyzko, A. M. (1995) Tragic loss or good riddance? The impending demise of traditional scholarly journals. Int. J. Hum.-Comput. St. 42: 71–122.
  • Orr, R. H. & Kassab, J. (1965) Peer group judgements on scientific merit: editorial refereeing. Presentation to the Congress of the International Federation for Documentation, Washington DC. [Cited in Zuckerman and Merton (1971).]
  • Perugini, S., Gonçalves, M. A. & Fox, E. A. (2004) Recommender systems research: a connection-centric survey. J. Intell. Inf. Sys. 23(2): 107–143.
  • Price, D. J. de S. (1965) Networks of scientific papers. Science 149: 510–515.
  • Purcell, G. P., Donovan, S. L. & Davidoff, F. (1998) Changes to manuscripts during the editorial process: characterizing the evolution of a clinical paper. JAMA 280: 227–228.
  • Ray, J. G. (2002) Judging the judges: the role of journal editors. Q. J. Med. 95: 769–774.
  • Raymond, E. S. (1999) The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. Sebastopol, CA: O’Reilly.
  • Rennie, D. (2002) Fourth international congress on peer review in biomedical publication. JAMA 287: 2759–2760.
  • Roberts, L. W., Coverdale, J., Edenharder, K. & Louie, A. (2004) How to review a manuscript: a “down-to-earth” approach. Acad. Psychiatry 28: 81–87.
  • Schafer, J. B., Konstan, J. A. & Riedl, J. (2001) E-Commerce Recommendation Applications. Data Mining and Knowledge Discovery 5: 115-153.
  • Schroter, S., Black, N., Evans, S., Carpenter, J., Godlee, F. & Smith, R. (2004) Effects of training on quality of peer review: randomised controlled trial. BMJ 328: 673–675.
  • Schuhmann, R. (2008) Editorial: Peer review per Physical Review. Phys. Rev. Lett. 100: 050001.
  • Seglen, P. O. (1997) Why the impact factor of journals should not be used for evaluating research. BMJ 314: 497.
  • Shardanand U. & Maes P. (1995) Social information filtering: algorithms for automating "word of mouth". Proceedings of the SIGCHI conference on Human factors in computing systems, 210-217.
  • Smigel, E. O. & Ross, H. L. (1970) Factors in the editorial decision. Amer. Sociol. 5: 19–21.
  • Smith, R. (2006) Peer review: a flawed process at the heart of science and journals. J. Roy. Soc. Med. 99: 178–182.
  • Spier, R. (2002) The history of the peer-review process. Trends Biotechnol. 20: 357–358.
  • Stallman, R. M. (2002) Free Software, Free Society: Selected Essays of Richard M. Stallman. Boston, MA: GNU Press.
  • Stringer, M. J., Sales-Pardo, M., Amaral, L. A. N. (2008) Effectiveness of journal ranking schemes as a tool for locating information. PLoS ONE 3(2): e1683. doi:10.1371/journal.pone.0001683
  • Symonds, M. R., Gemmell, N. J., Braisher, T. L., Gorringe, K. L., Elgar, M. A. (2006) Gender differences in publication output: towards an unbiased metric of research performance. PLoS ONE 1(1): e127. doi:10.1371/journal.pone.0000127
  • Takacs G., Pilaszy, I., Nemeth, B. & Tikk, D. (2007) On the Gravity Recommendation System. Proceedings of KDD Cup and Workshop, 22-30.
  • Takacs, G., Pilászy, I., Németh, B. & Tikk, D. (2008) Investigation of Various Matrix Factorization Methods for Large Recommender Systems. Proceedings of 2nd Netflix-KDD Cup and Workshop.
  • Todd, P. A. & Ladle, R. J. (2008) Hidden dangers of a ‘citation culture’. Ethics Sci. Environ. Polit. 8(1): 13–16. doi:10.3354/esep00091
  • Torvalds, L. (2007) Tech Talk: Linus Torvalds on git. YouTube Video
  • Underwood, A. J. (2004) It would be better to create and maintain quality rather than worrying about its measurement. Mar. Ecol. Prog. Ser. 270: 283–286.
  • van Raan, A. F. J. (2005) Fatal attraction: conceptual and methodological problems in the ranking of universities by bibliometric methods. Scientometrics 62(1): 133–143.
  • van Raan, A. F. J. (2006) Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgement for 147 chemistry research groups. Scientometrics 67(3): 491–502.
  • van Rooyen, S., Godlee, F., Evans, S., Smith, R. & Black, N. (1998) Effect of blinding and unmasking on the quality of peer review: a randomized trial. JAMA 280(3): 234–237.
  • van Rooyen, S., Black, N. & Godlee, F. (1999a) Development of the Review Quality Instrument (RQI) for assessing peer reviews of manuscripts. J. Clin. Epidemiol. 52(7): 625–629.
  • van Rooyen, S., Godlee, F., Evans, S. Black, N. & Smith, R. (1999b) Effect of open peer review on quality of reviews and on reviewers’ recommendations: a randomised trial. BMJ 318: 23–27.
  • Walsh, E., Rooney, M., Appleby, L. & Wilkinson, G. (2000) Open peer review: a randomised controlled trial. Brit. J. Psychiat. 176: 47–51.
  • Weingart, P. (2005) Impact of bibliometrics upon the science system: inadvertent consequences? Scientometrics 62(1): 117–131.
  • Wennerås, C. & Wold, A. (1997) Nepotism and sexism in peer review. Nature 387: 341–343.
  • Zhang, Y.-C., Blattner, M. & Yu, Y.-K. (2007) Heat conduction process on community networks as a recommendation model. Phys. Rev. Lett. 99: 154301. doi:10.1103/PhysRevLett.99.154301
  • Zhang, Y.-C., Medo, M., Ren, J., Zhou, T., Li, T. & Yang, F. (2007) Recommendation model based on opinion diffusion. Europhys. Lett. 80: 68003. doi:10.1209/0295-5075/80/68003
  • Zhou, T., Ren, J., Medo, M., Zhang, Y.-C. (2007) Bipartite network projection and personal recommendation. Phys. Rev. E 76: 046115. doi:10.1103/PhysRevE.76.046115
  • Zuckerman, H. & Merton, R. K. (1971) Patterns of evaluation in science: institutionalisation, structure and functions of the referee system. Minerva 9(1): 66–100. doi:10.1007/BF01553188