Category Archives: Research

What is an Outside Scholar

NOTE: The following is 1st draft of a chapter which I wrote for a book I am currently working on.  I decided to cut the chapter as being outside the scope of the book but it seemed a shame to waste it, so I decided to post it here, particularly since parts of it relate so various of my Great Books posts.

So far in this book I have used the term outside scholar fairly casually to refer to Victor Sharrow and those like him. Before proceeding, I think it is time to expand on what I mean when I use this label. First, though, I think it is appropriate that we review a bit of history about scholarship in general.

A scholar is a person who creates knowledge by a process called research and transmits it to others, usually through writing. In contemporary usage the term often connotes a profession. In the truest and most historic sense, however, scholarship is a vocation; scholars are driven by intellectual curiosity, love of knowledge, and a desire to create a permanent legacy for other scholars who will come later. A person with the true scholarly vocation will usually find a way to pursue their interests regardless of what formal profession they follow to make a living. In fact, the idea of a professional scholar who is paid for their studies is largely an invention of the modern age.

In our western tradition this conception of scholarship has its roots, like much else in our society, in the Golden Age of classical Greece when literate men began to research science, philosophy, and history and record their conclusions on scrolls which they allowed other scholars to borrow and copy, birthing the concept of scholarly publication. Typical of these men were the historians Herodotus, Thucydides, and Xenophon. The first of these was a merchant, while the other two were career military men but all three were fascinated by recent history and the causes and effects of war. After collecting and comparing oral histories and visiting the some of the locations where important events had occurred, they wrote books which not only chronicle history, but also analyzed it. There works are still read and studied today1.

Thucydides, at least, was fully cognizant of his drive to leave a permanent intellectual legacy, writing,

“It will be enough for me…if these words of mine are judged useful by those who want to understand clearly the events which happened in the past and which (human nature being what it is) will, at some time or other and in much the same ways, be repeated in the future. My work is not a piece of writing designed to meet the taste of an immediate public, but was done to last forever.”

Thucydides knew Herodotus personally and was influenced by his book. Xenophon would have been acquainted with the work of both and seems to have written the Hellenica as a direct sequel to Thucydides’ work. Nevertheless, it never occurred to these men, or their contemporary colleagues whose work is now lost, to think of themselves as a community or school of historic scholars. They merely shared a common interest. It was the philosophers of Greece who originated the idea of an academy. The original academy was a grove of trees outside Athens where teachers met with their students. The academy became an actual institution when Plato joined with other local philosophers to create a school, holding classes in his home or the nearby gymnasium. Aristotle, the son of the Macedonian Royal Physician, studied there for several years before returning to Macedon to found his own academy, the Lyceum. Philosophers in Greece had always supplemented their incomes by teaching the sons of the local aristocracy. Formal academies were a way of persuading the students to come to them, rather than wandering the country in search of students. Academies in the pattern of Plato’s came and went in the Hellenic until the very end of the ancient period. From the first century of onward the Christianity began to dominate the intellectual life of the West, gradually replacing the more secular philosophy of classical antiquity. By the time the Western Roman empire collapsed, most learning was concentrated in the Church. Literacy rates dropped throughout Europe and the secular members of the upper classes found they were too busy fighting for survival to devote time to scholarship. The Eastern empire survived and was spared the worst effects of the Dark Ages, but the Byzantine mind was increasingly inclined towards mysticism and away from rational scholarship. In 529 the emperor Justinian ordered the closure of the last incarnation of the Athenian Academy, an event which some historians consider to be the official end of the ancient era and the beginning of the medieval period.

For nearly 1000 years, the church, particularly the monasteries, had a virtual monopoly on scholarship. Nearly everyone who learned to read and write was taught by clerics and most of what books still survived were the property of the church. The first universities were an outgrowth of earlier monastic schools and existed mainly to train priests and church officials2. Even those rare lay scholars who did not accept ordination pursued their studies with and within the church organization, or not at all.

All of this began to change around the end of the 15th century. The invention of the printing press and availability of paper drastically lowered the cost of books. Rising economic prosperity allowed more lay people in the upper class and the emerging bourgeoisie the luxury of an extended education. Since the 13th century, classical works which had long been lost in Europe but had survived in the Islamic world had begun to make their way back into the libraries of the West. Now they could be purchased and read by the laity. A new kind of intellectual, began to emerge throughout Europe to help build the modern age .

These renaissance men had more in common with the scholars of classical Athens than with monks of the Middle Ages. Typical of them was Niccolò Machiavelli. Machiavelli was a Florentine politician. After finding himself on the wrong side of a coup, he found himself unemployed and was forced to retire to the countryside. His best known work, The Prince was an unsuccessful attempt to showcase his knowledge of political science and recent history in the hopes that a powerful noble would notice and offer him a position. Permanently shut out of politics, he consoled himself by reading the classics and writing a scholarly commentary on the works of Livy. Machiavelli might be the first successful outside scholar of the modern age. In fact, at least some historians feel that the publication of his works mark the start of the modern age3. Machiavelli was a layman and out of favor with the establishment. His major works were not published until after his death and were officially banned by the Church. Even today The Prince, while widely read, remains controversial. Despite this, Machiavelli’s eventual influence on western thought is incontestable .

Perhaps the greatest of all the Enlightenment outsiders, though, was Spinoza. Born in 1632 to a family of Portuguese Jews who had fled to Amsterdam to escape the Inquisition, he showed a scholarly turn and was initially expected to become a rabbi. His curiosity soon drove him beyond the Torah, Talmud, and orthodox judaica into the Cabala and other esoteric studies. Then, after taking Latin lessons from a gentile freethinker, he proceeded to devour every philosophical text he could find, from Aristotle to Descartes. By then the young philosopher was beginning to harbor theories that made the elders of the synagogue extremely nervous .

Intellectual life among the Dutch Jews of the 17th century was closely circumscribed. Holland was one of the only places in Europe that was not closed to them in that period, and they remained only at the sufferance of their Protestant Christian hosts. Driven by the dual imperatives to maintain their cultural unity and to avoid giving offense to the Christians, they focused their studies on the Torah and avoided dangerous speculation. Young Spinoza, who had now begun saying things like “Angels are probably only hallucinations” and “The Bible uses figurative language and isn’t meant to be taken literally,” was not just a destabilizing influence, but was all too likely to bring down the wrath of the Christian majority on the Jewish community .

At the age of 24 Spinoza given a choice: he could either accept an annuity of 1,000 florins in return for keeping his unorthodox theories to himself, or he could be excommunicated from the Jewish faith. He chose excommunication. Europe had recently concluded a series of brutal wars of religion between Catholics and Protestants which raged intermittently for 126 years. Religious affiliation was still the single most important factor in the personal identity of most people and to not belong to an organized religion was unthinkable. Yet Spinoza never converted to another faith. Changing his first name from Baruch to Benedict, he moved into an attic apartment and spent the rest of his life writing books on philosophy while he supported himself by grinding lenses. Later, when his reputation began to grow, he turned down financial support from Lois XIV of France and even a prestigious university professorship on the grounds that accepting money from the government would irrevocably compromise his freedom to philosophize.

Of his five works (one unfinished) only two could be safely published during his life: a commentary on the philosophy of Decarte and the Theologico-Political Treatise, which was immediately placed on the index of banned books and had to be sold with a false cover and only the author’s initials on the title page. Among the inflammatory ideas contained in the book is the idea that the Bible is written in figurative language. The key to understanding it is to study the historical, biographical, and cultural context in which the authors lived,

The universal rule, then, in interpreting Scripture is to accept nothing as an authoritative Scriptural statement which we do not perceive very clearly when we examine it in the light of its history.

… such a history should relate the environment of all the prophetic books extant; that is, the life, the conduct, and the studies of the author of each book, who he was, what was the occasion, and the epoch of his writing, whom did he write for, and in what language. Further, it should inquire into the fate of each book: how it was first received, into whose hands it fell, how many different versions there were of it, by whose advice was it received into the Bible, and, lastly, how all the books now universally accepted as sacred, were united into a single whole.

All such information should, as I have said, be contained in the ‘history’ of Scripture. For, in order to know what statements are set forth as laws, and what as moral precepts, it is important to be acquainted with the life, the conduct, and the pursuits of their author: moreover, it becomes easier to explain a man’s writings in proportion as we have more intimate knowledge of his genius and temperament.

Further, that we may not confound precepts which are eternal with those which served only a temporary purpose, or were only meant for a few, we should know what was the occasion, the time, the age, in which each book was written, and to what nation it was addressed. Lastly, we should have knowledge on the other points I have mentioned, in order to be sure, in addition to the authenticity of the work, that it has not been tampered with by sacrilegious hands, or whether errors can have crept in, and, if so, whether they have been corrected by men sufficiently skilled and worthy of credence. All these things should be known, that we may not be led away by blind impulse to accept whatever is thrust on our notice, instead of only that which is sure and indisputable.

Today, this viewpoint is at the core of all but the most fundamentalist bible Judeo-Christian bible study, but it was revolutionary in 1670. In fact, the Theologico-Political Treatise is barely studied or quoted today, except by historians, because most of its arguments are now taken for granted in mainstream western thought.

Spinoza’s greatest work is his Ethics which solidified his reputation, along with Descartes and Leibniz, as one of the three greatest rationalist philosophers. It would be hard to exaggerate the extent of Spinoza’s influence on the next 500 years of modern philosophy. His impact on Judaism, once his people were ready to reclaim him, was equally pervasive. He has been called “The “first modern secular Jew” and credited with originating many of the core ideas of Reform Judaism .

Even as Machiavelli, Spinoza, and numerous other freethinkers were revolutionizing Western thought from outside any organized intellectual establishment, new forces were making themselves felt throughout Western Civilization4. Universities, which had first appeared in the medieval period, multiplied through the modern period, first in Europe and then in the New World5. Meanwhile scholars and learned professionals, seeing the value of communication and collaboration, began to organize themselves into societies. Typical of these was the Royal Society, founded in 1660, of which Henry Oldenburg, one of Spinoza’s best friends, was the first secretary. The, often overlapping, influence of the universities and societies on the growth of knowledge was overwhelmingly positive. However, as time went on a divide began to appear between the “elite” scholars who attended and taught at universities and/or belonged to scholarly societies and the “amateur” scholars who did not. A new Academy was forming which had the power to give or withhold approval and legitimacy to scholarly efforts.

The implicit narrative began to be that outside scholars were undisciplined and underprivileged. By the end of the Enlightenment, efforts were made to bring the most brilliant of them into the fold, which many accepted joyfully. Spinoza was exceptional in turning down a university position when it was offered. More typical was Samuel Johnson, that brilliant titan of English letters, who was given an honorary doctorate and referred to as “Dr Johnson” by academics forever more. Benjamin Franklin, a self-educated man who spent his early career as the archetypal outside scholar, happily accepted his own honorary doctorate and membership in the Royal Society in later life, glorying in his hard-won academic legitimacy.

As time went on, it became harder even for exceptional outsiders to gain admission to the ivory tower of academia. The Academy had emerged as a new international priesthood, with a hold over scholarship almost as strong as the church had enjoyed in the previous age. Only those who had served their novitiate and displayed appropriately orthodox dogmas could be ordained.

Rise of the Modern University

While universities first appeared in the middle ages and can, in at least in theory, be placed into the tradition of higher education which began with the Athenian academy, most of the traits which we associate with the modern university first appeared in the 19th century. It was in this period when two major schools of thought emerged which still shape thinking about the role of the university. One of these viewpoints was articulated by Cardinal John Henry Newman, in a series of lectures given in Dublin in the 1850s. Newman’s view was shaped by his own experiences at Oxford which, like the other “ancient universities” of the British Islands was then in the process of transitioning from training aristocrats to providing a liberal education for the new class of skilled bourgeoisie. He argued that the primary role of a university was to provide a generalized education. Research was a less important mission than teaching. Indeed, research could be more efficiently conducted outside the university,

The view taken of a University in these Discourses is the following:—That it is a place of teaching universal knowledge. This implies that its object is, on the one hand, intellectual, not moral; and, on the other, that it is the diffusion and extension of knowledge rather than the advancement. If its object were scientific and philosophical discovery, I do not see why a University should have students; if religious training, I do not see how it can be the seat of literature and science. … …there are other institutions far more suited to act as instruments of stimulating philosophical inquiry, and extending the boundaries of our knowledge, than a University. Such, for instance, are the literary and scientific “Academies,”… … To discover and to teach are distinct functions; they are also distinct gifts, and are not commonly found united in the same person. He, too, who spends his day in dispensing his existing knowledge to all comers is unlikely to have either leisure or energy to acquire new .

The Newman model of the university’s mission was highly influential in the United Kingdom and, to a lesser extent, on liberal arts colleges in America .

Meanwhile, in Germany, another model was emerging based on the University of Berlin, founded by Wilhelm von Humboldt in 1810. In the Humboldt type university teaching and research were inseparable. The university was a sort of knowledge factory. Students learned by being involved, albeit at a very low level, in the critical investigation of truth. The overall prestige of a university was based on the quality of research it generated. The Humboldt model became wildly popular on the continent because Humboldt type research systems were seen as a major factor in Germany’s economic growth. When the US began building its state university system with the passage of the Morrill Acts in 1862 and 1890, the Humboldt model was taken as a template for the ideal public university .

Until World War II most new universities in Europe and the Americas were based on the Humboldt paradigm. After the war, however, pressures to provide mass education to all citizens, combined with population pressures from the baby boom and the passage of the GI Bill in the US, which allowed returning soldiers to finance higher education, created demand for a third type of university. Neither Newman nor Humboldt type schools were physically capable of absorbing the influx of new students, which pushed student-to-faculty ratios to an historic high. nor were the new–primarily first generation–students particularly interested either in gaining a generalized liberal education or engaging in research. They came to school to learn technical skills and gain specialized diplomas which would increase their incomes. In response to this demand, the second half of the twentieth century saw a wave of new polytechnic schools, vocational schools that reinvented themselves as “technical universities”, and, finally, for profit “universities”. At these new schools basic research, if conducted at all, was a distinctly secondary pursuit. The need for faculty in these institutions paved the way a type of second-class academic whose primary job was lecturing to students who would never themselves become scholars .

Older universities, forced to compete with the new technical schools for funding, faculty, and students, began to adopt some of their traits. Student-to-faculty ratios rose, universities began doing more applied research, and an increasing number of specialized professional degree programs appeared in catalogs. Many older universities added professional schools, which allowed them to attract talented students who might otherwise go to a technical university while charging them tuition at a much higher rate than that for “research” graduate degrees. In 1908 Harvard began offering a new graduate degree, the Master of Business Administration (MBA), which was essentially a vocati9onal diploma for corporate executives. Other major research universities rapidly followed. Today the MBA is the most awarded graduate degree world-wide. Some MBA students are involved with research and a few go on to PhD programs, but the degree is not seen as preparation for a research career. In most business schools that offer PhD programs, MBA and PhD candidates are admitted based on different criteria and are almost completely segregated from each other throughout their studies. An MBA, even if they are a talented researcher, has almost no chance of landing a tenure-track academic job after graduation. There are around 800,000 of them graduating every year and every one of them, if they choose to do research, is, by definition, an outside scholar6.

The result of these four decades of competitive convergence, the typical state university of today has a case of institutional schizophrenia. One side of the split personality is a Humboltian research university in which research teams, led by tenured professors assisted by a chosen few students, spend their time competing for grant money and cranking out papers. The other side is a career school in which lecturers and graduate teaching assistants cater to legions of undergraduates’ and professional students’ need to diplomas which will allow them to take their places among the ranks of the bourgeoisies.

The same period over which the university attained its final form has seen the stratification of the scholarly community into four rigid castes, with relatively little mobility between them. The two upper castes make up the Academy, while the two lower castes are outsiders. At the top are the professional researchers. Most often they are tenured professors at a research university, or hold an analogous position at a public or private research facility. This caste not only has little trouble getting their research published and accepted, but because they control the peer review process, conference agendas, and PhD committees, are able to give or withhold the stamp of legitimacy to scholars of the lower castes. Below them are the lecturers, scholars who have either failed to reach the upper class, or whose main interest is education. Their main function is undergraduate and professional education but if they can somehow find the time and money for research they can often get it published. Below them are the professionals who hold specialized doctoral or masters degrees in law, business, medicine, engineering, education or other fields. They they generally are generally able to publish applied research in their own field, generally under the auspices of a professional association, but are discouraged from pure or theoretical research. At the lowest level are the autodidacts. These scholars, no matter what their level of interest, ability, and knowledge, have not managed to obtain the graduate degree which is the minimum requirement for scholarly legitimacy. In general, they have no access to journals, conferences, or “respectable” academic presses and are totally ignored by the academy. The avenues open to them to communicate their work–“popular” nonfiction, Internet blogs and predatory, for-profit journals, have little reach even among their own caste.

One of the most universal traits of all four castes in specialization. Despite a certain amount of lip service to multidisciplinary or interdisciplinary scholarship, 21st century scholars tend to confine their work to incredibly narrow disciplines. The typical modern scholar is thus defined by their place in a rigid system which labels and circumscribes them according to type of (or lack of) institution, rank, and specialty. There is no place in such a system for a Benjamin Franklin, a Francis Bacon, or even an Aristotle or Spinoza.

Historian John Lukacs explains this phenomenon as part of a process of bureaucratization which has continued in all aspects of Western Civilization throughout the modern age, reaching new heights in the twentieth century, “In this increasingly bureaucratized world, little more than the possession of various diplomas mattered. Since admission to certain schools–rather than the consequently almost automatic acquisition of degrees–depended on increasingly competitive examinations, the word ’meritocracy’ was coined…In reality the term ’meritocracy’ was misleading. As in so many other spheres of life the rules that governed the practices and functions of schools and universities were bureaucratic rather than meritocratic.” Securing admission to a program and earning a degree is only the first step for someone seeking an academic career. In the US it takes around ten years for the average PhD student to earn their degree, counting from the receipt of their bachelor’s . Once they take the examinations and submit to copious paperwork to gain admission to a program, they are presented with a list of required courses, further exams, and residency requirements to gain the degree. The only requirement that is designed purely to test the student’s skill as a writer and researcher is the dissertation. Even in this area following the correct format and submitting the appropriate paperwork often becomes nearly as important as the actual scholarship. In many fields, particularly the physical sciences, the PhD program is not even seen as adequate preparation for independent research and students are expected to spend further years in one or more “post-doc” research appointments to gain further experience.

Newly made PhDs as next subjected to yet another “meritocratic” sorting process. The lucky and well-connected are placed in “tenure track” positions as assistant professors. The second tier secure positions as lecturers–second class faculty who have no prospect of tenure and are expected to teach heavy course loads to free up the professors for research. The rest, an increasing percentage of the total, eke out a living as part time adjunct instructors, often commuting to three or more schools in a week in order to earn a living income. These “gypsies”, as they are referred to by their more fortunate colleagues, live in hope that a full time position will materialize, but the odds are stacked against them. It is hardly surprising that so many PhD students either fail to complete their degree or, having obtained it, give up and leave academia forever. Some of them have no choice: a gap in employment of more than a few months, or two much time spent as an adjunct, is often seen as a black mark in an academic’s career, permanently excluding them from consideration for full time positions7.

As for those lucky few, the small percentage of scholars who make it onto the tenure track, they are privileged to spend the next six or seven years working sixty hour weeks while they accumulate the requisite ticket punches for promotion. If all goes well they gain tenure around year seven, finally making it into full membership in the academy. If something goes wrong, or the university simply decides that it doesn’t need any more associate professors at the moment, they are thanked and excused and leave to start over from the beginning .

An associate professor working towards tenure has no incentive to take risks. A large volume of acceptable publications is always less risky than a few brilliant ones. Research that is two controversial, or steps on the toes of a member of the tenure committee, can easily wreck their career. Some of them tell themselves that they will play it safe until they get tenure, then work on the projects that they really want to do. A few follow through on this, but it is hard to radically change the direction of one’s research after seven years of escalating commitment. Many of them, after spending two decades of their research career playing it safe, have no idea how to take risks even if they wanted to.

Everything in the career path of an academic selects for risk avoiding individuals who know how to play the system. Successful professors have all the same character traits of a career bureaucrat. Worse, by the time they achieve tenure they have been thoroughly socialized to look down on any scholar who has not managed to survive the same process. At the same time, they have spent years acquiring narrowly specialized knowledge, working mostly with people in the same discipline, and being warned by their mentors not to have opinions or do work outside their field8

American research universities are incredibly good at their main function, which is rigorous, deep research in narrowly defined areas. They focus on training the kind of scholars that they need for this mission. Unfortunately, these specialized professors are much less effective at some of the other functions which have traditionally been associated with scholars. Teaching, particularly at the undergraduate level, is generally fobbed off on lecturers and graduate students. Practical applications, particularly those involving interdisciplinary knowledge, tend to be the province or corporate R&D organizations, where researchers are expected to pursue projects that will make a profit for the company and which only share their findings with competitors when it is in their interest. The task of advising policymakers is carried out by staff intellectuals at government agencies–which are, more or less by definition–even more bureaucratic and conservative than the universities.

But what of those scholars who follow the more traditional model, more like the great thinkers of the ancient world and the enlightenment? What about those who left the academy after earning a graduate degree–PhD, masters or professional, but still have an interest in doing real scholarly research and creating knowledge or affecting public policy? What about autodidacts who never had a formal education at all but, after a lifetime of reading are now ready to write serious nonfiction works? Is it even possible for these outside scholars to make a contribution in the modern era?

So far in this book, I have deliberately avoided writing any autobiographical details because I felt it would distract from the purpose of the work. Now, however, in the interests of full disclosure, I must mention that I too am one of these outsiders, and the answers to these questions affect me personally. I attended professional school at a major research university, earning an MBA. While there I did original research and completed a thesis which was later published as my first book. Several professors strongly urged my to continue on and finish a PhD. Upon examining what would actually be required, and the personal and family sacrifices that I would need to make, I decided that it wasn’t worth it. I am still doing primary research in my specialty, but I am finding every aspect of it more difficult now that I am now affiliated with an institution: it is much harder to obtain grant funding, I have trouble getting the journals and database access I need, and I no longer have a departmental fund to pay my way to conferences. When I go to publish in journals I find that the burden of proving my credibility is on me; without the name of an institution under my byline, the assumption is that I don’t have the qualifications to publish. I am far from the only one in this situation, though. Later, I will talk about some of the changes which are making life easier for us.

  1. Read together Books V-XII of Herodotus’ Histories, Thucydides’ History of the Peloponnesian War, and Xenophon’s Hellenica form a continuous trilogy of the history of Greece and her neighbors from just before the Greco-Persian wars up to the aftermath of the Peloponnesian War, a period of approximately 136 years.
  2. Note the modern similarity between academic regalia and monastic habits.
  3. Alan Bloom argues that Machiavelli was the philosopher who began the Enlightenment. According to Bloom, it was Machiavelli who first suggested that the philosophers of western civilization, who had formerly been dependent on the patronage of the aristocracy, should “change camps” and espouse democracy, reason, and the theory of rights–some of the most characteristic concepts of the modern age–as these would create a society that offered them greater protection and scope for their talents.
  4. My discussion has necessarily been limited in scope to the history of Western Civilization. Other societies have their own scholarly traditions and institutions, some of which predate Western civilization itself. Likewise, they have had their own outside scholars who toiled outside the scholarly establishment and gained legitimacy and influence only late in life or even centuries after their deaths. Confucius is but one example. As the modern age continued, however, the ruling and intellectual classes of the East were increasingly educated by the Academy of the West. By the 20th century the Academy was completely international, and organized on the Western Model. See Eberhard.
  5. Even the destruction and upheavals of the Wars of Religion did little to slow the spread of universities. In fact, some of the most famous universities were founded as gambits in the struggle between Protestants and Catholics. For example, Trinity College in Dublin was established on the orders of Elizabeth I to educate the sons of her protestant subjects in Ireland without subjecting them to the corruptive influences of Catholicism.
  6. During orientation on my first day of business school I raised my hand and asked an associate dean about research opportunities for MBA students. He laughed and said “If you want to do research, what are you doing in the MBA program? You should have applied as a PhD.”
  7. For purposes of discussion I have focused on the career path of scholars at a research university. Many PhDs also work for government agencies or for-profit research organizations which have their own bureaucratic hurdles.
  8. At American universities and schools in other countries that are based on the American model, the basic unit of organization is the department, which consists of all of the university’s specialists in a particular discipline. At English universities, on the other hand, the basic unit is the college, which will typically include one professor from each discipline. English professors, and European academics in general, also tend to be more involved with teaching and administration than their American colleagues. See Eagleton for a delightful overview of some of the differences.


Anderson, Robert. “The ‘Idea of a University’ today.” History and
Politics (2010).

Bloom, Allan David. The Closing of the American Mind: How Higher Education Has Failed Democracy and Impoverished the Souls of Today’s Students. New York: Simon & Schuster, 1987.

Copulsky, Jerome E. “The Last Prophet: Spinoza and the Political Theology of Moses Hess.” University of Chicago Divinity School, 2008.

Durant, Will. The Story of Philosophy: The Lives and Opinions of the World’s Greatest Philosophers. Kindle Ed. Aristeus, 2014.

Eagleton, Terry. Across the pond: an Englishman’s view of America. 2013

Eberhard, Wolfram. A History of China. 3rd ed. [org. pub. 1969]. Project Gutenberg, 2006.

Herodotus. The Persian War. Translated by William Shepherd. Cambridge; New York: Cambridge University Press, 1982.

Hoffer, Thomas B., and Vincent Welch. Time to degree of U.S. Research doctorate Recipients. National Science Foundation Directorate for Social, Behavioral, and Economic Sciences, March 2006.

Lukacs, John. At the end of an Age. New Haven: Yale University
Press, 2002.

Machiavelli, Niccoló. The Prince. Translated by George Bull. LondoEagleton, Terry. Across the pond: an Englishman’s view of America.
2013n; New York: Penguin Books, 2003.

Newman, John Henry. The Idea of a University Defined and Illustrated In Nine Discourses Delivered to the Catholics of Dublin. Project Gutenberg, 2008.

Newman, John Henry. The University: Its Rise and Progress. Edited by Kevin A. Straight. Montrose, CA: Creative Minority Productions, 2015.

O’Brien, Keith. “The Ronin Insitute for wayward academics: a bold new idea to solve the PhD crises.” Boston Globe  (May 27, 2012).

Spinoza, Benedictus de. The ethics of Spinoza: the road to inner freedom. Secaucus, N.J.: Citadel Press, 1976.

Spinoza, Benedictus de. Theologico-Political Treatise. Translated by R.H.M. Elwes. Project Gutenberg, 1997.

Spinoza, Benedictus de, and Joseph Ratner. “The Life of Spinoza.” in The philosophy of Spinoza, [org. pub. 1926]. Project Gutenberg, 2010.

Thucydides. Thucydides: History of the Peloponnesian War. Translated by Rex Warner. Harmondsworth, Middlesex: Penguin, 1954.

Xenophon. Hellenica. Translated by Henry Graham Dakyns. Champaign, Ill.: Project Gutenberg, 2008.


Word Frequency Analysis – The Most Common Words

There are any number of reasons why you might need a list of the most common words in the language. In my case, I was working on a piece of software to speed the process of building indexes for my print books. My program reads the book and suggests a list of words that the author might want to include in the index. It needed a list of the most common words so it would know not to bother suggesting them. I’ll post that script in a couple of days. For now, though, I thought I would give you a very simple piece of Python code that reads a directory full of text files, counts how many times each word occurs, and prints a list of those which show up most often. I set it to give me the most common 1000 words. You could generate a list of any length, though, just by changing one number in the code.

If you don’t care to look behind the curtain and just want to cut and paste my word list, feel free to scroll down to the bottom of the post.

For raw data, I used a sample of 37,358 Project Gutenberg texts. PG is kind enough to offer an interface for researchers like me to harvest books. Note that this would work nearly as well with a much smaller sample. But I had already downloaded the books for another project, so I figured I might as well use them. If you use a PG harvest for your data set, make sure and remove the Human Genome Project gene sequence files (a full dump contains at least three copies of the full human genome). Otherwise, this script will have major grief when it tries to count each gene as a word.

Note that, as currently written, this script requires GNU Aspell and a system that works correctly with pipes. This means it should run fine on nearly any Unix-like system, but you Windoze people are on your own.

The first part of the script loads a a few standard modules. Then it gets a listing of the current directory and starts looping through each text file in it. With each iteration it prints a status message with the file name and percent completion. With scripts like this that take a day or two to run I like to be able to see at a glance how far along I am. As an aside, if you access your computer through a terminal like I do you will probably want to use GNU Screen or a similar utility to protect yourself from accidental disconnects while it’s still running.

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess


filelist = os.listdir('.')

counter = 0
for f in filelist:
    counter += 1
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%'

The next portion opens each book file and reads it in. Next, because I’m using PG books as a data set I need to trip off all of the boilerplate license text which occurs at the beginning and end of the files. Otherwise, because similar text appears in every file, it will skew the word distributions. Luckily, PG marks the actual text of the book by bracketing it in the words “START OF THIS PROJECT GUTENBERG EBOOK” and “END OF THIS PROJECT GUTENBERG EBOOK”. The front part is easy, we just do a string find to get the location of the first line-feed character after the start text appears. The end part is a little trickier; the easiest way to get it is to reverse the whole book. This means, however, that we also need to flip the search text. Pretty neato, huh?

   with open(f, "rb") as infile:
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]

The next step is to check the book text for words that aren’t in the dictionary, simply because there is no reason to count words that aren’t part of Standard English. The easiest way to do this on a Linux system like mine is to run the system’s spellcheck, Aspell, on the file. We also want to eliminate duplicate words from this list, since it will save iterations later.

        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)

Next, we go through the book text and strip out most of the punctuation. The string containing the punctuation to be removed looks a lot like the string you get by calling string.punctuation. Note, though, that I left in the “‘” and “-” characters because they are actually a part of contractions and compound words, respectively. I also split the book text, which at this point is one big string, into a list of words and capitalize them.

        #strip out most of the punctuation
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  

In the final segment of the script we count how many times the words occur and update the counters, which are kept as a dictionary object. Then we convert the dictionary to a list, sort it, and print the 1000 most common words to a CSV data file. If you need a different number of words, just change the 1000 to another value.

        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                    wordcounts[w] += 1

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])


with open('wordcounts_pg', 'w') as wc_output:
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')

That’s all there is to it. Pretty easy, huh? Now set it to run, detach the terminal, and ignore it until this time tomorrow. My machine can count words in about 1500 books per hour, so it takes about 25 hours to make it through the full sample.

And now, finally, here is the list of words. Feel free to cut and paste it to use for your own projects:

Word Occurrences
the 149164503
of 81154540
and 73797877
to 60771291
a 47925287
in 41773446
that 26590286
was 24584688
he 24462836
i 24025629
it 22795878
his 20173668
is 18378165
with 18081192
as 17645451
for 17473870
had 14408612
you 13939609
be 13252982
on 13207285
not 13181744
at 13015022
but 12718486
by 12438046
her 11878371
which 10826405
this 10263128
have 10196168
from 10088968
she 9778689
they 9715080
all 8819085
him 8771048
were 8314601
or 8143254
are 7787136
my 7572900
we 7412199
one 7373621
so 7203582
their 7018823
an 6518028
me 6419080
there 6267776
no 6185033
said 5938853
when 5899530
who 5878132
them 5808758
been 5787319
would 5689624
if 5655080
will 5166315
what 4895509
out 4556168
more 4440752
up 4416055
then 4222409
into 4129481
has 4000893
some 3929663
do 3914008
could 3749041
now 3747314
very 3630489
time 3571298
man 3559452
its 3544086
your 3522411
our 3517346
than 3494543
about 3349698
upon 3337366
other 3316391
only 3285019
any 3236410
little 3183383
like 2993385
these 2979508
two 2943507
may 2934056
did 2915540
after 2853393
see 2852408
made 2842273
great 2839852
before 2774768
can 2746279
such 2734113
should 2708032
over 2672597
us 2651042
first 2553483
well 2517899
must 2484839
mr 2465607
down 2433044
much 2428947
good 2376889
know 2372135
where 2353232
old 2291164
men 2286995
how 2261780
come 2217201
most 2188746
never 2160804
those 2135489
here 2122731
day 2071427
came 2061124
way 2042813
own 2037103
go 2009804
life 2007769
long 1992150
through 1989883
many 1982797
being 1976737
himself 1941387
even 1915129
shall 1890432
back 1865988
make 1852069
again 1848115
every 1845835
say 1817170
too 1810172
might 1807261
without 1781441
while 1759890
same 1701541
am 1696903
new 1687809
think 1665563
just 1660367
under 1649489
still 1643537
last 1616539
take 1614771
went 1595714
people 1593685
away 1582685
found 1574065
yet 1563963
thought 1556184
place 1543300
hand 1500131
though 1481938
small 1478723
eyes 1469270
also 1467931
house 1438223
years 1435529
another 1415606
don’t 1381480
young 1379348
three 1378462
once 1377940
off 1376942
work 1375035
right 1360201
get 1345597
nothing 1344419
against 1325938
left 1289397
ever 1269433
part 1261573
let 1260289
each 1258840
give 1258179
head 1254870
face 1253762
god 1249406
0 1239969
between 1225531
world 1219519
few 1213621
put 1200519
saw 1190392
things 1188437
took 1172602
letter 1167755
tell 1160034
because 1155609
far 1154860
always 1152942
night 1152416
mrs 1137055
love 1121812
both 1111644
sir 1100855
why 1097538
look 1095059
having 1069812
mind 1067461
father 1062643
called 1062190
side 1053255
looked 1051044
home 1036554
find 1036485
going 1034663
whole 1033731
seemed 1031466
however 1027701
country 1026854
got 1024945
thing 1022424
name 1020634
among 1019175
seen 1012779
heart 1011155
told 1004061
done 1000189
king 995498
water 994392
asked 993082
heard 983747
soon 982546
whom 979785
better 978434
something 957812
knew 956448
lord 956398
course 953585
end 947889
days 929530
moment 926478
enough 925144
almost 916006
general 903316
quite 902582
until 902333
thus 900738
hands 899106
nor 876106
light 869941
room 869532
since 864596
woman 864072
words 858824
gave 857475
b 853639
mother 852308
set 851757
white 850183
taken 848343
given 838078
large 835292
best 833941
brought 833270
does 826725
next 823345
whose 821731
state 820812
yes 817047
oh 815302
door 804702
turned 804433
others 800845
poor 800544
power 797133
present 792424
want 791194
perhaps 789201
death 788617
morning 786748
la 783512
rather 775384
word 774340
miss 771733
less 770410
during 763957
began 762442
themselves 762418
felt 757580
half 752587
lady 742708
full 742062
voice 740567
cannot 738450
feet 737299
order 736997
near 736832
true 735006
1 730887
it’s 727886
matter 726818
stood 725802
together 725703
year 723517
used 723293
war 720950
till 720824
use 719314
thou 714663
son 714275
high 713720
round 710093
above 709745
certain 703716
often 698006
kind 696975
indeed 696469
i’m 690646
along 688169
case 688098
fact 687334
myself 684387
children 683334
anything 682888
four 677704
dear 676320
keep 675722
nature 674055
known 671288
point 668710
p 668356
friend 666493
says 666011
passed 665792
within 665633
land 663605
sent 662540
church 659035
believe 656459
girl 652783
city 650397
times 649022
form 647388
herself 646989
therefore 644835
hundred 640059
john 639007
wife 636379
fire 632762
several 632704
body 630129
sure 629252
money 629251
means 627640
air 626921
open 626306
held 625660
second 622526
gone 614808
already 613870
least 609236
alone 606078
hope 602206
thy 599253
chapter 597339
whether 596307
boy 596048
english 594784
itself 591838
2 591413
women 589579
hear 587189
cried 586705
leave 586112
either 581618
number 576685
rest 575648
child 574531
behind 572007
read 571445
lay 571286
black 569530
government 567320
friends 567282
became 564384
around 559161
river 556286
sea 552753
ground 550622
help 549284
c 548349
i’ll 546929
short 546465
question 545629
reason 545464
become 544896
call 544576
replied 544286
town 543694
family 542309
england 542109
lost 537241
speak 537188
answered 536154
five 535088
coming 534713
possible 534639
making 530530
hour 530471
dead 529575
really 528631
looking 528622
law 528248
captain 525928
different 522269
manner 519256
business 516115
states 511757
earth 511042
st 510820
human 510666
early 508769
sometimes 507383
spirit 506297
care 505984
sat 505109
public 504862
close 503948
towards 503262
kept 502051
french 501813
party 500749
truth 500365
line 498822
strong 498492
book 496520
able 494330
later 494101
return 492237
hard 490701
mean 489853
feel 487798
story 486538
m 485841
received 483744
following 481558
fell 480591
wish 480562
person 480508
beautiful 479656
seems 477423
dark 476293
history 475744
followed 474307
subject 473058
thousand 470929
ten 469675
returned 469387
thee 467513
age 466838
turn 466674
fine 466630
across 466545
show 465685
arms 465504
character 464946
live 464642
soul 463939
met 463300
evening 463176
die 462851
common 459553
ready 457764
suddenly 456627
doubt 455415
bring 453346
ii 453190
red 450793
free 447675
that’s 445572
account 444530
cause 444403
necessary 444147
can’t 443812
need 443326
answer 442440
miles 441924
carried 438793
although 438423
fear 437796
hold 437493
interest 437382
force 436993
illustration 436577
sight 435854
act 435269
master 433105
ask 432510
idea 432424
ye 432036
sense 430693
an’ 430321
art 430226
position 429722
rose 428624
3 427441
company 427142
road 425669
further 425131
nearly 424118
table 424064
everything 423740
brother 423088
sort 422809
south 421800
reached 420190
london 418755
six 418131
didn’t 416216
cut 412716
taking 412571
continued 411607
understand 411326
appeared 409564
sun 407584
none 407168
else 406851
big 406799
o 406388
longer 406382
deep 406170
army 405897
beyond 405580
view 404378
strange 400814
natural 400483
talk 399814
north 398556
suppose 396693
court 396267
service 393925
bed 393878
past 393609
ought 393331
street 392970
cold 391836
hours 391460
toward 390231
added 389818
spoke 389420
seem 388757
neither 388355
late 388105
probably 387568
real 386926
clear 385649
chief 385350
run 385269
certainly 385179
est 384982
united 384930
stand 384385
forward 384028
front 383866
purpose 382457
sound 382443
feeling 382032
eye 380164
happy 378251
i’ve 377633
except 374853
knowledge 374155
blood 373563
low 373268
remember 373173
pretty 372548
change 372221
living 371264
american 369773
bad 369425
horse 369396
peace 369168
meet 366864
effect 365907
boys 364460
en 364172
school 362681
comes 362575
france 360771
fair 359826
forth 359249
died 359161
fall 358176
placed 357047
note 354944
led 354740
saying 354703
length 354502
pass 353234
gold 350268
entered 349397
doing 348304
latter 347844
written 347699
laid 346808
4 344382
according 343990
daughter 343682
opened 343526
dr 340867
trees 339826
distance 339817
office 339771
attention 339722
hair 337835
n 337111
prince 335635
wild 335514
wanted 335167
society 335139
husband 332251
play 331807
wind 330079
green 329633
greater 329453
tried 328784
west 328702
important 327851
ago 327793
bear 325469
various 325246
especially 324511
mine 321967
paper 320046
island 320002
glad 319989
makes 319717
instead 319188
faith 318882
lived 318731
pay 318090
heaven 316878
ran 315958
s 315761
blue 315697
minutes 315172
duty 315065
foot 314708
ship 314700
fellow 314523
letters 313624
persons 311105
action 310840
below 309831
heavy 309808
york 309749
strength 308836
pleasure 307965
immediately 307823
remained 307750
save 306991
standing 306911
whatever 306070
won’t 305381
trouble 305338
e 305293
window 305257
object 305202
try 304928
parts 304007
period 303992
desire 303985
beauty 303513
opinion 303459
arm 303347
system 302641
third 302389
chance 301890
books 301331
george 300975
doctor 300779
british 300353
silence 300238
he’s 300053
enemy 298899
hardly 298533
5 296045
greek 295622
exclaimed 294602
send 293592
food 293239
happened 293092
lips 292334
sleep 291632
influence 290698
slowly 290590
works 289252
months 288930
generally 288629
gentleman 287966
beginning 287473
tree 287341
boat 286781
mouth 285685
there’s 285569
sweet 285425
drew 284944
deal 284389
v 284339
future 284186
queen 284002
yourself 283364
condition 283335
figure 283153
single 283016
smile 282793
places 282793
besides 281838
girls 281703
rich 281130
afterwards 281017
battle 280676
thinking 280651
footnote 280245
presence 279893
stone 279829
appearance 279691
follow 279498
iii 279239
started 278072
caught 277993
ancient 277595
filled 277238
walked 276882
impossible 276720
broken 276365
former 276016
century 275990
march 275880
field 274479
horses 274255
stay 274139
twenty 273187
sister 272290
getting 271641
william 270478
knows 269506
afraid 269150
result 268749
seeing 268724
you’re 268500
hall 267020
carry 266780
arrived 266706
easy 266309
lines 265956
wrote 265929
east 265852
top 265242
wall 264942
merely 264898
giving 264484
raised 264154
appear 264015
simple 263923
thoughts 263760
struck 263694
moved 263492
mary 263463
direction 263444
christ 263262
wood 263260
born 263084
quickly 262966
paris 262393
man’s 262105
visit 261882
outside 260418
holy 260348
entirely 259045
somewhat 259020
week 258960
laughed 258562
secret 258198
village 257758
henry 257557
christian 257504
danger 257486
wait 257012
wonder 256770
learned 256420
stopped 256191
tom 256117
covered 256117
6 255876
bright 255349
walk 255090
leaving 254851
experience 254763
unto 254610
particular 254564
loved 254479
usual 254307
plain 253867
to-day 253804
seven 253567
wrong 253172
easily 252954
occasion 252780
formed 252707
ah 252144
uncle 252120
quiet 252035
write 251743
scene 251380
evil 250993
married 250965
please 250781
fresh 250507
camp 249947
german 248539
beside 248522
mere 248276
fight 247957
showed 247904
grew 247866
expression 247804
scarcely 247641
board 247578
command 247398
language 247302
considered 247260
regard 247101
hill 246854
finally 246533
national 246452
paid 246364
joy 246060
worth 245352
piece 244733
religion 244677
perfect 244671
royal 244615
tears 244448
president 244135
value 244084
dinner 243572
spring 242721
produced 242576
middle 242282
charles 242134
brown 241885
expected 241668
lower 241299
circumstances 241150
remain 241102
wide 240773
political 240686
charge 240464
success 240254
per 240083
officers 239806
hath 239618
indian 239572
observed 239548
lives 239448
respect 238787
greatest 238784
w 238776
cases 238527
tone 238005
america 237215
youth 236992
summer 236698
garden 236552
music 236354
waiting 236223
due 236178
modern 235763
jack 235557
unless 235428
study 235093
allowed 234852
leaves 234652
bit 233774
race 233156
military 232907
news 232435
meant 232274
afternoon 232063
winter 231867
picture 231735
houses 231575
goes 231281
sudden 230675
proper 230476
justice 230410
difficult 229784
changed 229658
grace 229281
chair 228931
10 228875
private 228392
eight 228222
hot 227873
reach 226608
silent 226552
‘i 226540
flowers 226379
laws 226197
noble 225931
watch 225328
floor 225326
killed 225020
built 224484
declared 224477
judge 224393
colonel 224303
members 224213
broke 224166
fast 223897
duke 223481
o’ 223293
shot 223105
sit 222222
usually 222162
step 222119
speaking 222101
attempt 221687
marriage 221054
walls 220575
stop 220466
special 220316
religious 220300
discovered 220260
beneath 219894
supposed 219260
james 219013
gives 218988
forms 218743
turning 218692
authority 218686
original 218519
straight 218414
property 218393
page 218233
plan 218185
drawn 217873
personal 217458
l 217130
cry 217022
passing 216926
class 216527
likely 216216
sitting 215841
cross 215821
spot 215719
soldiers 215683
escape 215311
complete 215288
eat 215120
bound 214985
conversation 214895
trying 214332
meeting 213898
determined 213756
simply 213506
shown 213457
bank 213261
shore 212917
running 212509
corner 212507
soft 212163
journey 212007
isn’t 211316
i’d 211132
reply 210852
author 210827
believed 210653
rate 210607
prepared 210558
lead 210548
existence 210220
enter 209851
indians 209589
troops 209398
wished 209068
glass 208986
notice 208859
higher 208770
social 208685
iron 208019
rule 207943
orders 207856
building 207813
madame 207780
mountains 207700
minute 207575
receive 207440
offered 207306
h 206821
names 206725
learn 206618
similar 206437
closed 206419
considerable 206102
lake 206017
wouldn’t 206012
8 205864
pleasant 205487

And here is the complete script:

#! /usr/bin/env python

'''Word frequency analyzer'''

import os, stat, string, subprocess


filelist = os.listdir('.')

counter = 0
for f in filelist:
    counter += 1
    if os.path.splitext(f)[1] == '.txt':
        print f+'t', str(counter)+' of '+str(len(filelist))+'t', 
        print str((float(counter)/float(len(filelist)))*100)[:6]+'%' 
        with open(f, "rb") as infile:
        #try to determine if this is a Gutenberg ebook.  If so, attempt
        #to strip off the PG boilerplate 
        if "PROJECT GUTENBERG EBOOK" in book:
            a = book.find("START OF THIS PROJECT GUTENBERG EBOOK")
            if a <> -1:
                b = book.find('n', a)
            c = list(book); c.reverse()
            book = string.join(c, '')
            d = book.find('KOOBE GREBNETUG TCEJORP SIHT FO DNE')
            if d <> -1:
                e = book.find('n', d)
            c = list(book); c.reverse()
            book=string.join(c, '')
            book = book[b:len(book)-e]
        #see which words aren't in the dictionary
        oddwords = subprocess.check_output(
                    "cat "+f+" | aspell list", shell=True).split()

        #find unique words
        u_oddwords = []
        for w in oddwords:
            if w not in u_oddwords: u_oddwords.append(w)
        #strip out most of the punctuation
        for i in range(len(book)):
            if book[i] not in '!"#$%&()*+,./:;<=>?@[\]^_`{|}~':  
        for w in book:  
            if w not in u_oddwords:
                if w not in wordcounts:
                    wordcounts[w] = 1
                    wordcounts[w] += 1

final_list = []
for w in wordcounts:
    final_list.append([wordcounts[w], w])


with open('wordcounts_pg', 'w') as wc_output:
    for i in range(min(1000, len(final_list)-1)):
        wc_output.write(final_list[i][1]+', '+str(final_list[i][0])+'n')

First Preview of my Upcoming Book

Last week I placed a new academic working paper on that roughly parallels Chapter 11 of my upcoming book.  The version in the book will be written at a different reading level and without the math equations, but this is still a pretty good taste of what is coming.

Screenshot of paper from

Scholars like to post these preliminary drafts for several reasons.  The most important one for an independent researcher like myself is to receive feedback and suggestions prior to submission.  Another reason is to make findings available to the community sooner.  The average turn-around time to publish a journal article is two or three years and the field may have moved on by the time the paper hits the presses.

I probably don’t need to worry about obsolescence with this particular article, since the events with which it deals happened back in the 1950’s and 1960’s.  My book will be a study of the role of outside scholars in our society and, in particular, their ability to shape public policy.  Outside Scholars, in my usage, are people who engage in research and knowledge creation without being formally affiliated with the dominant academic community.  This particular article/chapter deals with an outside scholar named Victor Sharrow who devoted his life to arguing for what he saw as the “correct” interpretation on the Fourteenth Amendment.  He was ultimately unsuccessful, but I feel his career provides several intriguing insights as a characteristic outside scholar narrative.

Sharrow saw the Fourteenth Amendment as the key to dismantling the Jim Crow system in the South.  In the months prior to the 1958 election he mounted an intense one-man lobbying campaign to sway Dwight Eisenhower and other politicians to his views.  In my article I examine several of his arguments from a standpoint of modern data science.

Those of you who read my posts on data science and Python programming might be interested in the simulation models I describe in the paper.  I would be happy to send my spreadsheet and code to anyone who is interested.  Just e-mail me or message my Facebook page.

If all goes well, the book should be released in late 2016 or early 2017.


Degree of Voting Restriction by State in 1956, as Calculated by the Model Described in my Paper

Degree of Voting Restriction by State in 1956, as Calculated by the Model Described in my Paper

Practitioner Focused Doctorates?

The other day, a post from the University of Phoenix showed up in my news feed, extolling their doctoral programs.  I couldn’t resist firing off a quick comment:

post on University of Phoenix's Facebook Page

I stand by the claim.  The primary factor that affects the perceived quality of a doctoral program is the number of graduates who get assistant professor jobs at top schools.

UOP’s community manager responded to my comment by throwing out a red herring about how the programs are “practitioner focused”, which has nothing to do with what I said. Even if you are working in the private sector, the reputation of the school matters, and the reputation is driven by tenure-track placements.  I was interested, though, to find out that the program has been going on since 2002.  Then again, it isn’t surprising I had never heard of it, since no UOP graduates were teaching at the university where I went to graduate school, or seem to be publishing in any of the journals I read.

University of Phoenix community manager's reply to my comment

Later one of their current doctoral candidates tried to turn the discussion around and make it about me.  Thank you, Jennifer, but I already have a career.  Some of us are interested in the system itself, not just punching our own tickets.

one of University of Phoenix's doctoral candidate's reply to my comment
So, as amusing as it is to troll the University of Phoenix on social media, why am I bringing this up on my blog?  Well, as I thought about it over the weekend, I realized the existence of such a thing as an online “practitioner focused” doctoral degree is symptomatic of a larger educational issue.

First, let’s consider why any practitioner, which I take to mean someone who is not interested in teaching or public sector research, would need a doctoral degree.  I can only think of two reasons:  either they think it will prepare them for some sort of private sector research, or it is purely for prestige–one more certificate on the “I love me” wall of the office.

The first possibility is dubious.  In my own field, it is hard to think of any sort of research one could do with a DBA that they couldn’t do with an MBA.  Really, once you have a handle on statistics, theory of knowledge, and the basic experimental and data gathering methods then the rest is just reviewing the literature and keeping current in your own specialty.  I did all of the above in business school.  Then again, I went to the University of California, not the University of Phoenix.

The second possibility seems more likely, but also disturbing and a bit odious.  If people are getting doctoral degrees purely to get a pay bump or impress consulting clients, and not because of a calling to academia or because they really want to create knowledge, then the product itself, the degree, becomes much harder to differentiate.  The market for doctoral degrees moves away from monopolistic competition towards a purely competitive situation; one doctorate is as good as another, so schools compete on price.  They maximize their profit by pricing a doctorate so that their marginal revenue is equal to their marginal cost, so they have every incentive to push down the marginal cost, so as to push down the price and sell more degrees.  The actual academic content provides very little of the value proposition, and is neglected.  The degree is cheapened.  In other words, the same thing happens to the DBA that is happening to the MBA.

The DBA becomes the new MBA.  The MBA (or some other master’s degree) is already the new BA.  Meanwhile, President Obama is pushing for free community college for most students, which will effectively make the AA the new high school diploma.  The entire educational process becomes stretched out, and for what?  I’m convinced that students don’t learn any more by the time they graduate than they did a generation ago.

The problems in our educational system exist at every level, from kindergarten to postdoc, and I certainly don’t know how to fix them.  But I do believe that the only reason to get a doctoral degree that makes sense is because you want to be an academic, and only if the degree itself still means something.

With apologies and all due respect, University of Phoenix, please do not expect to see an application packet from me any time soon.

This post was published simultaneously on LinkedIn.

Easy Double Exponential Smoothing in Python

I realized this morning that it has been a while since I posted any Python code. I’ve been a bit busy with Handyman Kevin and haven’t been doing much data science. Still, I decided it was time to carve out a couple hours this morning to practice my skills. The result are these functions, which perform basic double exponential smoothing using the Holt-Winters method. I deliberately avoided using NumPy, SciPy, or any other libraries. It isn’t that I dislike Numpy/Scipy (far from it), but you can’t always get sysadmins to install extra libraries on the machines you’re using, especially if you are a guerrilla data scientist like me.

There are a lot of different time series methods out there, and they all have their points. Holt-Winters is the one that I keep coming back to, though. One of the reasons is simplicity–I can always remember it and bang it into a spreadsheet without needing to Google anything or download libraries. About the 40th time I typed it into a spreadsheet, though, it occurred to me that it would be smart to implement it in Python so I could save some typing.

The first function, MAPE, simply calculates the mean absolute percentage error (MAPE) of a list of estimated values, as compared to a list of actual values.

The next function, holtwinters, uses Holt-Winters to predict the next three values in a time series. You need to supply two smoothing coefficients, alpha and beta, for the level and trend, respectively. Typically, you would have a pretty good idea what these were from doing similar forecasts in the past.

If you don’t know the coefficients then use the third function, holtwinters_auto, to automatically determine them. This function uses a grid search. Those of you who have read my monograph probably remember that I’m not usually wild about grid searches. In this case it makes sense, though, since you don’t usually need more than a few digits of precision on the coefficients.

Screenshot (3)

def MAPE(actual, estimate):
    '''Given two lists, one of actual values and one of estimated values, 
        computes the Mean Absolute Percentage Error'''
    if len(actual) != len(estimate):
        print "ERROR: Lists not the same length."
        return []
    pcterrors = []
    for i in range(len(estimate)):
    return sum(pcterrors)/len(pcterrors)
def holtwinters(ts, *args):
    '''Uses the Holt-Winters exp. smoothing method to forecast the next
       three points in a time series.  The second two arguments are 
       smoothing coefficients, alpha and beta.  If no coefficients are given,
       both are assumed to be 0.5.
    if len(args) >= 1:
        alpha = args[0]
        alpha = .5
        findcoeff = True
    if len(args) >= 2:
        beta = args[1]
        beta = .5
    if len(ts) < 3:
        print "ERROR: At least three points are required for TS forecast."
        return 0
    est = []    #estimated value (level)
    trend = []  #estimated trend
    '''For first value, assume trend and level are both 0.'''
    '''For second value, assume trend still 0 and level same as first          
        actual value'''
    '''Now roll on for the rest of the values'''
    for i in range(len(ts)-2):
    '''now back-cast for the first three values that we fudged'''
    for i in range(len(ts)-3, len(ts)):
        trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1])
        est[i] = alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i]
    '''and do one last forward pass to smooth everything out'''
    for i in range(2, len(ts)):
        trend[i] = beta*(ts[i-1]-ts[i-2])+(1-beta)*(trend[i-1])
        est[i]= alpha*ts[i-1]+(1-alpha)*est[i-1]+trend[i]
    '''Holt-Winters method is only good for about 3 periods out'''
    next3 = [alpha*ts[-1]+(1-alpha)*(est[-1])+beta*(ts[-1]-ts[-2])+(1-beta)*         trend[-1]]
    return next3, MAPE(ts,est)
def holtwinters_auto(ts, *args):
    '''Calls the holtwinters function, but automatically determines the
    alpha and betta coefficients which minimize the error.
    The optional argument is the number of digits of precision you need
    for the coefficients.  The default is 4, which is plenty for most real
    life forecasting applications.
    if len(args) > 0:
        digits = args[0]
        digits = 4
    '''Perform an iterative grid search to find minimum MAPE'''
    alpha = .5
    beta = .5
    for d in range(1,digits):
        grid = []
        for b in [x * .1**d+beta for x in range(-5,6)]:
            for a in [x * .1**d+alpha for x in range(-5,6)]:
                grid.append(holtwinters(ts, a, b)[-1])
                if grid[-1]==min(grid):
                    alpha = a
                    beta = b
    next3, mape = holtwinters(ts, alpha, beta)
    return(next3, mape, alpha, beta)

Update on Life After College

It occurred to me this morning that I haven’t posted an update since before I graduated from business school, back in June.  I wouldn’t want the Internet to get the impression that I was resting on my laurels.  In fact, between moving and my various projects, I just haven’t had time to post.  I did think that a quick summary of what I have afoot would not be amiss, however.

After graduation, LAP invited me to reformat my MBA thesis so they could publish it as a monograph.  It should come out later this week under the title Freight Forwarding Cost Estimation:  An Analogy Based Approach (ISBN: 3659588598).  I am currently working on ways to apply the same techniques I used to predict international freight prices to new applications, including predicting blood sugar levels in diabetes patients.  Some of my preliminary work looks fairly promising, and I probably be ready to write a serious grant proposal in a few weeks. 

Although I remain committed to academic research in data science, I have recently been focusing more attention on popular nonfiction.  I have formed a production company, Creative Minority Productions, to serve as an umbrella for my various nonfiction writing and video projects.  As Creative Minority, I am currently producing two long format YouTube television programs.  Handyman Kevin is a how-to program that walks viewers through common home repair and woodworking projects using simple tools.  It is currently in post production, and will air on YouTube starting September 17.  Everybody’s Data Guide is also how-to and focuses on accomplishing everyday data science and statistical analysis tasks using free and open source software.  Both Handyman Kevin and Data Guide will be augmented with companion blogs and tons of supplemental online content.  The companion blogs will eventually form the basis for companion e-books for each channel.

Creative Minority is more than just a producer of YouTube content, however.  At least two major nonfiction writing projects are in the pipeline.  I’m personally excited, and wish I could give more details, but I can’t say more until all of the rights have been negotiated. 

And finally, lest you think that I have given up on writing fiction, I currently have finished manuscripts for two short stories and a novel, which I will keep submitting to publishers. 

Exciting times!  Business school was an incredible experience, and I feel like it substantially improved my teaching and editing skills.  However, it left very little time or working capital for my own creative projects.  Now I’m free to create content in which I am personally interested, and I plan to make the most of the opportunity. 

Commercial vs GPL: Data Analysis Software Showdown

As graduation nears, I look at my computer desktop and realize that most of the academic software licenses will expire before I start my next graduate program. For a data scientist, this is serious. How am I going to cross-tab survey results without SPSS? Am I going to have to do my stepwise regressions manually now? How am I going to create presentation quality geographic displays without Tableau? What package am I going to use for linear algebra? I freak out pretty badly, until I remember that just about every application you could want for data analysis is available for free on a GPL license that never expires.

Some of the software I use is already open source. Data scientists’ two favorite programming languages, Python and R, are already open source. The same goes for Linux, the operating system that runs on five of my seven computers. Many of the applications I use, though, are commercial but either the company or my university gives me a free license. What follows is a quick survey of open source alternatives for the most commonly used software.


Commercial: Microsoft Excel
Open Source: Gnumeric

Spreadsheet applications are to the data analyst what a table saw is to a woodworker: the big tool in the middle of the shop that gets used somehow in nearly every project. For many of us, Excel is the first spreadsheet we learn, probably because it is standard equipment on most office and university computers.

Excel is the Chrysler New Yorker of spreadsheet applications–huge and comfortable but not too nimble, loaded with lots of features that are nice to have, but you don’t really need them. Then again some features, like the way Excel handles data tables, advanced filtering, and pivot tables, can save a lot of time. Even the conditional formatting is nice to have. Plus, if you need to interference with business major types, Excel will be the only spreadsheet they’ve ever heard of.

Excel has plenty of drawbacks too, though. It is a huge program. It only runs on Windows (or OS X, if you don’t need to run any add-ins). The only natively supported scripting language is VBA. Perhaps worst of all, and unforgivably, its slow. If you haven’t noticed this for yourself, go try and do some sensitivity analysis on a simulation with 10,000 or more trials. Expect to have time for two or three cups of coffee every time you press F9.

Gnumeric is a totally different take on the concept of spreadsheets. When I first experimented with it about 15 years ago, I concluded that it was too limited to be useful. Since then, however, the the project has reinvented itself as the the lightweight, stripped down spreadsheet for data analysis. If Excel is a Chrysler New Yorker, then Gnumeric is a Dodge Dart.

In recent years, an SPSS style “statistics” menu has appeared in the Gnumeric interface. Now the most-used features of a spreadsheet and a statistics package are within easy reach, which will appeal to anyone who ever spent a morning clicking back and forth between Excel and SPSS while they analyzed a data set.

By far the most appealing feature of Gnumeric is that it uses Python as one of its scripting languages. This means that not only is it painless to create user functions but, given the multitude of libraries available for Python, you probably won’t need to very often. Excel’s non-linear solver seems pretty rudimentary when you have access to scipy.optimize. Also, since Gnumeric allows Python and C plug-ins, it is useful as a graphic front end to more complicated programs written in these languages. Pretty cool, especially considering the whole application is still lightweight enough to run on a $100 garage sale computer.

Other possibilities: OpenOffice (aka LibreOffice) is also free and seems to be designed as a more direct replacement for Excel. If you need a more general purpose spreadsheet it might be a good choice.

Statistics Package

Commercial: SPSS, SAS, Minitab
Open Source: PSPP

When all you need to do is calculate a few confidence intervals or run a T-test, a spreadsheet application will probably be adequate. If you are creating complex statistical models, you are probably going to write them in R. Between these extremes, about 90% of statistics work gets done in a statistics package. Which one you prefer probably depends on which one your college statistics professor used. The thing they all have in common, however, is that a full license costs a small fortune. Usually, the coolest add-on packages (for simulation, predictive analytics, etc) are even more of a buy-up. Luckily, for those of us of modest means, there is PSPP.

PSPP is intended as a direct clone of SPSS, but is GPL licensed, so it is completely free to use. One important difference is that PSPP is written mostly in Python notice a theme here?). This means that if you are Python hacker, you should have an easy time creating add-ins. It also means that PSPP runs on just about any platform that runs Python, which is nearly all of them.

Linear Algebra System

Commercial: Matlab
Open Source: Octave

When you start building series matrix-heavy models, such as anything involving Markov chains or finite element analysis, you are going to want to seriously think about using a language that is built for linear algebra. Sure, Python has good linear algebra support though <a “href=”>numpy</a> and other libraries. But Python is basically a general purpose, list based language. Linear algebra looks ugly in Python, and ugly code takes longer to write and is harder to debug. Its better to use the right tool for the job.

The first good language for linear algebra is Matlab. It is still incredibly popular, especially among the PhD Engineering crowd. For decades now, however, there has been a free open source alternative. Octave started out as clone of Matlab and the language syntax is still very similar. However, development on Octave often moves a little faster than Matlab, and Octave often gets features and functions before Matlab. This has cause Octave to become somewhat less compatible with Matlab as time goes on. Still, anyone who can code in Matlab should have not trouble picking up Octave, and Octave is free.

Image Editing and Drawing Software

Commercial: Adobe PhotoShop, Adobe Illustrator
Open Source: GIMP, Inkscape

While few of us are visual artists, either by training or inclination, visual display of data is an important part of data analysis. It is important to be able to illustrate our findings in professional quality maps, charts, and diagrams. To do this, you need to have the ability to work with bitmap images and raster drawings. The Adobe products have been the commercial state of the art for some time now, but GIMP and Inkscape have most of the same functionality. Like the commercial programs, you tend to use them together as a team. One word of warning: if you are migrating from Adobe to the open source programs, you will find that the interfaces are very different, at least in the stock configuration. These programs are replacements, not clones.

One interesting feature of both GIMP and Inkscape is that both allow you to create scripts and plug-ins. One of the languages that is supported is (you guessed it) Python. While I haven’t experimented with it much, it seems like there is the potential here for some pretty serious data display. For instance, it seems like you could build a pretty killer mapping plug-in for Inkscape to display geographic data on different layers of a drawing.

Python Histograms from the Console

Lately, while working on my MBA thesis, there have been many times when I’ve been working in Python and wanted to plot a quick histogram of a distribution. The whole process has been far too time consuming. Either I need to remember how to write my list out as a CSV file so I can plot it in Excel/Tableau/SPSS (none of which, IMHO, has a particularly intuitive mechanism for drawing histograms) or I need to be able plot the histogram directly from Python. The later would be fine, except that I’m usually in an SSH shell and I always have trouble with X servers on my Windows 8 laptop.

So, anyway, I wrote this little function and it works quite nice for plotting histograms in pure text mode. It should be fine with any terminal client of the last 50 years. It isn’t particularly sophisticated, but it works for me and may work for you too.

def crappyhist(a, bins):
    '''Draws a crappy text-mode histogram of an array'''
    import numpy as np
    import string
    from math import log10

    h,b = np.histogram(a, bins)

    for i in range (0, bins-1):
	    print string.rjust(`b[i]`, 7)[:int(log10(
                   np.amax(b)))+5], '| ', '#'*int(70*h[i-1]/np.amax(h))
    print string.rjust(`b[bins]`, 7)[:int(log10(np.amax(b)))+5] 

Screen shot of the crappyhist() function in action.