The Inherent Limits of Indexing

In the 1870s, a very young Melvil Dewey (Dewey was born in 1851) set out to create a universal cataloging system for libraries,  He wasn’t the first person to try such a thing. of course.  People had probably begun titling and sorting documents since the first person realized that in order to find the requested document, it was necessary to go through the whole stacks, because of course it wasn’t where it had last been seen.

For a long time fairly brief descriptions would have been sufficient:  perhaps just a title and an author’s name.  Sometimes even the title was added by a later cataloger:  thus the Sumerian creation myth is simply titled Enuma Elish:  from the first two words, meaning “When in the beginning…”.  

Once these annotations were added, books could be sorted by author and title.  And even if a book was put in the wrong place, it was possible to find it, not by opening and reading the first paragraph or so, but by simply looking at the author/title annotation.

This worked for a while, but eventually this, too, became unwieldy.  It wasn’t just that some authors wrote quite a lot.  It was also that many authors wrote on widely different subjects.  The first encyclopedic work was probably Pliny The Elder’s Natural History, which was incomplete at the time of his death in 79 CE (Pliny died in the eruption of Vesuvius.  He wasn’t in Pompeii–he went into the area where ash, stones, and (it turned out) poisonous fumes were all around, and apparently suffocated.)  Because ‘Natural History’ in those days included a wide variety of subjects (including mineralogy), even the unfinished work had to be searchable.  So though Pliny himself wrote in a novelistic style, he added things like tables of contents and page numbers.

It wasn’t long, however, before this simple cataloging became insufficient, as numbers of books, pamphlets, etc multiplied.  For a quite some time titles became long enough to fill the title page, and sometimes longer.  Where works were divided into chapters (which was often necessary for longer works), chapter headings became essentially abstracts, explaining what was in each chapter; and even after the development of Tables of Contents, the long chapter headings remained in use for quite a while.

The custom of indexing books became necessary probably by the late 16th century, as more and more books came into existence, and as books covering a wide variety of subjects began to become more common.  Although we associate indexes almost solely with nonfiction, Sir Walter Scott also provided indeces for some of his novels.  An index is not a replacement for tables of contents, abstracts, and chapter titles.  Rather, it’s a supplement, helping readers to find the subjects they’re searching for more precisely:  an index, when well made, points to the page numbers in which particular subjects are discussed, whereas a table of contents mostly tells only what page a certain chapter begins on.

But all this is assuming you have the book you need in hand.  So there came a time when it was necessary to catalog books not only by title and author, but also by subject,  And since a researcher is likely to want to find books on the same subject in one area, shelving began to be done by subject, at least in the case of nonfiction.

Up until the 1870s, people had tried various methods of cataloging.  When Jefferson donated his personal library to the fledgling Library of Congress, for example, it was already cataloged, by Jefferson himself, using a system based on Francis Bacon’s theories of knowledge.  What Dewey and others of his time (Cutter, for example, and Putnam) were trying to create was more ambitious:  a uniform cataloging system to be used in ALL libraries.

This was not ever achieved:  but Dewey’s was the most commonly used, at least in the US, until the Library of Congress developed their own system, which eventually predominated.  But where the Library of Congress system is an inductive system (categories are not created unless at least one book on a subject exists), Dewey tried to establish a system to codify  ALL possible subjects.

He did this by a top-down system, in which he first divided all knowledge into 10 overarching categories (hence ‘decimal’, although there are also decimals appended to the three-digit numbers).  Then each of the original ten categories was further subdivided into ten subcategories, and those were divided into ten more…

The thousand basic categories were then assigned three-digit numbers.   But this was not sufficient in itself for large libraries:  so the numbers after the decimal point were added to further clarify matters.  And then, in order to explicate the authors and titles (to make shelving easier), Cutter supplied letter/number combinations to follow the subject numbers (on a separate line).

Of course, this system could not be perfect.  For one thing, it was based largely on Dewey’s own somewhat idiosyncratic ideas.  Thus, because he was planning for books mostly written in English, Dewey supplied several three-digit categories for literature in English:  but other languages were not so richly cataloged.  So, for example, there is only one three-digit number for all Slavic literature (891), and even that is shared with other Eastern European languages and Celtic.  If a library has a LOT of (say) Russian literature, the numbers after the decimal can get so long that they slop over onto another line.  Or two, or…

Then there’s the fact that Dewey, for example, had no knowledge of what technologies were yet to be developed.  He did include catchall categories (‘general knowledge’, for example), into which new fields could be shoehorned–but this resulted in some pretty odd neighbors.

The Cutter numbers, as well, had their problems.  There are whole volumes of cataloging rules to determine how an author’s name will be filed.  And some of the results are more than a little strange.  Thus, for example, there was a religious leader who wrote in India called The Mother.  She published several books under this name:  but you won’t find them filed under this name, because she was born in France:  so the books are filed under the name La Mêre.  Or rather, Mêre, La.  If you’re looking for poetry by Pope (now Saint) John Paul II, he’s listed as a corporate author, even if he wrote the poetry before he was elected to the papacy.  If you’re trying to find one of the publications of the Showa Emperor on Marine Biology, you may have to use his personal name, which it’s impolite to use once the Emperor is deceased.  And because, in translated works (especially Classical Works), the translator is very important, many Classics Libraries have developed Cutter Numbers which emphasize the translator’s name.

All of these problems can often be resolved by scope notes (those ‘see’ and ‘see also’ notes, for example.  These scope notes, in the US, are most often provided by the Library of Congress Subject Headings.  Most of them are pretty straightforward, but some of them are landmines of controversy.  When I tried to look up ‘Trail of Tears’, for example, I was redirected to ‘see: Cherokee Removal, 1838’.  My response to this was profane:  but I’m only collaterally involved.  People who lost ancestors and other relatives in that ‘removal’ will probably take it even more seriously.  Categorization is not always dispassionate.

But even with these imperfections, it would seem that Dewey managed to solve an intractable problem fairly neatly, right?  Well, not exactly.  A book of a hundred pages or more may not have only one subject, or even only one main subject.  Thus, for example, is a book entitled Poetry And Mysticism primarily a book about religion–or about literary criticism?  Or is a book about Watergate to be place in a category about political corruption…or under History of the Nixon Administration?  I can tell you which I’d choose:  but the Library of Congress catalogers chose differently.

Then there are the errors that are obviously simple mistakes.  One book titled Storia della Literatura Italiana (‘History of Italian Literature’) for example, though clearly written in Italian, somehow ended up in a section of Spanish language books–presumably because the word ‘Storia’ is the same in Spanish and Italian. 

The essential problem is an intrinsic one.  If it were possible to summarize the contents of  book in fewer words (or a simple code), it wouldn’t be necessary to write out the whole book, now would it?  Some books may be unnecessarily verbose:  but few can be condensed THAT much.  So attempts at cataloging and indexing inevitably omit a lot of information–and not ALL of it can be unimportant, surely?

“But,” you might say, “Surely all of this is outdated?  With full-text searching and search engines, these problems can be avoided, can’t they?”

Well, no.  The search engines make it possible to search a much larger body of literature fairly quickly.  But they’re not necessarily any more comprehensive.  Search engines with high recall will bring back a lot of irrelevant stuff (and if I’d doubted that, the searches I used to get dates, titles, etc for this book would have demonstrated it to me anew),  I shouldn’t get >5000 hits searching for the etymology of the word ‘atlatl’, for example.  Once I’m informed that it’s from Nahuatl for ‘spear-thrower’, I don’t need much more–unless I’m planning to make one, which I’m not.  

On the other hand, a search engine with high precision may miss quite a bit of relevant stuff.  If the keywords are just a smidgen off, a high-precision engine will not necessarily find what I want to find.  This is something spammers learned early:  that a slight misspelling or rephrasing will often throw the spam filters off the scent.

This is one reason I’m not terribly concerned that government and industrial spies are trying to collect ‘absolutely everything’.  I say, GIVE them absolutely everything, including your dog’s maiden name.  Just don’t give them any indexers.  They’ll still be searching until Domesday.

 Talking of things that are hard to find, I note that one Frank And Ernest cartoon showed Frank sitting on a park bench.  A poster on a wall nearby warned “Big Brother is WATCHING you!”  “I hope he’s not easily bored.”  Frank replies,  I have a copy of this cartoon somewhere, but I’m not sure where.  More cataloging to be done, I suspect.

Leave a comment