Methods

Repository

You can find all the scripts/code used in this project in the guide2kulchur repository.

Data Sources

Goodreads

The majority of the visualizations use data publicly available from Goodreads. Goodreads has a large number of attributes for each book and author on their site. Specifically, the following fields are shown in the visualizations:

Title of book / Author name
Date book first published / Author's birth date
Description of the book/author posted on Goodreads
Author's place of birth
Average Goodreads user rating
Number of Goodreads user ratings given
The "top genres" of a book/author
The works of an author
The image of an author/book located on the their respective Goodreads page
Books/authors that are "similar" to a given book/author

The "Similar" field warrants further discussion. This field isn't necessarily the authors/books most similar to a given/author book. More specifically, the language used is the following:

"Readers who enjoyed book X also enjoyed book Y."
"Members who read books by author X also read books by author Y."

Rather than "similar" implying similarity in content or style, "similar" here means similarity in readership. While there'll be some correlation with content and/or style, it's important to clarify this, as you may be confused when you see in the individual author networks that Plato is "similar" to Ernest Hemingway and Rick Rubin. This definition of "similar" still provides a meaningful data source to study author relationships, specifically in studying their modern readerships and their reading behavior.

Nominatim

In order to make map visualizations, Nominatim was used to find the coordinates of an author's place of birth. This process is not perfect of course, as author birthplace strings are often either incomplete in scope (e.g., "Buffalo"), too broad to generate specific coordinates (e.g., "United States of America"), or could return results from an entirely different location due to the string (e.g., "Georgia"). Despite these limitations, this service allowed for author and book maps with a broad range of authors and books, both in relation to time and geography.

Wikidata

Unfortunately, a large number of older authors did not have birth dates on record in Goodreads; further, a programming error caused authors with BC era birthdates to be coded as having unavailable birthdates. To partially alleviate these issues, Wikidata was used to find the birthdates and birth places for a number of older authors, like Hesiod and Homer. This process was also not perfect, as joins were made on an author's name, which could incorrectly match an author from Goodreads to an author in the Wikidata database.

Visualizations

kepler.gl

For all map visualizations, kepler.gl, a "data-agnostic, high-performance web-based application for visual exploration of large-scale geolocation data sets" made by Uber, was used. This open source application was very easy to work with, and its high-performance backend allowed for a large number of markers and marker data displayed on a single map. I'd recommend it to others, and will definitely use it again in the future on other projects.

ipysigma

ipysigma, a Jupyter widget made by médialab Sciences Po that uses sigma.js and graphology under the hood, was used to make graphs of author and book networks. Again, I'd definitely recommend this tool and will be using it again.

Limitations

While a number of limitations have already been mentioned, I'll mention a few more here:

Given that the pool of authors and books is taken from Goodreads, and that many visualizations use an author/book's rating count to filter out some of the data, we'll be lacking in authors/books that aren't as interacted with on Goodreads, or are simply lacking fields that we rely on, like birth place or publication date. Unfortunately (for me at least), this means that many of the older authors and books will not be included in the visualizations.
Aside from Goodreads "native" data, like average ratings and number of ratings given, things like an author's birth place, or a book's publication date have the possibility of being incorrect, particularly for older authors and books. Coupled with the point above, since we don't have that much of the older book/author data, this isn't a huge concern, but it's still there.