Methods
Sources, Transformations, Limitations
Repository
You can find all the scripts/code used in this project in the guide2kulchur repository.
Data Sources
Goodreads
The majority of the visualizations use data publicly available from Goodreads. Goodreads has a large number of attributes for each book and author on their site. Specifically, the following fields are shown in the visualizations:
- Title of book / Author name
- Date book first published / Author's birth date
- Description of the book/author posted on Goodreads
- Author's place of birth
- Average Goodreads user rating
- Number of Goodreads user ratings given
- The "top genres" of a book/author
- The works of an author
- The image of an author/book located on the their respective Goodreads page
- Books/authors that are "similar" to a given book/author
- "Readers who enjoyed book X also enjoyed book Y."
- "Members who read books by author X also read books by author Y."
Nominatim
In order to make map visualizations, Nominatim was used to find the coordinates of an author's place of birth. This process is not perfect of course, as author birthplace strings are often either incomplete in scope (e.g., "Buffalo"), too broad to generate specific coordinates (e.g., "United States of America"), or could return results from an entirely different location due to the string (e.g., "Georgia"). Despite these limitations, this service allowed for author and book maps with a broad range of authors and books, both in relation to time and geography.
Wikidata
Unfortunately, a large number of older authors did not have birth dates on record in Goodreads; further, a programming error caused authors with BC era birthdates to be coded as having unavailable birthdates. To partially alleviate these issues, Wikidata was used to find the birthdates and birth places for a number of older authors, like Hesiod and Homer. This process was also not perfect, as joins were made on an author's name, which could incorrectly match an author from Goodreads to an author in the Wikidata database.
Visualizations
kepler.gl
For all map visualizations, kepler.gl, a "data-agnostic, high-performance web-based application for visual exploration of large-scale geolocation data sets" made by Uber, was used. This open source application was very easy to work with, and its high-performance backend allowed for a large number of markers and marker data displayed on a single map. I'd recommend it to others, and will definitely use it again in the future on other projects.
ipysigma
ipysigma, a Jupyter widget made by médialab Sciences Po that uses sigma.js and graphology under the hood, was used to make graphs of author and book networks. Again, I'd definitely recommend this tool and will be using it again.
Limitations
While a number of limitations have already been mentioned, I'll mention a few more here:
- Given that the pool of authors and books is taken from Goodreads, and that many visualizations use an author/book's rating count to filter out some of the data, we'll be lacking in authors/books that aren't as interacted with on Goodreads, or are simply lacking fields that we rely on, like birth place or publication date. Unfortunately (for me at least), this means that many of the older authors and books will not be included in the visualizations.
- Aside from Goodreads "native" data, like average ratings and number of ratings given, things like an author's birth place, or a book's publication date have the possibility of being incorrect, particularly for older authors and books. Coupled with the point above, since we don't have that much of the older book/author data, this isn't a huge concern, but it's still there.