Earlier this year, I was asked to give a 30-minute data analytics presentation for a job interview with a news tech company. My prior data work is primarily visualization and web design, so I decided to put something together from scratch. While I didn't land the job — meaning you could hire me! — this was a great exercise, and I ended up learning some interesting things about library usage in Seattle.
What's In The Data Set?
The Seattle Public Library's continually-updated "Monthly Checkouts By Title" contains checkout totals for each item checked out during the month. It covers both physical and digital platforms, and extends all the way back to 2005. In total, it contains 36.6 million rows of data, and comes as an 8gb CSV file when downloaded. (h/t to Data Is Plural for surfacing this.) In a single row of data, we get:
Bibliographic info: Title, creator, publication date, and subject tags.
The item's material type: book, ebook, videodisc, magazine, atlas (etc etc).
A simple digital/physical split, as well as a platform split for digital checkouts: The SPL currently uses the digital platforms Overdrive (ebooks and comics), Hoopla (music and movies), Freegal (music streaming), and Zinio (magazines).
...but it's kinda messy, too. Each of the digital platforms reports its data in a slightly different way. And while the formatting between physical items is fairly consistent, a movie on DVD and its streaming counterpart may list completely different information; for example, whether a film's director or star is listed in the "creator" field.
Seattle's Open Data program provides a graphical front-end for users, called Socrata, which is like an open source version of Tableau. It's a little clunky and limited... but it's free! so whatever. It also includes a graphing utility, which is fairly decent, if a little limited. Importantly, it also allows you to filter the data set and then download the results, which is great for folks like me with limited bandwidth or older computers (I'm on a 2011 Mac Mini). For SQL, I'm running Postgress in the Terminal.
So let's get into it! We'll start by examining global trends in physical vs. digital checkouts over the years. Then we'll drill down into a single year, 2017, and take a look at 2017's most-checked-out items. Lastly, we'll briefly look at the effects of COVID-19 on checkouts in 2020.
1. Global Trends
Total checkouts, Physical v. Digital
\copy checkouts_full
FROM 'Checkouts_by_Title.csv'
WITH csv header;
-- Full monthly rollup
CREATE TABLE monthly_rollup_full
AS SELECT
checkoutyear,
checkoutmonth,
usageclass,
checkouttype,
materialtype,
SUM(checkouts) AS "monthlycheckouts"
FROM checkouts_full
GROUP BY
checkoutyear,
checkoutmonth,
usageclass,
checkouttype,
materialtype
ORDER BY
checkoutyear ASC,
checkoutmonth ASC,
usageclass ASC,
checkouttype ASC,
materialtype ASC;
\copy monthly_rollup_full
TO 'monthly_rollup_full.csv'
WITH CSV HEADER;
LINK: monthly_rollup_full.csv
-- Physical vs digital
CREATE TABLE monthly_rollup_phsical_digital
AS SELECT
checkoutyear,
checkoutmonth,
usageclass,
SUM(checkouts) AS "monthlycheckouts"
FROM checkouts_full
GROUP BY
checkoutyear,
checkoutmonth,
usageclass
ORDER BY
checkoutyear ASC,
checkoutmonth ASC,
usageclass ASC;
\copy monthly_rollup_phsical_digital
TO 'monthly_rollup_physical_digital.csv'
WITH CSV HEADER;
LINK: monthly_rollup_physical_digital.csv
While the total checkouts per year have remained relatively static over the last decade at around 9 million, the share of total checkouts from digital platforms has steadily increased. Prior to COVID-19, the trends indicate that digital checkouts would have surpassed physical ones some time in the next few years.
Digital Platforms, AV Split
Note: physical checkouts are labeled "Horizon" (the physical system's software package) in these graphs.
-- Monthly rollup w/ Platform Splits
CREATE TABLE monthly_rollup_full_platforms
AS SELECT
checkoutyear,
checkoutmonth,
usageclass,
checkouttype,
SUM(monthlycheckouts) AS "monthlycheckoutsplatform"
FROM monthly_rollup_full
GROUP BY
checkoutyear,
checkoutmonth,
usageclass,
checkouttype
ORDER BY
checkoutyear ASC,
checkoutmonth ASC,
usageclass ASC,
checkouttype ASC;
\copy monthly_rollup_full_platforms
TO 'monthly_rollup_full_platforms.csv'
WITH CSV HEADER;
LINK: monthly_rollup_full_platforms.csv
-- Digital platforms only
CREATE TABLE monthly_rollup_digital
AS SELECT *
FROM monthly_rollup_full_platforms
WHERE usageclass = 'Digital';
\copy monthly_rollup_digital
TO 'monthly_rollup_digital.csv'
WITH CSV HEADER;
LINK: monthly_rollup_digital.csv
[TK Media types]
When we zoom in on the digital checkouts, we can see that Overdrive is the dominant platform. And when we split the digital checkouts by material type, we find that ebooks rather than AV (movies and music, primarily) account for the vast majority of the digital checkouts.
-- List of material types for reference
DROP TABLE material_list;
CREATE TABLE material_list
AS SELECT DISTINCT
UNNEST(STRING_TO_ARRAY(materialtype, ', '))
AS "material type list",
usageclass
FROM checkouts_full
ORDER BY "material type list" ASC;
\copy material_list
TO 'material_list.csv'
WITH CSV HEADER;
LINK: material_list.csv
CREATE TABLE av_formats_only
AS SELECT
checkoutyear,
checkoutmonth,
checkouttype,
SUM(checkouts) AS "monthlycheckouts"
FROM checkouts_full
WHERE materialtype IN
('AUDIOBOOK','MOVIE','MUSIC',
'MUSICSNDREC','SONG','SOUNDCASS',
'SOUNDDISC','SOUNDREC','TELEVISION',
'VIDEO','VIDEOCART','VIDEOCASS',
'VIDEODISC','VIDEOREC')
GROUP BY
checkoutyear,
checkoutmonth,
checkouttype
ORDER BY
checkoutyear ASC,
checkoutmonth ASC,
checkouttype ASC;
\copy av_formats_only
TO 'av_formats.csv'
WITH CSV HEADER;
LINK: av_formats.csv
...But when we add back in the physical checkouts to the AV formats, we find that physical formats like DVDs, BluRays, CDs (etc) count for the solid majority of checkouts. However, checkout data alone is unclear on why: There might be a better selection of titles available physically, the patrons might be unfamiliar with the library's streaming services (which, in my experience, is common), or the affordability of DVD/Blu-Ray players when compared with computers and streaming subscriptions could be a factor.
2. Single-Year Drilldown: 2017
Monthly checkout totals, platform split
-- 2017 Rollup
CREATE TABLE monthly_rollup_2017
AS SELECT
checkoutyear,
checkoutmonth,
usageclass,
checkouttype,
materialtype,
SUM(checkouts) AS "monthlycheckouts"
FROM checkouts_full
WHERE checkoutyear = 2017
GROUP BY
checkoutyear,
checkoutmonth,
usageclass,
checkouttype,
materialtype
ORDER BY
checkoutyear ASC,
checkoutmonth ASC,
usageclass ASC,
checkouttype ASC,
materialtype ASC;
\copy monthly_rollup_2017
TO 'monthly_rollup_2017.csv'
WITH CSV HEADER;
LINK: monthly_rollup_2017.csv
Month-to-month, checkout totals remained fairly consistent at around 770k, with a slight uptick during the summer. The platform percentage splits were also fairly consistent, with physical checkouts hovering around 70%. There's a small increase in Overdrive checkouts in December, which I'm guessing could be from people checking out digital material for their holiday travels.
How many times is an item typically checked out per month?
CREATE TABLE checkouts_2017
AS SELECT *
FROM checkouts_full
WHERE checkoutyear = 2017
ORDER BY
checkoutyear ASC,
checkoutmonth ASC,
usageclass ASC;
WITH c1 (count) AS (
SELECT COUNT(checkouts)
FROM checkouts_2017
WHERE checkouts = 1),
c2 (count) AS (
SELECT COUNT(checkouts)
FROM checkouts_2017
WHERE checkouts = 2),
c3 (count) AS (
SELECT COUNT(checkouts)
FROM checkouts_2017
WHERE checkouts = 3),
c4 (count) AS (
SELECT COUNT(checkouts)
FROM checkouts_2017
WHERE checkouts = 4),
c5 (count) AS (
SELECT COUNT(checkouts)
FROM checkouts_2017
WHERE checkouts = 5),
c6 (count) AS (
SELECT COUNT(checkouts)
FROM checkouts_2017
WHERE checkouts > 5)
SELECT * FROM
c1 AS "1 Checkout",
c2 AS "2 Checkouts",
c3 AS "3 Checkouts",
c4 AS "4 Checkouts",
c5 AS "5 Checkouts",
c6 AS ">5 Checkouts";
I was also curious to see how often an particular title was typically checked out per month. The results of this cut are interesting: About half of all the titles in 2017 were checked out only once, with each increasing checkout decreasing in percentage. One takeaway here could be the breadth of a library's collection is a significant factor in its usage; patrons are just as interested in the "deep cuts" as the "hits."
But there's also some red flags here! For example, we don't know how many copies of each item the SPL stocks: If they only have one copy, it's not going to be checked out more than once. Furthermore, when you look at this at the checkout-level rather than per-item, it's clear that single-monthly-checkouts account for only around 12% of total checkouts, and the "hits" are indeed driving most of the total checkout count.
3. 2017's Hot Content 🔥
Finally, let's take a look at 2017's 1,000 most-checked-out items for the year. We'll start by listing the Top 20 titles by total checkouts (drumroll please):
Reality check! The top 2 items are mobile HotSpots and device checkouts. For context: in 2005, Google granted $225k to the SPL to purchase 150 mobile hotspots to lend to patrons. This program has been expanded over the years and replicated at other libraries. On the device side, the SPL provides laptops and iPads. This is an important reminder that, beyond just having stuff, libraries provide critical services to citizens without access to technology.
But moving on to the top 20 titles:
Check.
Title
◼︎ Ph. / ◼︎ Dig.
Media
4711
The Economist
◼︎ Dig.
Magazine
4048
Arrival
◼︎ Phys.
Videodisc
3790
The New Yorker
◼︎ Dig.
Magazine
3525
The Underground Railroad
◼︎ Phys.
Book
3458
Hillbilly Elegy
◼︎ Phys.
Book
3238
Hidden Figures
◼︎ Phys.
Videodisc
3184
Girl on the Train
◼︎ Dig.
Ebook
3184
Moonlight
◼︎ Phys.
Videodisc
3082
The Goldfinch
◼︎ Dig.
Ebook
3065
Manchester by the Sea
◼︎ Phys.
Videodisc
2965
La La Land
◼︎ Phys.
Videodisc
2944
Girl on the Train
◼︎ Phys.
Videodisc
2853
Moana
◼︎ Phys.
Videodisc
2778
The Accountant
◼︎ Phys.
Videodisc
2766
The Underground Railroad
◼︎ Dig.
Ebook
2693
Sully
◼︎ Phys.
Videodisc
2679
Hillbilly Elegy
◼︎ Dig.
Ebook
2672
Hacksaw Ridge
◼︎ Dig.
Ebook
2654
Passengers
◼︎ Phys.
Videodisc
2631
Lion
◼︎ Phys.
Videodisc
-- 2017 Top 20
CREATE VIEW top_20_2017 AS
SELECT title, usageclass, materialtype,
SUM(checkouts) AS total_checkouts
FROM checkouts_2017
GROUP BY title, usageclass, materialtype
ORDER BY total_checkouts DESC
LIMIT 22;
\copy (SELECT * FROM top_20_2017)
TO 'top20.csv' WITH csv header;
LINK: top20.csv
There are some interesting findings here:
2 of the top 3 are digital versions of periodicals. (These checkout counts are likely aggregated across all issues in the year.)
Only 2 physical books appear in the top 20, and their ebook counterparts are also on the list, which suggests that the content, rather than the format, drives the demand. All of the books/ebooks on the top 20 were 2017 best-sellers, with the exception of "The Goldfinch," a perennial audience favorite since its 2013 publication.
14 of the top 20 checkouts are physical... and 12 of those are DVDs/Blu-Rays.
In fact, if you split 2017's top 1,000 items by format, you can see that 73% are physical.
...but about 2/3 of those physical checkouts are DVDs/Blu-Rays. Another interesting thing here is that it appears audiobook usage has transitioned primarily to digital. Digital audiobook checkouts account for 5.1% of the top 1,000, about a quarter of the digital total, compared to a measly 0.2% for the physical "Sounddisc" media type, which includes both music and audiobooks. (Note: The "Physical: MIXED" slice represents the HotSpot and device checkouts referenced earlier.)
And if we zoom in on books specifically, we find some interesting trends. Let's look at the top 10 most-checked-out authors (across all their titles) in both physical and digital formats.
Top Authors, Physical & Digital
-- 2017 top authors
CREATE VIEW authors_2017 AS
SELECT creator, usageclass,
SUM(checkouts) AS total_checkouts
FROM checkouts_2017
GROUP BY creator, usageclass
ORDER BY total_checkouts DESC;
\copy (SELECT * FROM authors_2017)
TO 'authors_2017.csv' WITH csv header;
LINK: authors_2017.csv
On the physical side, 9 of the top 10 are children's authors, with Mo Willem (author of the "Pigeon" and "Elephant & Piggie" series) with a significant lead at #1. The lone adult author on the physical list is James Patterson...
...who tops the list of digital checkouts. The authors on the digital list write primarily long-running mystery/thriller series, with Stephen King and Neil Gaimen working in horror and sci-fi. Two of the three women authors on this list, Nora Roberts and Janet Evanovitch, were both prolific romance writers before crossing over into mystery/thrillers.
Putting the top digital and physical authors side-by-side, I noticed that Patterson's digital total was higher than his physical one. Curious about the other top digital authors, I checked out their physical totals, and found this was true of all 10.
Going back to a question from the beginning, about when the percentage of digital checkouts might surpass physical ones, this finding suggests that this crossover may have already happened for particular book genres or readership demographics.
4. 2020: COVID-19 & Beyond
Any smooth trends in physical vs. digital checkouts were drastically interrupted by the effects of COVID-19; as the SPL closed in mid-March, physical checkouts dropped almost to zero. While digital checkouts have increased during the library's closure, these gains don't come close to making up for the lost physical checkouts. It's difficult to predict how much of this increase in digital traffic will continue after the SPL can safely re-open.
Anecdotally: At Green Apple, we've averaged around 2,000 web orders per year over the last decade. From March to December of 2020, we've had around 50,000, and the majority of customers who place orders on our web site are doing so for the first time. I imagine this general increase in awareness of digital offerings from outlets beyond Amazon, Apple, Netflix, (etc) is being experienced by many other organizations as well.
But the unfortunate wrinkle with libraries relates back to the Top 2 checked-out items of 2017: They're are one of the few places which provide free access to technology for individuals who otherwise lack it. With budget cuts already a common experience among American libraries, putting more resources toward digital offerings could result in decreased resources for these critical social services in the future. ♦︎