By Devin Smith, November 2021.
Earlier this year, I was asked to give a 30-minute data analytics presentation for a job interview with a news tech company. My prior data work is primarily visualization and web design, so I decided to put something together from scratch. While I didn't land the job — meaning you could hire me! — this was a great exercise, and I ended up learning some interesting things about library usage in Seattle.
The Seattle Public Library's continually-updated "Monthly Checkouts By Title" contains checkout totals for each item checked out during the month. It covers both physical and digital platforms, and extends all the way back to 2005. In total, it contains 36.6 million rows of data, and comes as an 8gb CSV file when downloaded. (h/t to Data Is Plural for surfacing this.) In a single row of data, we get:
...but it's kinda messy, too. Each of the digital platforms reports its data in a slightly different way. And while the formatting between physical items is fairly consistent, a movie on DVD and its streaming counterpart may list completely different information; for example, whether a film's director or star is listed in the "creator" field.
Seattle's Open Data program provides a graphical front-end for users, called Socrata, which is like an open source version of Tableau. It's a little clunky and limited... but it's free! so whatever. It also includes a graphing utility, which is fairly decent, if a little limited. Importantly, it also allows you to filter the data set and then download the results, which is great for folks like me with limited bandwidth or older computers (I'm on a 2011 Mac Mini). For SQL, I'm running Postgress in the Terminal.
So let's get into it! We'll start by examining global trends in physical vs. digital checkouts over the years. Then we'll drill down into a single year, 2017, and take a look at 2017's most-checked-out items. Lastly, we'll briefly look at the effects of COVID-19 on checkouts in 2020.
\copy checkouts_full FROM 'Checkouts_by_Title.csv' WITH csv header; -- Full monthly rollup CREATE TABLE monthly_rollup_full AS SELECT checkoutyear, checkoutmonth, usageclass, checkouttype, materialtype, SUM(checkouts) AS "monthlycheckouts" FROM checkouts_full GROUP BY checkoutyear, checkoutmonth, usageclass, checkouttype, materialtype ORDER BY checkoutyear ASC, checkoutmonth ASC, usageclass ASC, checkouttype ASC, materialtype ASC; \copy monthly_rollup_full TO 'monthly_rollup_full.csv' WITH CSV HEADER; LINK: monthly_rollup_full.csv -- Physical vs digital CREATE TABLE monthly_rollup_phsical_digital AS SELECT checkoutyear, checkoutmonth, usageclass, SUM(checkouts) AS "monthlycheckouts" FROM checkouts_full GROUP BY checkoutyear, checkoutmonth, usageclass ORDER BY checkoutyear ASC, checkoutmonth ASC, usageclass ASC; \copy monthly_rollup_phsical_digital TO 'monthly_rollup_physical_digital.csv' WITH CSV HEADER; LINK: monthly_rollup_physical_digital.csv
While the total checkouts per year have remained relatively static over the last decade at around 9 million, the share of total checkouts from digital platforms has steadily increased. Prior to COVID-19, the trends indicate that digital checkouts would have surpassed physical ones some time in the next few years.
Note: physical checkouts are labeled "Horizon" (the physical system's software package) in these graphs.
-- Monthly rollup w/ Platform Splits CREATE TABLE monthly_rollup_full_platforms AS SELECT checkoutyear, checkoutmonth, usageclass, checkouttype, SUM(monthlycheckouts) AS "monthlycheckoutsplatform" FROM monthly_rollup_full GROUP BY checkoutyear, checkoutmonth, usageclass, checkouttype ORDER BY checkoutyear ASC, checkoutmonth ASC, usageclass ASC, checkouttype ASC; \copy monthly_rollup_full_platforms TO 'monthly_rollup_full_platforms.csv' WITH CSV HEADER; LINK: monthly_rollup_full_platforms.csv -- Digital platforms only CREATE TABLE monthly_rollup_digital AS SELECT * FROM monthly_rollup_full_platforms WHERE usageclass = 'Digital'; \copy monthly_rollup_digital TO 'monthly_rollup_digital.csv' WITH CSV HEADER; LINK: monthly_rollup_digital.csv [TK Media types]
When we zoom in on the digital checkouts, we can see that Overdrive is the dominant platform. And when we split the digital checkouts by material type, we find that ebooks rather than AV (movies and music, primarily) account for the vast majority of the digital checkouts.
-- List of material types for reference DROP TABLE material_list; CREATE TABLE material_list AS SELECT DISTINCT UNNEST(STRING_TO_ARRAY(materialtype, ', ')) AS "material type list", usageclass FROM checkouts_full ORDER BY "material type list" ASC; \copy material_list TO 'material_list.csv' WITH CSV HEADER; LINK: material_list.csv CREATE TABLE av_formats_only AS SELECT checkoutyear, checkoutmonth, checkouttype, SUM(checkouts) AS "monthlycheckouts" FROM checkouts_full WHERE materialtype IN ('AUDIOBOOK','MOVIE','MUSIC', 'MUSICSNDREC','SONG','SOUNDCASS', 'SOUNDDISC','SOUNDREC','TELEVISION', 'VIDEO','VIDEOCART','VIDEOCASS', 'VIDEODISC','VIDEOREC') GROUP BY checkoutyear, checkoutmonth, checkouttype ORDER BY checkoutyear ASC, checkoutmonth ASC, checkouttype ASC; \copy av_formats_only TO 'av_formats.csv' WITH CSV HEADER; LINK: av_formats.csv
...But when we add back in the physical checkouts to the AV formats, we find that physical formats like DVDs, BluRays, CDs (etc) count for the solid majority of checkouts. However, checkout data alone is unclear on why: There might be a better selection of titles available physically, the patrons might be unfamiliar with the library's streaming services (which, in my experience, is common), or the affordability of DVD/Blu-Ray players when compared with computers and streaming subscriptions could be a factor.
-- 2017 Rollup CREATE TABLE monthly_rollup_2017 AS SELECT checkoutyear, checkoutmonth, usageclass, checkouttype, materialtype, SUM(checkouts) AS "monthlycheckouts" FROM checkouts_full WHERE checkoutyear = 2017 GROUP BY checkoutyear, checkoutmonth, usageclass, checkouttype, materialtype ORDER BY checkoutyear ASC, checkoutmonth ASC, usageclass ASC, checkouttype ASC, materialtype ASC; \copy monthly_rollup_2017 TO 'monthly_rollup_2017.csv' WITH CSV HEADER; LINK: monthly_rollup_2017.csv
Month-to-month, checkout totals remained fairly consistent at around 770k, with a slight uptick during the summer. The platform percentage splits were also fairly consistent, with physical checkouts hovering around 70%. There's a small increase in Overdrive checkouts in December, which I'm guessing could be from people checking out digital material for their holiday travels.
CREATE TABLE checkouts_2017 AS SELECT * FROM checkouts_full WHERE checkoutyear = 2017 ORDER BY checkoutyear ASC, checkoutmonth ASC, usageclass ASC; WITH c1 (count) AS ( SELECT COUNT(checkouts) FROM checkouts_2017 WHERE checkouts = 1), c2 (count) AS ( SELECT COUNT(checkouts) FROM checkouts_2017 WHERE checkouts = 2), c3 (count) AS ( SELECT COUNT(checkouts) FROM checkouts_2017 WHERE checkouts = 3), c4 (count) AS ( SELECT COUNT(checkouts) FROM checkouts_2017 WHERE checkouts = 4), c5 (count) AS ( SELECT COUNT(checkouts) FROM checkouts_2017 WHERE checkouts = 5), c6 (count) AS ( SELECT COUNT(checkouts) FROM checkouts_2017 WHERE checkouts > 5) SELECT * FROM c1 AS "1 Checkout", c2 AS "2 Checkouts", c3 AS "3 Checkouts", c4 AS "4 Checkouts", c5 AS "5 Checkouts", c6 AS ">5 Checkouts";
I was also curious to see how often an particular title was typically checked out per month. The results of this cut are interesting: About half of all the titles in 2017 were checked out only once, with each increasing checkout decreasing in percentage. One takeaway here could be the breadth of a library's collection is a significant factor in its usage; patrons are just as interested in the "deep cuts" as the "hits."
But there's also some red flags here! For example, we don't know how many copies of each item the SPL stocks: If they only have one copy, it's not going to be checked out more than once. Furthermore, when you look at this at the checkout-level rather than per-item, it's clear that single-monthly-checkouts account for only around 12% of total checkouts, and the "hits" are indeed driving most of the total checkout count.
Finally, let's take a look at 2017's 1,000 most-checked-out items for the year. We'll start by listing the Top 20 titles by total checkouts (drumroll please):
Reality check! The top 2 items are mobile HotSpots and device checkouts. For context: in 2005, Google granted $225k to the SPL to purchase 150 mobile hotspots to lend to patrons. This program has been expanded over the years and replicated at other libraries. On the device side, the SPL provides laptops and iPads. This is an important reminder that, beyond just having stuff, libraries provide critical services to citizens without access to technology.
But moving on to the top 20 titles:
-- 2017 Top 20 CREATE VIEW top_20_2017 AS SELECT title, usageclass, materialtype, SUM(checkouts) AS total_checkouts FROM checkouts_2017 GROUP BY title, usageclass, materialtype ORDER BY total_checkouts DESC LIMIT 22; \copy (SELECT * FROM top_20_2017) TO 'top20.csv' WITH csv header; LINK: top20.csv
There are some interesting findings here:
In fact, if you split 2017's top 1,000 items by format, you can see that 73% are physical.
...but about 2/3 of those physical checkouts are DVDs/Blu-Rays. Another interesting thing here is that it appears audiobook usage has transitioned primarily to digital. Digital audiobook checkouts account for 5.1% of the top 1,000, about a quarter of the digital total, compared to a measly 0.2% for the physical "Sounddisc" media type, which includes both music and audiobooks. (Note: The "Physical: MIXED" slice represents the HotSpot and device checkouts referenced earlier.)
And if we zoom in on books specifically, we find some interesting trends. Let's look at the top 10 most-checked-out authors (across all their titles) in both physical and digital formats.
-- 2017 top authors CREATE VIEW authors_2017 AS SELECT creator, usageclass, SUM(checkouts) AS total_checkouts FROM checkouts_2017 GROUP BY creator, usageclass ORDER BY total_checkouts DESC; \copy (SELECT * FROM authors_2017) TO 'authors_2017.csv' WITH csv header; LINK: authors_2017.csv
On the physical side, 9 of the top 10 are children's authors, with Mo Willem (author of the "Pigeon" and "Elephant & Piggie" series) with a significant lead at #1. The lone adult author on the physical list is James Patterson...
...who tops the list of digital checkouts. The authors on the digital list write primarily long-running mystery/thriller series, with Stephen King and Neil Gaimen working in horror and sci-fi. Two of the three women authors on this list, Nora Roberts and Janet Evanovitch, were both prolific romance writers before crossing over into mystery/thrillers.
Putting the top digital and physical authors side-by-side, I noticed that Patterson's digital total was higher than his physical one. Curious about the other top digital authors, I checked out their physical totals, and found this was true of all 10.
Going back to a question from the beginning, about when the percentage of digital checkouts might surpass physical ones, this finding suggests that this crossover may have already happened for particular book genres or readership demographics.
Any smooth trends in physical vs. digital checkouts were drastically interrupted by the effects of COVID-19; as the SPL closed in mid-March, physical checkouts dropped almost to zero. While digital checkouts have increased during the library's closure, these gains don't come close to making up for the lost physical checkouts. It's difficult to predict how much of this increase in digital traffic will continue after the SPL can safely re-open.
Anecdotally: At Green Apple, we've averaged around 2,000 web orders per year over the last decade. From March to December of 2020, we've had around 50,000, and the majority of customers who place orders on our web site are doing so for the first time. I imagine this general increase in awareness of digital offerings from outlets beyond Amazon, Apple, Netflix, (etc) is being experienced by many other organizations as well.
But the unfortunate wrinkle with libraries relates back to the Top 2 checked-out items of 2017: They're are one of the few places which provide free access to technology for individuals who otherwise lack it. With budget cuts already a common experience among American libraries, putting more resources toward digital offerings could result in decreased resources for these critical social services in the future. ♦︎