rinsemiddlebliss

Ink drawing of items a person might have on a chaotic desk. A no smoking sign, a mug of tea, a yorur (not yogurt) jar, a notebook with a spoon on it, a bar chart labeled only THINGIES, a report that says Itemize-Used for bird specifications and rogue animals, a bottle of bills with some lying next to it. Own work 2023.

Book scraper yak shave

I just wanted to export my book data from Goodreads

by AK Krajewska

Goodreads has over a decade of my book data and I wanted to get it out. I've wanted to get it out for a while, but I particularly wanted to get it out in December because the people who run my Mastodon instance started a Bookwyrm instance, and I wanted to get my books into it. Bookwyrm is is to Goodreads what Mastodon is to Twitter. That's a series of statements that probably makes sense to few of my readers, I realize as I write it out. The important thing here is I wanted to export all the information I had put into Goodreads through years of data entry and writing and store it in some way that I could use it anywhere else I pleased.

Goodreads has an export function, but, as I quickly discovered, it doesn't export all the data you might want, and some of it is formatted in annoying ways. No problem, I though, I'll open it in a spreadsheet (the export is a CSV) and clean it up. I'm trying to learn Python and cleaning up a bunch of data sounds exactly like an Automate the Boring Stuff with Python kind of problem. But I have so many books and the data was so messy. I might need to grab it myself to create a cleaner output.

How hard can it be?

No problem, I'd use the Goodreads API. Oh no, the Goodreads API stopped issuing keys on December 8, 2020 and plans to retire the API.

No problem, I'll use a web scraper. Simon Willison always writes about scraping websites using Python so it must not only be possible but probably a good idea. And surely, scraping Goodreads to extract your book data must a thing a lot of people want to do so I'll just find a script someone else wrote.

After some admittedly not very thorough searching, I discovered no one was exporting all the data I wanted, probably because it's relatively easy to scrape all the stuff that's on one page, and more difficult to scrape things that require traversing multiple pages and are inconsistent, like the data started reading and the annotations.

Fine, very well, no problem, I'll just follow one of the existing scraping tutorials to learn the basic principles of writing a scraper and once I get that to run, I'll build my own. Oh. The tutorial code doesn't work? Oh no. Well, I guess I'll debug it and make it work as the tutorial claimed it would.

But even though it works, it's kind of bad, isn't it? A rather clunky solution. I bet I could refactor it and make it better. Oh, oh no, now it's a whole thing. Now I'm refactoring someone else's weird half-assed tutorial just out of, I don't know, spite?

Fine, pretty hard, I guess

I'm not even done, because right after I refactored that first chunk about figuring out how many pages, I talked about it with someone who actually knows Python[1] and realized there's a better way to handle pagination. So I'm going to have to do that next and just take out that whole weird counting chunk I spent an hour on.

And since I'm going to all this trouble, I might as well write up a tutorial that works, so that's on the to-do list, too. I mean, it's cool, I'm learning a lot about Python and when to use ChatGPT for code questions versus when to ask an actual human expert, and how you can't trust other people's code unless you run it (I kind of knew that but now I know it harder), and that maybe I can trust my intuition when other people's code seems off to me.

Lichen subscribe

Eventually, I am going to scrape my Goodreads record and get it into Bookwyrm and 2024 shall be the year of owning my book data. In the meantime, I'm working in the open and documenting the whole weird journey in a braindump file in my book scraper project on GitHub. I think by writing so it's pretty stream of consciousness, but it's not like it takes me any extra effort to write it. When I've shared these sort of working notes with people at work, they've tended to like them and find them kind of useful (to my surprise) so it's possible that some of you people on the internet might be interested in following my adventure.

I hope that in the end I can write a Goodreads scraper that gets out all the data, including all the started reading and intermediate reading progress data, and the notes and quotations, and I hope I can package it up so that anyone who is willing to use a command line can use it to get their Goodreads data even if they don't know how to code. Because if 2024 is the year of owning our own book data, the book nerds who aren't programmers will need some help from the book nerds who can at least pretend to be programmers until someone catches us and discovers we[2] were three racoons in a trench coat all along.


[1] That person is my spouse. A very conveniently located Python expert, except when he shoulder surfs my code and asks a seemingly innocent question like "What are you trying to accomplish with that bit?" This is probably what it feels like to be an intern to a senior dev.

[2] Me. It's me. I'm three racoons in a trench coat holding a torn piece of newspaper with a copy-pasted regex written in crayon in one of my mouths.