Recently, I was on a six hour bus ride without internet. I thought, it would be nice if I could read through the blog archives of my favorite authors. Most blogs are like magazines where the most recent posts stand out, but it’s difficult to read from beginning to end like a book. Also, I prefer reading before sleeping, but I don’t want to be in front of a computer screen. Thus, the idea was born:
Figure Out How to Convert Blogs into a Kindle compatible ebook that could be read from start to finish.
So, I coded up a quick way to convert some of my favorite blogs into a Kindle compatible .mobi ebook. On this page, I’ll outline the steps on how I did it so that you can do the same. Note, this does require some programming knowledge. At the bottom of this page, I provide the mobi ebook for the blogs SebastianMarshall.com and Tynan.com.
To do this, you basically need a HTML copy of each individual blog post, create a table of contents containing a link to each post, and let calibre ebook manager do its magic.
I’m listing generic steps below. If anyone wants the full code, feel free to contact me. I coded my program very quickly without comments, and it’s not very optimized.
Step 1: Get a Listing of All the Blog Posts
Many blogs will have an archive page. For example, Tynan.com and Tim Ferriss are easy.
http://www.tynan.com/archives
http://www.fourhourworkweek.com/blog/sitemap/
However, Sebastian does not have any archive page as far as I could tell so I had to come up with a more creative solution.
I could access all his blog posts by iterating through each of his past pages like so:
http://www.sebastianmarshall.com/page/1 (shows the most recent material)
I found his last page (as of January 2012) was http://www.sebastianmarshall.com/page/138.
So, a normal for loop would do
for($i=0;$i<139;$i++) {
//go through each page
//fetch the URL to the individual blog posts using PHP DOM library
//add URL to an array
}
Once, this was completed, I had an array with each of his blog posts from most recent to oldest. However, I prefer reading from their first post onward. This shows their natural progression both in terms of quality of writing, evolution of ideas, and success. Simply, going through the array backwards works.
Step 2 Extracting the Post Content and Not the Layout/Misc Material
Now, each blog post is going to contain all the headers, sidebars, footers, and other content you don’t want. Since all three of these blogs use WordPress however, you can usually figure out how to extract just the body content based on the HTML structure.
For example, on Sebastian’s blog, the content is always nested between <div id=”content”>CONTENT</div>:
Using PHP’s DOM library, we can extract this.
I found for some reason that some bizarre characters come up so had to run it through this to clear it up:
$strbody = iconv(‘UTF-8’, ‘ASCII//TRANSLIT//IGNORE’, $body);
Now that we have the content, just have to save the HTML page.
Step 3 – Create a Table of Contents that Calibre can read
This took me a while to figure out, but calibre has a built in ebook generator that can take a collection of HTML pages and make it into an ebook.
To do this, you need a basic HTML page listing each of the individual html pages/blog posts:
Step 4) Add TOC Page to Calibre and Create Ebook
Optional: Edit the Metadata for title, author, cover image, and so on.
Optional: Assign Chapters: I’m not even sure this is necessary, but it doesn’t hurt to try. If you want the blog posts to have a page break then under the Structure Detection tab, you need to add the HTML code breaks for chapters.
Now, it’s just as simple as hitting the convert button and waiting for the conversion to complete. Once finished, you have a Kindle compatible ebook.
Bugs/Future Improvements:
There are a few bugs that I haven’t had the time or interest to work out yet. Just doing the above took many hours. Nevertheless, here are some bugs to keep in mind.
Embedding photos and images is still a challenge. It might be as simple as downloading the images onto my computer for Calibre to have access to, but I haven’t tried it yet.
Some of the posts are out of order due to the order that FTP downloaded the pages, and the way calibre organizes files by datetime stamp rather than the TOC order of posts. One issue was I wanted to add an introduction page to the beginning. I used this trial program to modify the introduction.html’s datetime stamp to the past so that it would show up as the first chapter.
http://download.cnet.com/File-Property-Edit-Pro/3000-2248_4-10864040.html
You may find an early blog post at the end and vice versa. Nevertheless, most of it should be in order, and the TOC still works.
Code Optimization. I had one PHP file running everything, and it kept timing out on the server so I had to constantly re-run it. I’m sure there are ways to optimize the code.
The Ebooks (mobi format zipped)
https://www.peterxpark.com/tynanebook
https://www.peterxpark.com/sebastianebook
I hope the blog authors don’t mind that I did this. I imagine their highest priority is reaching as many people as possible. They don’t make their livelihood off of their blog advertising so it shouldn’t hurt them. In addition, this provides an easy way for readers to read through the back catalog of posts easily and without a computer. Be sure to subscribe to them for the latest content.
In the future, I’d like to figure out how to add the images in as well as a way to convert any blog by just inputting the address. But that would take a lot more time.
Please feel free to leave a comment below or email me for additional comments, questions, or so on.