Category Archives: PHP Coding

PHP Coding

Blog to Ebook Conversion Program v1

Since I wrote Converting WordPress Blog to Kindle Ebook, I’ve built a PHP program that can take any WordPress blog and output a zip file of the blog’s posts. With this zip file, Calibre can create an ebook PDF or MOBI file.

The program consumes a great deal of memory and takes a few minutes to complete. For that reason, I’ve limited the script to only do 20 posts maximum, it simply cannot run in a shared hosting environment.

If you don’t already, you’ll need the ebook program Calibre to compile the HTML files and images in the zip file into a PDF, EPUB, or MOBI ebook file.

The actual program has four steps:

First, you enter the blog domain URL in the form and hit Submit. On the backend, the program will generate a text file with all the blog post URLs sorted by oldest first.

Then, the second form will randomly take 3 of those URLs and try to figure out the page’s HTML tags for the post title and content. The largest challenge is that each WordPress blog has a different structure so figuring out how to only extract the title and post content can be difficult. At the bottom of the page, I provide different generic HTML tags that usually matches up the title and post’s body. You can also input a custom HTML tag and attribute. Most blogs seem to use a <div id=”post”></div>. Once you’ve selected the right HTML tags, the program will save all the blog posts as an HTML file, save all the embedded images, and make everything available for download as a zip file.

After downloading the provided file, extract its contents, and drag and drop the Calibre.html file into the Calibre program.

Calibre will generate its own zip file where it says, “Formats: ZIP”.  If you want the images included, you’ll need to manually move the images folder into Calibre’s zip file.

There are PHP libraries out there that can convert HTML files too, but they’re rather resource intensive. Maybe, I’ll add it in the next version.

Some future possible upgrades:
– Have the PHP Program convert the blog automatically to MOBI/PDF rather than rely on Calibre (biggest issue is security and resource intensity)
– AJAX Interface, have everything handled on one page rather than multiple
– Frontend is rather ugly and not most intuitive. Was focused primarily on getting it working before making it user friendly.

See it in action here:

http://www.peterxpark.com/blogebook/blogform.php

PHP Coding

Converting WordPress Blog to Kindle Ebook

Recently, I was on a six hour bus ride without internet. I thought, it would be nice if I could read through the blog archives of my favorite authors. Most blogs are like magazines where the most recent posts stand out, but it’s difficult to read from beginning to end like a book. Also, I prefer reading before sleeping, but I don’t want to be in front of a computer screen. Thus, the idea was born:

Figure Out How to Convert Blogs into a Kindle compatible ebook that could be read from start to finish.

So, I coded up a quick way to convert some of my favorite blogs into a Kindle compatible .mobi ebook. On this page, I’ll outline the steps on how I did it so that you can do the same. Note, this does require some programming knowledge. At the bottom of this page, I provide the mobi ebook for the blogs SebastianMarshall.com and Tynan.com.

To do this, you basically need a HTML copy of each individual blog post, create a table of contents containing a link to each post, and let calibre ebook manager do its magic.

I’m listing generic steps below. If anyone wants the full code, feel free to contact me. I coded my program very quickly without comments, and it’s not very optimized.

Step 1: Get a Listing of All the Blog Posts
Many blogs will have an archive page. For example, Tynan.com and Tim Ferriss are easy.

http://www.tynan.com/archives
http://www.fourhourworkweek.com/blog/sitemap/

However, Sebastian does not have any archive page as far as I could tell so I had to come up with a more creative solution.

I could access all his blog posts by iterating through each of his past pages like so:

http://www.sebastianmarshall.com/page/1 (shows the most recent material)

I found his last page (as of January 2012) was http://www.sebastianmarshall.com/page/138.

So, a normal for loop would do
for($i=0;$i<139;$i++) {
//go through each page
//fetch the URL to the individual blog posts using PHP DOM library
//add URL to an array
}

Once, this was completed, I had an array with each of his blog posts from most recent to oldest. However, I prefer reading from their first post onward. This shows their natural progression both in terms of quality of writing, evolution of ideas, and success. Simply, going through the array backwards works.

Step 2 Extracting the Post Content and Not the Layout/Misc Material

Now, each blog post is going to contain all the headers, sidebars, footers, and other content you don’t want. Since all three of these blogs use WordPress however, you can usually figure out how to extract just the body content based on the HTML structure.

For example, on Sebastian’s blog, the content is always nested between <div id=”content”>CONTENT</div>:

Using PHP’s DOM library, we can extract this.

I found for some reason that some bizarre characters come up so had to run it through this to clear it up:

$strbody = iconv(‘UTF-8’, ‘ASCII//TRANSLIT//IGNORE’, $body);

Now that we have the content, just have to save the HTML page.

Step 3 – Create a Table of Contents that Calibre can read

This took me a while to figure out, but calibre has a built in ebook generator that can take a collection of HTML pages and make it into an ebook.

http://manual.calibre-ebook.com/faq.html#how-do-i-convert-a-collection-of-html-files-in-a-specific-order

To do this, you need a basic HTML page listing each of the individual html pages/blog posts:

Step 4) Add TOC Page to Calibre and Create Ebook

Optional: Edit the Metadata for title, author, cover image, and so on.
Optional: Assign Chapters: I’m not even sure this is necessary, but it doesn’t hurt to try. If you want the blog posts to have a page break then under the Structure Detection tab, you need to add the HTML code breaks for chapters.

Now, it’s just as simple as hitting the convert button and waiting for the conversion to complete. Once finished, you have a Kindle compatible ebook.

Bugs/Future Improvements:

There are a few bugs that I haven’t had the time or interest to work out yet. Just doing the above took many hours. Nevertheless, here are some bugs to keep in mind.

Embedding photos and images is still a challenge. It might be as simple as downloading the images onto my computer for Calibre to have access to, but I haven’t tried it yet.

Some of the posts are out of order due to the order that FTP downloaded the pages, and the way calibre organizes files by datetime stamp rather than the TOC order of posts. One issue was I wanted to add an introduction page to the beginning. I used this trial program to modify the introduction.html’s datetime stamp to the past so that it would show up as the first chapter.

http://download.cnet.com/File-Property-Edit-Pro/3000-2248_4-10864040.html

You may find an early blog post at the end and vice versa. Nevertheless, most of it should be in order, and the TOC still works.

Code Optimization. I had one PHP file running everything, and it kept timing out on the server so I had to constantly re-run it. I’m sure there are ways to optimize the code.

The Ebooks (mobi format zipped)

http://www.peterxpark.com/tynanebook

http://www.peterxpark.com/sebastianebook

I hope the blog authors don’t mind that I did this. I imagine their highest priority is reaching as many people as possible. They don’t make their livelihood off of their blog advertising so it shouldn’t hurt them. In addition, this provides an easy way for readers to read through the back catalog of posts easily and without a computer. Be sure to subscribe to them for the latest content.

In the future, I’d like to figure out how to add the images in as well as a way to convert any blog by just inputting the address. But that would take a lot more time.

Please feel free to leave a comment below or email me for additional comments, questions, or so on.