Scrapy + WP-CLI + Google Cache: Restore lost posts

#python

#shell

#terminal

#wget

#wp-cli

#scrapy

#google cache

Always back up your database, right? In case you're learning the hard way, inheriting a bad situation or have another clever use, here's one way to restore lost posts.

In order to use this method you must have a basic understanding of:

shell: wget, wp-cli, composer
XPath
Python

Google Cache

Since every site is indexed on google cache you can get all your lost info from there while they still have it. Being a responsible developer we're going to download each page so that we only touch them once. This requires 2 steps.

A list of each page each on a single line. I recommend creating a Google Spreadsheet and within the first column execute the ImportXML function.

=ImportXML("http://www.google.com/search?q=site:{siteurl}&num=100", "//h3[@class='r']/a/@href")

If you have more than 100 results, add &start=101 after num=100 on line 101. After you have a clean list you can export this as a simple text file.

Download these cached pages so that we are not hitting the server anymore than we need to. In the following command we use a--user-agent, --input-file to reference the file that has the list of pages, and --wait to slow the download interval to 6 seconds so Google doesn't block us. Again, we're using our powers for good.

$ wget -v --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --input-file input.txt --wait=6

Scrapy

Installing Scrapy might take a little bit of work, I had to make sure libxml2 was installed. Once installed you can follow their documentation to set-up their tutorial project and modify it based on your needs.

$ scrapy startproject tutorial

I'm not going to outline every step to get Scrapy to do your scraping, but you need to have a basic understanding of XPath and a little bit of Python. Each site will be formatted differently so your XPath will vary. I will share some useful bits in relation to getting your WordPress post content, title, and post date so that you can reimport via WP-CLI.

Inside def parse(self, response): is where you'll add your parsing logic (go fig).

def parse(self, response):
    # store the original site url to strip out data later
    url = "http://siteurl.com/"

    sel = Selector(response)
    # enter the proper xpath to find post_title, link, post_date, and post_content
    title = sel.xpath('//h1[@class="entry-title"]/a/text()').extract()
    link = sel.xpath('//h1[@class="entry-title"]/a/@href').extract()
    content = sel.xpath('//div[@class="entry-content"]').extract()
    date = sel.xpath('//header/div[@class="entry-meta"]/span/text()').extract()

    # the result is in an array, I use pop to convert it to a clean string.
    link = link.pop() 
    title = title.pop() 
    date = date.pop()

    # NOTE: This may vary in your config.
    # Creating a filename based on site permalink structure `siteurl.com/post-name` to `post-name.html`
    file_name = string.replace(str(link), url, '')

    # content contains formatting that will cause errors, encode it to ascii
    content = str(content.pop().encode('ascii', errors='ignore'))

    # The date format is probably not going to be `%Y-%m-%d %H:%M:%S`
    # Capture the date in it's current format and convert it.
    d = datetime.strptime(date, '%m.%d.%y')
    day_string = d.strftime('%Y-%m-%d 00:00:00')

    items = []
    item = RecoverItem()

    # Scrapy lets you store this info to use later, we'll store it based on the tutorial lesson.
    # Also nice because it outputs in terminal once the command is run.
    item['title'] = title
    item['link'] = file_name
    item['desc'] = content.encode('ascii', errors='ignore')
    item['date'] = day_string
    foo = items.append(item)

    # build our wp-cli insert post command
    write = 'wp post create pages/' + file_name + '.html '
    write += '--post_title="'+str(item['title'])+'" '
    write += '--post_date="'+day_string+'" && '

    open('pages/'+file_name+'.html', 'wb').write(content)

    with open("insert.txt", "a") as myfile:
      myfile.write(write)

    return items

Once you've modified to your liking run:

$ scrapy crawl recover

If all went well you should now have a new folder called pages with the a neat list of html pages containing each posts content. You should also have a file named insert.txt with a messy looking wp-cli command.

WP-CLI

If you don't have wp-cli installed locally follow these instructions. I recommend testing this on your local build of the site. From the root level of the project simply paste the long list of wp post create commands found in your insert.txt.

WP-CLI will let you know if the post was created successfully, but you should check and make sure each post was inserted properly. Specifically that the titles, post date and content were inserted correctly.

Note: if you have any post titles with special characters you'll need to remove them or modify the provided source to be more strict.

You should now have recovered the posts and their content is set as drafts. You'll probably still end up doing TLC and not all data can be recovered, but you're in a lot better position than you were before.