Scrapy + WP-CLI + Google Cache: Restore lost posts
Always back up your database, right? In case you're learning the hard way, inheriting a bad situation or have another clever use, here's one way to restore lost posts.
In order to use this method you must have a basic understanding of:
- shell: wget, wp-cli, composer
- XPath
- Python
Google Cache
Since every site is indexed on google cache you can get all your lost info from there while they still have it. Being a responsible developer we're going to download each page so that we only touch them once. This requires 2 steps.
- A list of each page each on a single line. I recommend creating a Google Spreadsheet and within the first column execute the ImportXML function.
=ImportXML("http://www.google.com/search?q=site:{siteurl}&num=100", "//h3[@class='r']/a/@href")
If you have more than 100 results, add &start=101
after num=100
on line 101. After you have a clean list you can export this as a simple text file.
- Download these cached pages so that we are not hitting the server anymore than we need to. In the following command we use a
--user-agent
,--input-file
to reference the file that has the list of pages, and--wait
to slow the download interval to 6 seconds so Google doesn't block us. Again, we're using our powers for good.
$ wget -v --user-agent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" --input-file input.txt --wait=6
Scrapy
Installing Scrapy might take a little bit of work, I had to make sure libxml2 was installed. Once installed you can follow their documentation to set-up their tutorial project and modify it based on your needs.
$ scrapy startproject tutorial
I'm not going to outline every step to get Scrapy to do your scraping, but you need to have a basic understanding of XPath and a little bit of Python. Each site will be formatted differently so your XPath will vary. I will share some useful bits in relation to getting your WordPress post content, title, and post date so that you can reimport via WP-CLI.
Inside def parse(self, response):
is where you'll add your parsing logic (go fig).
def parse(self, response):
# store the original site url to strip out data later
url = "http://siteurl.com/"
sel = Selector(response)
# enter the proper xpath to find post_title, link, post_date, and post_content
title = sel.xpath('//h1[@class="entry-title"]/a/text()').extract()
link = sel.xpath('//h1[@class="entry-title"]/a/@href').extract()
content = sel.xpath('//div[@class="entry-content"]').extract()
date = sel.xpath('//header/div[@class="entry-meta"]/span/text()').extract()
# the result is in an array, I use pop to convert it to a clean string.
link = link.pop()
title = title.pop()
date = date.pop()
# NOTE: This may vary in your config.
# Creating a filename based on site permalink structure `siteurl.com/post-name` to `post-name.html`
file_name = string.replace(str(link), url, '')
# content contains formatting that will cause errors, encode it to ascii
content = str(content.pop().encode('ascii', errors='ignore'))
# The date format is probably not going to be `%Y-%m-%d %H:%M:%S`
# Capture the date in it's current format and convert it.
d = datetime.strptime(date, '%m.%d.%y')
day_string = d.strftime('%Y-%m-%d 00:00:00')
items = []
item = RecoverItem()
# Scrapy lets you store this info to use later, we'll store it based on the tutorial lesson.
# Also nice because it outputs in terminal once the command is run.
item['title'] = title
item['link'] = file_name
item['desc'] = content.encode('ascii', errors='ignore')
item['date'] = day_string
foo = items.append(item)
# build our wp-cli insert post command
write = 'wp post create pages/' + file_name + '.html '
write += '--post_title="'+str(item['title'])+'" '
write += '--post_date="'+day_string+'" && '
open('pages/'+file_name+'.html', 'wb').write(content)
with open("insert.txt", "a") as myfile:
myfile.write(write)
return items
Once you've modified to your liking run:
$ scrapy crawl recover
If all went well you should now have a new folder called pages with the a neat list of html pages containing each posts content. You should also have a file named insert.txt
with a messy looking wp-cli command.
WP-CLI
If you don't have wp-cli installed locally follow these instructions. I recommend testing this on your local build of the site. From the root level of the project simply paste the long list of wp post create
commands found in your insert.txt.
WP-CLI will let you know if the post was created successfully, but you should check and make sure each post was inserted properly. Specifically that the titles, post date and content were inserted correctly.
Note: if you have any post titles with special characters you'll need to remove them or modify the provided source to be more strict.
You should now have recovered the posts and their content is set as drafts. You'll probably still end up doing TLC and not all data can be recovered, but you're in a lot better position than you were before.
Please back up your Database often. <3