Longer MapReduce Slices on AppEngine
The new version of AppEngine's MapReduce framework relies on the blobstore to store the intermediate results of Mappers. At the beginning of each slice, the shard's intermediate result file on the blobstore is opened, and is kept open for the duration of the slice, appending key/value pairs as their emitted by the mapper.
The problem with this is that the Files API only allows you to hold a file open for about 30 seconds-- any longer and you'll receive an opaque-sounding "ApplicationError: 10" exception.
Sometimes you want or need slices to run longer than the default 30 seconds-- maybe it takes more than 30 seconds to make over an input element, or maybe your mapper loads some state at the beginning of each slice that you'd prefer not to do twice a minute.
We can fix this by buffering the output of the mappers and only opening the blobstore file when we flush the results to blobstore. (See http://code.google.com/p/appengine-mapreduce/issues/detail?id=144#c3) As long the individual work items can be written to the blobstore in under 30 seconds, we can set our slices to be as long as necessary, particularly if you run the job on an AppEngine backend rather than frontend instances.
For example, define a dynamic backend with up to 10 instances in your backends.xml file:
<backends>
<backend name="mapreduce-backend">
<class>B2</class>
<instances>10</instances>
<options>
<public>false</public>
<dynamic>true</dynamic>
</options>
</backend>
</backends>
Now you can set up your mapreduce job to run on this backend, with longer slices:
MapReduceSettings settings = new MapReduceSettings()
.setBackend("mapreduce-backend")
.setWorkerQueueName("mapreduce-workers")
.setControllerQueueName("default")
.setMillisPerSlice(5 * 60 * 1000);
You can pickup the patched version from bedatadriven's maven repository:
<dependency>
<groupId>com.google.appengine</groupId>
<artifactId>appengine-mapreduce</artifactId>
<version>r351-bedatadriven1</version>
</dependency>
<repository>
<id>bedatadriven-public</id>
<name>Bedatadriven Public Repository</name>
<url>http://nexus.bedatadriven.com/content/groups/public</url>
</repository>