Last Updated: February 25, 2016
·
6.993K
· miketheman

Using Chef's remote_file with GitHub raw content

Chef is a really awesome configuration management tool, and there may be times that you want to leverage it to place files from sources that are not your own on a target system.

It can be very beneficial to leverage resources that are not located in your environment, since this may be a resource that you are not responsible for. In some cases, these might be files located on GitHub.

Chef provides a git resource to check out a repository to a given location, but this might be overkill if all you want is a single file.
Another option would be to copy the entire file to your cookbook and use with the cookbook_file resource, however this then pollutes your cookbook with code that you may not be responsible for, as well as may never get updated when the source file is updated.

Chef provides a remote_file resource that has a bunch of sane default flags, and is built to retrieve files from remote locations, such as HTTP URLs, FTP, and more.

Since GitHub provides content over HTTP(s), I will be using that endpoint.

Here's an example of a remote_file resource to retrieve the ps_mem tool written by P. Brady.

remote_file '/usr/local/bin/ps_mem' do
  source 'https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py'
  owner 'root'
  group 'root'
  mode '755'
end

This is pretty straightforward - we provide a source attribute informing the resource where to look for the file, as well as the 'name' attribute of the target destination in /usr/local/bin, and set some attributes for permissions and execution.

When Chef executes this resource for the first time, it will perform an HTTP GET request to the web server hosting the source, download it to a /tmp location, calculate a checksum, and store some metadata about the file, finally placing it into the correct destination and setting the permissions and mode.

This can be seen via a debug log of an initial run of this particular resource.

Here's a prettier version of the cached metadata file, located at /var/chef/cache/remote_file/<url>-<hash of url>.json:

{
"etag":"\"64d86a121a11f2ead5cfff2e2702e0ab3f4441f6\"",
"mtime":"Mon, 30 Jun 2014 19:30:27 GMT",
"checksum":"d5856de8f1a56a18c5e6c15cd529ad070a63a1d789c8286a3567371b069d82af"
}

The next time we run Chef with the same resource, my expectation would be that since the source file hasn't been modified, and the on-disk version hasn't been modified, nothing will happen.
One of Chef's design goals is to create idempotent resources, so multiple executions of the same resources will not change any state, no matter how many times the resources are executed, unless something changes, like someone updating the

However, when we run Chef again, we can see that the file is being downloaded again to a new /tmp location, and since the calculated checksum of the existing file matches the freshly downloaded one, nothing happens.

The problem with this is now we are downloading the file every single Chef run, and performing the checksum calculation locally, and effectively throwing the result away as it's not needed.

We could add the checksum of the file to our remote_file resource, but it will still download the file every run and compare it with our provided resource.

On the second run, we see that we are trying to leverage well-known HTTP caching mechanisms to prevent downloading the file again, as we don't expect the remote file to change that often.
The expected response code is 304, but we still get a 200 with all of the content during our second run.

We can see that we are submitting the metadata as part of the header request to GitHub, so why is GitHub not respecting the If-Modified-Since header and giving us a 304?

There's GitHub-specific client libraries that wrap the API and handle all the edge cases, but I don't want to have to worry about the specifics.

In further debugging with curl and Feedbot's If-Modified-Since tool, I can see that GitHub's responses don't support the request header against the raw endpoint, so telling my resource to not use that as part of the header request should prevent using the date/time as a qualifier, rather only use the Etag field to ask the web server if the content matches this particular Etag, and only send us the content if it doesn't match.

remote_file '/usr/local/bin/ps_mem' do
  source 'https://raw.githubusercontent.com/pixelb/ps_mem/master/ps_mem.py'
  use_last_modified false # true by default
  owner 'root'
  group 'root'
  mode '755'
end

This has the desired affect, as when making the HTTP request to the source, we provide only the Etag for this file, and the web server responds correctly. See full debug output.

So feel free to use the remote_file resource when picking a file from GitHub, just be sure to use the correct headers to prevent extra work from being done by all parties - GitHub's servers and local chef-client downloading, checksumming and diffing.

Happy cooking!

1 Response
Add your response

How i can use file:// instead of http:// ?

My recipe
remote_file "/opt/tomcat/webapps" do
source "file:///tmp/sample.war"
mode 0775
owner "root"
group "root"
backup 5
end

but i getting error msg
No such file or directory - /tmp/sample.war.war

over 1 year ago ·