The ‘backup’ gem needs a HOME

November 15, 2011

TL;DR

The backup gem contains a bug that causes it to crash when the HOME environment variable is not set. Notably, this is the case when jobs are started by ‘god’.

I have patched this and will soon send a pull request to the authors of the gem.

Read on to get a few hints on how to track down issues like this one.

The story of a painful bug hunt

0. Background

On several projects I’m using god to start and monitor daemon processes, especially resque workers. On one particular project, this just stopped working recently and as a workaround I had to start the resque workers manually under tmux.

Today I finally took the time to figure out what was going on.

1. Symptoms

In my god.conf I have a ‘resque’ watch configured which starts five workers. Here is the relevant snippet from god.conf:

num_workers = rails_env == 'production' ? 5 : 2
num_workers.times do |num|
  God.watch do |w|
    w.name     = "resque-#{num}"
    w.group    = 'resque'
    w.interval = 60.seconds
    w.env      = {"QUEUE"=>"*", "RAILS_ENV"=>rails_env}
    w.start    = "bundle exec bootup_rake -f #{rails_root}/Rakefile environment resque:work"
    w.dir      = "#{rails_root}"
...

When doing a “god start resque”, god tries to start up the workers but they crash almost immediately. If I do “ps aux | grep resque” within a few seconds I see the processes but then they just disappear.

2. What is god saying?

Finding the scroll that collects the utterings of god is easy:

/var/log/god.log

However, it quicly becomes apparent that god has nothing interesting to say:

I [2011-11-15 16:24:57]  INFO: Using pid file directory: /var/run/god
I [2011-11-15 16:30:55]  INFO: Syslog enabled.
I [2011-11-15 16:30:55]  INFO: Using pid file directory: /var/run/god
I [2011-11-15 16:39:30]  INFO: Syslog enabled.
I [2011-11-15 16:39:30]  INFO: Using pid file directory: /var/run/god
...

A bit of googling suggests that I try to run god in the foreground to see what is going on.

3. Non-daemonized god

So let’s see how god is being invoked:

root@foo:~# ps aux | grep god
root     27956  4.5  0.2 179304 19132 pts/1    Sl   10:12   \
   0:00 /usr/local/rvm/rubies/ruby-1.9.2-p290/bin/ruby      \
   /usr/local/rvm/gems/ruby-1.9.2-p290/bin/god -P /var/run/god.pid -l /var/log/god.log

We need to make a few tweaks:

  • don’t send output to a log file (remove -l)
  • load the config file with -c <config>
  • keep it in the foreground with -D

So our command becomes:

root@foo:~# /usr/local/rvm/rubies/ruby-1.9.2-p290/bin/ruby \ 
   /usr/local/rvm/gems/ruby-1.9.2-p290/bin/god             \
   -P /var/run/god.pid -c /etc/god.conf -D

This yields slightly more interesting output. Here are the lines for the worker ‘resque-1′:

I [2011-11-16 10:17:13]  INFO: resque-1 move 'unmonitored' to 'init'
I [2011-11-16 10:17:13]  INFO: resque-1 moved 'unmonitored' to 'init'
I [2011-11-16 10:17:13]  INFO: resque-1 [trigger] process is not running (ProcessRunning)
I [2011-11-16 10:17:13]  INFO: resque-1 move 'init' to 'start'
I [2011-11-16 10:17:13]  INFO: resque-1 start: bundle exec bootup_rake \ 
   -f /var/www/foo/current/Rakefile environment resque:work
I [2011-11-16 10:17:13]  INFO: resque-1 moved 'init' to 'start'
I [2011-11-16 10:17:13]  INFO: resque-1 [trigger] process is running (ProcessRunning)
I [2011-11-16 10:17:13]  INFO: resque-1 move 'start' to 'up'
I [2011-11-16 10:17:13]  INFO: resque-1 moved 'start' to 'up'
I [2011-11-16 10:17:13]  INFO: resque-1 [ok] memory within bounds [8952kb] (MemoryUsage)
I [2011-11-16 10:17:14]  INFO: resque-1 [ok] process is running (ProcessRunning)

OK, maybe that’s not so interesting after all. God starts the process and is satisfied with that.

What IS interesting is the fact that this actually works! The resque workers keep running and start processing jobs.

4 WTF?

So when I start god non-daemonized, it works. When god is started with ‘service god start’, it does not.

I need a way to see the log output from god when it is running in daemonized mode. This turns out to be as simple as:

god log resque

By default this gives me the output from the first worker in the group. I won’t repeat it here, but it is exactly the same as when running god in the foreground.

It seems to be safe to assume that god is behaving the same way whether started by me or via /etc/init.d/god. To be sure I also tried starting it by executing the exact two lines from this file. Same result.

5. Inspecting the environment

I have suspected for a while that this issue has to do with the environment in which the rake command is being run. This seems to be the only major source of difference between the working and the non-working experiments.

So I need to know the environment in which the failing rake invocations are being run. I know that for all running processes, the environment is available in the file /proc/<pid>/environ. But how can I get to this file when the process dies within seconds?

My solution is to run the following command immediately after god tries to start the workers:

less -f /proc/`ps aux | grep resque | head -1 | cut -d ' ' -f 4`/environ

What this does is

  • ps aux | grep resque finds the process info for the about-to-die resque workers
  • head -1 grabs the first one
  • cut -d ‘ ‘ -f 4 cuts out the PID (fourth field when using a space as the delimiter)
  • This PID is then used as part of the path passed to the outer less command

The environ file is in some kind of binary format (which is why I need -f on the less command) the details of which I haven’t bothered to look up.

I don’t really need to, because after running the above command I see this in my iTerm:

GEM_HOME=/usr/local/rvm/gems/ruby-1.9.2-p290^@TERM=xterm^@IRBRC=/usr/local/rvm/r\
ubies/ruby-1.9.2-p290/.irbrc^@MY_RUBY_HOME=/usr/local/rvm/rubies/ruby-1.9.2-p290\
^@QUEUE=*^@rvm_path=/usr/local/rvm^@PATH=/usr/local/rvm/gems/ruby-1.9.2-p290/bin\
:/usr/local/rvm/gems/ruby-1.9.2-p290@global/bin:/usr/local/rvm/rubies/ruby-1.9.2\
-p290/bin:/usr/local/rvm/bin:/var/www/foo/shared/bundle/ruby/1.9.1/bin:/usr/loca\
l/rvm/gems/ruby-1.9.2-p290/bin:/usr/local/rvm/gems/ruby-1.9.2-p290@global/bin:/u\
sr/local/rvm/rubies/ruby-1.9.2-p290/bin:/usr/local/rvm/bin:/usr/local/sbin:/usr/\
local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games^@PWD=/var/www/foo/releases/20\
111115153743^@LANG=en_US.UTF-8^@SHLVL=1^@BUNDLE_GEMFILE=/var/www/foo/releases/20\
111115153743/Gemfile^@RAILS_ENV=production^@rvm_ruby_string=ruby-1.9.2-p290^@GEM\
_PATH=/usr/local/rvm/gems/ruby-1.9.2-p290:/usr/local/rvm/gems/ruby-1.9.2-p290@gl\
obal^@RUBYOPT=-I/usr/local/rvm/gems/ruby-1.9.2-p290/gems/bundler-1.0.18/lib -rbu\
ndler/setup^@BUNDLE_BIN_PATH=/usr/local/rvm/gems/ruby-1.9.2-p290/gems/bundler-1.\
0.18/bin/bundle^@RUBY_VERSION=ruby-1.9.2-p290^@

Notice that the variables are separated by some byte that less renders as “^@”. Good!

6. Reproducing the bug

Off to Emacs to transform this into something useful. I won’t bore you with the details, but a quick keyboard macro transforms the above into:

export GEM_HOME=/usr/local/rvm/gems/ruby-1.9.2-p290
export TERM=xterm
export IRBRC=/usr/local/rvm/rubies/ruby-1.9.2-p290/.irbrc
export MY_RUBY_HOME=/usr/local/rvm/rubies/ruby-1.9.2-p290
export QUEUE=*
export rvm_path=/usr/local/rvm
export PATH=/usr/local/rvm/gems/ruby-1.9.2-p290/bin:/usr/local/rvm/gems/ruby-1.9.2-p290@global/bin:\
   /usr/local/rvm/rubies/ruby-1.9.2-p290/bin:/usr/local/rvm/bin:/var/www/foo/shared
export PWD=/var/www/foo/releases/20111115153743
export LANG=en_US.UTF-8
export SHLVL=1
export BUNDLE_GEMFILE=/var/www/foo/releases/20111115153743/Gemfile
export RAILS_ENV=production
export rvm_ruby_string=ruby-1.9.2-p290
export GEM_PATH=/usr/local/rvm/gems/ruby-1.9.2-p290:/usr/local/rvm/gems/ruby-1.9.2-p290@global
export RUBYOPT=-I/usr/local/rvm/gems/ruby-1.9.2-p290/gems/bundler-1.0.18/lib -rbundler/setup
export BUNDLE_BIN_PATH=/usr/local/rvm/gems/ruby-1.9.2-p290/gems/bundler-1.0.18/bin/bundle
export RUBY_VERSION=ruby-1.9.2-p2

Back on the server, I clear all existing environment vars with a snippet I found here:

unset  $ ( /usr/bin/env | egrep '^(\w+)=(.*)$' | \
  egrep -vw 'PWD|USER|LANG' | /usr/bin/cut -d= -f1);

Next, I paste all the exports above into iTerm and finally I am ready to try starting the worker again.

Voila!! It crashes:

rake aborted!
can't convert nil into String
/var/www/foo/shared/bundle/ruby/1.9.1/gems/backup-3.0.16/lib/backup.rb:40:in `join'
/var/www/foo/shared/bundle/ruby/1.9.1/gems/backup-3.0.16/lib/backup.rb:40:in `<module:Backup>'
/var/www/foo/shared/bundle/ruby/1.9.1/gems/backup-3.0.16/lib/backup.rb:8:in `<top (required)>'
...

Here are the lines around line 40 of backup.rb:

##                                                                                                                                                                                                                                                                                                                                                                      
# Backup's Environment paths                                                                                                                                                                                                                                                                                                                                            
PATH               = File.join(ENV['HOME'], 'Backup')
DATA_PATH          = File.join(ENV['HOME'], 'Backup', 'data')

Ah, so the backup gem assumes that the HOME environment variable is set. If not, it crashes.

7. Done

Fortunately the source for “backup” is on Github, so I could easily fork the project and apply the very simple patch that was needed.

About these ads

5 Responses to “The ‘backup’ gem needs a HOME”

  1. dave f Says:

    how is this related to emacs?

  2. chopmo Says:

    You’re right, it isn’t really. Removed that tag.

  3. David Says:

    It was probably tagged with emacs because of the environ reformatting? But normally just piping through “tr ” ‘\n'” is enough for this. No need to bother emacs with it.

    But you don’t even have to go over the proc/…/environ if you just use the ‘e’ option of ps. (then maybe use “tr ‘ ‘ ‘\n'” to get in line breaks)

    • chopmo Says:

      David,

      Yes, it was because of the reformatting that I tagged it with emacs. Originally I went into more detail with this part, but I edited that out before publishing.

      Great tips about tr (a tool I need to learn how to use) and the ‘-e’ option to ps, thanks!

      • David Says:

        My pleasure. The decisive first parameter of tr (backslash 0) seems to have filtered out somehow though.


Comments are closed.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: