When I was writing mustardwatch, I realized that the Linux ptrace API was even
more complicated than I remembered.

It was impossible to handle stopping signals correctly before PTRACE_SEIZE,
introduced in Linux 3.4 (2012). If you don't need to support earlier kernels,
simplify your life by only using PTRACE_SEIZE. I'll treat all APIs that aren't
compatible with SEIZE (e.g. PTRACE_TRACEME) as deprecated, and ignore them here.

In particular, everything here assumes the following rules:
* Always attach with PTRACE_SEIZE.
* Always set PTRACE_O_TRACESYSGOOD.

Here's the way ptrace works:

* You (the tracer) seize a tracee.
* The tracee can be running, or in several kinds of stopped states.
* You can call waitpid() on it -- as if it's your child -- which can give you
  various events, including many ptrace-only events.
* When the tracee is stopped, you can inspect or change it with various ptrace
  calls.
* You can restart the tracee in various ways.

More details:


* To seize:

PTRACE_SEIZE takes a flags argument (the same flags as SETOPTIONS).
* Make sure to set TRACESYSGOOD to distinguish syscalls from other stops.
* If you want to trace children, set TRACE{CLONE,FORK,VFORK}.


If you're forking a child, you probably want it to be stopped. So your seize code
might look like:

pid = fork();
if (pid == 0) {
  // Child.
  raise(SIGSTOP);
  execve(...);
} else {
  // Parent.
  waitpid(pid, &wstatus, WSTOPPED); // Wait for child to stop.
  ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACESYSGOOD); // Seize child.
  // XXX: Is this correct? Do we need to INTERRUPT?
  kill(pid, SIGCONT); // Continue child execution.
}

Followed by the main wait()/ptrace() loop.

The loop might look like this:
while (1) {
  poll(...): // Possibly on a signalfd to get SIGCHLD notifications.

  // If we got a SIGCHLD, run waitpid to get ptrace events from the child.
  while (1) {
    int wstatus;
    pid_t pid = waitpid(-1, &wstatus, WNOHANG|__WALL);
    if (pid == 0 || (pid < 0 && errno == ECHILD)) break;

    // We got a child event.
    if (WIFEXITED(wstatus) || WIFSIGNALED(wstatus) || WIFCONTINUE(wstatus)) {
      // This isn't a stop, e.g. the child exited.
    }

    if (WIFSTOPPED(wstatus)) {
      // This is a stop. See below.

      int stop_sig = WSTOPSIG(wstatus);
      int stop_event = wstatus >> 16;

      // ...

      ptrace(ptrace_restart, pid, 0, continue_sig);
      // In the case of a group-stop, you probably want to use PTRACE_LISTEN
      // rather than a normal restart call.
    }
  }
}

Seizing a tracee does not stop it.

Note that when a tracee forks and you have TRACEFORK set, you may get events
for the new pid before seeing the PTRACE_EVENT_FORK message (so you won't
necessarily recognize the pid from wait()).


* Stopped states.

In various cases, a tracee may enter a stopped state. You can tell when you call
waitpid and WIFSTOPPED(wstatus) is true.

When you get a stop, you can inspect or modify the tracee, and then continue
execution. When you continue execution, you can pass ptrace a signal number as
the last argument. For each kind of stop, I specify what the no-op action that
mimics untraced behavior would be.

The specific kind of ptrace stop is given in two parts of the wstatus, referred
to below as:

  int stop_sig = WSTOPSIG(wstatus); // Stopping signal.
  int stop_event = wstatus >> 16; // ptrace event identifier.

There are several kinds of stops:
* Syscall-stop:
  Delivered right before entering and right after exiting a system call, if you
  use PTRACE_SYSCALL.

  Can be identified with: stop_sig == (SIGTRAP|0x80) && stop_event == 0

  No-op action: Continue execution with continue_sig = 0.

* Event stop:
  Delivered to indicate other special ptrace-specific events. The event type is
  given in stop_event.

  Mostly you sign up for events explicitly: If you enable PTRACE_O_TRACEFORK,
  you might get PTRACE_EVENT_FORK events, and so on. For many events you can
  request additional event information with PTRACE_GETEVENTMSG.

  PTRACE_EVENT_STOP is special (you signed up for it when you used PTRACE_SEIZE):
  * If you get a STOP event and stop_sig is SIGTRAP, that indicates that you
    successfully interrupted a tracee (or a tracee's new forked/cloned child,
    which automatically gets interrupted at startup).
    (But note that if your INTERRUPT request happens at the same time as some
    other stop, you may get notified of that stop instead.)
  * If you get a STOP event and stop_sig is a stopping signal, that indicates a
    group-stop. See below.
    (Note: If you INTERRUPT a tracee which is currently in a non-ptrace stopped
    state, it goes into a group-stopped state. See below.)
  * There are no other STOP events. (TODO: Double-check this.)

  Can be identified with: stop_event != 0

  No-op action: For all event stops other than group-stops, continue execution
  with continue_sig = 0. For group-stops, see below.

* Signal-delivery-stop:
  When a process gets a signal (other than SIGKILL), its tracer is notified
  first, and may change or drop the signal by not passing it to the restart
  call.

  Can be identified with: stop_sig != (SIGTRAP|0x80) && stop_event == 0

  No-op action: Continue execution with continue_sig = stop_sig.

* Group-stop:
  A group-stop is a special kind of event stop that needs to be treated
  differently.

  When a process gets a stopping signal -- e.g. when the user presses ^Z --
  you're first notified as above, with a signal-delivery-stop (if the process
  has multiple threads, only one thread gets a signal-delivery-stop).

  Then, if you pass the signal on to the tracee, it gets stopped. Every thread
  in the process -- including the one that got the signal-delivery-stop --
  gets a group-stop event, indicated with stop_event == PTRACE_EVENT_STOP, and
  stop_sig one of the stopping signals (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU).

  A group-stop indicates that the tracee is stopped, but in a special
  ptrace-stop state, not a regular stopped state; SIGCONT won't be delivered in
  this state.

  A tracee is also put into this state if you INTERRUPT it while it's in a
  regular stopped state.

  Can be identified with: stop_event == PTRACE_EVENT_STOP && stop_sig != SIGTRAP

  The no-op action is *not* to continue execution -- since the tracee is
  supposed to be stopped -- but to put it in a regular stopped state (rather
  than the special ptrace stopped state it's currently in), with PTRACE_LISTEN
  (and continue_sig = 0).


* Inspecting and modifying the tracee.

The APIs available are mostly the ones you'd expect: Get/set register state,
memory, signal state, etc. (Note that a lot of the information you want is
exposed via /proc/pid/ rather than ptrace.)

For reading and writing memory, process_vm_{read,write}v is much better than
the direct ptrace API, which operates a machine word at a time.

Note: If you want to do a system call -- say, to allocate memory in the
tracee -- you'll need the instruction pointer to be on a syscall instruction
(or equivalent) on an executable page. There's no nice way to do this; your
reasonable options are:

* Run the tracee until the next time it does a system call;
* Scan its memory for a system call instruction (perhaps in the VDSO);
* Temporarily write a syscall instruction to executable memory.

(A special note on PTRACE_POKEDATA: It can write even to non-writable pages
in the tracee. If you write to a shared read-only page, it automatically gets
copied to a private mapping and unshared.)

* Restarting the tracee.

When the tracee is stopped, you can restart it with CONT, SYSCALL, or SINGLESTEP
(and with the related SYSEMU_SYSCALL and SYSEMU_SINGLESTEP, which skip over
executing system calls to let you emulate them yourself). The restart calls take
a signal argument, as described above.

You can also "restart" a tracee after group-stop with PTRACE_LISTEN, as
described above.

You can also detach from the tracee with PTRACE_DETACH.


// TODO: Write about ESRCH handling.
// BPF?
// TODO: Add a note about EINTR. Some system calls (such as epoll_wait, though
// not poll) will return EINTR when a process is interrupted, even though there
// was no signal. The ptrace man page says this is a kernel bug.


Requests:
  I {PEEK,POKE}{TEXT,DATA,USER}
  I {GET,SET}{FP,}REGS
  I {GET,SET}REGSET
  I {GET,SET,PEEK}SIGINFO
  I {GET,SET}SIGMASK
  I GETEVENTMSG
  I SECCOMP_GET_FILTER
  I {GET,SET}_THREAD_AREA
  I GET_SYSCALL_INFO [new]
  I SETOPTIONS

  R CONT
  R SYSCALL, SINGLESTEP
  R SYSEMU, SYSEMU_SINGLESTEP
  R LISTEN
  R DETACH

  G INTERRUPT
  G SEIZE

  x KILL
  x ATTACH
  x TRACEME


Options:
  EXITKILL
  TRACE{CLONE,FORK,VFORK}
  TRACEEXEC
  TRACEEXIT
  TRACESYSGOOD
  TRACEVFORKDONE
  TRACESECCOMP
  SUSPEND_SECCOMP