| ====================================== | 
 | XRay Flight Data Recorder Trace Format | 
 | ====================================== | 
 |  | 
 | :Version: 1 as of 2017-07-20 | 
 |  | 
 | .. contents:: | 
 |    :local: | 
 |  | 
 |  | 
 | Introduction | 
 | ============ | 
 |  | 
 | When gathering XRay traces in Flight Data Recorder mode, each thread of an | 
 | application will claim buffers to fill with trace data, which at some point | 
 | is finalized and flushed. | 
 |  | 
 | A goal of the profiler is to minimize overhead, the flushed data directly | 
 | corresponds to the buffer. | 
 |  | 
 | This document describes the format of a trace file. | 
 |  | 
 |  | 
 | General | 
 | ======= | 
 |  | 
 | Each trace file corresponds to a sequence of events in a particular thread. | 
 |  | 
 | The file has a header followed by a sequence of discriminated record types. | 
 |  | 
 | The endianness of byte fields matches the endianness of the platform which | 
 | produced the trace file. | 
 |  | 
 |  | 
 | Header Section | 
 | ============== | 
 |  | 
 | A trace file begins with a 32 byte header. | 
 |  | 
 | +-------------------+-----------------+----------------------------------------+ | 
 | | Field             | Size (bytes)    | Description                            | | 
 | +===================+=================+========================================+ | 
 | | version           | ``2``           | Anticipates versioned  readers. This   | | 
 | |                   |                 | document describes the format when     | | 
 | |                   |                 | version == 1                           | | 
 | +-------------------+-----------------+----------------------------------------+ | 
 | | type              | ``2``           | An enumeration encoding the type of    | | 
 | |                   |                 | trace. Flight Data Recorder mode       | | 
 | |                   |                 | traces have type == 1                  | | 
 | +-------------------+-----------------+----------------------------------------+ | 
 | | bitfield          | ``4``           | Holds parameters that are not aligned  | | 
 | |                   |                 | to bytes. Further described below.     | | 
 | +-------------------+-----------------+----------------------------------------+ | 
 | | cycle_frequency   | ``8``           | The frequency in hertz of the CPU      | | 
 | |                   |                 | oscillator used to measure duration of | | 
 | |                   |                 | events in ticks.                       | | 
 | +-------------------+-----------------+----------------------------------------+ | 
 | | buffer_size       | ``8``           | The size in bytes of the data portion  | | 
 | |                   |                 | of the trace following the header.     | | 
 | +-------------------+-----------------+----------------------------------------+ | 
 | | reserved          | ``8``           | Reserved for future use.               | | 
 | +-------------------+-----------------+----------------------------------------+ | 
 |  | 
 | The bitfield parameter of the file header is composed of the following fields. | 
 |  | 
 | +-------------------+----------------+-----------------------------------------+ | 
 | | Field             | Size (bits)    | Description                             | | 
 | +===================+================+=========================================+ | 
 | | constant_tsc      | ``1``          | Whether the platform's timestamp        | | 
 | |                   |                | counter used to record ticks between    | | 
 | |                   |                | events ticks at a constant frequency    | | 
 | |                   |                | despite CPU frequency changes.          | | 
 | |                   |                | 0 == non-constant. 1 == constant.       | | 
 | +-------------------+----------------+-----------------------------------------+ | 
 | | nonstop_tsc       | ``1``          | Whether the tsc continues to count      | | 
 | |                   |                | despite whether the CPU is in a low     | | 
 | |                   |                | power state. 0 == stop. 1 == non-stop.  | | 
 | +-------------------+----------------+-----------------------------------------+ | 
 | | reserved          | ``30``         | Not meaningful.                         | | 
 | +-------------------+----------------+-----------------------------------------+ | 
 |  | 
 |  | 
 | Data Section | 
 | ============ | 
 |  | 
 | Following the header in a trace is a data section with size matching the | 
 | buffer_size field in the header. | 
 |  | 
 | The data section is a stream of elements of different types. | 
 |  | 
 | There are a few categories of data in the sequence. | 
 |  | 
 | - ``Function Records``: Function Records contain the timing of entry into and | 
 |   exit from function execution. Function Records have 8 bytes each. | 
 |  | 
 | - ``Metadata Records``: Metadata records serve many purposes. Mostly, they | 
 |   capture information that may be too costly to record for each function, but | 
 |   that is required to contextualize the fine-grained timings. They also are used | 
 |   as markers for user-defined Event Data payloads. Metadata records have 16 | 
 |   bytes each. | 
 |  | 
 | - ``Event Data``: Free form data may be associated with events that are traced | 
 |   by the binary and encode data defined by a handler function. Event data is | 
 |   always preceded with a marker record which indicates how large it is. | 
 |  | 
 | - ``Function Arguments``: The arguments to some functions are included in the | 
 |   trace. These are either pointer addresses or primitives that are read and | 
 |   logged independently of their types in a high level language. To the tracer, | 
 |   they are all numbers. Function Records that have attached arguments will | 
 |   indicate their presence on the function entry record. We only support logging | 
 |   contiguous function argument sequences starting with argument zero, which will | 
 |   be the "this" pointer for member function invocations. For example, we don't | 
 |   support logging the first and third argument. | 
 |  | 
 | A reader of the memory format must maintain a state machine. The format makes no | 
 | attempt to pad for alignment, and it is not seekable. | 
 |  | 
 |  | 
 | Function Records | 
 | ---------------- | 
 |  | 
 | Function Records have an 8 byte layout. This layout encodes information to | 
 | reconstruct a call stack of instrumented function and their durations. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bits)  | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | discriminant  | ``1``        | Indicates whether a reader should read a      | | 
 | |               |              | Function or Metadata record. Set to ``0`` for | | 
 | |               |              | Function records.                             | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | action        | ``3``        | Specifies whether the function is being       | | 
 | |               |              | entered, exited, or is a non-standard entry   | | 
 | |               |              | or exit produced by optimizations.            | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | function_id   | ``28``       | A numeric ID for the function. Resolved to a  | | 
 | |               |              | name via the xray instrumentation map. The    | | 
 | |               |              | instrumentation map is built by xray at       | | 
 | |               |              | compile time into an object file and pairs    | | 
 | |               |              | the function ids to addresses. It is used for | | 
 | |               |              | patching and as a lookup into the binary's    | | 
 | |               |              | symbols to obtain names.                      | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | tsc_delta     | ``32``       | The number of ticks of the timestamp counter  | | 
 | |               |              | since a previous record recorded a delta or   | | 
 | |               |              | other TSC resetting event.                    | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 | On little-endian machines, the bitfields are ordered from least significant bit | 
 | bit to most significant bit. A reader can read an 8 bit value and apply the mask | 
 | ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned | 
 | shift right by ``0x04`` to obtain the function_id field. | 
 |  | 
 | On big-endian machine, the bitfields are written in order from most significant | 
 | bit to least significant bit. A reader would read an 8 bit value and unsigned | 
 | shift right by 7 bits for the discriminant. The function_id field could be | 
 | obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``. | 
 |  | 
 | Function action types are as follows. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Type          | Number       | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | Entry         | ``0``        | Typical function entry.                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Exit          | ``1``        | Typical function exit.                        | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Tail_Exit     | ``2``        | An exit from a function due to tail call      | | 
 | |               |              | optimization.                                 | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Entry_Args    | ``3``        | A function entry that records arguments.      | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 | Entry_Args records do not contain the arguments themselves. Instead, metadata | 
 | records for each of the logged args follow the function record in the stream. | 
 |  | 
 |  | 
 | Metadata Records | 
 | ---------------- | 
 |  | 
 | Interspersed throughout the buffer are 16 byte Metadata records. For typically | 
 | instrumented binaries, they will be sparser than Function records, and they | 
 | provide a fuller picture of the binary execution state. | 
 |  | 
 | Metadata record layout is partially record dependent, but they share a common | 
 | structure. | 
 |  | 
 | The same bit field rules described for function records apply to the first byte | 
 | of MetadataRecords. Within this byte, little endian machines use lsb to msb | 
 | ordering and big endian machines use msb to lsb ordering. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size         | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | discriminant  | ``1 bit``    | Indicates whether a reader should read a      | | 
 | |               |              | Function or Metadata record. Set to ``1`` for | | 
 | |               |              | Metadata records.                             | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | record_kind   | ``7 bits``   | The type of Metadata record.                  | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | data          | ``15 bytes`` | A data field used differently for each record | | 
 | |               |              | type.                                         | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 | Here is a table of the enumerated record kinds. | 
 |  | 
 | +--------+---------------------------+ | 
 | | Number | Type                      | | 
 | +========+===========================+ | 
 | | 0      | NewBuffer                 | | 
 | +--------+---------------------------+ | 
 | | 1      | EndOfBuffer               | | 
 | +--------+---------------------------+ | 
 | | 2      | NewCPUId                  | | 
 | +--------+---------------------------+ | 
 | | 3      | TSCWrap                   | | 
 | +--------+---------------------------+ | 
 | | 4      | WallTimeMarker            | | 
 | +--------+---------------------------+ | 
 | | 5      | CustomEventMarker         | | 
 | +--------+---------------------------+ | 
 | | 6      | CallArgument              | | 
 | +--------+---------------------------+ | 
 |  | 
 |  | 
 | NewBuffer Records | 
 | ----------------- | 
 |  | 
 | Each buffer begins with a NewBuffer record immediately after the header. | 
 | It records the thread ID of the thread that the trace belongs to. | 
 |  | 
 | Its data segment is as follows. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bytes) | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | thread_Id     | ``2``        | Thread ID for buffer.                         | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | reserved      | ``13``       | Unused.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | WallClockTime Records | 
 | --------------------- | 
 |  | 
 | Following the NewBuffer record, each buffer records an absolute time as a frame | 
 | of reference for the durations recorded by timestamp counter deltas. | 
 |  | 
 | Its data segment is as follows. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bytes) | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | seconds       | ``8``        | Seconds on absolute timescale. The starting   | | 
 | |               |              | point is unspecified and depends on the       | | 
 | |               |              | implementation and platform configured by the | | 
 | |               |              | tracer.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | microseconds  | ``4``        | The microsecond component of the time.        | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | reserved      | ``3``        | Unused.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | NewCpuId Records | 
 | ---------------- | 
 |  | 
 | Each function entry invokes a routine to determine what CPU is executing. | 
 | Typically, this is done with readtscp, which reads the timestamp counter at the | 
 | same time. | 
 |  | 
 | If the tracing detects that the execution has switched CPUs or if this is the | 
 | first instrumented entry point, the tracer will output a NewCpuId record. | 
 |  | 
 | Its data segment is as follows. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bytes) | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | cpu_id        | ``2``        | CPU Id.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | absolute_tsc  | ``8``        | The absolute value of the timestamp counter.  | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | reserved      | ``5``        | Unused.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | TSCWrap Records | 
 | --------------- | 
 |  | 
 | Since each function record uses a 32 bit value to represent the number of ticks | 
 | of the timestamp counter since the last reference, it is possible for this value | 
 | to overflow, particularly for sparsely instrumented binaries. | 
 |  | 
 | When this delta would not fit into a 32 bit representation, a reference absolute | 
 | timestamp counter record is written in the form of a TSCWrap record. | 
 |  | 
 | Its data segment is as follows. | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bytes) | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | absolute_tsc  | ``8``        | Timestamp counter value.                      | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | reserved      | ``7``        | Unused.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | CallArgument Records | 
 | -------------------- | 
 |  | 
 | Immediately following an Entry_Args type function record, there may be one or | 
 | more CallArgument records that contain the traced function's parameter values. | 
 |  | 
 | The order of the CallArgument Record sequency corresponds one to one with the | 
 | order of the function parameters. | 
 |  | 
 | CallArgument data segment: | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bytes) | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | argument      | ``8``        | Numeric argument (may be pointer address).    | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | reserved      | ``7``        | Unused.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | CustomEventMarker Records | 
 | ------------------------- | 
 |  | 
 | XRay provides the feature of logging custom events. This may be leveraged to | 
 | record tracing info for RPCs or similarly trace data that is application | 
 | specific. | 
 |  | 
 | Custom Events themselves are an unstructured (application defined) segment of | 
 | memory with arbitrary size within the buffer. They are preceded by | 
 | CustomEventMarkers to indicate their presence and size. | 
 |  | 
 | CustomEventMarker data segment: | 
 |  | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | Field         | Size (bytes) | Description                                   | | 
 | +===============+==============+===============================================+ | 
 | | event_size    | ``4``        | Size of preceded event.                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | absolute_tsc  | ``8``        | A timestamp counter of the event.             | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 | | reserved      | ``3``        | Unused.                                       | | 
 | +---------------+--------------+-----------------------------------------------+ | 
 |  | 
 |  | 
 | EndOfBuffer Records | 
 | ------------------- | 
 |  | 
 | An EndOfBuffer record type indicates that there is no more trace data in this | 
 | buffer. The reader is expected to seek past the remaining buffer_size expressed | 
 | before the start of buffer and look for either another header or EOF. | 
 |  | 
 |  | 
 | Format Grammar and Invariants | 
 | ============================= | 
 |  | 
 | Not all sequences of Metadata records and Function records are valid data. A | 
 | sequence should be parsed as a state machine. The expectations for a valid | 
 | format can be expressed as a context free grammar. | 
 |  | 
 | This is an attempt to explain the format with statements in EBNF format. | 
 |  | 
 | - Format := Header ThreadBuffer* EOF | 
 |  | 
 | - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End | 
 |  | 
 | - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent | 
 |  | 
 | - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type | 
 |  | 
 | - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory | 
 |  | 
 | - End := EndOfBuffer RemainingBufferSizeToSkip | 
 |  | 
 |  | 
 | Function Record Order | 
 | --------------------- | 
 |  | 
 | There are a few clarifications that may help understand what is expected of | 
 | Function records. | 
 |  | 
 | - Functions with an Exit are expected to have a corresponding Entry or | 
 |   Entry_Args function record precede them in the trace. | 
 |  | 
 | - Tail_Exit Function records record the Function ID of the function whose return | 
 |   address the program counter will take. In other words, the final function that | 
 |   would be popped off of the call stack if tail call optimization was not used. | 
 |  | 
 | - Not all functions marked for instrumentation are necessarily in the trace. The | 
 |   tracer uses heuristics to preserve the trace for non-trivial functions. | 
 |  | 
 | - Not every entry must have a traced Exit or Tail Exit. The buffer may run out | 
 |   of space or the program may request for the tracer to finalize toreturn the | 
 |   buffer before an instrumented function exits. |