Skip to content

Latest commit

 

History

History
236 lines (188 loc) · 11.6 KB

split-inputs.md

File metadata and controls

236 lines (188 loc) · 11.6 KB

How To Split A Fuzzer-Generated Input Into Several

This document describes several recommended strategies for splitting a single fuzzer-generated input into several parts (sub-inputs).

Examples

Splitting a fuzzer-generated input into several independent parts (sub-inputs) is required very often. Some examples:

  • Fuzzing a regular expression library requires

    • The regular expression (RE)
    • Flags for RE compilation and matching
    • A string to search the RE in
  • Fuzzing an audio/video format decoder often requires

    • Decoding flags
    • Several frames
  • Fuzzing a XSLT or CSS library requires

    • The stylesheet input
    • The XML/HTML input
  • Fuzzing a font-rendering library requires

    • The font file
    • The text to render
    • The rendering flags
  • Fuzzing a database library may require

    • The query text
    • The database state

TODO: more examples?

Common Data Format

When trying to split the fuzzer-generated input into several parts, the first question one needs to ask is whether the input format is common, i.e. is it used or processed by other libraries, APIs, of fuzz targets.

If the data format is common (e.g. a widely used media format or network packet format) then it is highly desirable for a fuzz target to consume exactly this data format, and not some custom modification. This way it will be easier to procure a seed corpus for this fuzz target and to use the generated corpus to test/fuzz other targets.

Multiple Options

If the data format may be processed by a fuzz target in a small number of different ways, it is often the best approach to split the fuzz target into several ones, each processing the input in exactly one way.

Make sure to cross-pollinate the corpora between these targets. OSS-Fuzz does that automatically.

Embedding / Comments

When a fuzz target for a common data format requires some flags, options, or additional auxiliary sub-input(s), it is sometimes possible to embed the extra input inside a custom section or a comment of the main data format.

Examples:

  • PNG allows custom "chunks", and so a fuzz target for a PNG decoder can hide the flags used during PNG processing in a separate PNG chunk, e.g. fUZz (example).
  • When fuzzing C/C++/Java/JavaScript inputs one may hide a sub-input in a single-line // comment. TODO: example?

Hash

When only one small fixed-size sub-input is required (such as flags / options), the fuzz target may compute a hash function on the full input and use it as the flag bits. This option is very easy to implement, but it's applicability is limited to relatively simple cases. The major problem is that a small local mutation of the input leads to a large change in the sub-input, which often makes fuzzing less efficient. Try this approach if the flags are individual bits and the input type allows some bit flips in the inputs (e.g. a plain text).

TODO: example.

Custom Serialization Format

If you do not intend to share the corpus with any other API or fuzz targets, then a custom serialization format might be a good option for a multi-input fuzz target.

First / Last Bytes

When only one fixed-size sub-input is required (such as flags / options), it is possible to treat the first (or last) K bytes of the input as sub-input, and the rest of the bytes as the main input.

Just remember to copy the main input into a separate heap buffer of Size - K bytes, so that buffer under/overflows on the main input are detected.

TODO: example.

Magic separator

Choose a 4-byte (or 8-byte) magic constant that will serve as a separator between the inputs. In the fuzz target, split the input using this separator. Use memmem to find the separator in the input -- memmem is known to be friendly to fuzzing engines, at least to libFuzzer.

Example (see full code here):

// Splits [data,data+size) into a vector of strings using a "magic" Separator.
std::vector<std::vector<uint8_t>> SplitInput(const uint8_t *Data, size_t Size,
                                     const uint8_t *Separator,
                                     size_t SeparatorSize) { ... }

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  const uint8_t Separator[] = {0xDE, 0xAD, 0xBE, 0xEF};
  auto Inputs = SplitInput(Data, Size, Separator, sizeof(Separator));
  // Use Inputs.size(), Inputs[0], Inputs[1], ...
}

It is relatively easy for a modern fuzzing engine to discover the separator, but nevertheless we recommend to provide several seed inputs with the desired number of separators.

Fuzzed Data Provider

FuzzedDataProvider (FDP) is a single-header C++ library that is helpful for splitting a fuzz input into multiple parts of various types. It is a part of LLVM and can be included via #include <fuzzer/FuzzedDataProvider.h> directive. If your compiler doesn't have this header (in case it's an older Clang version or some other compiler), you can copy the header from here and add it to your project manually. It should just work, as the header doesn't depend on LLVM.

An advantage and disadvantage of using this library is that the input splitting happens dynamically, i.e. you don't need to define any structure of the input. This might be very helpful in certain cases, but would also make the corpus to be no longer in a particular format. For example, if you fuzz an image parser and split the fuzz input into several parts, the corpus elements will no longer be valid image files, and you won't be able to simply add image files to your corpus.

Main concepts

  • FuzzedDataProvider is a class whose constructor accepts const uint8_t*, size_t arguments. Usually, you would call it in the beginning of your LLVMFuzzerTestOneInput and pass the data, size parameters provided by the fuzzing engine.
  • Once an FDP object is constructed using the fuzz input, you can consume the data from the input by calling the FDP methods listed below.
  • If there is not enough data left to consume, FDP will consume all the remaining bytes. For example, if you call ConsumeBytes(10) when there are only 4 bytes left in the fuzz input, FDP will return a vector of length 4.
  • If there is no data left, FDP will return the default value for the requested type or an empty container (when consuming a sequence of bytes).
  • If you consume data from FDP in a loop, make sure to check the value returned by remaining_bytes() between loop iterations.
  • Do not use the methods that return std::string unless your API requires a string object or a C-style string with a trailing null byte. This is a common mistake that hides off-by-one buffer overflows from AddressSanitizer.

Methods for extracting individual values

  • ConsumeBool, ConsumeIntegral, ConsumeIntegralInRange methods are helpful for extracting a single boolean or integer value (the exact type is defined by a template parameter), e.g. some flag for the target API, or a number of iterations for a loop, or length of a part of the fuzz input.
  • ConsumeProbability, ConsumeFloatingPoint, ConsumeFloatingPointInRange methods are very similar to the ones mentioned above. The difference is that these methods return a floating point value.
  • ConsumeEnum and PickValueInArray methods are handy when the fuzz input needs to be selected from a predefined set of values, such as an enum or an array.

These methods are using the last bytes of the fuzz input for deriving the requested values. This allows to use valid / test files as a seed corpus in some cases.

Methods for extracting sequences of bytes

Many of these methods have a length argument. You can always know how many bytes are left inside the provider object by calling remaining_bytes() method on it.

  • ConsumeBytes and ConsumeBytesWithTerminator methods return a std::vector of the requested size. These methods are helpful when you know how long a certain part of the fuzz input should be. Use .data() and .size() methods of the resulting object if your API works with raw memory arguments.
  • ConsumeBytesAsString method returns a std::string of the requested length. This is useful when you need a null-terminated C-string. Calling c_str() on the resulting object is the best way to obtain it.
  • ConsumeRandomLengthString method returns a std::string as well, but its length is derived from the fuzz input and typically is hard to predict, though always deterministic. The caller can provide the max length argument.
  • ConsumeRemainingBytes and ConsumeRemainingBytesAsString methods return std::vector and std::string objects respectively, initialized with all the bytes from the fuzz input that left unused.
  • ConsumeData method copies the requested number of bytes from the fuzz input to the given pointer (void *destination). The method is useful when you need to fill an existing buffer or object (e.g. a struct) with fuzzing data.

For more information about the methods, their arguments and implementation details, please refer to the FuzzedDataProvider source code. Every method has a detailed comment in that file, and the implementation is relatively small.

Examples of fuzz targets using FuzzedDataProvider

  • net_verify_name_match_fuzzer splits the fuzz input into two parts.
  • net_http2_frame_decoder_fuzzer reads data in small chunks in a loop in order to emulate a sequence of frames coming from the network connection.
  • net_crl_set_fuzzer initialized multiple parameters and uses the rest of the fuzz input for the main argument (i.e. data to be parsed / processed). Note that using Protobufs based fuzzing might be more efficient for such a target.
  • net_parse_cookie_line_fuzzer is a slightly more sophisticated fuzz target that emulates different actions with different parameters initialized with the fuzz input.

Type-length-value

A custom Type-length-value, or TLV, may sound like a good solution. However, we typically do not recommend using a custom TLV to split your fuzzer-generated input for the following reasons:

  • This is more test-only code for you to maintain, and easy to get wrong
  • Typical mutations performed by fuzzing engines, such as inserting a byte, will break the TLV structure too often, making fuzzing less efficient

However, a TLV input combined with a custom mutator might be a good option. See Structure-Aware Fuzzing.

Protobufs

Yet another option is to use one of the general-purpose serialization formats, such as Protobufs, in combination with a custom mutator. See Structure-Aware Fuzzing.