summaryrefslogtreecommitdiff
path: root/docs/DataFlowSanitizerDesign.rst
diff options
context:
space:
mode:
authorPeter Collingbourne <peter@pcc.me.uk>2013-08-07 22:47:34 +0000
committerPeter Collingbourne <peter@pcc.me.uk>2013-08-07 22:47:34 +0000
commit2eeed711beec49dfad5d3a3f16fdfca4b2f3acf0 (patch)
treefeb3e0f5f4982f52c99545e1bb33ccb41aaf170c /docs/DataFlowSanitizerDesign.rst
parent7ae9745a54d2f02f2adf95d91fa74827a3a69b14 (diff)
downloadclang-2eeed711beec49dfad5d3a3f16fdfca4b2f3acf0.tar.gz
clang-2eeed711beec49dfad5d3a3f16fdfca4b2f3acf0.tar.bz2
clang-2eeed711beec49dfad5d3a3f16fdfca4b2f3acf0.tar.xz
DataFlowSanitizer; Clang changes.
DataFlowSanitizer is a generalised dynamic data flow analysis. Unlike other Sanitizer tools, this tool is not designed to detect a specific class of bugs on its own. Instead, it provides a generic dynamic data flow analysis framework to be used by clients to help detect application-specific issues within their own code. Differential Revision: http://llvm-reviews.chandlerc.com/D966 git-svn-id: https://llvm.org/svn/llvm-project/cfe/trunk@187925 91177308-0d34-0410-b5e6-96231b3b80d8
Diffstat (limited to 'docs/DataFlowSanitizerDesign.rst')
-rw-r--r--docs/DataFlowSanitizerDesign.rst142
1 files changed, 142 insertions, 0 deletions
diff --git a/docs/DataFlowSanitizerDesign.rst b/docs/DataFlowSanitizerDesign.rst
new file mode 100644
index 0000000000..b704035f2f
--- /dev/null
+++ b/docs/DataFlowSanitizerDesign.rst
@@ -0,0 +1,142 @@
+DataFlowSanitizer Design Document
+=================================
+
+This document sets out the design for DataFlowSanitizer, a general
+dynamic data flow analysis. Unlike other Sanitizer tools, this tool is
+not designed to detect a specific class of bugs on its own. Instead,
+it provides a generic dynamic data flow analysis framework to be used
+by clients to help detect application-specific issues within their
+own code.
+
+DataFlowSanitizer is a program instrumentation which can associate
+a number of taint labels with any data stored in any memory region
+accessible by the program. The analysis is dynamic, which means that
+it operates on a running program, and tracks how the labels propagate
+through that program. The tool shall support a large (>100) number
+of labels, such that programs which operate on large numbers of data
+items may be analysed with each data item being tracked separately.
+
+Use Cases
+---------
+
+This instrumentation can be used as a tool to help monitor how data
+flows from a program's inputs (sources) to its outputs (sinks).
+This has applications from a privacy/security perspective in that
+one can audit how a sensitive data item is used within a program and
+ensure it isn't exiting the program anywhere it shouldn't be.
+
+Interface
+---------
+
+A number of functions are provided which will create taint labels,
+attach labels to memory regions and extract the set of labels
+associated with a specific memory region. These functions are declared
+in the header file ``sanitizer/dfsan_interface.h``.
+
+.. code-block:: c
+
+ /// Creates and returns a base label with the given description and user data.
+ dfsan_label dfsan_create_label(const char *desc, void *userdata);
+
+ /// Sets the label for each address in [addr,addr+size) to \c label.
+ void dfsan_set_label(dfsan_label label, void *addr, size_t size);
+
+ /// Sets the label for each address in [addr,addr+size) to the union of the
+ /// current label for that address and \c label.
+ void dfsan_add_label(dfsan_label label, void *addr, size_t size);
+
+ /// Retrieves the label associated with the given data.
+ ///
+ /// The type of 'data' is arbitrary. The function accepts a value of any type,
+ /// which can be truncated or extended (implicitly or explicitly) as necessary.
+ /// The truncation/extension operations will preserve the label of the original
+ /// value.
+ dfsan_label dfsan_get_label(long data);
+
+ /// Retrieves a pointer to the dfsan_label_info struct for the given label.
+ const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label);
+
+ /// Returns whether the given label label contains the label elem.
+ int dfsan_has_label(dfsan_label label, dfsan_label elem);
+
+ /// If the given label label contains a label with the description desc, returns
+ /// that label, else returns 0.
+ dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc);
+
+Taint label representation
+--------------------------
+
+As stated above, the tool must track a large number of taint
+labels. This poses an implementation challenge, as most multiple-label
+tainting systems assign one label per bit to shadow storage, and
+union taint labels using a bitwise or operation. This will not scale
+to clients which use hundreds or thousands of taint labels, as the
+label union operation becomes O(n) in the number of supported labels,
+and data associated with it will quickly dominate the live variable
+set, causing register spills and hampering performance.
+
+Instead, a low overhead approach is proposed which is best-case O(log\
+:sub:`2` n) during execution. The underlying assumption is that
+the required space of label unions is sparse, which is a reasonable
+assumption to make given that we are optimizing for the case where
+applications mostly copy data from one place to another, without often
+invoking the need for an actual union operation. The representation
+of a taint label is a 16-bit integer, and new labels are allocated
+sequentially from a pool. The label identifier 0 is special, and means
+that the data item is unlabelled.
+
+When a label union operation is requested at a join point (any
+arithmetic or logical operation with two or more operands, such as
+addition), the code checks whether a union is required, whether the
+same union has been requested before, and whether one union label
+subsumes the other. If so, it returns the previously allocated union
+label. If not, it allocates a new union label from the same pool used
+for new labels.
+
+Specifically, the instrumentation pass will insert code like this
+to decide the union label ``lu`` for a pair of labels ``l1``
+and ``l2``:
+
+.. code-block:: c
+
+ if (l1 == l2)
+ lu = l1;
+ else
+ lu = __dfsan_union(l1, l2);
+
+The equality comparison is outlined, to provide an early exit in
+the common cases where the program is processing unlabelled data, or
+where the two data items have the same label. ``__dfsan_union`` is
+a runtime library function which performs all other union computation.
+
+Further optimizations are possible, for example if ``l1`` is known
+at compile time to be zero (e.g. it is derived from a constant),
+``l2`` can be used for ``lu``, and vice versa.
+
+Memory layout and label management
+----------------------------------
+
+The following is the current memory layout for Linux/x86\_64:
+
++---------------+---------------+--------------------+
+| Start | End | Use |
++===============+===============+====================+
+| 0x700000008000|0x800000000000 | application memory |
++---------------+---------------+--------------------+
+| 0x200200000000|0x700000008000 | unused |
++---------------+---------------+--------------------+
+| 0x200000000000|0x200200000000 | union table |
++---------------+---------------+--------------------+
+| 0x000000010000|0x200000000000 | shadow memory |
++---------------+---------------+--------------------+
+| 0x000000000000|0x000000010000 | reserved by kernel |
++---------------+---------------+--------------------+
+
+Each byte of application memory corresponds to two bytes of shadow
+memory, which are used to store its taint label. As for LLVM SSA
+registers, we have not found it necessary to associate a label with
+each byte or bit of data, as some other tools do. Instead, labels are
+associated directly with registers. Loads will result in a union of
+all shadow labels corresponding to bytes loaded (which most of the
+time will be short circuited by the initial comparison) and stores will
+result in a copy of the label to the shadow of all bytes stored to.