Security Analysis - Diving into Dataflow Analysis and Reaching Definitions
4 min read

Security Analysis - Diving into Dataflow Analysis and Reaching Definitions

Static analysis comes in all shapes and sizes. A very common technique is dataflow analysis, where we look at how variables affect each other. In this article we’ll look at reaching definitions analysis, one of the core techniques enabling dataflow analysis.
Security Analysis - Diving into Dataflow Analysis and Reaching Definitions

Automatic security analysis tools use dataflow analysis to automatically look for vulnerabilities. One vulnerability that it can detect this way is an SQL injection.

📚 Find out more about SQL queries at OWASP  

A simple (but imprecise) vulnerability detector would just raise a warning for every line that executes an SQL query. A simple grep would suffice:

$ grep -i "execute_query(.*)" vulnerableFile.py

Unfortunately, this raises too many false positives. To remove these false positives, we want the vulnerability detector to only warn us when there is a significant risk of a potential SQL injection.

Checking whether user input actually flows into the arguments of execute_query(...) is a good additional check that helps us achieve this because we're only really interested in situations where an attacker can change the query.

However, we won't be able to add this step by only using grep. We need to figure out if the variables tied to input to the program are ever (indirectly) passed/ combined with an SQL query. We need dataflow analysis!


🙋‍♂️ Reaching Definitions

We start with what's called reaching definitions analysis; which will help us determine a dataflow graph.

In short; reaching definitions analysis determines for each statement which variable definitions reach that point.

Definition

A variable definition is a program statement where we define the value of a variable; The following is an example definition of a variable x :

x = b
definition of variable x

Reaching

It’s easiest to demonstrate how a definition can reach with an example:

1:    def vulnerable_function(user_input, condition=True)
2:	     if condition:
3: 	 	     query = "SELECT * FROM USERS where id == " + user_input
4:	
5:    query = "SELECT * FROM USERS where id == 0"
6:  
7:    return server.execute_query(query) 
sample reaching definitions code

In the example, on line 3, we see that the programmer has constructed an SQL query that could be exploited if it ever evaluates.
However, you’ll also see that query directly gets overwritten afterwards, with a safe SQL query. It’s re-defined.
There is no possible way to reach the final line (where we evaluate query) with the vulnerable definition of query still valid.

In short; the definition on line 3 doesn’t reach line 7, but the definition of query on line 5 does.

The Algorithm

The algorithm for discovering the reaching definitions is relatively straightforward.

We’ll implement a so-called fixed-point algorithm. This algorithm applies a simple transformation until it reaches a state where applying the rule doesn’t result in any changes, a fixed-point.

So what’s this rule? Well, it's is two-fold:

  1. The reaching definitions after a statement in the code are the same as for the statement itself, except for the variables re-defined in that statement.
  2. The reaching definitions before a statement are equal to the union of the reaching definitions of all the statements that can come directly before this one (predecessors).

We’ll initialise our program by assuming that there are no reaching definitions at all. Then we’ll apply these two rules until we reach our fixed point. At which point we know exactly variable definitions reach throughout the program.

💡 Interested in algorithms like this? Search online for "Abstract Interpretation"

🕸️ The dataflow graph

It might not be clear immediately what we’re supposed to do with the results from this reaching definitions analysis, but it's all we need to construct a dataflow graph.

Let’s quickly define what a dataflow graph is before we continue:

  • The nodes in the graph are all the variable definitions  in the program.
  • We’ll add directed edges for each statement that uses a previously defined variable.

Remember: there can be multiple definitions for the same variable!

A Dataflow graph for the code sample above
A dataflow graph where user input flows into variable_c
💡 Did you know compilers use dataflow analysis to find dead code?

A common way to use this dataflow graph is taint analysis. In taint analysis, we determine sources and sinks and see if there is any dataflow from sources to sinks.

For example, we can see if there are paths in the graph from user input (a source) to sensitive functions like SQL query evaluation (sinks).
If we find such a path from a source to a sink,  we have a potential path through the program where user input flows into an SQL query that is evaluated.

💡 By building a graph out of the dataflow relations in a program, we can use efficient algorithms to do the pathfinding for us!

Conclusion

In this article, we had a look at reaching definitions analysis and how it helped us construct a dataflow graph. Dataflow analyses are used in both compilers and security analysis tools.

📚 Advanced techniques like context-sensitivity and points-to analysis are necessary to improve accuracy and model more complex programs.

This article is the first in a series tackling the various techniques used to build security analysis tools.
Be sure to stick around (maybe signup 😉) & let me know what you want to read about next!

Enjoying these posts? Subscribe for more