## Tuesday, January 27, 2015

### Streaming @ SODA: Part I

This two part series is written by my student Samira Daruki

Modern graph data sets are too large to fit in the memory. And so the streaming model is one of the more popular and attractive ones for analyzing massive graphs: in this model, for an input graph $G = (V, E)$ with $n$ vertices and $m$ edges, the edges arrive in an arbitrary order in a stream and the algorithm can only use $\tilde{O}(n)$ space. The algorithm is allowed to have several passes over the stream but usually the ideal case is to have just one pass.

For many graph problems such as matching, spanning tree, graph sparsification, approximate distance and counting subgraphs there now exist streaming algorithms with space complexity $O(n \text{poly} (\log n))$. In these algorithms, usually we assume that the nodes of the graphs can be stored in memory but edges cannot. Notice that for constructing  matchings, spanners and sparsifiers, the output size is often $\Omega(n)$, so it forces the streaming algorithm to use $\Omega(n)$ space.

But if you don't need to report the output, then this can be avoided. For an example, see the work of Kapralov, Khanna and Sudan from SODA 2014 which approximates the matching size to a $\text{poly}(\log n)$ factor using $\text{poly}(\log n)$ space in a random stream (where edges appear in a random order rather than arbitrarily)

Thus, the question now is:
can we obtain a $\text{poly}\log$ space streaming algorithm for approximating the solution cost for other graph problems?
Consider for instance MAX-CUT. There are several results on approximating the maximum cut in graphs in a non-streaming model; the trivial approach is to take a random cut. This selects half of the edges in expectation, resulting in a factor $1/2$-approximation.

Thus implies that in a streaming model we can just count the number of edges $m$ and output $\frac{m}{2}$ which results in a $O(\log n)$-space algorithm. By keeping a sample of the edge set we can get a different tradeoff: a $(1+\epsilon)$-approximation algorithm which uses $O(\frac{n}{\epsilon^2})$ space.

Can we get a streaming algorithm with better than factor-$2$ approximation using just $poly(\log n)$ space?

A paper by Kapralov, Khanna and Sudan in the streaming session of SODA15 this year answers this question. This is the latest in a line of results on streaming graph problems by Kapralov and others from SODA 2012, 2013 and 2014 (mentioned above)

Here is their main result
For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating the value of MAX-CUT  to a factor $2 − \epsilon$ requires $\Omega(\sqrt{n})$ space, even in the random order model.
This result rules out the possibility of $\text{poly}(\log n)$ space, but suggests that $\tilde{O}(\sqrt{n})$ space cost might be possible in some specific settings.

Another major result of this paper is as follows:

For any constant $\epsilon > 0$ a single pass streaming algorithm for approximating MAX-CUT value to factor $1 + \epsilon$ requires $n^{1−O(\epsilon)}$ space in adversarial streams.

The main fact they  use here is the connection between the MAX CUT value and (distance from) bipartiteness:
if a graph $G$ with $m$ edges is $\beta$-far from being bipartite, then maxcut value of $G$ is at most $(1-\beta)m$.
The other simple observation is that any algorithm that computes a $\gamma$-approximation to MAX CUT distinguishes between bipartite graphs and graphs that are $1-\frac{1}{\gamma}$-far from being bipartite. Thus to show that no streaming algorithm using space $c$ can achieve a $\gamma$- approximation with failure probability at most $\delta$, it's enough enough to show no streaming algorithm using space $c$ can distinguish between bipartite graphs and graphs which are $1- \frac{1}{\gamma}$-far from being bipartite with probability at least $1- \delta$.

Given these facts, now the core idea to prove the main result here is exhibiting a distribution over inputs where $(2-\epsilon)$ approximation to MAX-CUT requires $\Omega(\sqrt{n})$ space.

More precisely, the input graph instances are based on random Erdos-Renyi graphs, which are either bipartite in YES case or non-bipartite in the NO case. In order to achieve a $(2-\epsilon)$-factor gap for the MAX CUT in this structure, we choose the expected degree of vertices to be $\Theta(\frac{1}{\epsilon^2})$.

This way, the input streaming graph can be partitioned and given to the algorithm in $\Omega(\frac{1}{\epsilon^2})$ phases, which can be simulated as a $\frac{1}{\epsilon^2}$-party one-way communication game.

Then, by giving a reduction from a variation of Boolean Hidden Matching(BHM)  called Distributional Boolean Hidden Partition(D-BHP) to the MAX-CUT on the input instance of the problem, and showing that $\Omega(\sqrt{n})$ space is necessary to differentiate between these two cases, the main streaming lower bound result for obtaining approximate MAX-CUT is straightforward.

There are many technical details in performing this reduction, but roughly speaking they show that any algorithm that solves MAX-CUT on the constructed instances must solve the two-party communication problem in at least one of the phases.

There are still some main open problems left in this paper:

• One is whether breaking $2$-approximation barrier in $\sqrt{n}$ space is possible if we are allowed to have $poly(\log n)$ passes over the input stream?
• Also it is interesting to think about designing streaming algorithms for other graph problems using $o(n)$ space.

This brings us to another paper presented in this session. In Streaming Algorithms for Estimating the Matching Size in Planar Graphs and Beyond (by Esfandiari, Hajiaghayi, Liaghat, Monemizadeh and Onak), this latter question is answered about finding the maximum matching for planar graphs using $o(n)$ space.

Here is the main result:
If the underlying graph is planar, then there is a streaming algorithm which provides a $O(1)$-approximation solution to maximum matching with high probability using $O(n^{\frac{2}{3}})$ space.
The main idea for proving the result here is to use a known structural graph property:
If we characterize the nodes of the input graph based on the degrees to two groups of light (deg < 9) and heavy (deg > 9) vertices and then define the shallow edges as the ones with  two light endpoints, then we have the following nice property: (Assuming |maximum matching| = m, |shallow edges| = s and | heavy vertices| = h):
$$\frac{\max (s, h)}{12} \leq m \leq h + s$$

Then using this structural fact as the main tool, constructing a small size matching (bounded by $c n^{\frac{2}{3}}$) as we read the edges in a greedy manner, and estimating the number of shallow edges and heavy vertices in the induced subgraph by a subset of sampled vertices with size $c n^{\frac{2}{3}}$, we can approximate the size of the maximum matching by a constant factor.

In addition to planar case, they show that similar results for approximating maximum matching in other graph structures such as $d$-degenerate graphs and forests are achievable.

Coming up: parametrized streaming and VERTEX COVER.