Monday, May 17, 2021


 I've been at the U of Utah and Salt Lake City for 14 years (14.5 really). It was my first academic job and the longest time I've spent anywhere (throughout my whole life). So it's a little hard to accept that I'm moving to my next adventure. 

It's a two-part adventure, because why make one move when you can make two. 

Firstly, as of today, I'm going to working with Alondra Nelson at the White House Office of Science and Technology Policy, advising on matters relating to fairness and bias in tech systems. This is a scary and exciting new position, and I hope to help to nudge things along just a bit further in the direction of tech that can help more than it harms, especially for those who've been left behind in our rush to an algorithmically controlled future. 

Secondly, I'm moving to Brown University to join the CS department there as well as their Data Science Initiative. Together with Seny Kamara and others, I'm going to start a new center on Computing for the People, to help think through what it means to do computer science that truly responds to the needs of people, instead of hiding behind a neutrality that merely gives more power to those already in power. 

Lots of changes, and because of the pandemic, all this will happen in slow machine, but it's a whirlwind of emotions (and new clothes - apparently tech conference T-shirts don't work in formal settings - WHO KNEW!!!). 

Friday, December 25, 2020

Lars Arge.

 Not a post I'd have wanted to make on Christmas day, but that's how it goes sometimes. 

Lars Arge just passed away, on Dec 23. For those of us who've been following his battles with cancer, this might not come as a total shock, but there was always hope, and that's no longer an option. 

It's hard to imagine this in 2020, but there was a time not that long ago (at least in my mind) when "big data" wasn't really a thing. Companies were acquiring lots of data, and "GIGA byte" was a thing, but there was no real appreciation of the computational challenge associated with big data. 

A paper by Aggarwal and Vitter in 1998 made the first step towards changing that, introducing the external memory model as a way to think about computations when you have memory access that are cheap (in RAM) and expensive (on disk). 

It's a diabolically simple model: all main memory access is free, and any disk access costs 1 unit (but you can get a block of data of size B for that one unit of access). It's not meant to be realistic, but like the best computational models, it's meant to isolate the key operations that are expensive so that we can study how algorithm design needs to change. 

Lars was one of the foremost algorithm designers for this new world of external memory. His Ph.D thesis laid out ideas for how to build data structures that are external memory efficient, and his research over the next many decades, in true Tarjan/Hopcroft form, built the fundamental structures and concepts one would need to even think about efficient algorithm design, with many clever ideas around batching queries, processing data in main memory to prepare for queries, and streaming access to disk when appropriate. 

Formal algorithmic models are often misunderstood. They look simplistic, miss many of the details that seem relevant in practice, and appear to encourage theoretical game playing divorced from reality. But a formal model at its best does its work invisibly. It shifts the way we think about a framework. It fosters the design of new paradigms for efficient algorithms, and it allows us to layer optimizations on that move a system from theory to practice without ever having to compromise the underlying design principles.

Lars was a force of nature in this area. I first remember meeting him in 1998 at AT&T Labs when I was interning and he was visiting there. He had boundless energy for this space, and seemingly wanted to turn everything into an external memory algorithm, whether it was geometry, data structures, or even the most basic algorithms like sorting. His intuition was the best kind of algorithmic intuition: build up the core primitives, and the rest would follow. 

And this is exactly what happened. The field exploded. For a while, "big data algorithms" WERE external memory algorithms. There was no other way to even talk about big data. And that spawned even more models. Streaming algorithms were inspired by external memory and the realization that a one pass stream was an effective way to work with large data. Cache-oblivious algorithms asked about what would happen if we took the same two-part hierarchy with main memory and disk and extended it to the cache. Semi-external memory models asked how we might modify the base model for graph computations. The MapReduce framework from the early 2000s generalized the external memory model to handle newer kinds of streaming/memory-limited architectures, in turn to be followed by Spark and so many other models. 

I'd go as far as to say this: all of the conceptual developments we see today in big data computations at some level can be traced back to work on external memory algorithms, and that was driven by Lars (and his collaborators). 

It wasn't just the papers he wrote. Lars was a leader in shaping the field. Early in the 2000s he moved back from Duke University to Aarhus University, and from there started to build what would become one of the foremost institutes for thinking about big data, first as a BRICS center and then as the appropriately named MADALGO Institute. 

Many of us who had anything to do with big data visited MADALGO at some point in our careers. I spent one of the best summers of my life being hosted by him during my sabbatical - my children still remember that summer we spent in Aarhus and wish we could go back each year. He instinctively knew that the best way to foster the area was to facilitate a generation of researchers who would bring their own ideas to Aarhus, mix and exchange them,  and then go away and share them with the world. 

And he wasn't merely content with that. He wanted to demonstrate the power of his perspective beyond just the realm of academia. He started a company SCALGO that applied the principles of external memory algorithms (and so much more) to help with modeling geospatial data. I remember distinctly him telling me the first time he demonstrated SCALGO products in a forum with other companies doing GIS work and how the performance of their system blew the other products out of the water. For someone (at the time) deeply embedded in the theory of computer science, I was astounded and encouraged by this validation of formal thinking. 

Lars was a giant in our field (his email address was always large@..., and this worked more appropriately than one would ever dream of). But he was also a giant both in real life and in his personality. He was the warmest, most fun person to be around. He seemed almost ego-free, and often downplayed his own accomplishments, claiming that his main talent was hanging around with smarter people. He was extremely generous with his time and resources (which is why so many of us were able to visit Aarhus and benefit from being at MADALGO)

He was the life of any party -- I still remember when he hosted the Symposium on Computational Geometry in Denmark. It felt like we were at a post-battle Viking celebration (and yes he got up on a table and shouted "SKÅL" over and over again while an actual pig was roasting on a spit nearby). I remember him taking me to a Denmark-Sweden soccer game and warning me not to wear anything with blue on it. I remember us going for go-kart racing and his stream of trash talking. 

Lars was the entire package: a great person, a great researcher, a visionary leader, and a canny entrepreneur. I will miss him greatly. 

Thursday, April 11, 2019

New conference announcement

Martin Farach-Colton asked me to mention this, which is definitely NOT a pox on computer systems. 
ACM-SIAM Algorithmic Principles of Computer Systems (APoCS20) 8, 2020
Hilton Salt Lake City Center, Salt Lake City, Utah, USA
Colocated with SODA, SOSA, and Alenex 
The First ACM-SIAM APoCS is sponsored by SIAM SIAG/ACDA and ACM SIGACT. 
Important Dates:  
        August 9: Abstract Submission and Paper Registration Deadline
August 16: Full Paper Deadline
October 4: Decision Announcement 
Program Chair: Bruce Maggs, Duke University and Akamai Technologies 
Submissions: Contributed papers are sought in all areas of algorithms and architectures that offer insight into the performance and design of computer systems.  Topics of interest include, but are not limited to algorithms and data structures for: 

  • Databases
  • Compilers
  • Emerging Architectures
  • Energy Efficient Computing
  • High-performance Computing
  • Management of Massive Data
  • Networks, including Mobile, Ad-Hoc and Sensor Networks
  • Operating Systems
  • Parallel and Distributed Systems
  • Storage Systems

A submission must report original research that has not previously or is not concurrently being published. Manuscripts must not exceed twelve (12) single-spaced double-column pages, in addition the bibliography and any pages containing only figures.  Submission must be self-contained, and any extra details may be submitted in a clearly marked appendix. 
Steering Committee: 

  • Michael Bender
  • Guy Blelloch
  • Jennifer Chayes
  • Martin Farach-Colton (Chair)
  • Charles Leiserson
  • Don Porter
  • Jennifer Rexford
  • Margo Seltzer

Tuesday, March 26, 2019

On PC submissions at SODA 2020

SODA 2020 (in SLC!!) is experimenting with a new submission guideline: PC members will be allowed to submit papers. I had a conversation about this with Shuchi Chawla (the PC chair) and she was kind enough (thanks Shuchi!) to share the guidelines she's provided to PC members about how this will work.

SODA is allowing PC members (but not the PC chair) to submit papers this year. To preserve the integrity of the review process, we will handle PC member submissions as follows. 
1. PC members are required to declare a conflict for papers that overlap in content with their own submissions (in addition to other CoI situations). These will be treated as hard conflicts. If necessary, in particular if we don't have enough confidence in our evaluation of a paper, PC members will be asked to comment on papers they have a hard conflict with. However, they will not have a say in the final outcome for such papers.  
2. PC submissions will receive 4 reviews instead of just 3. This is so that we have more confidence on our evaluation and ultimate decision. 
3. We will make early accept/reject decisions on PC members submissions, that is, before we start considering "borderline" papers and worrying about the total number of papers accepted. This is because the later phases of discussion are when subjectivity and bias tend to creep in the most. 
4. In order to be accepted, PC member submissions must receive no ratings below "weak accept" and must receive at least two out of four ratings of "accept" or above.  
5. PC member submissions will not be eligible for the best paper award.

My understanding is that this was done to solve the problem of not being able to get people to agree to be on the PC - this year's PC has substantially more members than prior years.

And yet....

Given all the discussion about conflicts of interest, implicit bias, and double blind review, this appears to be a bizarrely retrograde move, and in fact one that sends a very loud message that issues of implicit bias aren't really viewed as a problem. As one of my colleagues put it sarcastically when I described the new plan:

"why don't they just cut out the reviews and accept all PC submissions to start with?"
and as another colleague pointed out:

"It's mostly ridiculous that they seem to be tying themselves in knots trying to figure out how to resolve COIs when there's a really easy solution that they're willfully ignoring..."

Some of the arguments I've been hearing in support of this policy frankly make no sense to me.

First of all, the idea that a more heightened scrutiny of PC papers can alleviate the bias associated with reviewing papers of your colleagues goes against basically all of what we know about implicit bias in reviewing. The most basic tenet of human judgement is that we are very bad at filtering our own biases and this only makes it worse. The one thing that theory conferences (compared to other venues) had going for them regarding issues of bias was that PC members couldn't submit papers, but now....

Another claim I've heard is that the scale of SODA makes double blind review difficult. It's hard to hear this claim without bursting out into hysterical laughter (and from the reaction of the people I mentioned this to, I'm not the only one).  Conferences that manage with double blind review (and PC submissions btw) are at least an order of magnitude bigger (think of all the ML conferences). Most conference software (including easy chair) is capable of managing the conflicts of interest without too much trouble. Given that SODA (and theory conferences in general) are less familiar with this process, I’ve recommended in the past that there be a “workflow chair” whose job it is to manage the unfamiliarity associated with dealing the software. Workflow chairs are common at bigger conferences that typically deal with 1000s of reviewers and conflicts.

Further, as a colleague points out, what one should really be doing is "aligning nomenclature and systems with other fields: call current PC as SPC or Area Chairs, or your favorite nomenclature, and add other folks as reviewers. This way you (i) get a list of all conflicts entered into the system, and (ii) recognize the work that the reviewers are doing more officially as labeling the PC members. "

Changes in format (and culture) take time, and I'm still hopeful that the SODA organizing team  will take a lesson from ESA 2019  (and their own resolution to look at DB review more carefully that was passed a year or so ago) and consider exploring DB review. But this year's model is certainly not going to help.

Update: Steve Blackburn outlines how PLDI handles PC submissions (in brief, double blind + external review committee)

Update: Michael Ekstrand takes on the question that Thomas Steinke asks in the comments below: "How is double blind review different from fairness-through-blindness?".

Tuesday, February 19, 2019

OpenAI, AI threats, and norm-building for responsible (data) science

All of twitter is .... atwitter?... over the OpenAI announcement and partial non-release of code/documentation for a language model that purports to generate realistic-sounding text from simple prompts. The system actually addresses many NLP tasks, but the one that's drawing the most attention is the deepfakes-like generation of plausible news copy (here's one sample).

Most consternation is over the rapid PR buzz around the announcement, including somewhat breathless headlines (that OpenAI is not responsible for) like

OpenAI built a text generator so good, it’s considered too dangerous to release
Researchers, scared by their own work, hold back “deepfakes for text” AI
There are concerns that OpenAI is overhyping solid but incremental work, that they're disingenuously allowing for overhyped coverage in the way they released the information, or worse that they're deliberately controlling hype as a publicity stunt.

I have nothing useful to add to the discussion above: indeed, see posts by Anima Anandkumar, Rob MunroZachary Lipton  and Ryan Lowe for a comprehensive discussion of the issues relating to OpenAI.  Jack Clark from OpenAI has been engaging in a lot of twitter discussion on this as well.

But what I do want to talk about is the larger issues around responsible science that this kerfuffle brings up. Caveat, as Margaret Mitchell puts it in this searing thread.

To understand the kind of "norm-building" that needs to happen here, let's look at two related domains.

In computer security, there's a fairly well-established model for finding weaknesses in systems. An exploit is discovered, the vulnerable entity is given a chance to fix it, and then the exploit is revealed , often simultaneously with patches that rectify it. Sometimes the vulnerability isn't easily fixed (see Meltdown and Spectre). But it's still announced.

A defining characteristic of security exploits is that they are targeted, specific and usually suggest a direct patch. The harms might be theoretical, but are still considered with as much seriousness as the exploit warrants.

Let's switch to a different domain: biology. Starting from the sequencing of the human genome through the million-person precision medicine project to CRISPR and cloning babies, genetic manipulation has provided both invaluable technology for curing disease as well as grave ethical concerns about misuse of the technology. And professional organizations as well as the NIH have (sometimes slowly) risen to the challenge of articulating norms around the use and misuse of such technology.

Here, the harms are often more diffuse, and the harms are harder to separate from the benefits. But the harm articulation is often focused on the individual patient, especially given the shadow of abuse that darkens the history of medicine.

The harms with various forms of AI/ML technology are myriad and diffuse. They can cause structural damage to society - in the concerns over bias, the ways in which automation affects labor, the way in which fake news can erode trust and a common frame of truth, and so many others - and they can cause direct harm to individuals. And the scale at which these harms can happen is immense.

So where are the professional groups, the experts in thinking about the risks of democratization of ML, and all the folks concerned about the harms associated with AI tech? Why don't we have the equivalent of the Asilomar conference on recombinant DNA?

I appreciate that OpenAI has at least raised the issue of thinking through the ethical ramifications of releasing technology. But as the furore over their decision has shown, no single imperfect actor can really claim to be setting the guidelines for ethical technology release, and "starting the conversation" doesn't count when (again as Margaret Mitchell points out) these kinds of discussions have been going on in different settings for many years already.

Ryan Lowe suggests workshops at major machine learning conferences. That's not a bad idea. But it will attract the people who go to machine learning conferences. It won't bring in the journalists, the people getting SWAT'd (and one case killed) by fake news, the women being harassed by trolls online with deep-fake porn images. 

News is driven by news cycles. Maybe OpenAI's announcement will lead to us thinking more about issues of responsible data science. But let's not pretend these are new, or haven't been studied for a long time, or need to have a discussion "started".

Monday, January 28, 2019

FAT* Session 2: Systems and Measurement.

Building systems that have fairness properties and monitoring systems that do A/B testing on us.

Session 2 of FAT*: my opinionated summary.

Disqus for The Geomblog