UW Allen School Colloquium: Hodgepodge


Good afternoon. All right. Welcome, everyone,
to the kickoff for the 2019 Allen
School Colloquium. I can now officially say this is
the second annual Allen School Colloquium, and I hope you
are all as excited as I am. I do need to do one
thing really quickly. Hold on, one second. So I’m not teaching
this quarter, but my mom always
went to see my class. And so I figured I’d show
her all of you, which will be the best-looking
photo I’ve sent her so far since I’ve been at UW. So for those of you who
are new to the department, the Allen School
Colloquium is an event we have each year, where
you give the talks. It’s a great way to see what
different groups are up to and hear about your
fellow students’ research. So today we have a wonderful
lineup of Sami, Max, Aditya, and Robbie. And Sami’s going to
kick it off, telling us about trace reconstructions. So I’ll hand it off to her. Thank you, Zach. I have to wait for this
thing to connect again. There we go. OK. So I am Sami, and we call this
the Hodgepodge Colloquium, which is partially true. But really, this is the
Softball Team Colloquium. So this is all people
who are on the softball team and either love being
on it, like Max and Robbie, or hate being on it, like me,
and would like more of you to play so that I don’t
have to show up to as many of the games. Or Aditya, who is new and
excited to join softball. So you should join
the Slack channel. Also, theory’s boring, so
just join the Slack channel right now you don’t even have to
listen to trace reconstruction. It’s all good. So what is trace reconstruction? So trace reconstruction
is this problem originally from the computational
biology community. I’ll talk a little bit more
about that in a second. But as theory does, we have
stolen the problem from fields and completely changed
it into a problem that we think is interesting,
and now no one else really likes it anymore. So the way that the theory
community looks at it is the following– there is a
binary string X of length n, and then we assume
that you’re given what we call some traces
of this unknown string. So your goal is to try to
figure out what the string is. So these traces that
you get are basically sampled substrings,
where you can imagine that each bit is deleted
with some fixed probability. Just imagine, probability
1/2, the bit is deleted, so you get to see these
little sampled substrings. So, for instance,
the last substring is what you would see if those
gray bits had been deleted. And again, the goal is
to learn the labels of X with high probability using
the minimum number of traces. You can assume that you know
the length of the string and that you know the
deletion probability. So as I said, we
stole this problem from the computational
biology community. In particular, this is actually
a problem that comes up for the MISL group, . And you can ask them
exactly how it arises, because I don’t
know, and I didn’t go to my college bio classes. But a lot of them did. The important thing to know is
that the full problem actually involves not only deletions,
as I described it, but also insertions and substitutions. But in the theory community,
we have studied substitutions extremely well, and it’s really
the insertions and deletions that we don’t know how
to handle very well. So hence, that’s why
I stated the problem with only deletions. So what do we know
about this problem? Remember that n is the
length of the unknown string. We know that exponential
and n to the 1/3 traces suffice to find the
original string. This is using something
called a mean-based algorithm. We are basically just looking
at the expectation single bit statistics of the traces. As for lower bounds, we know
that n to the 3/2 traces is required. And that’s for arbitrary
worst case strings. If you assume that you
sample the string randomly, then we know about exponential
and log n to the 1/3 traces suffice. So the important thing
to really get out of that is that we know that
the trace complexity is somewhere between n to the
3/2 and exponential and n to the 1/3. But the big question
which we’ve all been thinking about who
care about this problem are, can we reconstruct
any binary string with high probability using
subexponential traces? So 2 to some poly log n traces. And the sad thing is
that really nobody knows. People have been stuck
on this for a few years. Which, in the theory community,
being stuck for a few years is really not
a long time, but it seems like the
kind of thing where we think people are going to
be stuck on this for a very long time unfortunately. So what have people
been doing who are interested in this
problem, when I just told you that people are
really stuck on it. So instead, people have been
really exploring variance around the space of trace
reconstruction with the hope that it gives intuition
as to how you could maybe make progress on that
worst case problem that I talked about
or just exploring other interesting
areas in the space. So, for instance, I
described the problem as saying that each
bit has to be deleted with the same probability. Maybe you relax that
or change it somehow. You can also imagine that
instead of randomly sampling your string or it just being
like any binary string, you could imagine that you
create some alphabet which you sample from instead. You can characterize
trace complexity in terms of the number of
1s in your original string. There’s also other earlier work
on this more biology-related question about reconstructing
label trees when you were only given access to
leaves of the tree. And so, our work kind of
fits into this realm of, how do we explore the space
of trees reconstruction in a way which may be meaningful
to the theory community? So we’ve been looking
at this problem that we call tree
trace reconstruction. So my goal is to sell you on
the problem, not really my work. So I’ll briefly talk about
some of our work here, but I hope I’ve
just convinced you that trace reconstruction
is kind of interesting. So tree trace reconstruction
is the following– you assume that you have
an unknown label tree X on n nodes. You can assume that you
know the structure of X– say, you know that it’s a binary
tree or something like that. You know the number
of nodes tat it’s on. You know the
deletion probability, and you know the deletion model. Meaning, you know when
one note is deleted, how does it affect
the rest of the tree. The goal is to learn
the labels of X with high probability from
some samples that you receive. So to define how
you get a sample, we need to define
some deletion models. In the deletion models
that we consider, you’re going to assume that each
node is deleted independently with some fixed probability q,
just like in the string case. You can assume
that the siblings– siblings of a certain node– all
have a left to right ordering, and that the root
node’s not deleted. Those are just technicalities. So we introduced this
problem because we really wanted to bring more
combinatorial ideas back into reconstruction problems. Whereas they had been
very much so based in complex analysis– so just
expanding upon the techniques that we can use. And practically, there’s been
some recent work in the DNA nanotechnology
world which has led us to believe that, hey, maybe
actually practically traces of trees are interesting
to look at as well. So a deletion model that
we define called that’s TED deletion model– the Tree
Edit Distance model– is the following– when
you delete a node– so when you delete that
gray node up there– the children of it
move up in the tree. So, for instance, when
that gray node is deleted, its children have moved up to
take its place in the tree. And this is inspired by
computational biology settings, again. And then I’ll just
state our results for a complete k-ary trees. And the big takeaway
here is the stuff in red. The point is that when you
have a k-ary tree for k being the degree, you
can actually reconstruct n polynomially many traces. And then you can do
the same thing again using a different technique
when the degree is actually much larger. So the point is that with
this addition of structure, we are actually able to break
that exponential barrier that I had mentioned for the
string trace reconstruction case. So open problems– like I said,
string trace reconstruction is very open. People really don’t know how to
improve particularly that upper bound for the problem. And then there’s lots
of other variance that we are interested in. One that I like is
instead of reconstructing the original string
exactly, can you recover a string which is
close in Hamming distance? And maybe you can do
that with fewer traces. And this is another
example of a problem in the space which is well
motivated technically but also practically. So thank you. And again, join
the softball team. [APPLAUSE] ZACH: OK. And think we’re going to do
a couple of questions really quickly in between. SAMI: What’s up, AUDIENCE: So [INAUDIBLE]
model, would you actually– the tree was just a path. Don’t you just get back– ZACH: Exactly. Yeah. So it’s a generalization of
the string trace reconstruction problem, where a
string is just a path. AUDIENCE: better than what– SAMI: Oh, sorry. So that’s only for
complete k-ary trees I meant to state that for. Yeah. Yeah. Mm-hmm. AUDIENCE: Well,
pardon the naivety for asking the question, how bad
is this exponential complexity in practice? Like, in the area
of DNA sequencing? Does it actually have
exponentially [INAUDIBLE]?? SAMI: No. I believe you actually
have constant. A MISL person could
correct me on that. But yeah, you actually want to
constant– very, very small. So exponential is bad. OK. Thanks. [APPLAUSE] MAX: All right. I’m Max. And so, unfortunately,
I can’t answer any of those MISL
questions, because I’m the stupid MISL member. So I’m going to be talking
about fluidics plus computation. And because I’ve got 30
slides to do in 10 minutes, we’re just going to skip
most of the computation part. So I’m going to start
by just telling you what fluidics is,
because a year and a half ago I don’t really know
what politics was either. So I’m really talking
about microfluidics, which is the practice of
manipulating small fluids. So here’s a bunch of
different technologies that you might want
to do to do that. You basically have a little
series of tubes on the left, and that’s nice. It’s cheap. It’s single purpose. So that has some uses. And then, in the middle, you’ve
got a pipetting robot, which is basically a
robotic grad student, It picks up pipettes and
squirts them into little tubes, just like a grad student would. And then, on the right,
you have a newer technology called digital microfluidic
device, which I’ll explain a little bit more about in a. Second this is a
video of one we’ve built here at UW
called PurpleDrop. And I like think
about this progression of fluidics as kind
of the same thing that we saw in electronics. On the left, you’ve got
simple single-purpose devices, which we still use,
because they’re cheap when you want to mass
produce something at volume that has a fixed function. But, of course, people
want general purpose programmable things. And then, in electronics,
we eventually got that, but they were expensive and big. And then, of course, now
it’s cheap and ubiquitous. And so I think that’s the
way we’re moving in fluidics. And so that’s why I’m excited
about this digital microfluidic device. So how does it work? So that was the top view. Here’s a side view. So we’re slicing
through it here. You have a grid of electrodes. So you’re always seeing
you through the plane here. And you’ve got a
droplet squished between two hydrophobic layers. And then that white
electrode is on. On. And if you turn the
other one on, it moves. It’s as simple as that. And basically, that’s
the key primitive. And if you put that together,
you can move things around, you can mix and split
them in parallel, and you can move around
to other peripherals you might have on the board. So in this video here, the
little legs on the bottom– the little squares that
they’re not really moving to– those would be heaters. And you could attach
any peripherals you might want onto the board. But the question I
want to talk about is, how do you
program this thing? Because there’s a lot going on. You might want to
make sure– you don’t want your droplets to
bump into each other, and you have to
manage where they are. So how might you
program this thing? So you certainly
don’t want to talk about where the droplets are. You don’t want to say, turn
on electrode 1 at time step 1. That would be really bad. So what people currently do
is they make these graphs. So they basically
make a data flow graph that describes your protocol. This is a simple one. You input two things, mix
them, then output them. And the edges connect
the data flow– the droplet flow, in this case. And currently, the way
things work is you take that and you run it through
some big synthesis tool that uses place and
route techniques from VLSI. Because you’re basically
doing place and route. You need to figure out, where
would you mix this thing? How do you get the
droplets there? Same thing with wires
and modules on a chip. So then you get out a sequence
of electrode actuations. And then you run
that on the hardware, and it zips that droplet around. And then someone goes
and collect the data. And that works well and good if
you want to run one experiment. But a scientist probably
wanted to run that experiment, so they designed that graph. And they’re probably going
to collect that data. And science isn’t a
one and done thing, so they’re probably
going to want to run another experiment, right? And so this is where this kind
of static workflow falls down. And this is where we want to
put computation in the loop. So maybe the scientist is going
to run a machine learning model to figure out what
experiments to do next. Maybe it’s just a
simple if statement. Maybe they’re sweeping
over a parameter list. So instead of trying to come
up with some new programming tool that can solve all of
this with some fancy, schmancy programming language,
we’re just going to use the ones
we’ve already got. So instead of trying to
create a fluidic programming language that tries to overthink
what is a fluidic function and what does it mean for
an if statement execute on the fluidic device, we’re
going to punt on all that and instead create a fluidic
operating system that allows you to write a
regular program in whatever your favorite
programming language is. You can use all your
machine learning models, your if statements,
and all those wacky things. And you can run devices
on a microfluidic chip. So let’s see what
that looks like. So at the top here, this
is kind of our stack. These are all the
cool things you might want to do with fluidic. These are all the
awesome domains that I know nothing about that
scientists are really good at. And at the bottom, you
have this cool hardware that people might want to use
to automate their experiments. So our key idea here is put in
a general purpose programming language, instead of some
special baked fluidic programming stack. And, of course, to
do this, we need to fill this abstraction
gap with a fluidic operating system. And what this is going
to do for us is basically automate resource management. So just like when
you write to a file, the Linux kernel
is like, OK, well, you don’t really
know how many disks. You have you don’t
know where they are. You don’t know where the
blocks are on that disk. That’s OK, I’ll find it for
you, and I’ll just give you back a file descriptor. So a simple number
that represents, OK, I actually know,
under the hood, where those bits
are sitting on disk. And maybe that disk
gets corrupted– it’ll move it somewhere else for you. So it’s the same kind of thing–
dynamic resource management for fluidics. So we call that stack
Puddle, and the hardware we’ve called PurpleDrop. And one nice thing about
this kind of abstraction is that you can write programs
on an operating system and it can run on
diverse sets of hardware. Maybe it runs something
on a GPU or maybe have different kinds
of disks or something. By having a dynamic
abstraction, you can imagine kind of
plugging and playing different types of hardware. So even if people want to
use those pipetting robots, you could imagine using it
with a system stack like this. So we have back ends for
the microfluidic chip, but we’re working on
the ones for the robots. So let’s blow that
stack up and break it down a little bit more. So the high level programs that
people write in Puddle today, you just write in Python. We have front ends in
Python, Rust, and JavaScript. But I’ll show you some
snippets of the Python one, because that’s
what scientists use. So you can write
high level programs. You can even write
domain-specific libraries. Because you can share code,
because it’s just Python codes. And it all hits
puddle through an API. So there’s no compilation step. You’re talking to,
basically, the runtime system through a well-defined API. And then there’s
Puddle, the actual thing in the middle that’s
doing all the analysis and figuring out where
the droplets should go. We’re not really going to
talk about that at all. Basically, I’m going to show
you how you would use it and some cool stuff
we can do with it. And then there’s the hardware. Again, that, I’m going to not
really have time to talk about. So what kind of programs
can you write with Puddle? So there are programs
that look familiar to you, because they’re programs. So I promise this is only
code that I’ll really walk through and make you read. So this code has a
user defined function. And so all the things I’m
going to highlight in green are actual innovations that
when we published this paper, this was the first time people
have done in microfluidic land. So they don’t sound
big to you all, because they are things that
are very run-of-the-mill in programming, but these
are things that if you look at microfluidics in the static
way that people have been doing, it couldn’t
really be done. So you can define
your own functions, and you can share
code to people, and that’s really nifty. You can also loop over
unbounded data structures. So here we’re defining
a thermocycle function. Which, basically, thermocycling
means you repeatedly heat something up. So you’re taking the thing
you’re going to heat up– the droplet– and a list of the
temperatures and times that you’re going
to heat it up to. So what do you do? Well, you loop over that list. So that’s something that
you couldn’t do before. And then the underlined called
to heat here is an API call. So this is actually
just patching into the Puddle
runtime that routes the droplet to the heater
and executes the heat. Another fancy thing you can
do is run an if statement. So you can take a
sensor reading– here, you’re detecting the
volume of the droplet– and get the data
back, and do something with it, all in
the same program. So you’re computing
over the data that you got back from the real
world all in the same program. So here, we’re taking a
volume reading of the droplet. This is done with a camera. And we compared
to some threshold. And if too low– so maybe the
droplet has evaporated a little bit– we pump in some more water
with the input command, and that’s an API call. And we just add it
right to the droplet. And because this is just
regular, vanilla Python code, it’s compositional. So if I gave you the
thermocycle code in a library, you could write PCR or a
Polymerase Chain Reaction, which is something you
might do to duplicate DNA. And it’s basically a simple
instance of thermocycle. So you wouldn’t have to write
thermocycle on your own, you could use the
method that I gave you. So that’s really nice. You can write all this
fancy code in Python, and it will run on the device. So like I said, all these
go through API calls, which hit the Puddle runtime. Which I am not going
to talk about at all, because I don’t
have enough time. At the end of the day, Puddle
crunches on the commands that it gave you,
and it eventually calls out to PurpleDrop. Which, here’s a nice
big image of PurpleDrop. That eventually runs your code. Cool. So again, just to tell
you some cool things you can do with this,
there’s three things you could do with Puddle. And they’re all
sorts of decisions that you couldn’t make before. So you can make decisions at
the execution level, which is where you take sensor
data and you feed it back into the planning loop. So Puddle’s using this to
make your program run better. And one good example of
this is droplet detection. So PurpleDrop has
a camera on top, and it’s constantly
looking at your droplets. It’s using a very, very
simple computer vision, and it’s detecting how
big your droplets are and where they might be. And what it’s doing,
is it’s comparing that with internally where it’s
expecting your droplets to be. And so, if, as is common
on this technology, a droplet fails to
move as planned, we can catch that and
reroute around it. And we can do that
dynamically, because we’re not committed to a particular
plan ahead of time. And that’s the annotated
computer vision algorithm working in real time. So it works in real time,
and this graph basically says it works. And if you don’t have
this error correction, it’s hard to do
much on the device. Because errors are common,
and you need to correct them. [MUSIC PLAYING] AUDIENCE: You’re time’s up. MAX: My time’s up! [LAUGHTER] OK, that program
that we showed you, that was the first time anyone
could run PCR on this device. So that’s pretty cool. And we wrote a Key-DNA store. It’s like a key value store,
where you give it a key and you get back DNA. And that’s it. You can watch the
animations finish. [APPLAUSE] ZACH: Are there any
questions for Max? MAX: Sorry, I need this. AUDIENCE: Really cool talk. What is the debugging
experience like? So if I make a mistake, as I
do when I write Python code, what’s my experience in
figuring out what went wrong and how to fix it? MAX: So debugging
your protocol probably isn’t going to be too much
different from debugging your Python code. That’s one nice thing is that
you’re just running Python 2. If you want to do
print line debugging or even invoke the PDB and run
your protocol interactively, you could totally do that. Debugging the hardware
is a nightmare. It’s like magic when it works,
which means just like magic when it doesn’t. So it’s very hard to
figure out what’s going on. [LAUGHTER] Yeah. AUDIENCE: So you setup
very nice analogy with conventional
operating systems. I just want to write. I want to know how
many disk I have. But the thing is, maybe once
a decade I go to write a file, and it says, sorry, there’s
no room on the disk. I imagine you have constant
resource exhaustion issues, because it’s this big. Right? MAX: Yeah. So you can totally run
into resource exhaustion. It’s really kind of related
to how many live droplets. So the scheduler,
because you don’t get to say when
things run, has a lot of freedom in sequencing
things that you thought might should run in parallel. But if you don’t have enough
resources to run it, but yeah, you totally could come up
with a case that does that. We are picturing a case
where you instead of one of these devices you
have a gajillion of them all linked together. In which case, this
starts to make more sense. There was– I think
one, maybe over there. No. OK. [APPLAUSE] ADITYA: Yeah, I’m a new
grad student, so most of you didn’t meet me. Can I have a quick show
of hands, how many of you do that PixieDust
named machine learning? Yeah, cool. So the PixieDust named
machine learning? AUDIENCE: [INAUDIBLE]. ADITYA: Yeah, so the fancy term. So essentially, I’m
trying to combine two of these hard topics,
like machine learning for internet of things. So my research, which I did
at Microsoft Research India, was to enable this complex
machine learning algorithms in 2 kilobytes of RAM for
the internet of things. And we’re talking about
one particular algorithm which enables deep learning
for 2 kilobytes of RAM for time series of data. So I call this The Edge
of Machine Learning. But people thought it
was for Microsoft Edge because I was in
Microsoft, but it is not. [LAUGHTER] So let me be very clear about
the devices I’m talking about. Like these ARM Cortex-M0 class
devices which have 2 kilobytes of RAM and maybe 48
megahertz clock– maybe the weakest of
the devices out there. And you can’t even run a
proper for loop for long enough without running out the battery. So that’s essentially
the scale we’re tackling. And the flash is
not great as well. We had 32 kilobytes. So that essentially means a
working memory of anything should be less than 2
kilobytes and our model can’t be more than 32 kilobytes. So what is the impact? Most of these things
are futuristic. But one of them, actually,
the smart agriculture for disconnected
farms, that something called FarmBeats, which happens
across the lake in Microsoft Research, Redmond, that
has been pretty successful. But the rest of the things
are still futuristic. So let’s hope they
come to fruition. So just to be clear,
the models will be trained on the cloud,
infinite resources. We don’t care about how
many GPUs, how many icebergs we melt. Finally,
we want them to be deployed on the tiny
IoT devices which have very, very small resources. So a shameless plugging
of advertisement. So we developed a library in
Microsoft Research India which has four cool algorithms,
two in ICM in 2017, two in NeurIPS in 2018. We have two more coming up,
one in [INAUDIBLE] 2019, and in NeurIPS 2019. So essentially it
has six algorithms. Bonsai is our classical
tree-based algorithm which [INAUDIBLE] 2 kilobytes
for classical machine learning tasks. Essentially, you can
replace any classifier which you have been using
in your code bases and replace it with
Bonsai or ProtoNN. ProtoNN is an
kNN-based classifier. Whereas, coming to the much
more cooler thing that’s a deep learning– so EMI-RNN
is a signal-localizing tool for your time series data. And the thing which
I’ll be talking today is FastGRNN it expands too
fast, accurate, stable and tiny [INAUDIBLE] neural network. So it’s a recurrent
neural net, which we expect it to be 45 times
faster than [INAUDIBLE],, while maintaining
the same accuracy. So [INAUDIBLE] time
series like IoT essentially gets the
streaming time series data every single day,
so we just wanted to see how we can
leverage time series data and make some sort
of insights out of it using machine learning. So this has been a
joint collaboration with [INAUDIBLE] who is at
MSRI and a couple of people from Berkeley. So just a refresher
for people who don’t know what recurrent
neural networks are, they have been out there
from 1989 probably. And they essentially
are an intelligent way of recursive matrix
multiplications which keep on accumulating
some memory into the future so that you can leverage the
entire [INAUDIBLE] memory for some classification task. So all I want to show here
is the most important term is the U matrix multiplying with
the hidden state [INAUDIBLE] over and over. So if there is a long
time series which keeps on accumulating
memory, this U essentially becomes something like
U to the power of T, where T is the total
number of time steps. And this comes into
the gradient step during the optimization,
which leads to something called exploding
and vanishing gradient. So you either have your
neural network exploding in gradients– they don’t work– or you don’t have your gradient
moving in any direction, so they are just saturated. So you don’t learn anything
at the end of the day, because of this problem. And this has been a known
problem from like 1990s. And people have been
trying to solve it. So one of the techniques
which people came up with is called unitary ordinance. Because the eigenvalues of
the U to the power of T term explode, what they did
is they just constrained U to be a unitary matrix. So eigenvalues of the highest
eigenvalue and the lowest eigenvalue are always set to 1. So you always project the
matrix back into unitary space. So there is no chance
your optimization is going to go bad. But the problem
is because you’re constraining the
[INAUDIBLE] of model, you accuracy get a big hit, and
you require much bulkier models to work on this. So essentially, this stops
us from deploying them onto the edge. Then comes the standard
state-of-the-art, LSTMs, GRUs, which I’ve been there from 1994. They’ve got the
state-of-the-art everywhere, but they’re too bulky
because they’re too complex. And most of these processes
are single-threaded, so we can’t even leverage
parallel processing. So everything is
pretty expensive if you want to go for these. So we come up with
something called FastRNN. So FastRNN is the
predecessor of FastGRNN. FastRNN is one of
the primitive RNNs which you can see out there. It’s essentially a weighted
residual connection that implies you are adding
the current memory multiplied with the weight to the previous
memory with another weight. And both of these weights
are coupled, essentially, and you train them with
standardized [INAUDIBLE],, and you hope everything works. And surprisingly,
everything works. [LAUGHTER] But the only problem
is we are we are still 1% to 2% short of
LSTMs and GRUs, so we thought, what can we do? Because our inference
[INAUDIBLE] is as good as RNNs. So there’s no overhead
over RNN, so we thought, what can we do better? And there are some
theoretical [INAUDIBLE].. If someone wants to
talk, we can talk later. But essentially, we,
for the first time, proved that that can be a
non-time dependent bounce for generalization and
convergence in RNNs. And this is an [INAUDIBLE]
comparison, so it’s not fair, but still it’s a first
of its own result. Then we thought, why can’t
we match the LSTMs and GRUs? And we came up with a slightly
intuitive and intelligent way of leveraging gating, which
has been there in LSTMs and GRUs for a long time. So we’ve used most of the
information which RNN is using, and we essentially do some sort
of fine transformations– very, very simple intelligent
transformations– which essentially are
multiplying it with a scalar and adding it with
a scalar for a– just a subtraction
from one, assuming their entire information
is bounded by one. And we don’t know
why, but this works as well as LSTMs and GRUs,
and this four times faster. So we thought, why
not push it further? So we took the
matrices and made them low-rank, sparse,
byte quantized. So at the end of the day, we
are 45 times faster than LSTM while maintaining the accuracy. So whenever you see
your LSTM or GRU, you are finding it
hard for the deployment just to replace
that with FastGRNN. And hopefully you’ll get
within 1% of the accuracy. Otherwise, just hit me up. [LAUGHTER] So I was surprised
initially, because when we were designing
this algorithm we thought this would be
only for the IoT task. But it generalized pretty well. So the first three data sets
are actually speech recognition data sets, where we
have short utterances of words, like hello,
Marvin, 1, 2, 3, and so on. And Wakeword-2 is
a data set which is actually a
Microsoft proprietary Hey, Cortana data set. So we wanted to do Hey,
Cortana detection on device because privacy reasons. I think just after we released
the paper, Apple [INAUDIBLE] on-device. So before that,
nothing was actually happening on on-device. Everything was being
streamed to your cloud. So every single word you
utter, even before Hey, Siri, is actually in the cloud. So we don’t want that to happen,
and you can use this to happen. We generalized it to NLP
like sentiment analysis, language modeling. And then we also ran experiments
on activity recognition essentially. So the left three
bars are [INAUDIBLE].. So we are really good,
otherwise the paper wouldn’t have been accepted. But the model set
is something which I’m very happy about,
because this is log scale, and they’re super smart. This is a comparison
with unitary RNNs. Unitarian RNNs, we even
get to about 200 times better model sizes while reading
it [INAUDIBLE] by up to 2%. This is the crown jewel
of the entire paper. We actually did Hey, Cortana
detection on 1 kilobyte of RAM, and we deployed it. So the only caveat
here is we assume that we get the featurized
version of your [INAUDIBLE] in, because we can’t afford
to FFT on the device. So if you get the
feature as Hey, Cortana, we can always make the decisions
using our 1 kilobyte model. We ran a bunch of experience
on real-world IoT devices, and we see it’s up to
two orders of magnitude faster than most of the
other algorithms like LSTMs and unitary RNNs. If you want to deploy an
RNN0-based model for IoT, I think, [INAUDIBLE]
RNN is the way to go. It’s just a new upgrade. So we have about 700 stars. So please, I get 60
more stars, and it improves my GitHub profile. So we don’t support TensorFlow
2.0, but we support Python, C++, and TensorFlow. We even have code optimized
RNN [INAUDIBLE] for Python. So give it a shot. Thank you. [APPLAUSE] ZACH: Do you have any questions? AUDIENCE: So that accuracy– the 1% accuracy–
right now, I understand that’s sort of just a
general measurement. But are there– is there future
work planned to actually peg down why that happens? How far it generalized? Is that something
you’re planning on doing in the future? ADITYA: Not exactly. But we tried to deploy this
in huge products in Microsoft, and this 1% is something
which I observed there. In the IoT regime, I still
match the performance of LSTM and GRU, but it has
some [INAUDIBLE] issues when it goes to large-scale data sets. AUDIENCE: OK. So you’re saying is there’s not
sort of a general equivalence between these two things. ADITYA: Yeah. Because we– AUDIENCE: You don’t really
have a consensus of when the [INAUDIBLE] ADITYA: Yeah. It’s just experimental. So as long as LSTM works, I can
guarantee you up to 1% error. But yeah, that’s that. I can’t always get into your
matching LSTM performance. AUDIENCE: OK. Fair enough. It was cool. The experiment was [INAUDIBLE]
with very good results. ADITYA: Sure. Thanks. AUDIENCE: What size
[INAUDIBLE] was the library? Because you said you did Hey,
Cortana in 1k [INAUDIBLE] you did it– like, the general library
size, how big is it? ADITYA: So the code
is pretty small. The inference codes are
returning native [INAUDIBLE].. So inference codes are just a
couple, 20 to 30 lines of code. And once you write the
serialization you can do that. And we also have
a compiler which was published in PLDI which
can take in all of our models and do the inference for you. Yeah. AUDIENCE: You said
that the unitary RNN is inherently stable, but
then it’s really bulky. ADITYA: Yeah. AUDIENCE: So what’s the
trade-off for FastRNN? Is there a similar [INAUDIBLE]? ADITYA: No. So FastRNN is only limited
by its expressivity. Because even though universal
approximation theorem says all the RNNs are all-powerful,
but you are using a convex optimization routine
to solve a non-convex. Problem. So we don’t when we
hit the sweet spot. But empirically, when we
actually do the experiments, like unitary RNNs
are [INAUDIBLE] 5% load, then FastRNN. But FastRNN is still [INAUDIBLE]
1% to 2% less than [INAUDIBLE].. So this is all empirical. But theoretically, FastRNN’s
training is always stable. So you will never have exploding
and vanishing gradient problem. ZACH: These are great questions. I think we’re going to thank
Aditya again and have Robbie– [APPLAUSE] And remember, these
are your colleagues. So if you have more
questions, find them in the research [INAUDIBLE]. AUDIENCE: Could I have a
quick note, just for you if you’re using a
mic, make sure you’re holding it close
enough to your mouth that it actually picks up– ZACH: Everybody hear that? (SHOUTING) If you
want to be heard– Just a second, guys. ROBBIE: All right. So this is joint work that
I did with three undergrads here at UW– Nick, [INAUDIBLE] and Sierra. The title for the draft that
we have of this paper right now is A Strategy-Proof Alternative
to Round-Robin Subtournaments. But for this talk,
I’ve added a subtitle, which is, Are People
Good at Their Jobs? And that’s actually
the question I’m going to be asking you multiple
times throughout this talk. So I’m going to
start by asking you about some Olympic
athletes, and asking you, are they good at their jobs? So I’m going to
bring up a video. This is from the 2012
Olympic Badminton tournament. And you can tell it’s 2012,
because this video is in 360p. So I will be giving
play-by-play, because you can’t actually
entirely see what’s going on. So we’re going to
watch the team in gold. And specifically, we’re going
to watch this person in the back here. She’s going to take a big swing
and send the birdie directly into the net. Sending things
directly into the net is going to be a bit of a theme. Here’s a serve. It’s going to go
directly into the net. Another server
directly into the net– and a little later, a third
one directly into the net. At this point, this gentleman
is going to come here. He’s one of the organizers
of the tournament. He’s going to give the teams
a little bit of a pep talk and suggest maybe
they try hitting the birdie into the net
a little bit less often. The gold team really
takes this to heart. And instead of hitting
the birdie into the net, they just let it
drop onto the ground. [LAUGHTER] And lest you think this is
just the yellow team involved, a little bit later, the
purple team is going to serve, and they are going to
miss by about that much. [LAUGHTER] So these are Olympic athletes
playing their Olympic sport. And I’m going to ask
you the question, are these Olympians
good at their jobs? You do have to vote. Who votes yes, these Olympians
are good at their jobs? Who votes no? It’s about 50-50. So watching just
that 20-second video, you would probably say that
they’re not playing very well. In fact, you might
even say it looks like they’re trying to lose. And it doesn’t just look
like they’re trying to lose, they are trying to lose. And in fact, that’s
why I’m going to argue that they’re
good at their jobs. So what is the job
of an Olympian? Well, it’s to win
the gold medal. And the way this
tournament was set up, those teams actually
improve their chances of winning the gold medal
by losing the match. So how could that happen? I’m going to break
down very briefly how this tournament works. So there is 16 teams. They’re going to be divided
into groups of four, and you will play each of the
other teams in your group. So that’s three matches. After those three matches, the
top two teams in every group are going to move on, the bottom
two are going to be eliminated. So the second phase of the
tournament, for those top two teams, depending on
which group you’re in and whether you’ve
finished first or second, you’re going to be put into
this bracket over here. And in this bracket– this
is just a single elimination tournament– these top two
teams play, the next two play, the next two play,
and the next two play. Losers are eliminated,
winners move on. These two winners
play each other, those two play each other,
and then the last game, the two remaining teams
play, and the winner gets the gold medal. So the two teams we are watching
our teams W and X in group A. They know, whether
they win or lose, they’re going to move
on to the next round. It’s just a question of are
they going to be A1 or A2. And because of some other
results in the tournament, it turned out that D2, the
second place team in group D, was one of the best
teams– the team that they were
most worried about. And so W and X looked
at this bracket, they looked at where D2 is
going to end up, and they said, well, I don’t want to play
D2 for as long as possible. And given a choice of
A2, way down here, or A1, a little bit closer to D2,
I want to be further away. And so they said, I
would rather be A2. And to do that, I want
to lose the match. So losing their
match, not winning it, gave them the best chance
of winning the gold medal. Or at least, it was supposed to. Remember that guy
giving them a pep talk? That was a little bit
less of a pep talk and more of a threat talk. They were disqualified
immediately after the match for not using their best efforts
and for conducting themselves in a manner which was clearly
abusive and detrimental to the sport. So you could tell from
the quality of that video that it was from
quite a while ago. It was from 2012. And the Olympics happen
every four years, right? So there was another
Olympics in 2016. And after the international
news coverage of these badminton players looking silly
trying to lose a match, the badminton organizers
had to do something. So they had to do something
to change the tournament to make this not happen again. So what did they do? They kept the same
basic structure, but they sprinkled in a little
randomness between the two phases. So instead of knowing
exactly where you’re going to end up in
the bracket, they changed it a little
bit so there’s a little bit of randomness
in where you end up. It’s not uniformly
random where you end up, but it’s not completely
determined anymore. And what they said
when they announced this change is that they
had eliminated any players thoughts about actively trying
to lose a match or matches, irrespective of
other match results, and that they had ensured
such a regrettable spectacle would never be witnessed
in badminton again. [LAUGHTER] And the 2016
tournament happened. There were no incidents. So did the Badminton
World Federation, when they redesigned this
tournament, do a good job? Who thinks yes,
they did a good job? Who thinks no? Again, split. My vote is going to be no. It’s not too hard to
construct another example where your teams are still
going to want to lose. So it’s not too hard. Here’s what you do– you start with the scenario
that happened in 2012, where the teams wanted to lose, and
then you actually have it. Without changing anything
about the scenario, the teams are still
incentivized to lose. Now, you do you have to
think about it a little bit, and you do have to do a
probability calculation, but they’re still
incentivized to lose. So some of you are saying,
wait a minute, Robbie, we’re seven minutes
into your talk, and all you’ve talked
about is sports. Is this just a
talk about sports? And no, it’s not just
to talk about sports, there is some
computer science here. Let me get to it. Specifically, this is
going to be a problem in algorithmic game theory. So what is algorithmic
game theory, or what’s the problem here
in algorithmic game theory? The tournament organizers
did not properly align their incentives. The designers of the
tournament wanted a tournament that was exciting for the fans,
where every team was trying their best to win every
match, and every match was very exciting. The players wanted to
win the gold medal. So what we have here is a
misalignment of incentives. So mechanism design
is a subfield in algorithmic game theory,
where you design algorithms assuming that the inputs
are coming from people who have incentives. And what we want to do is
design a tournament here– an algorithm– where the incentives
of the competitors and the incentives of the
designers are aligned. Essentially, where wanting to
win the gold medal inherently makes you, from the
tournament design, want to win all of your matches. So that’s what
we’re going to try to do in the last couple
of minutes of this talk. So we one of tournament
that aligns incentives. So it’s not too hard to come
up with such a tournament. What you do is you just have a
single elimination tournament for the whole thing. . So take that second
phase that they had and just make that phase
the whole tournament. And you are never
going to want to lose a match in this tournament,
because if you lose, you have no chance of
winning the gold medal, because you’ve been eliminated. That does work, but it’s
not a very exciting answer, and it’s not a
very useful answer. In fact, this was the way the
badminton tournament worked until 2008. And in 2012, what the
badminton organizers decided is that they didn’t
want that anymore. And why didn’t they want that? What they want, I believe,
is to ensure every team plays more than one match. This is the Olympics. They have teams flying in
from all over the world. These teams are
going to come, and if play in a single elimination
tournament, half of them will play one match,
they will lose, and then they will
have to go home. And that’s very disappointing
for all the teams that flew across
the entire world to play in this tournament. It’s disappoint for
the fans who get to watch them once, be
disappointed that they lose, and then have to go home. So these are the
two goals we want– to align incentives
and ensure every team is going to play
more than one game. So your first thought
might be, well, let’s just keep the group
play and maybe add a little bit more randomness
to that second phase. Unfortunately,
that doesn’t work. This is a result
by Pauly from 2014, under some pretty
reasonable assumptions, you can’t have this group
play as the first phase and do anything in the second
phase that will properly align the incentives. So if we’re going to
design this tournament, we can’t be Messing
with the second phase. We have to actually
change that first phase, that group play phase. So this is what
we’re going to do. So my conjecture when we
started this project about nine months ago is that what
you should do is you should replace the group play
with a double elimination tournament. So what’s a double
elimination tournament? It starts out a lot like a
single elimination tournament. On the top half, it looks
almost exactly the same. But when you lose, you’re
not eliminated right away. When you lose, you go
down here to what’s called the losers
bracket, and you’d stay in the losers bracket
as long as you keep winning. So when you get your second
loss, you are eliminated. Until you have two losses,
you continue playing. And you just continue like that
until there is exactly one team remaining with less
than two losses. And then they can
advance out of this group play out of this first phase. And my conjecture
nine months ago is that this works– that this
properly aligns incentives. Now, one more time,
I’m going to ask you, is someone good at their job? And of course, the job of a
theoretical computer scientist is to make conjectures and then
prove those conjectures to be correct. So do we think I’m
good at my job? Who says yes? Who says no? You are far too optimistic. [LAUGHTER] That conjecture is false. Luckily, my co-authors are
very good at their jobs. And so, together, we
were able to come up with a different conjecture
that is actually true. So you can’t use a double
elimination tournament, but you can make just
the tiniest modification, and then it will properly
align incentives. So what we do is as soon as we
have two teams left, instead of those two teams playing
until someone has two losses, they play just a single game. And whoever wins that
game is who moves on. So I’m going to
actually prove to you right now that this
works– that this properly aligns the incentives. So in any game
where you’re going to be eliminated
if you lose, you are not going to want to lose. So all of these
matches in green here, we’ve handled just
with that statement. Then we just have to worry
about the winners bracket. In this semifinal game, if you
lose, what you’ll have to do is beat the other two teams who
are remaining in the tournament one right after the other. On the other hand,
if you win, you just have to beat one
of those two teams. So if you lose, you
will have to play a superset of the opponents that
you would play if you had won. So you’re definitely
going to choose to play fewer opponents to need
to beat fewer teams to win. And in the first round,
we have, essentially, the same statement. If you lose, you’re going
to have to beat all three of the other teams
in the tournament one right after the other
in elimination games. If you win, you
only have to beat two of the three teams
remaining in the tournament in elimination games in
order to win the tournament. And that is the entire
proof of the theorem. This theorem, one of the reasons
that I’m so happy with it, is it’s simple enough to
really fit onto a slide. And actually, if you take
maybe a little bit longer than I took here, I think
you could explain it to Olympic athletes, even
if they don’t have much math background. [LAUGHTER] And really do convince them not
to lose anymore in tournaments. So that is our theorem. So what are the
takeaways from my talk? The first takeaway is, watch the
badminton tournament in 2012. It could get really interesting. Secondly, tell the
Badminton World Federation about modified double
elimination tournament instead of group play and help
us ensure that such a regrettable spectacle is never
witnessed in badminton again. Thank you. [APPLAUSE] Dan? AUDIENCE: So you didn’t
give us any intuition on why the traditional double
elimination doesn’t work. ROBBIE: Yeah so what does
traditional double elimination not work? So the problem is, essentially,
it’s only in that first round. If you lose in the
first– if there’s a team that’s better than you,
that you don’t have a very good chance against,
you might be able to intentionally
lose in the first round, face them in the
loser’s bracket, only need to get an upset once. As opposed to, if you
win, you might play them in the final round, where
you have to beat them twice. And even though it’s
two out of three, you might prefer, if
you’re worse than them, to just have to beat them once
down here, rather than two times out of three naturally. Other questions? [INAUDIBLE] AUDIENCE: Actually,
no– never mind. ROBBIE: OK. Yeah? AUDIENCE: If you’re telling
us to watch badminton in 2020, wouldn’t this be evidence
that the committee has done a great job
designing the tournament. [LAUGHTER] ROBBIE: I suppose,
under the there’s no such thing as
bad press standard, then yeah, this is
quite possibly what they should keep around. Yeah. AUDIENCE: It seems like,
from a meta perspective, the committee has aligned
their incentives just fine by the method of if you
trying to pull this stunt we will just kick you loose. ROBBIE: Yeah. This is a good point. So there’s this
external statement that they can make of,
if you try to lose, we’re going to kick you out. And therefore, you
can’t try to lose. This was a particularly
bad example of teams trying to lose,
where it was very obvious. You should have to
be worried, though, about teams trying to lose and
being subtle enough about it that you don’t catch them. Paul. AUDIENCE: So there are other
types of other sports– curling and softball– College World Series,
and things like that, where, in
College World Series, you have multiple games
to decide whether you want to win or lose. So a single pairing isn’t
enough for things like that. Are there modifications
that would allow you to generalize in
some of these other scenarios? ROBBIE: Yeah. So are there things we
could do to generalize? So this is actually one
of the strange things about the statement, it
holds for four teams. If you go up to eight
teams, we actually don’t know how to make it so
that teams don’t want to lose. So this double
elimination, the proof is actually very fragile to
having only four teams around. So as long as you keep
everything in groups of four, you might be able to modify it. I actually don’t know
ways to get bigger than 4, except by saying, handle
the size 4 and recurse. Brett. AUDIENCE: One quick
one, one moderate one. The quick one is
just to clarify, you’re proposing the
modified double elimination to replace the group play and
then you proceed from there to the standard knockout. ROBBIE: Yes. Yeah. This replaces the group play,
then you can do either one. You can either go to
the standard knockout or you could do another version
of this– whichever works. AUDIENCE: And the
longer question is, is there a 30-second or
so summary of the poly results about why group play inherently
must [INAUDIBLE] aligned incentives. ROBBIE: Yeah is there
a 30-second version of the poly result? Unfortunately, no. His proof is a proof by
exhaustion that was– [LAUGHTER] –had to be checked
via computer, because he literally just said,
here are about 5,000 cases. I wrote up in C++ a program that
checks them all, and trust me, none of them work. [LAUGHTER] Getting a short
version of that proof is actually an interesting
open-ended statement. Lauren. AUDIENCE: So the one
benefit to group play is you only have to
play three games. Which, I’m thinking,
in terms of– I’ve been to so many
ultimate Frisbee tournaments, that time-wise you have
a weekend to do this. This actually, the
double elimination, winds up adding time. ROBBIE: Yeah. So it looks like it adds time. It actually doesn’t. So we’re only having one
team advance out of the four, instead of two. So we’ve actually gotten
rid of one of the rounds in the knockout round. And there are actually only
four rounds necessary here. Every team is going to
play, at most, four matches. AUDIENCE: True. But you have to play
A, B, C, D, and then L1 and L2 actually happens after
A, B, C, D happened, right? So doesn’t it– it’s
not really balanced. ROBBIE: So A, B, C,
D play all at once. The losers and the winners
play– so 3 and 4 happen. 5 happens by itself,
and then 6 happens. And then, remember,
there is no 7, because we modified
it to get rid of 7. So it’s actually
only four rounds, and it’s taking the
place of four rounds in the original tournament. AUDIENCE: Oh, four rounds. ROBBIE: Yeah. Should we– ZACH: One more. One more? OK, Nathan. AUDIENCE: So what if you
know that you can’t get gold, does this is still work. If you’re trying
to get [INAUDIBLE].. ROBBIE: Oh. That’s very interesting. I don’t know. But we have not put the
paper on archive yet, so we can figure it out
and type it up quickly. You can get an acknowledgment
for asking that question. ZACH: Sorry, [INAUDIBLE]. We’ll take more
questions offline. Some thanks, everybody,
for showing up. Please, make sure to
come back October 24, for the second installment of
2019’s Allen School Colloquium. And until then, I hope everybody
has a wonderful beginning of the quarter. Let’s thank the
speakers one last time. [APPLAUSE]

1 thought on “UW Allen School Colloquium: Hodgepodge

  • Truly Awesome!, its so cool!, See this New Album 'Monish Jasbird – Death Blow', channel link www.youtube.com/channel/UCv_x5rlxirO-WKjLIyk6okQ?sub_confirmation=1 , you can try 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *