McLaren Stanley's profile picture
McLaren Stanley
@StanTwinB
View on Twitter

Alright folks, gather round and let me tell you the story of (almost) the biggest engineering disaster I’ve ever had the misfortune of being involved in. It’s a tale of politics, architecture and the sunk cost fallacy [I’m drinking an Aberlour Cask Strength Single Malt Scotch] https://twitter.com/StanTwinB/status/1336835936538087428

The year was 2016. Donald Trump was not yet the President therefore the #DeleteUber movement hadn’t happened yet. TK was still the CEO, we were still in the hyper growth phase of international rollout, public sentiment was overwhelmingly positive, Uber was riding high.

But hypergrowth is not without its problems, and the app itself was starting to show some cracks. The engineering org had doubled in sized almost every year prior, and when you grow that fast you end up with an incredibly wide range of skill. That paired with a hacking mentality

that we called “Let builder’s build”, meant that the app architecture was complicated and fragile. Uber at the time was extremely heavy on client side logic so the app would break a lot. We were constantly doing hot fixes, burning releases, etc. The design was also scaling badly.

As a consequence of all these problems, there began to be a growing movement across all levels of the org that was rallying around the idea of “rewriting the app from scratch” The general sentiment was that the architecture was slowing us down, starting over would be faster.

So a team was formed to build a new mobile architecture for this new app. The driving charter for the team was to build an architecture that would “sustain mobile development at Uber for the next 5 years”. We did both platforms at once. Product and Design also started over.

On the iOS side of the world, the rewrite presented an opportunity to adopt Swift (which was in version 2.x during this timespan). Uber had tried Swift before, but like many who had adopted it that early on it was extremely problematic so it had been banned prior to the rewrite.

But the general feeling of the architecture team was that most of Swift’s problems centered around the flakiness of the Objective-C interop back then so if we wrote a pure Swift app we could avoid the major issues.

There was also a push to use the same major architectural patterns on both Android and iOS. The android folks at the time were big RxJava fans, and there was an equivalent RxSwift library that took advantage of the functional programming paradigms in Swift. Seemed straightforward

So this smaller core team of Design, Product, and Architecture went off in a room for with their new functional/reactive patterns, new language, and new app for a few months. Everything went well. The architecture relied heavily on the advanced language features of Swift.

The UI design was scalable for the growing number of products that Uber offered, the functional programing paradigm was powerful (albeit a bit of a learning curve), the architecture centered around our new realtime stream based networking protocol (that’s the part I wrote).

After a few months and a number of flashy demos later the momentum was building. The project was looking like a success. They had built amazing experiences in a short time with a small number of engineers. Most of the core product was built out. The execs were sold.

So the company-wide rollout began. Teams began shifting all their focus to bringing their features to the new app. At first the excitement of the new created a flurry of motivation and productivity. The architecture was built for feature isolation which allowed teams to move fast

But once Swift started to scale past ten engineers the wheels started coming off. The Swift compiler is still much slower than Objective-C to then but back then it was practically unusable. Build times went though the roof. Typeahead/debugging stopped working entirely.

There’s a video somewhere in one of our talks of an Uber engineer typing a single line statement in Xcode and then waiting 45 seconds for the letter to appear in the editor slowly, one-by-one.

Then we hit a wall with the dynamic linker. At the time you could only link Swift libraries dynamically. Unfortunately the linker executed in polynomial time so Apple’s recommend maximum number of libraries in a single binary was 6. We had 92 and counting.

As a result It took 8-12 seconds after tapping the app icon before main was even called. Our shinny new app was slower than the old clunky one. Then the binary size problem hit.

But to answer @tapbot_paul ’s original question, when the problems started showing up in earnest, we were already way past the point of no turning back (sunk cost fallacy). At this point the whole company was pouring its energy into the new app.

Thousands of people across every discipline, millions and millions (I can’t tell you the real number but it was way more than 1) of dollars had been spent. The whole management chain was fully bought in. I had privately had the “we need to stop” conversation with my director.

He told me that if this project fails he might as well pack his bags. The same was true for his boss all the way up to the VP. There was no way out.

So we rolled up our sleeves, and put our best people on each of the problems and prioritized the launch critical issues (dynamic linking, binary size). I was assigned to both dynamic linking and binary size in that order.

We quickly discovered that putting all of our code in the main executable solved the linking problem at App start up. But as we all know, Swift conflates namespacing with frameworks; so to do so would take a huge code change involving countless namespace checks.

That’s when the brilliant Richard Howell (not sure if he’s on Twitter) discovered while reading the Xcode build output that he could take all the intermediate object files and re-link them back into the main executable with a custom script after the build was complete.

Since Swift mangles the object namespace into the symbol name itself at compile time, this meant that he could safely preserve the namespacing while doing this. This allowed us to effectively static link our libraries and cut our pre-main time from 10 to basically 0.

Next problem: App Size. At the time we were planning to include the new app in the old app bundle and slowly roll it out at runtime as a safety net. First thing we did to buy space was to just remove the old app. We called this release strategy “Yolo”. TK himself made the call.

We also replaced all of our Swift structs with classes. Value types in general have a ton of overhead due to object flattening and the extra machine code needed for the copy behavior and auto-initializers etc. This saved us space so we pressed on.

But as the app kept growing. Soon we hit the cellar download limit (100 mb) for our universal binaries (iOS 8 and earlier). This represented a substantial amount of lost signups (it dollars it would cost us in the order of 8 figures of people who hadn’t upgraded yet).

At this point we were weeks away from the public launch date. We had graciously received help from a certain company that I’m still under NDA with, but they couldn’t solve our problem. The only thing we could do was regenerate all the model code (25% of the total line count)

back into Objective-C or drop support for iOS 8. Since iOS 9 had introduced individual architecture slicing it was affectively half the size (give or take). With only a week left we decided eat the 8 figures and drop support for iOS 8.

The general thinking was that at half the size we still had plenty of runway with the iOS 9 binary, and after the rewrite was done we could solve the problem sometime way down the road, because things would slow down a bit. We were unfortunately completely wrong about that.

After the app release we threw a huge party. The app was well received by the press. It was fast and snappy, with a flashy new design.

A bunch of people got promoted. We all breathed a sigh of relief. The 90 work weeks stopped for a few weeks.

But then the public sentiment started to shift. The new app was centered around letting the customer enter the destination first so they could get the price upfront (in the old day you just got a multiplier number next the rate).

Without manual pickup location entry people’s location would just show up as whatever the GPS location was last received. This can be very inaccurate (especially in cities with tall buildings) and drivers would end up on the wrong block. This was a horrible customer experience.

So to improve location pickup we changed the location permission to collect signal in the background so we could send the drivers to your current location. People freaked out. Some of my ex-Twitter colleges called on me to quit such an evil company that would track you like this.

As a result of said freakout, (there’s a whole other thread about this involving @gruber and @TechCrunch that I’ll explain some other time) people turned off location permission. But the new app hadn’t designed any experience to handle this use case.

So we scrambled to back fill the experience. We debated turning background location off, but then it would destroy the customer experience at pickup time again.

Then after the Trump got in the White House (this was about three months after the new app release), this set off a chain reaction that led to the start of the #DeleteUber movement. That is also another thread, but basically the NYC taxi union seized upon the outrage created by

the travel ban to accuse Uber off strike breaking by turning off surge at Laguardia. This was a completely lie, without surge pricing the supply immediately dries up (no one will drive to the airport without the extra incentive for them to go there). But the lie went viral.

All this time the Swift code growth continued. The continued problems and slow developer environment created two warring political factions within the iOS engineers at Uber. I’ll call them the Swift Zealots and the Objective-C curmudgeons.

So the combined external pressure, and internal factions meant that tensions were high. The Zealots were in denial about the problems that Swift created. The curmudgeons complained about everything you could imagine without providing much in the way of solutions.

It was right around this time that the app size problem caught up with us. I was on call and the release team was having trouble submitting the app. Turns out our brilliant solve to the dynamic linking problem now created a main executable that was too large for some archs.

So after fixing that problem, @aqua_geek and I did some digging and discovered that our compiled code size was growing a rate of 1.3 mb a week. I threw an alarm up the chain. At that rate we would hit the cellular download limit in 3 weeks if we didn’t do something.

But the internal devisions were growing so strong that we were in complete denial, one of the tech leads in the Swift camp wrote a two page paper on how the cellular download limit didn’t matter (Facebook blew past it a long time ago after all) We also so tired of fighting fires.

So one of our data scientists designed a test by artificially pushing one of the architecture slices over the limit and measuring the effect on the business metrics. The next week we pulled that slice back down and pushed another slice over the limit (to control for arch).

The effect was catastrophic. The negative business impact was a few orders of magnitude larger than the entire cost of the year long Swift re-write. Turns out a ton of people are on a cellular network the first time they download the Uber app (who knew?).

So we formed another strike team. We started decompling our object files and going over the symbols line by line to identify why our Swift code size was so much larger. We deleted unused features. Tyler had to rewrite the watchOS app back into objc.

(The watch app was only about 4400 lines but it since the processor arch was different and there was no ABI compatibility that means we had to include an entire extra copy of the Swift runtime into the watch bundle)

We were at our breaking point. So tired. But everyone rallied. This is when the real brilliant engineers started to shine. One of the devs in Amsterdam figured out how rearrange the compiler optimization passes. For those of us that aren’t compiler engineers I’ll explain.

Modern compilers do a ton of passes on our code. For example one pass might inline your functions. Another might replace constant expressions with their values. Depending on the order they execute the machine code might be smaller or larger.

If your inlined function gets passed a constant the complier can reason about that and replace the whole thing so

int x = 3
func(x) {
X + 4
}

would become just a constant value (7) if the inline pass goes first (which is much less code).

If inlining goes second the constant pass won’t be able to reason about the function body and you’ll end up with more code. This all of course depends entirely what the code you’re writing looks like so it’s hard to optimize the order of passes generically

So said brilliant engineer in Amsterdam, built an annealing algorithm in the release build to reorder the optimization passes in such a way to as minimize size. This shaved a whooping 11 mbs off the total machine code size and bought us enough runway to keep development going.

This terrified the Swift compiler engineers, they were worried that untested complier pass orders would expose untested bugs (even though each pass is supposed to be internally safe it’s hard to reason about the possible combinations). We didn’t have any major issues though.

We applied a bunch of other solutions too (linting for code patterns that were particularly expensive). We measured each one in the number of normal development weeks that the savings unlocked. But the real problem was the growth curve. It was always eating our winnings back.

Eventually we bought enough runway to make it to Apple upping the cellular download limit to 150. They also added a number of compiler features to help with the size optimizations (-Osize). By their own admission Swift will never compile as small as Objective-C

But as of this year they’ve gotten Swift down to 1.5x the machine code size of Objective-C, and they eventually upped it again to a 200 mb optional limit. We had enough runway to make it a few more years.

We almost failed though. If Apple hadn’t upped the limit we would have been force to re-write the Uber App back in ObjC. Eventually we were able to fix the other problems too. The brilliant @alanzeino and team got Swift support added to BUCK, which hugely improved build times.

A bunch of folks burned out along the way. A ton of money was spent, hard lessons were learned, but still to this day most people insist the rewrite was all worth it. New engineers who joined up loved the architectural consistency and never knew the pain it took to get there.

The larger community benefited from our learnings. @ellsk1 put together an amazing presentation and went on a speaking tour to share our knowledge. I was able to take my experience with me after I moved on and teach other teams how to make better decisions.

So my advice. Everything in Computer Science is a trade off. There is no universally superior language. Whatever you do, understand what the tradeoff are why you are making them. Don’t let it descend into a political war between opinionated factions.

Build in failure points. Figure out how to identify the tradeoffs and give yourself a way out if you get to a certain point a realize that you made a mistake. Big efforts are hard, but the cost grows the longer you make the wrong trade off.

Don’t be a Curmudgeon who doesn't contribute to the solution. Don't be a Zealot who creates bigger problems for everyone else. The best engineers I’ve ever worked with are really good at not falling into either of these traps.

Thank you all for coming on this journey with me. It was rather therapeutic. Good night!

[end thread]

Help us raise more money for charities by sharing this page ♥️
Wait! Before you go...
Grab Exclusive Deals for Books, Courses, Software.
100% of Profits Are Donated To Research-Backed Charities.