Scaling with common sense #2: Being future ready.

28 Jul 2021

Over the last year, owing to the unexpected rally in capital markets, Zerodha’s customer base has more than tripled, significantly increasing the number of concurrent users on our platforms along with the traffic and load they generate on numerous systems in the background. For context, in January 2020, we were handling 2+ million retail trades daily. In April 2020, it had risen to 7+ million. Today, it goes up to 12+ million. Our user base is now at 6+ million users, up from 2+ million last year.

We never foresaw that our client base that grew slowly and predictably over ten years would triple in less than a year. Not that we ever attempted to do so either. However, when it did happen, we were able to adapt our software, infrastructure, and organisation with only a nominal impact on our stack and only a negligible increase in technology and infrastructure costs. This includes transitioning 1000+ of our employees to full time remote work pretty much overnight owing to the Covid lockdowns in 2020. Meanwhile, our tech team that was 30 members strong last year now stands at 31.

The media and social media frenzy around the growth in business has piqued considerable interest, and there have been numerous queries along the lines of:

  • How did we manage our systems and costs to cope with the overnight explosion in capital markets activity?
  • How difficult or frantic were these changes for the tech team to handle?
  • How future-proof were our systems really and how did we plan them?

This post is an assortment of notes on the real life practices that we employ at Zerodha, and many first hand insights from the industry, which, to us, have been great lessons on how not to run technology organisations.

These notes demonstrate that the ability of a technology organisation and its systems to cope with future uncertainties and extreme volatility can be rooted in common sense first principles. Such principles can increase the odds of things falling into the right places naturally at opportune, almost serendipitous moments. A number of factors here are more human, more philosophical than technical, and so are these notes. It is not to say that these are universally applicable, but as they have worked for us, they may work for others too.

As always, these anecdotes should be read with a healthy and rational awareness of context and the trade-offs involved.

1. Future-proofing software is hard.

In fact, “proofing” is borderline oxymoronic. Software can change rapidly and become obsolete in no time. Just like with physical objects, shiny new software can make older software look dated, biasing technical decision-making and significantly altering its future readiness.

Paradoxically, little changes in software over long periods of time when one zooms out a little. Half a century old C language is still widely used and has changed little syntactically or semantically, and new languages continue to borrow heavily from it. Same with SQL. Same with paradigms like functional and object oriented programming; synchronous and asynchronous network communication; RPC and IPC; kernels; RDBMs and row and column based data modelling; lists, dictionaries, and maps; and on and on … While computing concepts change rarely and slowly, their specific avatars and manifestations can change rapidly and go in and out of fashion. What had become boring over the years can suddenly be re-invented and be in vogue again, like rendering HTML server-side and sending the results to the browser over the wire (!), or stateless “functions” (programs) in a shared environment (remember /*.php and /cgi-bin/?).

Software trends tend to be surprisingly cyclical. Reminds me of the quote “The beginning is the end and the end is the beginning” from the sci-fi series Dark, which Karan casually tried to pass off as his own after watching just one episode. He didn’t actually, but as the saying goes, “All Threads Digress To Karan”.

This fundamental realisation, that computing concepts seldom change unlike their specific implementations, is the first step to building systems that have better odds of being future ready. It is important to understand the past and present landscapes of software to gain perspective on what really is new and what is old, important factors that influence technical decisions. We use this as a yardstick when evaluating technologies, be it a Javascript framework or an entirely new technology. For instance, our decision to port our mobile applications to Flutter in 2018 (after writing a quick prototype in it over two weeks) when Flutter was alpha—a calculated risk that has paid off really well.

2. Pragmatism over perfection.

Knowing when and where to draw the line when building something is hard. Developers—I certainly do—have the tendency to tweak systems endlessly to get them into a “perfect” state to ween off a sense of incompleteness; “It’s not ready yet. Just a bit more”. This feeling never really goes away, but with practice and make-and-break production experience, it becomes possible to intuit the point where something is good enough to be shipped by all practical means. Also, in rapidly changing business environments, chasing a “perfect” state may be a fool’s errand. In a heavily regulated industry like the capital markets, regulations can come out of the blue and change how businesses operate overnight, forcing software to be refactored and rewritten with little time to plan.

For example, in mid 2020, the Indian capital markets regulator gave a deadline of about four weeks to completely change how equities were sold on trading platforms up until that point. What used to be a realtime transaction that was sent to the exchanges when users clicked on a sell button now required a PIN based authorisation on an external gateway operated by market institutions known as depositories. Needless to say, this forced a series of significant cascading changes through critical realtime financial systems with little time to test. The UI and UX of selling securities had to change on the trading platform across web, both mobile platforms, and multiple external platforms built on top of our systems. Not only that, millions of users needed to be onboarded onto this external gateway to setup their PINs and be educated of this entirely new flow on short notice.

Shortly thereafter, the regulation changed yet again to introduce a new SMS OTP to this gateway in addition to the PIN. And right after that, another regulation changed the validity of these authorisations forcing even more changes across systems. While these were in the interest of retail investors, to developers, they were extremely risky changes on unbelievably tight deadlines. Thus, in many environments, being pragmatic is not a choice but a necessity.

Such pragmatism has to continuously hit the right balance between core features and capabilities, code quality extensibility, stability to run and serve its function, market conditions, laws and regulations, and also, a well defined API/UI spec that is unlikely to change rapidly. API/UI here is used in an abstract sense. It can mean a command line interface, a file or wire format, or network calls, RPCs, or ABI—any layer that lays down certain promises when it comes to interactions with other systems. Paying attention to interface design has yielded us wildly unexpected benefits.

3. Stable interfaces: More art than science.

When a system has to be inevitably refactored or rewritten when business requirements change in unexpected ways, often, the only silver lining is a stable API spec that maintains compatibility with other systems, allowing something to be swapped out without jeopardizing its dependents. A stable API establishes a level of separation of concern between different parts of the same program, between different systems, and even between whole departments in an organisation—good old modularity which is commonly, incorrectly, conflated with microservices. It can often mean that one department in an organisation can take one planned hit rather than several at the same time, something that we have done countless times, something that would not have been possible if multiple departments and the overlapping systems they use were not modular enough.

It is a common misconception that internal APIs and architectures can be easily changed and replaced, which often results in the design of internal APIs and integrations not getting the same care and attention that public facing APIs usually get. The probability of poorly thought out integrations becoming technical debt as internal systems grow complex is very high. It is thus important to try and get the API spec as right as possible early on, even for internal systems.

4. Unexpected wins from common sense design: Kite Connect.

Common sense software design, architecture, and separation of services can yield unexpected benefits far beyond performance wins. My favorite example at Zerodha is Kite Connect, our HTTP/JSON API platform that allows building full fledged trading and investment platforms over our infrastructure.

4.1. The idea of Kite.

All through 2013 and 2014 in the early days of our tech team, we were busy digitising things inside the organisation, mostly writing Python scripts to automate mundane manual processes. Building a trading platform was not even a thought. Sometime in 2014, our OMS (Order Management System) vendor came out with a white label web based trading platform that we could offer to our users. Web based trading platforms were a tiny niche and Windows-only desktop trading platforms dominated the industry. When I saw it first hand, I was shocked by what passed for a web application. If I remember correctly, its standout feature was that IE6 was no longer a requirement … in 2014. We abruptly decided that we needed to build a usable web based trading platform. A pivotal, unplanned move that would eventually transform Zerodha into a full fledged technology firm.

4.2. Kite needs clean APIs.

There were only about four or five of us in the tech team at that point, and none of us had any prior experience building a trading platform let alone any familiarity with capital markets or finance. Thankfully, the OMS vendor had a set of SOAP-like HTTP/XML APIs on top of the OMS engine. This is what would power the buy and sell functions of the trading platform. I would spend the next several weeks trying to understand the highly inconsistent route and naming conventions in the poorly documented API, wrapping the calls one by one into HTTP/JSON calls with consistent field names in a Python web server. It made sense to write this middleware to hide vendor weirdness behind a consistent API that we would be comfortable working with. Of course, it would then take months of back and forth with the vendor to actually get the upstream APIs to be in a usable state, followed by years of follow ups for incremental improvements.

This new API spec was easy to use, understand, and was documented enough for Vivek to start integrating it into the Angular web frontend he had started building based on my UI “design”. We were winging it. The name “Kite” implied something simple and light weight to operate, something that could reach metaphorical heights.

4.3. Serendipity.

Around the same time, we had started realising that good quality end user software had to exist in the industry for the ecosystem to grow, whether or not we built it. While we were seeing numerous innovative end user applications launch by the day in India, capital markets was seeing none of it thanks to thick red tapes and the extremely high entry barrier owing to the legacy of a century old industry and status quo in general. For example, up until 2017, a user had to print, sign, and courier a ~40 page form to open an account with a brokerage as online account opening was not permitted by regulations. Who would want to experiment in an industry like that! It was (and still is) extremely complex to start a brokerage firm in India and the only way to invite innovation would have been to help technologists cut through the red tape and build on top of existing brokerage firms.

Coincidentally, it occurred to us that the APIs for Kite were generic and clear enough to build any trading platform on top of it, not just Kite. As developers, we were excited by essentially a clean set of well documented APIs, something that was alien to the industry. In hopes that eventually there would be people who would come build on top of it, we named it Kite Connect, designed an OAuth-like authentication flow, wrote better documentation, and built a self-service developer portal around this, while Kite itself was being built. It is important to note that something like this could only happen in organisations where technical decision-making powers are in the hands of technical people.

This wild, spontaneous side project very quickly bore fruits when smallcase approached us with a proposal to build a thematic investment platform. We gave them Kite Connect API keys, help with regulatory approvals, and seed funding from the modest profits we were making. Our users who had painstakingly opened accounts with us to invest and trade could now login to smallcase with a single click. We would handle the legalities, compliance, financial processes, and of course the underlying technology and infrastructure, and smallcase could focus on building their platform.

4.4. Rainmatter.

This kickstarted Rainmatter, our fintech investment fund, and the Kite Connect APIs became a “Brokerage as a Platform” (BaaP) offering to companies that wanted to build bespoke financial platforms at a time where user onboarding still required signing and couriering 40+ pages. Given this state of the industry, investing time and effort, let alone envisioning an API platform complete with a developer portal may not have made sense. But, Kite Connect was not planned, it was a fortuitous side effect of good API design.

Today, there are numerous applications and several companies built on top of Kite Connect. We dog-food it heavily integrating all our end user investment and trading apps, and several internal systems seamlessly via its APIs. It is the foundation on top of which significant parts of our business and product strategy are built. Amusingly, the Kite Connect platform on its own is a profitable vertical. All this has enabled us to start the Rainmatter Foundation, where Zerodha has committed a significant portion of its profits to fund environmental conservation efforts.

Many big unexpected wins.

5. Clever software, dumb code.

I cry-smile (🥲 our new favorite emoji in tech team that accurately captures a wide range of developer emotions) when I struggle to read and make sense of code I wrote just a few weeks ago—“Why did I do this? What was I thinking?". Imagine the cognitive burden of having to read, parse, and mentally compile somebody else’s code when getting into complex systems. It is a misconception that clever software needs clever code. Whether or not there are single-line nested list comprehensions and lambdas in Python code has no bearing on the cleverness and utility of the final software it produces. However, such cryptograms have a strong bearing on how readable and maintainable a codebase is, affecting its future development.

Software that is to be future ready, its source code should be easily read and parsed in the future by its own authors and future maintainers, whoever they may be. Paradoxically, the original authors themselves often end up in the shoes of future developers as the mental models of the software they wrote deplete with time. It took a long time for this to sink in, but when it did, it changed the way I wrote code and killed my fascination for magical multi-layered abstractions and cleverness. The value of simple, almost dumb, explicit code that is easy for a developer to pick up and work on cannot be overstated. In the same vein, writing comments that explain the intention of the developer, why something is being done in a program, is significantly more valuable than the comments that explain the what, which explicit, self-explanatory code will do anyway.

In our code reviews at Zerodha, we flag “clever code” that, while functional, is not easy for others to read and understand. These code review sessions where anyone in the team can participate irrespective of their primary projects help us keep ourselves in check when we start straying from these best practices from time to time.

6. Software estimation is hard.

There may be a grand unified theory for the universe, the discovery of the meaning of life, contact with intelligent extraterrestrial beings, before there will be a framework for accurately estimating software development complexity and timelines. I am only half joking. It is mind blowing how comically wrong software timelines generally are—forget large government projects, but the addition of a single new number on a report or a checkbox that changes a certain UI behaviour. Things that we estimate to take days can take weeks or even months.

As I write this paragraph, I am into the fourth day of debugging, testing, and rolling back what seemed like a trivial change that I estimated to take a few hours. On its own, the change is trivial, but the number of subtle edge cases it introduced because of the nuances of multiple systems involved and the timings of certain business processes, we could not foresee.

Software complexity and timeline estimation is often hindered by the countless subtleties in dependencies and business and user behaviour, than code. In reality, writing code often takes only a fraction of the time compared to figuring out what to write, and then dealing with all the unexpected side effects of what has been written. It is important to be conscious of this fact and actively factor it into project plans, leaving enough room for unexpected delays that are ironically expected. Unfortunately, this is seldom understood by non-technical decision makers who are often in charge of software development timelines.

7. Slow down to speed up.

Technical debt is inevitable. The older the debt, the harder it is to get rid of it. In organisations that develop and iterate software rapidly to meet feature goals and deadlines, there is little time to pause, refactor, do things the right way, paving the way to technical debt. That one missed feature in a quarter (yes, as absurd as it sounds, we know of organisations that have X “features” per quarter as a goal) could mean losing competitive advantage, right? This is a myth. Organisations often overestimate the importance of the features they continuously ship (and underestimate the importance of features they don’t ship). In fact, not constantly adding “features” and changing software plays an important role in its prospects of being future ready. That improving the quality and performance of existing software in itself is an important implicit “feature” is a fact lost on many. We regularly delay shipping new features to slow down and clean up technical debt. Unfortunately, in many organisations, such technical decisions are often not in the hands of developers (cries in Section 8).

7.1. Don’t fix what is not broken, but fix what might soon break.

The first version of Kite, our trading platform, was written in Python. When it was launched in 2015, there were barely a thousand concurrent users on it. Today, there are way more than a million at any given moment. Version 2, a significant refactor, was launched in 2016, and version 3, a complete rewrite, was launched in 2017. Between these versions, we exhausted possible avenues of performance optimization in Python and rewrote the whole backend in Go. The Angular frontend which became a pain to maintain very quickly was scrapped and rewritten in Vue. The Kite mobile applications that were originally written in native code (became a multi-codebase maintenance headache), then React Native (became a performance headache), were scrapped for a third time and replaced with a rewrite in Flutter. These decisions have worked out amazingly well for us.

Another instance is ticker, the component that streams tens of millions of real time market data packets every second to end users. It has been refactored and rewritten at least five times in the last six years as we got better at writing Go and as we started sensing impending performance bottlenecks. In addition, its internal streaming technology has changed from a custom TCP protocol to ZeroMQ to Nanomsg/Mango to NATS finally, which is rock solid and highly scalable. The refactored Go Kite web application backend that went live in 2017 got replaced last month with a version rewritten from scratch to be consistent with our current Go idioms and patterns. This rewrite which we did as a side project over many months, has ~50% fewer lines of Go and SQL code, has visible performance improvements, and is far simpler to read and understand, which also plays an important role in security. For Kite, we maintain 40ms roundtrip latency on the client side over a good internet connection as the baseline, and any time this starts fluctuating, we look at avenues to refactor. Such baselines can be indicators of looming bottlenecks.

Again, we recently scrapped the Python version of Console, our reporting and analytics platform that crunches through hundreds of billions of rows of financial data, and rewrote the whole stack in Go. Console only gets a small portion of the user activity Kite gets, but this rewrite pushes it lightyears ahead of its past avatar. It now computes and present reports in milliseconds, has a much cleaner and leaner codebase that gets rid of the past cruft and reflects the better engineering practices and maturity as developers that have gained over the years. For Console, we have now started experimenting with ClickHouse for moving certain kinds of fast growing immutable financial data that runs into the TBs for significantly more performance gains and massive reduction in storage space. While we can keep provisioning more storage, the current rate of growth of data indicates that in a couple of years, there would be cost implications that while immaterial to the organisation, we would not appreciate.

Our user onboarding portal has been scrapped and rebuilt multiple times over its lifetime, each time gaining usability improvements and also incorporating drastic regulatory / KYC changes.

When we detect code smell or impending performance problems in our software, we sit down and assess its criticality, dependencies, risk, future readiness, and more importantly, our own technical competence and confidence to decide how, if, and when to refactor or rewrite. This requires regular self-reflection and reality checks to avoid slipping into territories of delusion. These decisions are taken entirely at the discretion of the tech team and do not involve non-technical teams. The refactors are generally done by two to four developers who are the principal developers of respective projects. Of the dozens of big, carefully weighed refactors and rewrites, every single one has been more than worth it for us, not just in terms of technical, performance, and operational benefits, but cost.

7.2. Freeze new stuff to improve old stuff.

For existing software to be rewritten or refactored, many new developments may have to be paused, which businesses are generally paranoid of, because: “new” features. At Zerodha, we are okay to delay a new feature by a few weeks or even months to clean up technical debt and make room to meaningfully assimilate it. To be doubly sure, sometimes we test new features and make partial blue-green deployments of changes over many months, even features that may give us a perceived competitive advantage. Changes dictated by compliance and regulations are an exception to this though.

Funnily, such planned short term delays for maintaining the sanctity of software have enabled us to iterate and ship new features at a much faster rate than our industry counterparts. And in most cases, with every refactor, the need for another refactor, and the potential for technical debt, reduces considerably. Paradoxically, with calculated slowdowns, we have gained speed in unexpected ways.

7.3. Technical debt is a reality of life.

All of this sounds great in hindsight. Things could’ve gone wrong spectacularly too. However, irrespective of the outcome that is a matter of odds, there is little doubt that technical debt needs to be addressed proactively because:

  • That cleaner software has higher odds at being future ready than poorly built software is a fact.
  • Technical debt is inevitable in all complex systems.
  • Technical debt can seriously hamper the future readiness of systems and entire organisations.
  • With passing time, technical debt gets worse and harder to service.

No business goals, vision, strategy, or competitive advantage changes the fact that technical debt is inevitable and that it needs to be handled.

7.4. Fear.

Then why is it that many organisations rarely pause to clear technical debt? Why are developers afraid of suspending new development to periodically clean up their existing systems? Is refactoring and improving existing software not “new” development? From interactions with organisations of all shapes and sizes, I have come to the realisation that a key reason is fear, and not that of technical risk.

Often, technical leaders are afraid of change as they are obliged to justify even the minutest technical changes to non-technical management that holds the strings. “We are going to stop all new features on the trading platform for two months to rewrite a component for a 10ms latency improvement because the traffic pattern indicates that the request throughput might deteriorate in the near future” is not something that will go down well with non-technical management in most organisations. Extrapolate this to whole industries, it creates a vicious cycle of technical leaders with eroding confidence to take calculated decisions to scrap and rewrite poor software. Software development becomes muddled with business development under the supervision of non-technical leaders making technical decisions. A great shame.

8. The tyranny of non-technical “tech leaders”.

“Tech leaders” with no current hands-on experience or expertise in technology who hold the reins of tech teams, actively calling technical shots, have to be the biggest impediment to writing good software and building good tech teams. Far worse than technical incompetence, which at least in the right environment, can be addressed. Note that this is specifically about non-technical leaders who make technical decisions.

One could argue, a little dramatically, that they are enemies of good software. This includes people who stopped being hands-on with technology decades ago, still making nuanced decisions on technical minutiae backed by nothing but ego and delusion of expertise—a total lack of self-awareness and a strong case of intellectual dishonesty. It is mind boggling when one sees senior “executives” with no technical knowledge not only throw around meaningless technobabble in enterprise webinars, but actually impose even more nonsensical technical decisions in their organisations, producing terrible software monstrosities that are technical debt on arrival, cost a fortune to maintain, scale poorly, and drag back entire organisations and industries. An ironic anecdotal observation is that this seems to be prevalent in organisations that are most vocal about their “digital transformation journey” and “tech-first approach”.

Organisational hierarchies and power structures in countless organisations are such that software developers who actually write and maintain software rarely have a voice. They often have no choice but to implement random decrees that filter down through many levels of management from the utterly out-of-touch, unempathetic leadership at the very top. And one wonders why plenty of enterprise and government software globally are blackholes that suck in money, effort, and developer souls.

If an organisation wants to build future ready software, the leadership should be self-aware enough to trust, and delegate technical decision-making to hands-on technical people, build empathy across teams, and remove disingenuous “thought leader” executives from positions of technical decision-making. One does disappointingly acknowledge that such charlatanism is as old as civilisation itself and not unique to software development.

9. Abstraction, collaboration, and trust.

How can developers who have never traded build trading financial platforms that are good enough and are future ready?

In software development, some work is sometimes interesting but most work is ultimately boring. If one looks underneath a trading platform, it is forms, CRUD, RPC and API design, data serialization, naming confusion across packages, I/O bottlenecks, user management and sessions, and on and on. Sounds familiar? If one abstracts software enough, be it a trading platform or an e-commerce system, things start looking eerily similar. This is not in the same simplistic vein as taking a table and a chair apart where they just look like pieces of wood and nails. CRUD is the crux, not just a small component, of pretty much all user facing software that store and manipulate data. Similarly, API design is critical, not just incidental to software. So, with the right abstractions, it is possible to separate significant parts of implementation detail from the domain. For the parts that require domain expertise, developers can collaborate with domain experts and gradually build domain knowledge themselves.

When we started building Kite, none of us in the tech team had any expertise in trading, let alone anything remotely financial. However, we knew how to write software with good enough interfaces and focus on usability for end users. With Kite, we completely let go of the overly complicated trading interfaces that were status quo in the industry and built common sense UIs that were simple to use. The domain experts in the organisation trusted us enough to do this. In the beginning, it was just Nithin and I, and today, it has grown to a number of non-technical teams with domain expertise with whom the tech team collaborates meaningfully, and vice versa, without interfering with each other’s work. Sometimes, when this clear line of communication blurs, and it does indeed as with all human group endeavours, we are mutually respectful enough to step back and clean the slates. I am now convinced that this philosophy has to be rigorously applied from the very beginning at a grass roots level so that teams grow imbibed with the right culture and expectations, and that it can only be instilled from the top-down with empathetic and self-aware leadership.

Over time, many non-technical teams have learnt to communicate abstract requirements and problem statements which the tech team attempts to implement optimally. Some still have not though (you know who you are!). This fundamental principle, that non-technical teams should generally communicate problem statements instead of specific implementation details, is critical to future readiness of evolving systems. And so does the ability of tech teams to say “No” to requirements that do not make sense. For example: “Add a popup with X content to the trading platform that shows up whenever a user logs in” vs. “We need to inform users of X. Figure out the best way to do it”. The latter is an abstract problem statement that communicates the crux of the problem to the technical folks, allowing them to figure out the best way to solve the problem taking into consideration technical and product nuances. The former statement is a contextless, non-technical decree that when repeated enough times, ruins software, and of course, affects the morale of developers. A developer should know why something is being done to the software they write and maintain. That is when they can truly “own” it, maintain a current mental model without outright hating the codebase, and assimilate changes meaningfully rather than shoehorning them in.

10. Sometimes, connect by disconnecting.

As the number of standalone but interconnected systems within Zerodha grew, the effort to keep them in sync also grew significantly. For instance, several different kinds of customer databases and their subsets across various departments had to be kept in sync with certain sources of truth. Any change to data in certain systems had to be pushed to many other disjoint systems via API calls. This meant that a) Every system had to have APIs for getting updates b) Network and security implications for connecting disjoint systems c) Failure, retries, queuing and reconciliation logic for APIs d) Burden of communication between developers working on different systems and technologies who had to design and agree on these APIs. Since these setups had grown slowly and organically over many years starting with just one system and sprawling into dozens eventually, it was a maze that grew naturally and slowly. Such scenarios are prevalent in all organisations.

This maze of API connections between systems however turned out to have a silver lining. Most of them just required data to be synced to them to maintain consistent states, a typical trait of large organisations with many departments. Good old PubSub to the rescue. We identified source-of-truth systems that publish states, spec’d out data and standardised data structures and documented them, removed unnecessary API layers from all systems physically disconnecting them from each other, and instead, set up a central message bus and queue (Kafka) to which the source-of-truth systems published data on standardised topics as they came or changed. Whoever needed data simply consumed from one or more of the topics and did whatever they needed with it. This was implemented trivially as small sidecar utilities sitting alongside more complex systems, connecting disparate systems like the CRM to the trading platform’s user database to the mailing list without actually connecting them. Thanks to this, when a user deactivates their brokerage account, they instantly get unsubscribed from the mailing list manager without having to wait for “up to 7 days”, an industry first! This setup, which I visualize like a potato tuber network, including the Kafka installation, has incurred practically zero maintenance overhead over years.

Systems thus stopped being aware of each other. This has helped us evolve, change, and even remove systems completely across departments with minimal disruptions to the organisation and continues to play a critical role in moving the organisation forward.

11. Self-hosting FOSS: Obvious secret sauce.

Changes to complex intra-organisational systems require a deep understanding, ownership, and control of software. A seemingly trivial feature can necessitate cascading changes through multiple systems. I mentioned earlier that we have removed and swapped out entire systems with minimal disruptions to the organisation. That has only been possible because we self-host and manage as many systems as possible giving us complete control, insight, and liberty to change them as we please.

We self-host and self-manage everything from databases like Postgres, MySQL, ClickHouse, and Redis, Kafka clusters, ELK instances, monitoring and tracking systems including Sentry and Grafana, GitLab for hosting code and management deployments, numerous back office systems, employee intranet, accounting systems, and support and sales CRMs built on top of ERPNext, Metabase for data analysis, mailing lists, zero-trust + VPN network instances, Discourse forum for employees and on and on. Contrary to popular misconceptions, our tiny tech team has been able to install, run, and efficiently manage these systems all while writing software. In reality though, many of these battle tested systems are rock solid and require little oversight or “maintenance”. Do things break once in a while? Of course they do, but so do managed and proprietary systems as well.

Here is an anecdote of an unexpected and critical win that common sense self-hosting of FOSS systems enabled. We have been able to put all our self-hosted back office systems and dashboards that employees from various departments access behind an organisational SSO+2FA behind self-hosted VPN instances. This was instituted in 2017 as a common sense security measure and for the ease of access management. When the SEBI cybersecurity regulations for the industry came out in 2018, this setup automatically ensured that we were fully compliant with many critical clauses on day one. In March 2020 when the Covid lockdown was announced, thanks to the same architecture, we were able to transition to a fully remote company—something we had never envisioned—practically overnight with minimal effort and little added cost, enabling 1000+ of our employees to connect remotely and work. This common sense design enabled us to transition to a fully remote company in 2021 with no additional technical overhead.

12. Cost: Frugal hacking pays big.

Over the last eight years, we have established that self-hosting FOSS systems is an extremely cost efficient and common sense way of building and scaling a technology organisation even with a tiny tech team—our stack on our terms under our control.

By building most of our systems ourselves, self-hosting and self-managing, we have been able to incorporate drastic regulatory changes rapidly, scale systems with the growing user base, innovate faster than our industry counterparts, and keep our “IT costs” laughably negligible for the size of our operations. The cost savings that we gain from building our own technology (build vs. buy: we build much and buy very little) and self-hosting FOSS rather than relying on managed services or proprietary vendors contribute significantly to Zerodha’s profitability. Not only that, it is one of the key reasons why we were able to remain completely bootstrapped and never raise a single Rupee in investor funding.

Two anecdotes on costs out of countless:

  • Self-hosting a support ticketing system (a combination of OSTicket and brokerage specific modules built on top of ERPNext) for 1000+ users in the organisation costs us less than $10,000 per year in EC2 instances, storage, backups etc. The initial developer time has also been minimal. Maintenance effort is practically nil except for archiving old tickets once a year to a secondary DB instance when they run into the millions, which we have scripted. If we were to use what is the most popular SaaS ticketing system out there, the bill would have been upwards of $1 million per year.

  • Similarly, the heavily customized sales and support CRM on top of Frappe (Python business framework that ERPNext uses) that we tailor built for our workflows costs us a few thousand dollars to self host every year as opposed to any programmable SaaS sales CRM out there that would have cost us many millions of dollars yearly given the size of the organisation. The silver lining is that Frappe provides hundreds of out-of-the-box components, everything from a seamless UI to accounting to lead management to users to permissions to workflows to absolutely everything imaginable that one does not have to re-invent yet another CRUD CRM or a back office system. 97.42%* development effort is thus eliminated.

Disclosure: For the obvious aforementioned reasons, in 2020, Zerodha made a financial investment in Frappe Technologies, developers of ERPNext, one of the few FOSS companies in India.

Thus, the cost and ownership of technology are significant factors that affect the future readiness of not only systems, but an organisation’s very ability to exist into the future. It is worrying that a growing number of technical decision makers in savvy tech companies are apprehensive, even afraid of, self-hosting FOSS within organisations. Afraid of the “maintenance headache” strawman, afraid of the status quo, afraid of missing out on “enterprise grade” software powered by marketing blitz, and of course, afraid of non-technical management who assert technical decisions. Like everything else, self-hosting vs. not self-hosting should be a carefully weighed objective trade-off and not an emotional one.

On that note, if I had a Rupee for every time I spoke to a fledgling startup struggling to pay growing monthly SaaS bills, struggling to institute changes after being locked into proprietary SaaS, that would be enough Rupees to have a nice dosa+vada+coffee meal in South Bengaluru. On the other extreme are large, legacy, cash-rich organisations that are so dependent on external vendors for everything that their attempt to be “tech-first” is a steep uphill battle, like Karan’s inability to drive a car uphill, where his definition of a hill is any incline that is greater than 5 degrees.

13. The spider’s web of IT vendors.

We once spoke to a big “digital” bank in India that had several hundred developers on their office floor, where none of the developers ever wrote any code. Their job was to liaise with the bank’s IT vendors, raising, tracking, and executing endless requirement tickets. Here, the commissioning of a single CRUD API that recorded two fields in a database would take months of endless conference calls and committees. Then, there was an industry counterpart who had several dozen IT vendors building and servicing systems for them. I wish these were fictional organisations, but shudder by the realisation that there are countless such “tech first” organisations with abysmal in-house technical capacity, all stuck in a spider’s web of IT vendors, unable to move, change, or innovate, or even update their static website or an e-mail template fast enough, let alone be future ready. Many such organisations do manage to crawl into the future albeit by burning huge amounts of cash and wasting time and effort servicing mammoth technical debt, while ironically posturing increased “IT spend” as a measure of technological advancement. All it takes is one big SaaS or IT vendor lock-in to skyrocket expenses perpetually.

These organisations could always start using more FOSS and give technical decision-making powers to hands-on technical people allowing them to build in-house tech teams and capacity gradually. But, we already know why it rarely happens (Section 8). If IT spend was the right metric to gauge the quality of technology, banking applications and government portals would have been shining examples of good software, but they are not, globally.

14. Future is always uncertain and systems always go down.

The unfortunate truth is that be it trillion dollar technology corporations with the best engineering forces and unlimited resources, be it stock exchanges or social networks or airline systems, complex systems can go down in the most unexpected ways at the most unexpected moments. There is no such thing as 100% uptime and one would be deluded to think that it is practically achievable. It is the nature of complex systems that constantly face change. If nothing, the laws of physics at play eventually gets them.

In capital markets, there are so many undefined, previously unseen events like complex corporate actions (stock mergers, splits, reverse-merges etc.) that regularly break systems to the point that we time critical end-of-the-day batch processes to leave enough leeway for us to scramble at ungodly hours and address new edgecases. When such glitches inevitably creep in, the stakes and stress are high for us given the fact that we handle significantly more user concurrency, orders, and data than the next biggest broker in India.

Speaking of undefined events, remember the global fiasco of 2020 when crude oil started trading at negative prices? Not only were many exchange and trading systems globally not equipped to handle negative numbers, few seemed to even know that commodities could trade at negative prices. A shock to the system that wiped out a lot of organisations globally.

One cannot help but be pragmatic about downtime in systems. The better a system’s engineering, the better the environment in which the people who maintain the system are in, the better its odds of minimizing breakage and recovering from catastrophes.

15. So … ?

These meandering, tangential anecdotes attempt to illustrate how we develop software at Zerodha, and how common sense principles have enabled us from the very beginning to transition into a future that we never envisioned. These principles have enabled us to build complex software on top of precarious, legacy institutional infrastructure, and changed the course of the organisation for the better, despite wave after wave of highly volatile and unexpected shifts.

Our work at Zerodha has afforded me first hand experiences and visibility into a number of organisations that reaffirm my simple conviction that in software development, common sense and pragmatism trump any one-size-fits-all methodologies or frameworks. Thinking about it, it applies to many things in life.

TLDR;

  • The time and effort spent on attempting to write better software can have far-reaching positive consequences beyond just software.
  • Not depending on external IT vendors for everything and building technology in-house, owning, self-hosting, and self-managing grows technical capacity and improves the odds of being future ready.
  • The importance of self-hosting FOSS for being future ready is unparalleled (and underappreciated).
  • One of the biggest impediments to technological progress in organisations is the tyranny of disingenuous non-technical “tech leaders” (a leitmotif in this post and an obvious personal gripe) forcing technical decisions that are out of their depth.
  • Everything is a trade-off, a matter of odds, and being future ready with software is more about avoiding delusions and sailing into uncharted waters equipped with common sense, than it is about software.