Databases Were Built for Humans – AI Agents Change the Equation

For more than a decade, the industry has been preparing for a data explosion.

Zettabytes. Exponential curves. Hockey sticks on slides. Whether it was IDC’s DataSphere forecasts or countless vendor keynotes, the message was consistent: the amount of data created and stored worldwide was about to grow very, very fast.

And to be fair, that part largely went to plan.

Enterprises adapted. Storage scaled out. Cloud elasticity became normal. Analytical workloads were pushed away from systems of record. The industry did the work required to survive — and even thrive — in a world of exploding data volumes.

What almost nobody questioned, however, was a much quieter assumption baked into all of that planning.

The Assumption Nobody Revisited

All of those forecasts — explicit or implicit — assumed that the users of enterprise systems would remain human.

Humans are slow. Humans are bursty. Humans sleep.

Even power users have natural limits, predictable working patterns and an instinct for self-preservation when systems start pushing back. Entire generations of database design, connection management and capacity planning quietly depend on those characteristics.

It wasn’t a bad assumption. It was a reasonable one. Until it wasn’t.

A Step-Change, Not a Trend

What has changed is not just how much data exists, but who — or what — is accessing it.

AI agents introduce a new class of user into enterprise computing: non-human, machine-speed actors operating directly against application logic and data sources. This isn’t a continuation of an existing trend. It’s a step-change.

You’re not adding more users along the same curve. You’re changing the curve itself.

The data explosion was predicted. The user explosion — at least in this form — was not.

Why AI Agents Break Old Rules

AI agents don’t just behave like very enthusiastic humans.

They are fundamentally different:

Speed: they operate at machine speed, turning milliseconds into meaningful units of work
Relentlessness: they don’t pause, sleep or slow down unless explicitly forced to
Unpredictability: agentic workflows fan out, retry, amplify and cascade in ways humans never could

These aren’t “power users”. They’re closer to autonomous load generators.

When Agents Hit Systems of Record

Critically, AI agents don’t want last night’s report.

They want now.

That pulls them towards operational systems of record — the RDBMS platforms that were carefully protected for the last twenty years from exactly this kind of access pattern. Read replicas help, until they don’t. Caches help, until coherence matters. Copy lag becomes a business problem, not a technical detail.

The long-standing truce between OLTP and everything else is under strain.

Capacity Planning Enters the Chaos Zone

Traditional infrastructure planning assumes that tomorrow looks broadly like yesterday, just a bit bigger.

AI agents break that assumption.

Sudden workload spikes. Non-linear fan-out. Cost curves that move faster than budgeting cycles. Organisations are forced into an uncomfortable choice: over-provision aggressively and accept unpredictable cloud bills, or under-provision and risk outages in systems that now sit directly on critical decision paths.

Capacity planning stops being optimisation. It becomes risk management.

This Is Already Happening

None of this is theoretical.

Organisations are already talking openly about AI agents as part of their workforce — not as tools, but as actors performing work at scale.

Enterprises are comfortable counting tens of thousands of AI agents as “workers”, but it shouldn’t be surprising when those workers behave very differently to humans — and place very different demands on the systems beneath them.

The Equation Has Changed

The data explosion followed the forecast.

The explosion in users did not.

Databases were built for humans — slow, bursty, predictable ones — and that assumption shaped everything from architecture to cost models. AI agents don’t fit that mould… and pretending they do is how organisations drift into outages, runaway costs or both.

Databases were built for humans. AI agents didn’t get the memo — and they’re already in production.

Inferencing Is a Database Problem Disguised as an AI Problem

I have a habit of becoming interested in technology trends only once they collide with reality. Flash memory wasn’t interesting to me because it was new – it was interesting because it broke long-held assumptions about how databases behaved under load.

Cloud computing wasn’t interesting to me because infrastructure became someone else’s problem. It became interesting when database owners started making uncomfortable compromises just to get revenue-affecting systems to run acceptably in the cloud. Compute was routinely overprovisioned to compensate for storage performance, leading to large bills for resources that were mostly idle. At the same time, “modernisation” began to feel less like an architectural necessity and more like a convenient justification for expensive consultancy services.

And now, just when I thought flashdba had nothing left to say, AI is following the same path.

We’ve Seen This Movie Before

For the last couple of years, most of the attention has been on training. Bigger models, more parameters, more GPUs, massive share prices. That focus made sense because training is visible, centralised and easy to reason about in isolation. But as inferencing starts to move up into the enterprise, something changes.

In the enterprise, inferencing stops being an interesting AI capability and starts becoming part of real business workflows. It gets embedded into customer interactions, operational decisions and automated processes that run continuously, not just when someone pastes a prompt into a chat window. At that point, the constraints change dramatically.

Enterprise inferencing is no longer about what a model knows. It is about what the business knows right now. And that is where things begin to feel very familiar to anyone responsible for systems of record.

Because once inferencing depends on real-time access to authoritative operational data, the centre of gravity shifts away from models and back towards databases. Latency matters. Consistency matters. Concurrency matters. Security boundaries matter. Above all, correctness matters.

This is the point at which inferencing stops looking like an AI problem and starts looking like what it actually is: a database problem, wearing an AI costume.

Inferencing Changes Once It Becomes Operational

While inferencing remains something that sits at the edge of the enterprise, its demands are relatively modest: a delayed response is tolerable… slightly stale data is acceptable. If an answer is occasionally wrong, the consequences are usually limited to a poor user experience rather than a failed business process.

That changes quickly once inferencing becomes operational. When it is embedded directly into business workflows, inferencing is no longer advisory… it becomes participatory. It influences decisions, triggers actions and – increasingly – operates in the same execution path as the systems of record themselves. At that point, inferencing stops consuming convenient snapshots of data and starts demanding access to live context data.

What is Live Context?

By live context, I don’t mean training data, feature stores or yesterday’s replica. I mean current, authoritative operational data, accessed at the point a decision is being made. Data that reflects what is happening in the business right now, not what was true at some earlier point in time. This context is usually scoped to a specific customer, transaction or event and must be retrieved under the same consistency, security and governance constraints as the underlying system of record. In other words, a relational database. Your relational database.

Live Context gravitates towards RDBMS systems of record. It does not appear spontaneously – it is created at the moment a business state changes: when an order is placed, a payment is authorised, an entitlement is updated or a limit is breached, that change becomes real only when the transaction is committed to the RDBMS. Until then, it is provisional.

Analytical platforms can consume that state later, but they do not create it. Feature stores, caches and replicas can approximate it, but they do so after the fact. The only place where the current state of the business definitively exists is inside the operational production databases that process and commit transactions.

As inferencing becomes dependent on live context, it is therefore pulled towards those databases. Not because they are designed for AI workloads, and certainly not because this is desirable, but because they are the source of truth. If an inference is expected to reflect what is true right now, it must, in some form, depend on the same data paths that make the business run.

This is where the tension becomes unavoidable.

Inferencing Is Now A Database Problem

Once inferencing becomes dependent on live context, it inherits the constraints of the systems that provide that context. Performance, concurrency, availability, security and correctness are no longer secondary considerations. They become defining characteristics of whether inferencing can be trusted to operate inside business-critical workflows at all.

This is why enterprise AI initiatives are unlikely to succeed or fail based on model accuracy alone. They will succeed or fail based on how well inferencing workloads coexist with production databases that were never designed, built or costed with AI in mind. At that point, inferencing stops being an AI problem to be delegated elsewhere and becomes a database concern that must be understood, designed for and owned accordingly.

The Biggest Gap In The Clouds? High Performance RDBMS

Over the course of the last few blog posts, we’ve looked at how an increasing number of database workloads are migrating to the cloud, how there is more than one path to get there… and why overprovisioning is one of the biggest challenges to overcome.

We’re talking about business-critical application workloads here: big, complex, demanding, mission-critical, sensitive, performance-hungry… When on-prem, they are almost certainly running on dedicated, high-end infrastructure. And that’s a potential issue when you then migrate them to run on “someone else’s computer“.

As we’ve discussed before, the cloud is really a big pool of discrete resources and services, all of which are available on demand. You want a managed PostgreSQL instance? Click! It’s yours. You want three hundred virtual machines on which you can install your own software? Clickety-click! Off you go. If you’ve got the budget, the cloud has got a way for you to spend it. But underneath it all, whether you are using PaaS databases supplied by the cloud provider or installing the database software on IaaS systems, you are sharing that infrastructure – and the available performance – with the rest of the world.

Cloud Outcomes: Optimization versus Modernization

For some database workloads moving to the cloud, the modernization path will be the best fit, which means they will likely move to Platform-as-a-Service solutions where the day-to-day management of the database, operating system and infrastructure is taken care of by the cloud provider. Some examples of this path: on-prem SQL Server databases moving to Azure SQL Managed Instances; Oracle Databases moving to AWS’s Oracle RDS solution, etc.

But there is usually a certain class of database workload which doesn’t easily fit into these pre-packaged PaaS solutions: the big, the complex, the gnarly… the monsters of your data centre. And they inevitably end up in Infrastructure-as-a-Service… or stuck on-prem. For customers choosing the IaaS route (the “optimization path” in cloud-speak), the cloud provider manages the infrastructure but the customer is still responsible for the database and operating system.

Obviously, IaaS has a higher management overhead than PaaS, but often the journey to IaaS is simpler (essentially more of a lift and shift approach), while PaaS solutions often require a more complex migration. Especially with some cloud providers, where the recommended PaaS solution is actually a different database product entirely (for example, Oracle customers moving to Google Cloud or Microsoft Azure will be recommended by those cloud providers to move to Cloud SQL and Managed PostgreSQL respectively).

I/O Performance Is The Biggest Challenge

My view is that PaaS solutions are the best path for all appropriate workloads, but there will always be some outliers which need to move to IaaS. Almost by definition, those are the most high-profile, demanding, expensive, revenue-affecting… in fact… the most interesting workloads. And in all the cases I’ve seen, I/O performance has been the limiting factor.

It’s relatively easy to get a lot of compute power in the cloud. But as soon as you start ramping up the amount of data you need to read and write, or demanding that those reads and writes have very fast, predictable response times, you hit problems. In other words, if latency, IOPS or throughput are your metrics of choice, you’d better be ready to start doing unnatural things.

And it’s not necessarily the case that your required level of performance cannot be achieved. Often, it’s more correct to say that your required levels of performance cannot be achieved at an acceptable cost. Because it turns out that the following statement is just as true in the cloud as it ever was on-prem:

Performance and Cost are two sides of the same coin…

This is why I believe that the biggest gap in the cloud providers’ product portfolios today is in the area of high performance relational databases: primarily Oracle Database and Microsoft SQL Server. The PaaS solutions are designed for the average workloads, not the high-end. A complex database running on, for example, Oracle Exadata will struggle to run on a vanilla IaaS deployment – while the refactoring required to take that database and migrate it to Managed PostreSQL is almost unimaginable.

Cloud Compromises: Constrained and Optimized CPUs

Imagine the scenario where you wonder into a clothing store to buy a t-shirt. You find a design you like in size “Medium” but it’s too tight (I guess #lockdown has been unkind to us all…) so you ask for the next size up. But when it arrives, you notice something bizarre: the “Large” is not only wider and longer, it also has an extra arm hole. Yes, there are enough holes for three arms as well as your head. Even more bizarrely, the “XL” size has four sets of sleeves, while the “Small” has only one and the “XS” none at all!

Surprisingly, this analogy is very applicable to cloud computing, where properties like compute power, memory, network bandwidth, capacity and performance are often tied together. As we saw in the previous post, a requirement for a certain amount of read I/O Operations Per Second (IOPS) can result in the need to overprovision unwanted capacity and possibly even unnecessary amounts of compute power.

But there is one situation where this causes extra levels of pain: when the workload in question is database software which is licensable by CPU cores (e.g. Oracle Database, Microsoft SQL Server).

To extend the opening analogy into total surrealism, imagine that the above clothing store exists in a state which collects a Sleeve Tax of %100 of the item value per sleeve. Now, your chosen t-shirt might be $40 but the Medium size will cost you $120, the Large $160 and the XXXXXL (suitable for octopods) a massive $360.

Luckily, the cloud providers have a way to help you out here. But it kind of sucks…

Constrained / Optimized VM Sizes

If you need large amounts of memory or I/O, the chances are you will have to pick a VM type which has a larger number of cores. But if you don’t want to buy databases licenses for these additional cores (because you don’t need the extra CPU power), you can choose to restrict the VM instance so that it only uses a subset of the total available cores. This is similar to the concept of logical partitioning which you may already have used on prem. Here are two examples of this practice from the big hyperscalers:

Microsoft Azure: Constrained vCPU capable VM sizes

Amazon Web Services: Introducing Optimize CPUs for Amazon EC2 Instances

As you can see, Microsoft and AWS have different names for this concept, but the idea is the same. You provision, let’s say, a 128 vCPU instance and then you restrict it to only using, for example, 32 vCPUs. Boom – you’ve dropped your database license requirement to 25% of the total number of vCPUs. Ok so you only get the compute performance of 25% too, but that’s still a big win on the license cost… right?

Well yes but…

There’s a snag. You still have to pay the full cost of the virtual machine despite only using a fraction of its resources. The monthly cost from the cloud provider is the same as if you were using the whole machine!

To quote Amazon (emphasis mine):

Please note that CPU optimized instances will have the same price as full-sized EC2 instances of the same size.

Or to quote the slightly longer version from Microsoft (emphasis mine):

The licensing fees charged for SQL Server or Oracle are constrained to the new vCPU count, and other products should be charged based on the new vCPU count. This results in a 50% to 75% increase in the ratio of the VM specs to active (billable) vCPUs. These new VM sizes allow customer workloads to use the same memory, storage, and I/O bandwidth while optimizing their software licensing cost. At this time, the compute cost, which includes OS licensing, remains the same one as the original size.

It’s great to be able to avoid the (potentially astronomical) cost of unnecessary database licences, but this is still a massive compromise – and the cost will add up over each month you are billed for compute cores that you literally cannot use. Again, this is the public cloud demonstrating that inefficiency and overprovisioning are to be accepted as a way of life.

Surely there must be a better way?

Spoiler alert: there IS a better way

Overprovisioning: The Curse Of The Cloud

I want you to imagine that you check in to a nice hotel. You’ve had a good day and you feel like treating yourself, so you decide to order breakfast in your room for the following morning. Why not? You fill out the menu checkboxes… Let’s see now: granola, toast, coffee, some fruit. Maybe a juice. That will do nicely.

You hang the menu on the door outside, but later a knock at the door brings bad news: You can only order a maximum of three items for breakfast. What? That’s crazy… but no amount of arguing will change their rules. Yet you really don’t want to choose just three of your five items. So what do you do? The answer is simple: you pay for a second hotel room so you can order a second breakfast.

Welcome to the world of overprovisioning.

Overprovisioning = Inefficiency

Overprovisioning is the act of deploying – and paying for – resources you don’t need, usually as a compromise to get enough of some other resource. It’s a technical challenge which results in a commercial or financial penalty. More simply, it’s just inefficiency.

The history of Information Technology is full of examples of this as well as technologies to overcome it: virtualization is a solution designed to overcome the inefficiency of deploying multiple physical servers; containerisation overcomes the inefficiency of virtualising a complete operating system many times… it’s all about being more efficient so you don’t have to pay for resources you don’t really need.

In the cloud, the biggest source of overprovisioning is the way that cloud resources like compute, memory, network bandwidth, storage capacity and performance are packaged together. If you need one of these in abundance, the chances are you will need to pay for more of the others regardless of whether they are required or not.

Overprovisioning = Compromise

As an example, at the time of writing, Google Cloud Platform’s pd-balanced block storage options provide 6 read IOPS and 6 write IOPS per GB of capacity:

* Persistent disk IOPS and throughput performance depends on disk size, instance vCPU count, and I/O block size, among other factors.

Consider a 1TB database with a reasonable requirement of 30,000 read IOPS during peak load. To build a solution capable of this, 5000GB (i.e. 5TB) of capacity would need to be provisioned… meaning 80% of the capacity is wasted!

Worse still, the “Read IOPS per instance” row of the table tells us that some of the available GCP instance types may not be able to hit our 30,000 requirement, meaning we may have to (over)provision a larger virtual machine type and pay for cores and RAM that aren’t necessary (by the way, I’m not picking on GCP here, this is common to all public clouds).

But the real sucker punch is that, if this database is licensed by CPU cores (e.g. Oracle, SQL Server) and we are having to overprovision CPU cores to get the required IOPS numbers, we now have to pay for additional, unwanted – and very expensive – database licenses.

Overprovisioning = Overpaying

Let’s not imagine that this is a new phenomenon. If you’ve ever over-specced a server in your data centre (me), if you’ve ever convinced your boss that you need the Enterprise Edition of something because you thought it would be better for your career prospects (also me), or if you’ve ever spent £350 on a thermal imaging camera just so you can win an argument about whether you need a new front door (I neither admit nor deny this) then you have been overprovisioning.

It’s just that the whole nature of cloud computing, with it’s self-service, on-demand, limitlessly-scalable charateristics make it so easy to overprovision things all the time. So while the amounts may seem small when shown on the cloud provider’s Price per hour list, when you multiply them by the number of VMs, the number of regions and the number of hours in a year, they start to look massive on your bill.

And when you consider the knock on effects on database licensing, things really get painful. But let’s save that for the next blog post…

Choosing The Right Path To The Cloud

What happens when customers with on-prem databases decide they want to embrace the public cloud? If you’ve been following the story so far, it is my assertion that most of “the easy stuff” has already moved to the cloud: backups, websites, test/dev suites, videos of cats etc. We are now in the next phase of enterprise cloud adoption, where all the difficult, complex, gnarly stuff is being considered – and that usually means business-critical databases.

The guys at IBM Cloud have a name for this: they call it “Chapter Two“. In fact here’s a quote from page one of a recent IBM annual report (emphasis added by me):

“… the most challenging and complex work of these digital transformations still lies ahead. We call this work ‘Chapter 2,’ in which our clients modernize and move their mission-critical workloads to the cloud, and infuse AI deep into the decision-making workflows of their business.”

It seems that IBM knows it missed out on Chapter 1, but is determined to have a different result for this second, complex wave of cloud transformation. It has a long way to go to catch up, though, because Microsoft and AWS are dominating this market right now – while Oracle and Google are both racing hard to build their own shares of this massive opportunity.

Regardless of which public cloud is being considered, the question for most customers isn’t so much “Who?” as the more thorny issue of “How?”

So with that in mind, let’s have a look at the three main approaches to moving complex database applications into the public cloud.

Three Journeys To The Public Cloud

When it comes to existing, business-critical database-based applications, there are three high-level methods to consider when moving to the cloud. Yes that’s right, migrating to the public cloud is as easy as 1, 2, 3…

1. Optimize: The ‘Lift and Shift’ Approach [IaaS]

The chances are that your on-prem database is going to be one of the usual suspects: Oracle, SQL Server, Postgres, mySQL, DB/2 etc. And as probabilities have it, that database is more than likely running on Linux or Windows (if you are still running on big iron UNIX or – heaven help you – some kind of mainframe, you can leave now please) and there’s a fair chance you have a virtualization layer in there too. So just pick it up and wazz it into an Infrastructure-as-a-Service (IaaS) offering, will you? Quit fooling around reading this, you could have done it by now.

Welcome to lift and shift. You have now immediately realized the benefits of the cloud: all that on-prem hardware has been turned off and junked, the Capex bills have been replaced by monthly Opex costs to the cloud vendor and the DBAs have been rebadged as Site Reliability Engineers (SREs) and given new, cooler t-shirts to wear. Somebody give the CIO a bonus!

Of course, life is never this simple and there are inevitable pros and cons. IaaS still has to be managed by your own operations teams (which is expensive), databases licenses, where applicable, still have to be managed (and are expensive) and those cloud infrastructure bills are looking awfully large. Did somebody say the cloud was going to save money?

Performance is a problem too. The Dream Of The Cloud™️ is to have infinite scale of resources on demand, but your architecture was designed for on-prem and simply cannot take advantage of cloud scalability. When the DBAs (sorry, SREs) deployed this in the cloud, they had to choose between architecting for the average workload – which means performance sucks at peak times – or architecting for the max, which is way more expensive. Inevitably, the compromise fell on the side of cost and so the result is that application latency is high, user experience is low and nobody dares to run any analytical workloads for fear of taking the whole platform down.

Maybe there’s an alternative?

2. Modernize: Managed Database Services [PaaS]

Every cloud vendor has managed database offerings – in fact, most have a plethora of different offerings. Microsoft, for example, has Azure SQL Database as well as Cosmos DB. Oracle has the managed version of its eponymous database, Google has Cloud SQL and AWS has so many database services that there aren’t enough electrons on the internet to list them all. So why choose managed databases?

The Dream Of The Cloud™️ is to rid your business of all the low-level drudgery that comes with running IT infrastructure, so that your operations staff can rebadge themselves as DevOps and spend their time on more valuable activities, like drinking artisan coffees or breaking CI/CD pipelines. IaaS doesn’t really deliver on that dream, but PaaS gets a lot closer. Now, the cloud vendor takes care of much more drudgery and also – for licensable database products – manages the licenses so that you only pay for what you consume.

Of course, life is never this simple and there are inevitable pros and cons. Managed database services can be notoriously expensive if you need all the enterprise features you took for granted on-prem, while performance can often be problematic. Remember that managed SQL databases are designed for the average workload and not the peak, so if your system is an edge case in any way – if you’re not running with the pack – it won’t be a perfect fit. Maybe far from perfect.

Another potential issue is that many business-critical database applications are full of business logic. Think of Oracle database with PL/SQL packages written by developers long since retired. Can that be easily migrated into Managed Postgres on Azure, or Cloud SQL on GCP? Maybe the code calls UTL_FILE to write files which are then sent elsewhere using UTL_TCP. Try feeding that code into an automatic migration service.

Managed Databases are a great solution for the hundreds of boring databases you may have on-prem. Imagine never having to patch stuff again! But for anything even remotely unusual, or anything that regularly causes you pain, the chances are slim that PaaS will be the right fit.

Of course, there is another option…

3. Transform: Refactor to Cloud Native

Ahh, the path of the truly enlightened! Rip up your existing applications and rewrite everything to be cloud native. Fill out a bingo card with words like microservices, containers, serverless, Kubernetes and mutable infrastructure; tick them off one by one as your DevOps team write the whole application in Rust. Move to open source quasi-database platforms like Postgres, Cassandra and Elasticsearch.

Boom! You have now achieved The Dream Of The Cloud™️ which means that your application is truly distributed, scalable and has virtually no performance limits. I sound like I’m being sarcastic, but I’m really not – I have recently been working with customers who have built environments exactly like this and I could not be more impressed with the results (although these were “born-in-the-cloud” companies who were building new application stacks, rather than toiling with the technical debt of “legacy” on-prem apps). It’s the future.

But guess what? Life is never this simple and there are inevitable pros and cons. If you are starting with the baggage of an on-prem deployment which needs to be migrated, this is quite clearly the most complex and time-consuming option. It’s a proper migration project – and everybody in IT has a story about a long-term migration project which ran over time, over budget and ultimately didn’t deliver on its starting goals. Also, it may require specialist skills which your organisation doesn’t have. Do you really want to engage a team of consultants and pay them on a time and materials basis?

No matter which way you look at it, this option is the most expensive and carries the most risk.

So Which Option Is Best?

To state the obvious, there is no best option and everything has to be evaluated on a case by case basis. It makes sense to look at the anticipated lifetime of the application in question, because if it’s only going to be around for another couple of years, why expend the effort of rewriting anything? Just lift and shift, or use PaaS if possible. But most important of all, keep in mind that the options above don’t have to be mutually exclusive. It’s possible to lift and shift multiple applications to achieve an immediate goal of reducing your on-prem data centre footprint, then consider a smaller selection of those for further adaptation to PaaS. It’s also possible to move into the cloud using IaaS or PaaS while, at the same time, starting a longer-term project to refactor to cloud native.

In summary, there is no perfect journey to the cloud. The bigger and more complex the application/database, the more you’ll have to compromise on the expected result. But, after all, when was that not the case in Enterprise IT?

The Battle For Your Databases

There’s a battle going on right now between all of the public cloud vendors – a war in the clouds. And you might be surprised to hear what they are fighting over… They are fighting over you. Or, more specifically, your business-critical databases.

Everybody has something in the cloud these days. On a personal level, we are all keeping our photos, our music and our emails in the cloud. Corporations have followed suit: email, document collaboration and workflow, backups, websites… Almost everything is in the cloud. Almost.

The Big Scary Stuff That Nobody Wants To Move

Pretty much every company with an on-prem presence will have one or more relational databases underpinning their critical applications. Oracle Database, Microsoft SQL Server, PostgreSQL, DB/2 (the forgotten database of yesteryear: it’s still out there, but nobody likes to talk about it), MySQL… these products support mission critical applications like CRM, ERM, e-commerce, all those SAP modules that I can never remember the names of… And in each industry vertical, there are critical systems: healthcare has Electronic Patient Records, retail has its warehouse management platforms, finance has all manner of systems labelled Do Not Touch.

These workloads are the last bastion of on-prem, the final stand of the privately-managed data centre. And just like mainframes, on-prem may never completely die, but we should expect to see it fade away this decade. The challenge, though, is the inertia caused by such massive amounts of complexity and the associated risk of disturbing it. I have witnessed DBA teams who draw lots over which unfortunate will have to log on to “that database”, the one in the corner that nobody understands or wants to touch when it’s working ok. So how are they going to migrate that entire thing into AWS or Azure? Everybody knows a story about an eighteen-month migration project that overran budget by 1000% and then failed, right?

The View From The Clouds

So you may ask, if all this complex, gnarly stuff is full of risk, why do the hyperscalers want it? The answer is, because this is the biggest game left on the hunting ground. These vast technology stacks are the crown jewels of on-prem data estates. If you are Cloud Vendor A, there are some important reasons why you really want to capture this workload into your cloud:

Big applications and databases require a large recurring spend on premium cloud infrastructure
Customers are used to spending large amounts of money to run these services
The surrounding application ecosystem offers potential for the upsell of further cloud services (analytics, AI, business intelligence etc)
Once that workload comes into your cloud, it’s probably never leaving. In other words, it’s a long-term guaranteed revenue stream.

The last point is especially important: vendors use the term sticky to describe workloads like this. The effort of migrating all that sensitive, critical data and all that impenetrable business logic (written ten years ago by developers who have long since moved on) means you are never going to want to do this more than once. Once it’s in, it’s in.

A Massive Anchor

Working with one of the hyperscalers, I have heard these databases described as anchor workloads (credit: Kellyn Pot’vin Gorman) because they are what holds back the migration of large, juicy and complex environments into the public cloud. Like the biggest beast on the savannah, they are the hardest to take down… but a successful capture means everybody gets to eat until they are full.

So if this is you – if you are in fact a massive anchor – it’s probably worth keeping this in mind. Migrating your complex, challenging workload to the public cloud might seem like a mammoth task from your perspective, but to the hyperscalers you are the goose that lays the golden egg. And they can’t wait to get cracking.

Side note: I originally planned to call this post “Cloud Wars”, but I discovered that my former Oracle colleague, the inestimable Bob Evans, had beaten me to it…

How To Look Stupid (Part #612)

Now is the winter of our discontent. But rather than dwell on what a terrible year 2020 has been, I thought I’d make my final post of the year something more positive… so I am going to look back on one of the (many) times I made a fool of myself, in the hope that 2021 will give me the chance to do so again.

When Computers Go Bad

In the late 1990s, I was fresh out of university and working in my first job, for a small company (5 people!) at London’s Heathrow Airport, as a developer and database admin. We provided cargo handling software for all of the big airlines and freight companies. And on this particular day, “Dave”* at Air Canada had a problem with his system.

My company’s software managed the customs clearance of all inbound air freight for most of the airport. In order for inbound freight to leave the secure warehouses on a truck, this software (which, for Air Canada, ran on their main HPUX server) would send a message to the central HM Customs computer and then, upon receiving clearance, print out an official “air waybill” document. The waybill was legal proof that goods had clearance to leave the warehouse: no waybill = no clearance = no freight.

An hour ago, Dave had called in with a major problem: goods were being cleared by customs, but no waybills were bring printed. Air Canada now had a queue of lorries backed up at the warehouse and a crew that couldn’t do any work. There was nothing wrong with the printers, it was our software. Fix it, Dave begged us. Fix it now!

When DBAs Go Rogue

A senior colleague of mine, Denis**, was working on the problem and trying to test a fix on our lab system. He was also dialled in to Air Canada’s production system, on which our software ran – a crucial fact which turned out to be very important.

So when he called through to me from the server room to say, “Hey could you reboot the lab box?” I wondered over to his desktop and typed the magic reboot command on the first root window I found. Hey, one terminal session looks like another, right?

“Are you going to reboot it?” called Denis.

“I already have,” I yelled back, mildly irritated.

Denis stuck his head out of the door and stared at me, puzzled. I was then able to watch a whole range of emotions pass over his face: confusion changed to comprehension which in turn became outright horror.

I had just hard rebooted Air Canada’s entire UNIX platform with no warning to them at all.

Knowing When To Own Up

It took them a little while for Air Canada to realise what (or who) had happened to them. Remember, this was the 1990s, so big iron UNIX systems took about 15-30 mins to restart – and everybody was connected via dumb terminals which would have just suddenly gone blank.

Fred was a DBA until he accidentally truncated the wrong table

I mainly spent this time in purgatory, thinking about alternative careers, planning my new life in a Tibetan monastery or hoping for a natural disaster to divert attention.

But eventually, my desk phone rang and our receptionist said, “Dave from Air Canada wants to speak to you”.

I can vividly remember the dry mouth, my sweaty palms holding the phone, my voice about three octaves too high.

“Yes?” I stammered.

“I don’t know what you’ve done,” said Dave, “but all the waybills are coming out again now. Thanks very much!”

It’s important, I think, to be honest in these situations. But not that honest. So I let Dave get back to his busy job and made a mental note to confess to what had really happened some time within the next 25 years. And then I filed that next to the other mental note – the one about never, ever typing reboot without triple checking which system you are connected to.

Aspirations for 2021

When I look back at this story – and the many other times in my career when I made myself look stupid – I am grateful for the fact that things turned out ok. The whole year 2020 has felt like an elongated version of the purgatory I experienced above. But, as anybody who has ever rebooted a 1990’s-era big iron UNIX server will attest, the login window only appears about ten seconds after you’ve finally admitted to yourself that it’s never coming back.

So let’s hope that 2021, like Dave and his waybill printouts, gets us back on track fast.

* The names of innocent parties have been changed to protect their identities

** Denis really was called Denis though

The Public Cloud: The Hotel For Your Applications

Unless you are Larry Ellison (hi Larry!), the chances are you probably live in a normal house or an apartment, maybe with your family. You have a limited number of bedrooms, so if you want to have friends or relatives come to stay with you, there will come point where you cannot fit anybody else in without it being uncomfortable. Of course, for a large investment of time and money, you could extend your existing accommodation or maybe buy somewhere bigger, but that feels a bit extreme if you only want to invite a few people On to your Premises for the weekend.

Another option would be to sell up and move into a hotel. Pick the right hotel and you have what is effectively a limitless ability to scale up your accommodation – now everybody can come and stay in comfort. And as an added bonus, hotels take care of many dull or monotonous daily tasks: cooking, cleaning, laundry, valet parking… Freeing up your time so you can concentrate on more important, high-level tasks – like watching Netflix. And the commercial model is different too: you only pay for rooms on the days when you use them. There is no massive up-front capital investment in property, no need to plan for major construction works at the end of your five year property refresh cycle. It’s true pay-as-you-go!

It’s The Cloud, Stupid

The public cloud really is the hotel for your applications and databases. Moving from an investment model to a consumption-based expense model? Tick. Effectively limitless scale on demand? Tick. Being relieved of all the low-level operational tasks that come with running your own infrastructure? Tick. Watching more Netflix? Definite Tick.

But, of course, the public cloud isn’t better (or worse) than On Prem, it’s just different. It has potential benefits, like those above, but it also has potential disadvantages which stem from the fact that it’s a pre-packaged service, a common offering. Everyone has different, unique requirements but the major cloud providers cannot tailor everything they do to you individual needs – that level of customisation would dilute their profit margins. So you have to adapt your needs to their offering.

To illustrate this, we need to talk about car parking:

Welcome To The Hotel California

So… you decide to uproot your family and move into one of Silicon Valley’s finest hotels (maybe we could call it Hotel California?) so you can take advantage of all those cloud benefits discussed above. But here’s the problem, your $250/day suite only comes with one allocated parking bay in the hotel garage, yet your family has two cars. You can “burst” up by parking in the visitor spaces, but that costs $50/day and there is no guarantee of availability, so the only solution which guarantees you a second allocated bay is to rent a second room from the hotel!

This is an example of how the hotel product doesn’t quite fit with your requirements, so you have to bend your requirement to their offering – at the sacrifice of cost efficiency. (Incurring the cost of a second room that you don’t always need is called overprovisioning.) It happens all the time in every industry: any time a customer has to fit a specific requirement to a vendor’s generic offering, something somewhere won’t quite fit – and the only way to fix it is to pay more.

The public cloud is full of situations like this. The hyperscalers have extensive offerings but their size means they are less flexible to individual needs. Smaller cloud companies can be more attentive to an individual customer’s requirements, but lack the economies of scale of companies like Amazon Web Services, Microsoft and Google, meaning their products are less complete and their prices potentially higher. The only real way to get exactly what you want 100% of the time is… of course… to host your data on your own kit, managed by you, on your premises.

Such A Lovely Place

I should state here for the record that I am not anti-public cloud. Far from it. I just think it’s important to understand the implications of moving to the public cloud. There are a lot of articles written about this journey – and many of them talk about “giving up control of your data”. I’m not sure I entirely buy that argument, other than in a literal data-sovereignty sense, but one thing I believe to be absolutely beyond doubt is that a move to the public cloud will require an inevitable amount of compromise.

That should be the end of this post, but I’m afraid that I cannot now pass up the opportunity to mention one other compromise of the public cloud, purely because it fits into the Hotel California theme. I know, I’m a sucker for a punchline.

You and your family have enjoyed your break at the hotel, but you feel that it’s not completely working – those car parking charges, the way you aren’t allowed to decorate the walls of your room, the way the hotel suddenly discontinued Netflix and replaced it with Crackle. What the …? So you decide to move out, maybe to another hotel or maybe back to your own premises. But that’s when you remember about the egress charges; for every family member checking out of the hotel, you have to pay $50,000. Yikes!

I guess it turns out that, just like with the cloud, you can check out anytime you like… but you can never leave.

Cloud DBA: The Next Generation of Database Administrator?

In the previous post, I ~~ranted~~ discussed the evolution of the DBA role, looking at how many additional functions the database administrator has inherited over the years: code fixer, virtualisation tamer, Linux / Windows juggler, reluctant storage administrator, application server hater, firewall botherer and all round fixer of any product badged as Oracle.

But the real change I am interested in comes as a result of databases moving into the cloud. Because this exposes the DBA to ownership of a new problem: cost. Specifically, ongoing operational costs – or Opex. It is my belief that this is in fact A New Thing – and New Things are not to be trusted. Sure, in the on prem world, DBAs were involved in decisions concerning capital expenditure (Capex) like the scoping of database servers, the calculation of how many database licenses were needed, the justification of additional license options (e.g. Enterprise Edition instead of Standard Edition). But in most cases, those decisions were made by a collective and then signed off by the business.

Cloud is different. Everything you do in the public cloud costs money. You want to spin up an instance? Kerching. You want to use some SSD storage? Kerching! You want to download copies of your data to an on prem location? Egress charges ahoy… KERCHING!

Bills, Bills, Bills…

Decisions taken by DBAs in the normal course of their day jobs can now have a significant effect on the next invoice from the cloud vendor. Do you remember in the early days of cell phones, if you used your phone a lot you were never entirely sure what the bill would look like at the end of the month? Could be a little more than usual, could be so massive you need a loan from the World Bank. Sometimes, the cloud has a similar feel.

Most cloud vendors have remarkably complex pricing structures (some say this complexity is deliberate!) and this has in fact spawned a whole industry of experts (“cloud economists”) who can help customers understand and reduce their cloud costs, often using the two step principle of 1) turn stuff off, and 2) negotiate harder for discounts.

Into this new minefield steps that brave warrior, the DBA. Often charged with the apparently simple task of “move that database into the cloud”, not only must a new technical language be learned (e.g. “it’s not a VM in the cloud, it’s an instance”) and a new set of TLAs be absorbed (“In my AWS VPC, I use EC2, EBS, S3 and ZXP”)… but also a new understanding must be gained of what each checkbox and pulldown option does to the operating cost.

Another Plate To Spin

It’s a whole new area of expertise to take on – and it’s complex. What’s more, it’s subtly different between cloud vendors – and even if you only use one cloud, it’s subject to change over time. Usually in the direction of more expensive.

Here’s a simple example: provisioning an instance. You are a DBA (congrats!) and you need to migrate your on prem database into, say, Amazon Web Services. You first of all need to configure a Linux instance and some disks. There are many different ways of doing this – including templates, infrastructure-as-code and so on – but let’s do it in the GUI for fun. First, you’ll need some compute power, so let’s provision some from the Elastic Compute Cloud (EC2). Which type shall we choose?

If you are new to this, there are a lot of options. I mean, really a lot. Let me see now, there’s categories of General Purpose, Compute Optimized, Memory Optimized, Accelerated Computing, or Storage Optimized. These are just the categories… each one of which contains many types, which contains many options! But “General Purpose” sounds kinda normal, so let’s choose that. Now you need to choose the instance type:

Amazon Web Services – Elastic Compute Cloud choices for General Purpose instance types

Amazon Web Services – EC2 M5 Large instance types

If we go for instance type of M5, we are told that “This family provides a balance of compute, memory, and network resources, and is a good choice for many applications”. Cool, so now you have to pick the instance size:

This screenshot only shows a fraction of the total choices, with each config of vCPUs and Memory replicated again in the m5d.* range (adds NVMe SSD storage), plus some further options around bare metal. It is a labyrinthine set of options to consider.

If you haven’t undertaken the myriad training courses for this cloud vendor, how do you know which instance size to choose? Well, maybe the same way that you specced up the config of your on prem database servers before… right? Except most DBAs didn’t do that, they were allocated servers without really playing a part in their procurement. But my real point here is that the choice you make reflects the ongoing monthly cost. And there are more choices to make! After all, you are going to need some storage from Elastic Block Store on which to place your database:

Amazon Web Services – Elastic Block Store volume types

Amazon recommends one of two different options for “I/O-intensive NoSQL and relational databases” plus a third for data warehouses. I’ll tell you right now, if your database is even mildly transactional, you will want to use io1 or io2. Whatever you choose, it will have an affect on the monthly cost – you can see this by checking it out on the AWS Calculator.

And you know what we didn’t even cover at the start? The region – the geographical location in which this instance runs – also changes the cost, sometimes significantly. Pricing for European regions is often surprisingly higher than regions in the US.

Why This Matters (TL;DR)

What I am trying to show here is that, in the course of provisioning databases in the cloud, DBAs are having to make complicated choices which not only affect the performance of their databases but also the ongoing cost. In fact, it’s a balancing act: performance and cost are two sides of the same coin. Amazon Web Services, in the example above, offers a huge and dazzling array of options which offer different trade offs for these two dimensions. That’s not a bad thing by the way – I am not criticising AWS for giving us a choice – but it’s bewildering to the uninitiated.

What’s more, if you put a database in Microsoft Azure, or Google Cloud Platform, or Oracle Cloud Infrastructure, or Alibaba Cloud or … I can’t think of any other clouds … then be prepared for the fact that everything changes again.

It’s time for DBAs to learn to juggle with yet another ball.