No plan survives contact with the enemy: How we moved to a new payment service

For me, January of this year was not only a series of holidays, but also an occasion to celebrate one year since we completed our transition to a new payment gateway. As a perfume subscription service, this is a critical system, and if it has problems, then this affects everyone in the company. The transition itself can be compared to a good book: there is exposition — the problems with the previous service; an opening — about how we decided between in-house development and a ready-made service; rising action — when we started working on the architecture and making the first commits; climax — the moment of the initial launch and the associated challenges; and the denouement — when we finished migrating all our customers. Sound interesting? Then welcome to under the cut.

Treasure Island

First, a few words about our business. Scentbird is a US-based subscription service for perfumes and colognes. Once a month, a customer automatically pays us $16 and we send 8 ml of perfume of his/her choice. Basically, everything is very simple, but the devil is in the details. What are the special payment-related features of a subscription service?

  • A variety of plans. There are lots of nuances related to switching from plan to plan, canceling a prepaid plan, and so on.
  • Subscription management. Pause, skip a month, cancel.
  • Recovering a customer when some error (no money on the card, problems with the bank, etc.) causes payment to fail.
  • Financial reporting and everything related to it.

Subscription services generally use third-party solutions that are well known in the market: Stripe, Recurly, Chargebee, Vindicia, etc. All of these services are able to do all of the above, plus each one tries to offer something of its own. This is usually a focus on subscription renewal or useful integrations. There is another option: to build your own payment system from scratch. And there plenty such examples. This is precisely the path taken by HelloFresh (one of the largest meal kit services).

Part one. If you go to the right, you will lose your horse… (Folk story). How it all started.

At the beginning of 2020, we were using a certain subscription billing service, which we had switched to at some point due to the fact that they were very good at recovering customers with failed payments (this process is called either recovery or dunning). But over time, as the CTO, I became dissatisfied with the price, performance (sometimes a day would pass between when payment was initiated and when payment was actually made), and technical support (we had to build logging of each step, because they had no special logs on their side and we had to prove everything). With 400,000 active customers and several hundred thousand former customers, this is super serious. So, we decided to migrate and the first step was to determine where: a third-party solution or our own.

I decided to start first with making a list of the functionality that we would need. Here's what happened:

  • One-time charges (capture/refund/void)
  • Subscription plan management
  • Subscription management (change next billing day, change plan, etc)
  • Recurring charges
  • Recovery logic
  • Fraud prevention
  • Discount management
  • Tax calculation
  • Customer health management (e.g. duplicate charge prevention)
  • Technical performance monitoring & logging
  • Security
  • Support for alternative payment methods (PayPal, ApplePay, Google Wallet)
  • Analytics & dashboards
  • Mobile SDK
  • Chargeback management (a chargeback is when a customer contacts his or her bank to cancel and reverse a transaction)

After speaking with representatives of the relevant services, we created this spreadsheet:

Comparing these services by price, Recurly looks the most interesting. What about in-house development?

According to our analysis, we would need:

  • 4 software developers for at least 5-6 months of active development
  • Infrastructure improvements
  • A permanent team to support and improve the service
  • Regular auditing for PCI DSS compliance

True, we would save a lot in the short term and the service wouldn't become more expensive as the number of transactions increased, but having an application on the balance sheet that would be outside our main area of expertise did not look very attractive, so we decided to integrate with a third-party service. By the way, this decision—to buy or to build your own—is one of the biggest headaches of any CTO. There is no silver bullet here. And in the end, the decision is based on the growth or loss of business.

Part two. Plans are worthless, but planning is everything. (Attributed to Helmuth von Moltke). Planning the project.

The main challenge when moving to a different payment gateway is migrating customer data. If you make the slightest mistake in your logic, then hundreds (thousands) of angry users send you rays of hatred, and all your business metrics go down the tubes. That's why the transition needs to be done in stages. What's more, it's essential to preserve everybody's various payment methods and different statuses, and so on.

To account for all this and more, it's best to have a plan. Will everything go according to plan? No 100%, but still, planning is useful. Basic elements of the document:

  • Timeline — This includes the phases of the project and when they start/end (spoiler: the project took 2 months longer than we expected).
  • Success metrics — what we measure at the end of the project in order to understand whether the migration was worthwhile.
  • RACI matrix — This shows who is responsible for what in the project (so that later nobody says someone's opinion wasn't accounted for) There's an example below.
  • Functional requirements — all of these should have a 1-to-1 mapping in the task tracker.
  • Data migration plan.
  • Risks and a plan for dealing with them.

I'll describe the migration plan to you separately.

First, it's important to understand how to transfer payment data. We work with 4 payment methods:

  • ApplePay
  • PayPal
  • AmazonPay
  • CreditCard

The simplest methods were AmazonPay and PayPal. We knew the agreement id (a unique id issued when payment is authorized), which let us easily create a subscription in Recurly, set the required dates for the next payment, and that's all it takes. This was only possible because Recurly's API provided the right method, which again speaks to the importance of choosing the right partner.

The situation is roughly the same with ApplePay. When authorizing a payment, the gateway receives JSON, which can then be used for recurring transactions and data transfer. Unfortunately, our old gateway let us down here — they did not have the needed data.

But the most interesting story has to do with transferring card data. First, it happens without our involvement. It is the work of the payment gateways. Second, it takes quite a long time. Third, gateways generally let you make one import for free, and each subsequent import cost extra money. That means that it is very important to properly manage how you switch your gateway in a production environment.

This is the second important part of working with data. You should be very careful when enabling a new gateway in a production environment, because, for example, if you don't roll out all the functionality all at once, then some of your users won't be able to pay for this or that service, which will negatively affect the business. And here you have two options — either first transfer all the functionality to new hardware and then transfer your customers, or follow the 80-20 rule and transfer some of them and make peace with losses. That said, it's too early to start receiving feedback.

We chose the second option because by covering the scenarios for "subscribe", "change plan", "one-time transaction", "unsubscribe", and "pause", we were able to cover most customers and could start letting them into the new gateway.

Back to customer migration. Because we could transfer customers using PayPal/AmazonPay ourselves, we transferred them all in 3 passes (1,000 / 20,000 / all the rest) and were able to check all the scenarios before transferring customers using cards.

Part three. Shoot first. Ask questions later! (Napoleon Bonaparte). Creating the architecture and setting up integrations.

As you can imagine, our plot thickened as we started development. Before all else, a few words about implementation. Below I'll provide a couple of diagrams that explain in a simplified way what was done. Let's start with the business logic:

I'll highlight the main points. First, we now support the discount mechanism ourselves. Why? Because every gateway has its limitations, but the marketing team doesn't. For example, they introduce discounts for a decreasing amount over several months, the third month is free, and so on. Additionally, we have a lot of promotions linked to the physical aspect of the business (for example, a gift in the first month) and billing must account for this.

Second, the bills themselves. This is an important artifact for the finance team, because bills are documents that must be provided during an audit. Bills clearly describe everything that the customer paid for and they let you to correctly calculate financial metrics. Also, technical support should always be able to refund any line item in a bill. So accuracy is everything for us.

Third, the add-on mechanism. An add-on is a paid addition to a subscription. For example, we use add-ons for surcharges for shipping to Canada or a +1 subscription, which is only available on top of a main subscription. Plus, this makes it possible to make some tricky changes to the subscription itself, for example, to test a price increase on the main subscription.

The diagram above shows how the overall architecture changed. That's right, almost nothing has changed — and that was the objective. We still process incoming information through the combination of the API Gateway (in order to leave the service inside the VPC) + AWS Lambda + Rabbit. For logging purposes, we still keep these events on S3. And so on. A similar system proved its reliability, so we decided not to change it.

Some funny moments. Our account has a number of integrations, including Kount (anti-fraud protection) and Avalara (tax calculation service). Neither of them has a test environment:

  • Kount does not have one at all. All integrations with Kount use the production environment. This matters because we pay them for the number of verified transactions. Accordingly, we have this integration turned off in our test environment.
  • Avalara has a paid test environment. It costs about $15k per year and is only available for a limited time. That's why the service from Recurly itself is in a test environment. What's the problem? In e-commerce, tax is calculated at the level of each line item and depends on the tax code. Naturally, Recurly does not support any of this, so the tax amount may differ between production and test.

Part four. Only fools repeat their mistakes. Smart people make new ones. (Folk wisdom). What eventually went wrong and how we dealt with it.

And now for the most fascinating part: the mistakes that we (I) made as part of this project.

Mistake No. 1. No matter how many smart programmers have gathered, you still need someone who will play the role of a manager and bring everybody together. We started this project with 2 principal engineers and 2 senior engineers. The guys did everything right, but there were problems with focus and transparency. Well, and we all know that programmers like to solve engineering problems, not business problems.

Mistake No. 2. Insufficient analysis of dependent systems. As part of the transition to the new gateway, we also decided to update the internal CRM system used for technical support. The problem here is that without some person responsible for the project and pushing forward the necessary tasks, this work does not get done. This functionality slipped in the schedule and delayed customer migration.

Mistake No. 3. Insufficient involvement of representatives of other departments in the discussion of the migration plan. This was probably my biggest mistake, because as soon as we started talking about customer migration, the finance folks and technical support folks suddenly asked the question, "How will we know that everything is working well?" and they requested to add additional reports and monitoring. All this also pushed out the deadlines.

Mistake No. 4. Optimizing development tasks. In order to finish faster, each developer was responsible for his or her area and basically did not delve into the next developer's code, except for situations involving a code review. As a result, we faced the bus problem, and developers from other teams had implementation questions in some places.

Problem No. 5. During the migration of customers' card data, we discovered that the old gateway did not do any data normalization and there were some errors when importing the customer data. These customers had to be transferred by hand.

Results

What was the result? In the end, we completed the project — with a 2-month delay. Now a year later, we can say with 100% certainty that we have achieved our goals:

  • We cut our gateway cost in a third.
  • We solved the duplicate payment problem caused by bugs on the gateway side.
  • We doubled the processing speed for recurring payments, which has allowed us to start sending packages a day earlier.

Leave a comment and let me know what other technical solutions and problems related to subscription services you would be interested to learn about.