Skip to content

Lessons From My First Product Launch

Today I launched CloudVise, an AI advisor giving personalized advice across multiple topics. It was a long journey, starting all the way back in July.

Here are some lessons I learnt.

Timeframes

I started work on the product right after I left my previous company, at the start of July.

The first 1-2 weeks were spent reading some nice fantasy books, relaxing, and brainstorming. Eventually I decided I would make a skeleton of an app first - something that could handle payments and logins, on a Python backend (FastAPI) and a NextJS frontend, hosted on AWS.

Initially I was very obsessed with Infrastructure-as-Code (IaC), since in my previous company, a lot of the deployment was manual. I started with a production-grade, scalable setup on AWS (with load balancers, containers via Elastic Container Service, and databases vs Relational Database Service).

After I set that up, I realized that it would cost me ~$50/month just for hosting a single product. So, I spent another week optimizing that setup (I managed to drop the monthly cost per product to $2.50 - you can read about that here).

I worked on the frontend in the last week prior to the product launch. Using AI tools was quite a big help here (I'm not very familiar with TailwindCSS).

Takeaways

Having a solid backend is important

Most people brush aside the backend as it is not something obvious to users, however from my past experience, a misfunctioning/unreliable backend can really turn users away. I was fortunate to have a codebase I had spent a lot of time refining over the previous months which I could adapt for this purpose.

A well designed backend should:

  • be simple to understand (modular code)
  • have functions which do only one thing
  • not hardcode variables (so you can develop locally)
  • have unit tests (ideally)

The usual principles of Clean Code apply.

Automated deployments speed up development time

I use Github Actions for CI/CD, using OpenTofu for deploying infrastructure. Once I make changes to the dev branch, merging them to prod triggers the pipeline and the production site is updated ~5 minutes later.

This is very helpful for testing production-only things like Stripe payments on the live account, or Google Analytics.

Nice frontend stuff

Tailwind CSS (for theming), Shadcn (for components) and Lucide Icons are a great combination. It's easy to swap themes via tweakcn, dark mode comes for free, and mainstream LLMs are very familiar with this stack.

Random thoughts about handling long-running jobs on the backend

I was randomly thinking about the best way to handle the scenario where the backend needs to make an API call to a vendor which might take a long time (e.g. video generation).

I initially thought we could just have the frontend wait for the backend to complete, but I realized this wasn't really feasible because long running HTTP connections can time-out on many layers (API Gateway has a 29s timeout, App Runner has a 120s hard timeout). What happens to the backend result if the connection is interrupted, but the vendor's request finishes?

There are 2 ways to solve the problem. Both involve asynchronous processing.

The first is to use a webhook from the vendor. This way, the vendor can notify the backend when the long-running job is complete, and the backend can then update the database accordingly. Meanwhile, the frontend continuously polls a backend endpoint to check the status of the job.

If the vendor does not have a webhook, then the second best option is to use a queue (e.g. Simple Queue Service). A queue offers features like message retention, dead letter queues to handle failures, and FIFO ordering.

In particular, an implementation of queue + function system (SQS + Lambda) for a low price could be:

Queue:

  • message retention period 12 hours
  • default visibility: 30s (how long we will wait for a worker to process, before making the message available to other workers again)
  • maxReceives: 900 -> 1800s (30min) is max time we can wait for the vendor
  • DLQ after that

A message comes in.

Worker (a Lambda function):

  • first, check timestamp: if > 20min in queue, mark failed and send to DLQ
  • check job status from the vendor:
  • Success: get result, update DB, remove message.
  • Pending: if this is the first run, block for 30s, then set the visibility timeout to something short like 5s. On each retry, check message age. If < 60s old → 2s visibility; if < 5m → 5–10s; if < 20m → 15–30s.
  • Failed: send to DLQ, remove message.

DLQ handler:

  • Update DB, remove message.

In the interest of saving money I checked the costs on GCP, which offers Cloud Run Worker Pools. The cost: 0.000011244 × 2,628,000 = 29.549232 (≈ $29.55). Minimum machine size: 1 vCPU, 512 MB RAM. This is still pretty expensive compared to using AWS Lambdas.

Comments