Doughnut Reader

Temporal Python – A durable, distributed asyncio event loop (2023)

temporal.io

metadat

55 comments

I found out about Temporal 2+y ago now and were early adopters of their cloud. It's a bit of a paradigm shift when you start using it, but it is amazing at solving some types of problems in a very simple manner. There are some trade-offs, there always are, for us that was migrating long lived workflows. But the resulting simplicity and maintainability of the code has been great. One thing that was hard with Temporal was to "sell it" to business leaders because it's not a turn key solution, it's more like a piece of infra for engineers to build on top of. Kind of a higher level database-queue-workflow engine thing that simplifies work for engineers.

In short we were working on automating things like onboarding a new employee, which involves creating accounts for their saas apps, buying and shipping their device, email confirmations, satisfactions surveys etc. So a workflow could last up to 3 months with some fully automated systems, and some that required integrating with people (listening to jira event to trigger things, etc).

The error handling was the thing that sold me on Temporal, because things can break just about anywhere in unpredictable ways (not just code, can be process, employee quits during the onboarding, customer is out of licenses etc), so we need everything to be robust and be fixable by a person. With homegrown queue based systems or with BPEL it can be hard handling these situations (what if you need to roll back 3 steps?). With code you can use exceptions, write unit tests etc. We use the typescript sdk, promises made it very intuitive to code even some otherwise complicated scenarios (say event listeners etc).

troebr

11 days ago

Temporal is really neat but I think its marketed at too many use cases.

After a year of high-scale Temporal work, I found it was only good for low-scale work.

The onboarding and learning curve were insanely difficult and complex. Ultimately it doesn't scale as well as you think. The temporal team invented their own database to get around this limitation.

ub-volta-toss

11 days ago

Would love to hear more about the scale issues you saw. How many workflows or actions was too many? which components started breaking down, what were their failure modes?

claytonjy

11 days ago

See above. Its not so straightforward. You need enough headroom on each component that a negative feedback loop can start, eat resources, and have enough time and resources to calm itself before hitting some limit or degrading itself further

ub-volta-toss

4 days ago

Can you tell us more about your scaling issues with Temporal?

I haven't yet used it in production, but I would've expected that a system which evolved out of Uber's Cadence [0] (and which I believe is used at Uber extensively) would've scaled very well.

[0]: https://stackoverflow.com/a/61281435/1579058

wojcikstefan

5 days ago

I'm not sure how Uber does it, but it might be because they're using Cadence instead.

The Temporal team has acknowledged that Cassandra-backed Temporal hits scaling limits pretty fast.

The limitations aren't a clean "X actions/sec", they're sneakier. Because you can run X/sec for days and then the memory on the history service will spike, or any tiny slowdown in the DB will cause looping degradation. There are nasty feedback loops hidden in Temporal that turn small problems into very very large problems.

I think the core problem with Temporal is the way its sharded. This affects history service and its caches. If anything tells them to reload or restart, or that any of the nodes are unreachable, you get a retry storm on the DB.

In addition to these issues, Temporal can create feedback loops within itself. I've seen cases where it would not return to health, even with 0 workers requesting work for 10s of minutes.

We could have kept using and scaling Temporal, but it required 10-30x the resources of building something else. And it was scary to administer. You really need an entire team. You can't have somebody who isn't a dedicated engineer take on-call for it.

ub-volta-toss

4 days ago

What did you move to instead?

claytonjy

3 days ago

Invented their own database? They use Cassandra IIRC

NoThisIsMe

10 days ago

Nope. They hit the scale limit with Cassandra and now have an in-house storage layer

ub-volta-toss

4 days ago

As someone familiar with asyncio, I don't understand what this is or what it's for. What's an activity, workflow, or worker?

> See the asyncio.sleep in there? That’s no normal local-process sleep; that’s a durable timer backed by Temporal.

That's the normal asyncio.sleep. What does backed by Temporal mean? Reading further, it appears that Temporal is replacing the default asyncio event loop. I don't understand why every third party async Python library/framework feels the need to take over the default event loop instead of just building on top of it.

bmitc

11 days ago

Temporal is completely above and beyond Asyncio. It's a full scheduling of work and queues that's cross-machine, cross-language, and very transparent.

A workflow is the code that handles only deterministic actions and calls activities.

Activities are functions that do anything you want, typically affecting other systems with network or file calls.

A worker is the running process connected to Temporal with registered workflows & activities for it to pass work to.

I'm doing a lot of work with alert handling and provisioning systems using Temporal. Temporal in two minutes is a great video explanation: https://www.youtube.com/watch?v=f-18XztyN6c

storyinmemo

11 days ago

It's like that old Joel Spolsky article[0] says:

> you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it’s a zillion times faster which means it can be used to tackle huge problems in an instant

If you can replace the thing that people use with a distributed version, then that can make it easy to write distributed code.

[0] https://www.joelonsoftware.com/2006/08/01/can-your-programmi...

robertlagrant

11 days ago

The thing with event loops in python is that they are not a single, all-governing scheduler (as e.g. in the BEAM).

ev loops instead are a mid-layer concept that sits below other infrastructure such as threads and processes. And (perhaps somewhat frustratingly) it is not too uncommon to have multiple ev loops in parallel. See for example the proxy.py project, which offers to run one async loop per process for a speedup.

As a result, there are some incentives to swap out the loop itself, e.g. for faster implementations like uvloop, because they are somewhat pluggable anyways.

uniqueuid

11 days ago

Good design dictates that you start one loop and build the whole program around it, no? The docs for asyncio.run say as much.

blegr

11 days ago

Yes, that is good design and the event loop should basically be shared process-wide (asyncio objects are usually not thread safe and cannot be shared across event loops). Temporal only does custom event loops in isolated workflows.

kodablah

11 days ago

> all-governing scheduler (as e.g. in the BEAM).

Does this also mean they have preemptive multitasking like in BEAM?

mapcars

11 days ago

To add to other responses here, Temporal doesn't take over the default event loop in general (and users still use it for clients and activities and such). Temporal workflows must be deterministic and durable which means they are guaranteed to run and are resumable on other machines. Therefore Temporal workflows specifically operate on a custom event loop implementation. It doesn't affect anything outside the workflow.

kodablah

11 days ago

The OP seems targeted at devs who are already quite familiar with Temporal and are interested in using the new Python exposure.

FWIW, as someone who has never previously encountered Temporal, and has only a vague sense of the specific problem set it's trying to tackle and architectural approach it has taken, I find the post to be fairly impenetrable.

I'd love to read a proper introduction to Temporal by way of Python (and probably also by comparison to Celery).

davepeck

11 days ago

I think it means that your code could be resumed on a different machine.

adhamsalama

11 days ago

I just had to write a Python program that handled multiple async events - from a serial line, and with a tkinter GUI. The only way to make it 'truly' async was to handle the runloops myself, and add a separate queue, onto which I push coroutines, for processing. UI events and Serial I/O events (involving passing of messages to update states on both sides) all have to be pushed through the same mechanism in order to gain the functionality I need.

Sure, I 'could just use asyncio', or somehow work out how to crowbar serial i/o into tkinters' runloop. But in the end, writing my own just made more sense, and more importantly: it works great. I can have serial i/o and UI acting independently, but coordinating through a single queue .. this works so well.

>I don't understand why every third party async Python library/framework feels the need to take over the default event loop instead of just building on top of it.

Because you don't always have what you need to get the crowbar in place, nor big enough leverage to make space for what you have to do, asynchronously, in the app.

helpfulContrib

11 days ago

But that can indeed just be done with the standard `asyncio` loop. You run your GUI in a thread, run the `asyncio` event loop in its own thread, pass the `asyncio` loop messages with an `asyncio.Queue` and `asyncio.run_coroutine_threadsafe`, and then use `asyncio.to_thread` for the serial communication within the `asyncio` event loop.

bmitc

11 days ago

Read the example code, have a sinking feeling that is not taken from a real tested example. Either there are multiple unexplained symbols or teh code does not actually run.

For example, in "Implementing a Workflow" the execute_activity refers to Purchaser.purchase, which is not declared anywhere.

If the execute_activity times-out after 1 minutes, the status does not seem to be updated anywhere.

In "Running a Worker", do_purchaser is passed as an activity, without explanation. (I guess I'd need to read the fundamental Temporal docs?)

pierrebai

11 days ago

Yes, it has undergone revisions since which caused function name mismatch (EDIT: fixed). The execute_activity there uses start_to_close_timeout which is per attempt and will retry forever by default (customizable).

This is more of a primer on the Python part of Temporal rather than an explanation of all Temporal concepts in depth. Definitely would recommend reading the fundamental docs at https://docs.temporal.io/encyclopedia/. For more exact samples, see https://github.com/temporalio/samples-python.

kodablah

11 days ago

I like the concept.

The nice things is that it abstract the conditions checks on whether something is done, has succeeded or should be retried.

The bad things is that it abstract the conditions checks on whether something is done, has succeeded or should be retried.

It's nice because that's something you do again and again, and that's a lot of code. A lot of ways it can go wrong.

But it's bad because that's a huge chunk of black box magic that may execute remotely. If you need a custom or more optimized behavior anywhere in this logic, you are done for. If there is a bug/problem in this logic, it's game over. I also have to imagine debugging and error reporting is likely not super fun.

One point in particular that strikes me, is that impotence is generically guaranteed with something like "has this task executed without error last time". But usually, what I want is something much more specific, like "has that entries been updated", "has that file been created" and so on. From a bird view, it looks the same, but from a system reliability point of you, they are not at all the same.

Hard to see how they avoid duplicate results, overlapping tasks, etc.

I don't think they really can at that level of abstraction, which means you need to implement it manually.

Eventually it seems it's a huge dep to bring in for the actual practical problem is really solves well.

But I'm willing to be proven wrong on this one, because the tech is really damn cool.

BiteCode_dev

11 days ago

It is interesting seeing the comments here, the comments from adopters is there is a lot of value but it taking time to get up to speed. Those new to temporal a lot of questions seeking understanding.

I have spent a lot of time in the adjacent space of event driven systems and there, like here, it seems like some of the biggest challenge is just education.

I wouldn't say that EDA or workflow based systems are preferable to traditional API services with DBs, just that the space they occupy in the industry is so large that I think it is really really hard if you to introducing any different paradigms, even when you focus on domains where API services aren't a great fit (like here with long running, complex operations).

My point with this comment is simply that I think if you are trying to build anything that does things differently, developer education is as important or even more important than design and architecture, but often not considered because those building these systems are already so deep into it that they can't approach the problem as an outsider.

addisonj

11 days ago

I love Temporal-- we use it at my company. It's very very good for our use case, but took a while to understand how to use it. We're still figuring things out (Workflow versioning is one thing we suck at still).

That said, I'm not sure why this post from 2023 was posted here today. There've been multiple updates to the Python SDK since this post.

amackera

11 days ago

> Workflow versioning is one thing we suck at still

well its not entirely your fault :) what practices have you adopted now that you have some experience with it?

lower down OP mentions that they got the link from a HN discussion on asyncio 2 days ago https://news.ycombinator.com/item?id=40287354 . i guess the upvotes are today's lucky 10,000 learning about it for the first time.

swyx

11 days ago

Relying on external APIs or databases within activities might lead to variability in workflow execution.

Also, on handling HTTP errors in activities by raising an "ApplicationError" based on the status code, might simplifies error handling but might need to see how it accounts for more complex scenarios where errors are transient or where a retry could be successful even for some client errors like rate limiting or temporary unavailability etc.

As the asyncio library itself does have a steep learning curve, integration of asyncio with workflow systems like Temporal that also uses Pythons native asynchronous features, developers should be careful about indirect or subtle bugs, especially in error handling and task management.

avi_vallarapu

11 days ago

> Relying on external APIs or databases within activities might lead to variability in workflow execution.

This is why they are activities. Their results are stored in history, the workflow remains deterministic.

> might need to see how it accounts for more complex scenarios where errors are transient or where a retry could be successful even for some client errors like rate limiting or temporary unavailability etc.

Temporal allows you to specify whether an error is retryable or not.

kodablah

11 days ago

Credit: @kodablah and @chippiewill, thanks for turning me into this!

https://news.ycombinator.com/item?id=40282650

metadat

13 days ago

Anyone migrated from celery, with / without regrets?

benakh

11 days ago

Many Temporal users used Celery in the past. There was a popular blog post a while back about issues with celery: https://steve.dignam.xyz/2023/05/20/many-problems-with-celer.... Here's a brief heading-by-heading listing of how Temporal addresses those issues: https://community.temporal.io/t/suggestion-for-blog-post-abo....

(disclaimer, I'm the author of the post)

kodablah

11 days ago

The "API isn’t Pythonic" examples are misleading, the first and third are using more verbose forms of:

  add.delay(1, 2)

The verbose forms are for when you want extra functionality like in the second example.

It's relatively small compared to the other issues but it sticks out because it's one of only two listed as "you'll have to live with it".

Izkata

11 days ago

(to clarify my ambiguous disclaimer, I am the author of OP's Temporal post, not the Celery one)

kodablah

11 days ago

We migrated from an in-house redis queuing system.

Temporal has its own way of doing things; there's rules about what you can and cant do in workflows, what has to live in activities, etc. Its generally quite easy to adapt existing code work with it. We use typescript.

The worst part for us has been error/anomaly handling. Workflows can sometimes hit a state where the status reads in progress and errors aren't reported anywhere except buried in the event log; which surfaces great in the UI but we still haven't figured out how to programmatically respond to this condition.

A good example is: we use a home-grown version of this [1] to proxy large payloads to S3. However, if those payloads get REALLY large, they can take some time to upload and download; and if that "some time" is longer than 5 seconds, the control plane will believe that the worker has died, it won't reschedule, and the workflow just sits in In Progress. There's always a beautiful error on the temporal dashboard, and we can manually terminate/retry, but the world just seems to die when this happens and we can't do error-level cleanup stuff like alert the user that the thing they were doing didn't finish.

Temporal is also challenging to get support for. Its new, open source, we don't pay for temporal cloud, and there's not a ton of resources or people using it. The documentation is quite bad (if you like 500,000 word pages, codegen'd library sites with no comments, and one example for each feature, you'll like their documentation). Given we run our own temporal cluster, we've also had pretty large challenges in the self-hosting world. We work through them, usually after deep-diving into the temporal server code itself, but there's startlingly little documentation on self-hosting, and even less community support.

Overall, we don't regret adopting it, but if we had a time machine we wouldn't do it again. I feel it makes a series of sacrifices in order to create a system that has extremely high standards for processing, like financial/bank/healthcare level stuff. But, not only are we not building that, but the system has never behaved in a way which makes me think I'd even want to use it if I worked in those industries. Obviously I feel like I'm the one in the wrong here, and I'm sure its just a matter of "we screwed up something somewhere", but that leads back to: bad documentation, no way to get professional support without being on their cloud, and a lack of community support.

[1] https://github.com/DataDog/temporal-large-payload-codec

015a

11 days ago

Would you not like it if you didn’t self host?

If I’m being honest if it is a big issue to self host but it’s value to developers is obvious and apparent why not pay?

no_wizard

11 days ago

Nah we'd probably be fine paying temporal cloud to host the control plane. Their billing is a little weird; I know quite a bit about temporal-the-technology, and the pricing page is literally the first time I've ever seen the word "action" used. I'm familiar with workflows, activities, sinks, codecs, events, but not actions; so when they bill $N/million actions I have no idea what that means, and its surprising to me that that's how they bill it. But I'm sure there's an answer somewhere.

Temporal Cloud is really, really new. Like, it was in some kind of closed beta for a while, with a "contact us" form, as recently as a couple months ago? So, the main reason we don't use it is because it simply wasn't available. It looks like its more widely available now though.

015a

11 days ago

I'd question if you really need to distribute work across machines. It's great this makes distributed systems easier to write but it's much better to reject the premise and avoid writing them in the first place.

siliconc0w

11 days ago

Distributing the work across machines essentially comes for free with the ability to replay workflows from any step. You could run all of your worker processes on a single machine if it had enough capacity, but resuming a workflow on a different machine is transparent to your workloads assuming there's no local state.

dandandan

11 days ago

Is this equal to Azure Durable Functions?

7bit

11 days ago

Not necessarily "equal" but the basic premise is the same, yes, and there is a common lineage. Azure Durable Functions sits on Azure Durable Task Framework which was created by the co-founder of Temporal (https://temporal.io/about).

(disclaimer, I'm the author of the post)

kodablah

11 days ago

Ohh, that's great to hear! I do like ADF, but the Python worker is full of bugs and weird behaviour and tickets stay open for month without progress. I will definitely check that out!

7bit

11 days ago

During an evaluation I found a bug in their library. I went to their Slack, posted about it, and they gave a workaround in 15 minutes, created an issue in 30, and had a bugfix PR ready the next day. Pretty impressed with their team.

Excited to get to use it at some point.

rjbwork

11 days ago

Working on a freelance job 3 years ago, I got sucked down a rabbit-hole for months trying to get Azure Durable Functions to work. Too many bugs, no visibility into its workings and the worker would always grind to a halt. That job did not go well.

Avoid ADF (let alone Azure) until you've got some innovation tokens to spend, or go with another provider.

11 days ago

It got much better. The internals are explained well enough, although there is room for improvement. The bugs can be worked around. It's super annoying, but are the lesser evil when compared to what ADF brings to the table. However, I'm looking into Temporal to see, if that's the better app.

7bit

10 days ago

I can’t understand what layer provides the state orchestration. Like, in celery is redis. What about here?

sscarduzio

11 days ago

The Temporal server stores events and distributes tasks. There is a cloud offering or it can be self-hosted (with support for Cassandra, Postgres, MySQL, and SQLite persistence). This post focuses more on the Temporal Python SDK and not the general platform.

kodablah

11 days ago

Could you or anyone else with experience with Temporal share how hard it is to self-host in practice? Like, is this more like Redis (self-hosting is trivial) or Supabase (nominally self-hostable, but if you try to do it you'll quickly realize it's a pain and the happy path is to use their hosted platform).

kcorbitt

11 days ago

We offer a full guide to help here at https://docs.temporal.io/self-hosted-guide and many users of all sizes self-host Temporal. Having said that, it has challenges as does running any high-available production system. We offer cloud to ease this burden. You still run all your code/workers and you can end-to-end encrypt all data.

kodablah

11 days ago

Is this in the same space as Windmill.dev? Love to hear from anyone with experience with both.

darkteflon

10 days ago

I find Temporal itself to be effectively a clone of Amazon's Simple Workflow (SWF)

fortylove

11 days ago

That's no coincidence, Temporal is founded by the creators of Amazon Simple Workflow. See https://temporal.io/about.

kodablah

11 days ago

Isn't this just threads but with more surprise gotos?

KaiserPro

11 days ago

no, the whole point of temporal is to distribute work across machines, but without worrying too much on the orchestration.

workflows and activities are called remotely, and you can have an autoscaled worker pool handling these calls.

you can retry any unit easily on failure and specify the non retryable errors. What it requires in exchange is full determinism - the same input should produce the same activities in the same order, as a good starting point.

src: I'm a user since over a year ago.

ikari_pl

11 days ago

Threads are durable nor distributed.

anonzzzies

11 days ago