I found out about Temporal 2+y ago now and were early adopters of their cloud. It's a bit of a paradigm shift when you start using it, but it is amazing at solving some types of problems in a very simple manner. There are some trade-offs, there always are, for us that was migrating long lived workflows. But the resulting simplicity and maintainability of the code has been great. One thing that was hard with Temporal was to "sell it" to business leaders because it's not a turn key solution, it's more like a piece of infra for engineers to build on top of. Kind of a higher level database-queue-workflow engine thing that simplifies work for engineers.
In short we were working on automating things like onboarding a new employee, which involves creating accounts for their saas apps, buying and shipping their device, email confirmations, satisfactions surveys etc. So a workflow could last up to 3 months with some fully automated systems, and some that required integrating with people (listening to jira event to trigger things, etc).
The error handling was the thing that sold me on Temporal, because things can break just about anywhere in unpredictable ways (not just code, can be process, employee quits during the onboarding, customer is out of licenses etc), so we need everything to be robust and be fixable by a person. With homegrown queue based systems or with BPEL it can be hard handling these situations (what if you need to roll back 3 steps?). With code you can use exceptions, write unit tests etc. We use the typescript sdk, promises made it very intuitive to code even some otherwise complicated scenarios (say event listeners etc).
troebr
11 days ago
Temporal is really neat but I think its marketed at too many use cases.
After a year of high-scale Temporal work, I found it was only good for low-scale work.
The onboarding and learning curve were insanely difficult and complex. Ultimately it doesn't scale as well as you think. The temporal team invented their own database to get around this limitation.
ub-volta-toss
11 days ago
Would love to hear more about the scale issues you saw. How many workflows or actions was too many? which components started breaking down, what were their failure modes?
claytonjy
11 days ago
See above. Its not so straightforward. You need enough headroom on each component that a negative feedback loop can start, eat resources, and have enough time and resources to calm itself before hitting some limit or degrading itself further
ub-volta-toss
4 days ago
Can you tell us more about your scaling issues with Temporal?
I haven't yet used it in production, but I would've expected that a system which evolved out of Uber's Cadence [0] (and which I believe is used at Uber extensively) would've scaled very well.
I'm not sure how Uber does it, but it might be because they're using Cadence instead.
The Temporal team has acknowledged that Cassandra-backed Temporal hits scaling limits pretty fast.
The limitations aren't a clean "X actions/sec", they're sneakier. Because you can run X/sec for days and then the memory on the history service will spike, or any tiny slowdown in the DB will cause looping degradation. There are nasty feedback loops hidden in Temporal that turn small problems into very very large problems.
I think the core problem with Temporal is the way its sharded. This affects history service and its caches. If anything tells them to reload or restart, or that any of the nodes are unreachable, you get a retry storm on the DB.
In addition to these issues, Temporal can create feedback loops within itself. I've seen cases where it would not return to health, even with 0 workers requesting work for 10s of minutes.
We could have kept using and scaling Temporal, but it required 10-30x the resources of building something else. And it was scary to administer. You really need an entire team. You can't have somebody who isn't a dedicated engineer take on-call for it.
ub-volta-toss
4 days ago
What did you move to instead?
claytonjy
3 days ago
Invented their own database? They use Cassandra IIRC
NoThisIsMe
10 days ago
Nope. They hit the scale limit with Cassandra and now have an in-house storage layer
ub-volta-toss
4 days ago
As someone familiar with asyncio, I don't understand what this is or what it's for. What's an activity, workflow, or worker?
> See the asyncio.sleep in there? That’s no normal local-process sleep; that’s a durable timer backed by Temporal.
That's the normal asyncio.sleep. What does backed by Temporal mean? Reading further, it appears that Temporal is replacing the default asyncio event loop. I don't understand why every third party async Python library/framework feels the need to take over the default event loop instead of just building on top of it.
bmitc
11 days ago
Temporal is completely above and beyond Asyncio. It's a full scheduling of work and queues that's cross-machine, cross-language, and very transparent.
A workflow is the code that handles only deterministic actions and calls activities.
Activities are functions that do anything you want, typically affecting other systems with network or file calls.
A worker is the running process connected to Temporal with registered workflows & activities for it to pass work to.
I'm doing a lot of work with alert handling and provisioning systems using Temporal. Temporal in two minutes is a great video explanation: https://www.youtube.com/watch?v=f-18XztyN6c
storyinmemo
11 days ago
It's like that old Joel Spolsky article[0] says:
> you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it’s a zillion times faster which means it can be used to tackle huge problems in an instant
If you can replace the thing that people use with a distributed version, then that can make it easy to write distributed code.
The thing with event loops in python is that they are not a single, all-governing scheduler (as e.g. in the BEAM).
ev loops instead are a mid-layer concept that sits below other infrastructure such as threads and processes. And (perhaps somewhat frustratingly) it is not too uncommon to have multiple ev loops in parallel. See for example the proxy.py project, which offers to run one async loop per process for a speedup.
As a result, there are some incentives to swap out the loop itself, e.g. for faster implementations like uvloop, because they are somewhat pluggable anyways.
uniqueuid
11 days ago
Good design dictates that you start one loop and build the whole program around it, no? The docs for asyncio.run say as much.
blegr
11 days ago
Yes, that is good design and the event loop should basically be shared process-wide (asyncio objects are usually not thread safe and cannot be shared across event loops). Temporal only does custom event loops in isolated workflows.
kodablah
11 days ago
> all-governing scheduler (as e.g. in the BEAM).
Does this also mean they have preemptive multitasking like in BEAM?
mapcars
11 days ago
To add to other responses here, Temporal doesn't take over the default event loop in general (and users still use it for clients and activities and such). Temporal workflows must be deterministic and durable which means they are guaranteed to run and are resumable on other machines. Therefore Temporal workflows specifically operate on a custom event loop implementation. It doesn't affect anything outside the workflow.
kodablah
11 days ago
The OP seems targeted at devs who are already quite familiar with Temporal and are interested in using the new Python exposure.
FWIW, as someone who has never previously encountered Temporal, and has only a vague sense of the specific problem set it's trying to tackle and architectural approach it has taken, I find the post to be fairly impenetrable.
I'd love to read a proper introduction to Temporal by way of Python (and probably also by comparison to Celery).
davepeck
11 days ago
I think it means that your code could be resumed on a different machine.
adhamsalama
11 days ago
I just had to write a Python program that handled multiple async events - from a serial line, and with a tkinter GUI. The only way to make it 'truly' async was to handle the runloops myself, and add a separate queue, onto which I push coroutines, for processing. UI events and Serial I/O events (involving passing of messages to update states on both sides) all have to be pushed through the same mechanism in order to gain the functionality I need.
Sure, I 'could just use asyncio', or somehow work out how to crowbar serial i/o into tkinters' runloop. But in the end, writing my own just made more sense, and more importantly: it works great. I can have serial i/o and UI acting independently, but coordinating through a single queue .. this works so well.
>I don't understand why every third party async Python library/framework feels the need to take over the default event loop instead of just building on top of it.
Because you don't always have what you need to get the crowbar in place, nor big enough leverage to make space for what you have to do, asynchronously, in the app.
helpfulContrib
11 days ago
But that can indeed just be done with the standard `asyncio` loop. You run your GUI in a thread, run the `asyncio` event loop in its own thread, pass the `asyncio` loop messages with an `asyncio.Queue` and `asyncio.run_coroutine_threadsafe`, and then use `asyncio.to_thread` for the serial communication within the `asyncio` event loop.
bmitc
11 days ago
Read the example code, have a sinking feeling that is not taken from a real tested example. Either there are multiple unexplained symbols or teh code does not actually run.
For example, in "Implementing a Workflow" the execute_activity refers to Purchaser.purchase, which is not declared anywhere.
If the execute_activity times-out after 1 minutes, the status does not seem to be updated anywhere.
In "Running a Worker", do_purchaser is passed as an activity, without explanation. (I guess I'd need to read the fundamental Temporal docs?)
pierrebai
11 days ago
Yes, it has undergone revisions since which caused function name mismatch (EDIT: fixed). The execute_activity there uses start_to_close_timeout which is per attempt and will retry forever by default (customizable).
The nice things is that it abstract the conditions checks on whether something is done, has succeeded or should be retried.
The bad things is that it abstract the conditions checks on whether something is done, has succeeded or should be retried.
It's nice because that's something you do again and again, and that's a lot of code. A lot of ways it can go wrong.
But it's bad because that's a huge chunk of black box magic that may execute remotely. If you need a custom or more optimized behavior anywhere in this logic, you are done for. If there is a bug/problem in this logic, it's game over. I also have to imagine debugging and error reporting is likely not super fun.
One point in particular that strikes me, is that impotence is generically guaranteed with something like "has this task executed without error last time". But usually, what I want is something much more specific, like "has that entries been updated", "has that file been created" and so on. From a bird view, it looks the same, but from a system reliability point of you, they are not at all the same.
Hard to see how they avoid duplicate results, overlapping tasks, etc.
I don't think they really can at that level of abstraction, which means you need to implement it manually.
Eventually it seems it's a huge dep to bring in for the actual practical problem is really solves well.
But I'm willing to be proven wrong on this one, because the tech is really damn cool.
BiteCode_dev
11 days ago
It is interesting seeing the comments here, the comments from adopters is there is a lot of value but it taking time to get up to speed. Those new to temporal a lot of questions seeking understanding.
I have spent a lot of time in the adjacent space of event driven systems and there, like here, it seems like some of the biggest challenge is just education.
I wouldn't say that EDA or workflow based systems are preferable to traditional API services with DBs, just that the space they occupy in the industry is so large that I think it is really really hard if you to introducing any different paradigms, even when you focus on domains where API services aren't a great fit (like here with long running, complex operations).
My point with this comment is simply that I think if you are trying to build anything that does things differently, developer education is as important or even more important than design and architecture, but often not considered because those building these systems are already so deep into it that they can't approach the problem as an outsider.
addisonj
11 days ago
I love Temporal-- we use it at my company. It's very very good for our use case, but took a while to understand how to use it. We're still figuring things out (Workflow versioning is one thing we suck at still).
That said, I'm not sure why this post from 2023 was posted here today. There've been multiple updates to the Python SDK since this post.
amackera
11 days ago
> Workflow versioning is one thing we suck at still
well its not entirely your fault :) what practices have you adopted now that you have some experience with it?
lower down OP mentions that they got the link from a HN discussion on asyncio 2 days ago https://news.ycombinator.com/item?id=40287354 . i guess the upvotes are today's lucky 10,000 learning about it for the first time.
swyx
11 days ago
Relying on external APIs or databases within activities might lead to variability in workflow execution.
Also, on handling HTTP errors in activities by raising an "ApplicationError" based on the status code, might simplifies error handling but might need to see how it accounts for more complex scenarios where errors are transient or where a retry could be successful even for some client errors like rate limiting or temporary unavailability etc.
As the asyncio library itself does have a steep learning curve, integration of asyncio with workflow systems like Temporal that also uses Pythons native asynchronous features, developers should be careful about indirect or subtle bugs, especially in error handling and task management.
avi_vallarapu
11 days ago
> Relying on external APIs or databases within activities might lead to variability in workflow execution.
This is why they are activities. Their results are stored in history, the workflow remains deterministic.
> might need to see how it accounts for more complex scenarios where errors are transient or where a retry could be successful even for some client errors like rate limiting or temporary unavailability etc.
Temporal allows you to specify whether an error is retryable or not.
kodablah
11 days ago
Credit: @kodablah and @chippiewill, thanks for turning me into this!
The "API isn’t Pythonic" examples are misleading, the first and third are using more verbose forms of:
add.delay(1, 2)
The verbose forms are for when you want extra functionality like in the second example.
It's relatively small compared to the other issues but it sticks out because it's one of only two listed as "you'll have to live with it".
Izkata
11 days ago
(to clarify my ambiguous disclaimer, I am the author of OP's Temporal post, not the Celery one)
kodablah
11 days ago
We migrated from an in-house redis queuing system.
Temporal has its own way of doing things; there's rules about what you can and cant do in workflows, what has to live in activities, etc. Its generally quite easy to adapt existing code work with it. We use typescript.
The worst part for us has been error/anomaly handling. Workflows can sometimes hit a state where the status reads in progress and errors aren't reported anywhere except buried in the event log; which surfaces great in the UI but we still haven't figured out how to programmatically respond to this condition.
A good example is: we use a home-grown version of this [1] to proxy large payloads to S3. However, if those payloads get REALLY large, they can take some time to upload and download; and if that "some time" is longer than 5 seconds, the control plane will believe that the worker has died, it won't reschedule, and the workflow just sits in In Progress. There's always a beautiful error on the temporal dashboard, and we can manually terminate/retry, but the world just seems to die when this happens and we can't do error-level cleanup stuff like alert the user that the thing they were doing didn't finish.
Temporal is also challenging to get support for. Its new, open source, we don't pay for temporal cloud, and there's not a ton of resources or people using it. The documentation is quite bad (if you like 500,000 word pages, codegen'd library sites with no comments, and one example for each feature, you'll like their documentation). Given we run our own temporal cluster, we've also had pretty large challenges in the self-hosting world. We work through them, usually after deep-diving into the temporal server code itself, but there's startlingly little documentation on self-hosting, and even less community support.
Overall, we don't regret adopting it, but if we had a time machine we wouldn't do it again. I feel it makes a series of sacrifices in order to create a system that has extremely high standards for processing, like financial/bank/healthcare level stuff. But, not only are we not building that, but the system has never behaved in a way which makes me think I'd even want to use it if I worked in those industries. Obviously I feel like I'm the one in the wrong here, and I'm sure its just a matter of "we screwed up something somewhere", but that leads back to: bad documentation, no way to get professional support without being on their cloud, and a lack of community support.
If I’m being honest if it is a big issue to self host but it’s value to developers is obvious and apparent why not pay?
no_wizard
11 days ago
Nah we'd probably be fine paying temporal cloud to host the control plane. Their billing is a little weird; I know quite a bit about temporal-the-technology, and the pricing page is literally the first time I've ever seen the word "action" used. I'm familiar with workflows, activities, sinks, codecs, events, but not actions; so when they bill $N/million actions I have no idea what that means, and its surprising to me that that's how they bill it. But I'm sure there's an answer somewhere.
Temporal Cloud is really, really new. Like, it was in some kind of closed beta for a while, with a "contact us" form, as recently as a couple months ago? So, the main reason we don't use it is because it simply wasn't available. It looks like its more widely available now though.
015a
11 days ago
I'd question if you really need to distribute work across machines. It's great this makes distributed systems easier to write but it's much better to reject the premise and avoid writing them in the first place.
siliconc0w
11 days ago
Distributing the work across machines essentially comes for free with the ability to replay workflows from any step. You could run all of your worker processes on a single machine if it had enough capacity, but resuming a workflow on a different machine is transparent to your workloads assuming there's no local state.
dandandan
11 days ago
Is this equal to Azure Durable Functions?
7bit
11 days ago
Not necessarily "equal" but the basic premise is the same, yes, and there is a common lineage. Azure Durable Functions sits on Azure Durable Task Framework which was created by the co-founder of Temporal (https://temporal.io/about).
(disclaimer, I'm the author of the post)
kodablah
11 days ago
Ohh, that's great to hear! I do like ADF, but the Python worker is full of bugs and weird behaviour and tickets stay open for month without progress. I will definitely check that out!
7bit
11 days ago
During an evaluation I found a bug in their library. I went to their Slack, posted about it, and they gave a workaround in 15 minutes, created an issue in 30, and had a bugfix PR ready the next day. Pretty impressed with their team.
Excited to get to use it at some point.
rjbwork
11 days ago
Working on a freelance job 3 years ago, I got sucked down a rabbit-hole for months trying to get Azure Durable Functions to work. Too many bugs, no visibility into its workings and the worker would always grind to a halt. That job did not go well.
Avoid ADF (let alone Azure) until you've got some innovation tokens to spend, or go with another provider.
pm
11 days ago
It got much better. The internals are explained well enough, although there is room for improvement. The bugs can be worked around. It's super annoying, but are the lesser evil when compared to what ADF brings to the table. However, I'm looking into Temporal to see, if that's the better app.
7bit
10 days ago
I can’t understand what layer provides the state orchestration. Like, in celery is redis. What about here?
sscarduzio
11 days ago
The Temporal server stores events and distributes tasks. There is a cloud offering or it can be self-hosted (with support for Cassandra, Postgres, MySQL, and SQLite persistence). This post focuses more on the Temporal Python SDK and not the general platform.
kodablah
11 days ago
Could you or anyone else with experience with Temporal share how hard it is to self-host in practice? Like, is this more like Redis (self-hosting is trivial) or Supabase (nominally self-hostable, but if you try to do it you'll quickly realize it's a pain and the happy path is to use their hosted platform).
kcorbitt
11 days ago
We offer a full guide to help here at https://docs.temporal.io/self-hosted-guide and many users of all sizes self-host Temporal. Having said that, it has challenges as does running any high-available production system. We offer cloud to ease this burden. You still run all your code/workers and you can end-to-end encrypt all data.
kodablah
11 days ago
Is this in the same space as Windmill.dev? Love to hear from anyone with experience with both.
darkteflon
10 days ago
I find Temporal itself to be effectively a clone of Amazon's Simple Workflow (SWF)
fortylove
11 days ago
That's no coincidence, Temporal is founded by the creators of Amazon Simple Workflow. See https://temporal.io/about.
kodablah
11 days ago
Isn't this just threads but with more surprise gotos?
KaiserPro
11 days ago
no, the whole point of temporal is to distribute work across machines, but without worrying too much on the orchestration.
workflows and activities are called remotely, and you can have an autoscaled worker pool handling these calls.
you can retry any unit easily on failure and specify the non retryable errors. What it requires in exchange is full determinism - the same input should produce the same activities in the same order, as a good starting point.
I found out about Temporal 2+y ago now and were early adopters of their cloud. It's a bit of a paradigm shift when you start using it, but it is amazing at solving some types of problems in a very simple manner. There are some trade-offs, there always are, for us that was migrating long lived workflows. But the resulting simplicity and maintainability of the code has been great. One thing that was hard with Temporal was to "sell it" to business leaders because it's not a turn key solution, it's more like a piece of infra for engineers to build on top of. Kind of a higher level database-queue-workflow engine thing that simplifies work for engineers.
In short we were working on automating things like onboarding a new employee, which involves creating accounts for their saas apps, buying and shipping their device, email confirmations, satisfactions surveys etc. So a workflow could last up to 3 months with some fully automated systems, and some that required integrating with people (listening to jira event to trigger things, etc).
The error handling was the thing that sold me on Temporal, because things can break just about anywhere in unpredictable ways (not just code, can be process, employee quits during the onboarding, customer is out of licenses etc), so we need everything to be robust and be fixable by a person. With homegrown queue based systems or with BPEL it can be hard handling these situations (what if you need to roll back 3 steps?). With code you can use exceptions, write unit tests etc. We use the typescript sdk, promises made it very intuitive to code even some otherwise complicated scenarios (say event listeners etc).
troebr
11 days ago
Temporal is really neat but I think its marketed at too many use cases.
After a year of high-scale Temporal work, I found it was only good for low-scale work.
The onboarding and learning curve were insanely difficult and complex. Ultimately it doesn't scale as well as you think. The temporal team invented their own database to get around this limitation.
ub-volta-toss
11 days ago
Would love to hear more about the scale issues you saw. How many workflows or actions was too many? which components started breaking down, what were their failure modes?
claytonjy
11 days ago
See above. Its not so straightforward. You need enough headroom on each component that a negative feedback loop can start, eat resources, and have enough time and resources to calm itself before hitting some limit or degrading itself further
ub-volta-toss
4 days ago
Can you tell us more about your scaling issues with Temporal?
I haven't yet used it in production, but I would've expected that a system which evolved out of Uber's Cadence [0] (and which I believe is used at Uber extensively) would've scaled very well.
[0]: https://stackoverflow.com/a/61281435/1579058
wojcikstefan
5 days ago
I'm not sure how Uber does it, but it might be because they're using Cadence instead.
The Temporal team has acknowledged that Cassandra-backed Temporal hits scaling limits pretty fast.
The limitations aren't a clean "X actions/sec", they're sneakier. Because you can run X/sec for days and then the memory on the history service will spike, or any tiny slowdown in the DB will cause looping degradation. There are nasty feedback loops hidden in Temporal that turn small problems into very very large problems.
I think the core problem with Temporal is the way its sharded. This affects history service and its caches. If anything tells them to reload or restart, or that any of the nodes are unreachable, you get a retry storm on the DB.
In addition to these issues, Temporal can create feedback loops within itself. I've seen cases where it would not return to health, even with 0 workers requesting work for 10s of minutes.
We could have kept using and scaling Temporal, but it required 10-30x the resources of building something else. And it was scary to administer. You really need an entire team. You can't have somebody who isn't a dedicated engineer take on-call for it.
ub-volta-toss
4 days ago
What did you move to instead?
claytonjy
3 days ago
Invented their own database? They use Cassandra IIRC
NoThisIsMe
10 days ago
Nope. They hit the scale limit with Cassandra and now have an in-house storage layer
ub-volta-toss
4 days ago
As someone familiar with asyncio, I don't understand what this is or what it's for. What's an activity, workflow, or worker?
> See the asyncio.sleep in there? That’s no normal local-process sleep; that’s a durable timer backed by Temporal.
That's the normal asyncio.sleep. What does backed by Temporal mean? Reading further, it appears that Temporal is replacing the default asyncio event loop. I don't understand why every third party async Python library/framework feels the need to take over the default event loop instead of just building on top of it.
bmitc
11 days ago
Temporal is completely above and beyond Asyncio. It's a full scheduling of work and queues that's cross-machine, cross-language, and very transparent.
A workflow is the code that handles only deterministic actions and calls activities.
Activities are functions that do anything you want, typically affecting other systems with network or file calls.
A worker is the running process connected to Temporal with registered workflows & activities for it to pass work to.
I'm doing a lot of work with alert handling and provisioning systems using Temporal. Temporal in two minutes is a great video explanation: https://www.youtube.com/watch?v=f-18XztyN6c
storyinmemo
11 days ago
It's like that old Joel Spolsky article[0] says:
> you only have to get one supergenius to write the hard code to run map and reduce on a global massively parallel array of computers, and all the old code that used to work fine when you just ran a loop still works only it’s a zillion times faster which means it can be used to tackle huge problems in an instant
If you can replace the thing that people use with a distributed version, then that can make it easy to write distributed code.
[0] https://www.joelonsoftware.com/2006/08/01/can-your-programmi...
robertlagrant
11 days ago
The thing with event loops in python is that they are not a single, all-governing scheduler (as e.g. in the BEAM).
ev loops instead are a mid-layer concept that sits below other infrastructure such as threads and processes. And (perhaps somewhat frustratingly) it is not too uncommon to have multiple ev loops in parallel. See for example the proxy.py project, which offers to run one async loop per process for a speedup.
As a result, there are some incentives to swap out the loop itself, e.g. for faster implementations like uvloop, because they are somewhat pluggable anyways.
uniqueuid
11 days ago
Good design dictates that you start one loop and build the whole program around it, no? The docs for asyncio.run say as much.
blegr
11 days ago
Yes, that is good design and the event loop should basically be shared process-wide (asyncio objects are usually not thread safe and cannot be shared across event loops). Temporal only does custom event loops in isolated workflows.
kodablah
11 days ago
> all-governing scheduler (as e.g. in the BEAM).
Does this also mean they have preemptive multitasking like in BEAM?
mapcars
11 days ago
To add to other responses here, Temporal doesn't take over the default event loop in general (and users still use it for clients and activities and such). Temporal workflows must be deterministic and durable which means they are guaranteed to run and are resumable on other machines. Therefore Temporal workflows specifically operate on a custom event loop implementation. It doesn't affect anything outside the workflow.
kodablah
11 days ago
The OP seems targeted at devs who are already quite familiar with Temporal and are interested in using the new Python exposure.
FWIW, as someone who has never previously encountered Temporal, and has only a vague sense of the specific problem set it's trying to tackle and architectural approach it has taken, I find the post to be fairly impenetrable.
I'd love to read a proper introduction to Temporal by way of Python (and probably also by comparison to Celery).
davepeck
11 days ago
I think it means that your code could be resumed on a different machine.
adhamsalama
11 days ago
I just had to write a Python program that handled multiple async events - from a serial line, and with a tkinter GUI. The only way to make it 'truly' async was to handle the runloops myself, and add a separate queue, onto which I push coroutines, for processing. UI events and Serial I/O events (involving passing of messages to update states on both sides) all have to be pushed through the same mechanism in order to gain the functionality I need.
Sure, I 'could just use asyncio', or somehow work out how to crowbar serial i/o into tkinters' runloop. But in the end, writing my own just made more sense, and more importantly: it works great. I can have serial i/o and UI acting independently, but coordinating through a single queue .. this works so well.
>I don't understand why every third party async Python library/framework feels the need to take over the default event loop instead of just building on top of it.
Because you don't always have what you need to get the crowbar in place, nor big enough leverage to make space for what you have to do, asynchronously, in the app.
helpfulContrib
11 days ago
But that can indeed just be done with the standard `asyncio` loop. You run your GUI in a thread, run the `asyncio` event loop in its own thread, pass the `asyncio` loop messages with an `asyncio.Queue` and `asyncio.run_coroutine_threadsafe`, and then use `asyncio.to_thread` for the serial communication within the `asyncio` event loop.
bmitc
11 days ago
Read the example code, have a sinking feeling that is not taken from a real tested example. Either there are multiple unexplained symbols or teh code does not actually run.
For example, in "Implementing a Workflow" the execute_activity refers to Purchaser.purchase, which is not declared anywhere.
If the execute_activity times-out after 1 minutes, the status does not seem to be updated anywhere.
In "Running a Worker", do_purchaser is passed as an activity, without explanation. (I guess I'd need to read the fundamental Temporal docs?)
pierrebai
11 days ago
Yes, it has undergone revisions since which caused function name mismatch (EDIT: fixed). The execute_activity there uses start_to_close_timeout which is per attempt and will retry forever by default (customizable).
This is more of a primer on the Python part of Temporal rather than an explanation of all Temporal concepts in depth. Definitely would recommend reading the fundamental docs at https://docs.temporal.io/encyclopedia/. For more exact samples, see https://github.com/temporalio/samples-python.
kodablah
11 days ago
I like the concept.
The nice things is that it abstract the conditions checks on whether something is done, has succeeded or should be retried.
The bad things is that it abstract the conditions checks on whether something is done, has succeeded or should be retried.
It's nice because that's something you do again and again, and that's a lot of code. A lot of ways it can go wrong.
But it's bad because that's a huge chunk of black box magic that may execute remotely. If you need a custom or more optimized behavior anywhere in this logic, you are done for. If there is a bug/problem in this logic, it's game over. I also have to imagine debugging and error reporting is likely not super fun.
One point in particular that strikes me, is that impotence is generically guaranteed with something like "has this task executed without error last time". But usually, what I want is something much more specific, like "has that entries been updated", "has that file been created" and so on. From a bird view, it looks the same, but from a system reliability point of you, they are not at all the same.
Hard to see how they avoid duplicate results, overlapping tasks, etc.
I don't think they really can at that level of abstraction, which means you need to implement it manually.
Eventually it seems it's a huge dep to bring in for the actual practical problem is really solves well.
But I'm willing to be proven wrong on this one, because the tech is really damn cool.
BiteCode_dev
11 days ago
It is interesting seeing the comments here, the comments from adopters is there is a lot of value but it taking time to get up to speed. Those new to temporal a lot of questions seeking understanding.
I have spent a lot of time in the adjacent space of event driven systems and there, like here, it seems like some of the biggest challenge is just education.
I wouldn't say that EDA or workflow based systems are preferable to traditional API services with DBs, just that the space they occupy in the industry is so large that I think it is really really hard if you to introducing any different paradigms, even when you focus on domains where API services aren't a great fit (like here with long running, complex operations).
My point with this comment is simply that I think if you are trying to build anything that does things differently, developer education is as important or even more important than design and architecture, but often not considered because those building these systems are already so deep into it that they can't approach the problem as an outsider.
addisonj
11 days ago
I love Temporal-- we use it at my company. It's very very good for our use case, but took a while to understand how to use it. We're still figuring things out (Workflow versioning is one thing we suck at still).
That said, I'm not sure why this post from 2023 was posted here today. There've been multiple updates to the Python SDK since this post.
amackera
11 days ago
> Workflow versioning is one thing we suck at still
well its not entirely your fault :) what practices have you adopted now that you have some experience with it?
lower down OP mentions that they got the link from a HN discussion on asyncio 2 days ago https://news.ycombinator.com/item?id=40287354 . i guess the upvotes are today's lucky 10,000 learning about it for the first time.
swyx
11 days ago
Relying on external APIs or databases within activities might lead to variability in workflow execution.
Also, on handling HTTP errors in activities by raising an "ApplicationError" based on the status code, might simplifies error handling but might need to see how it accounts for more complex scenarios where errors are transient or where a retry could be successful even for some client errors like rate limiting or temporary unavailability etc.
As the asyncio library itself does have a steep learning curve, integration of asyncio with workflow systems like Temporal that also uses Pythons native asynchronous features, developers should be careful about indirect or subtle bugs, especially in error handling and task management.
avi_vallarapu
11 days ago
> Relying on external APIs or databases within activities might lead to variability in workflow execution.
This is why they are activities. Their results are stored in history, the workflow remains deterministic.
> might need to see how it accounts for more complex scenarios where errors are transient or where a retry could be successful even for some client errors like rate limiting or temporary unavailability etc.
Temporal allows you to specify whether an error is retryable or not.
kodablah
11 days ago
Credit: @kodablah and @chippiewill, thanks for turning me into this!
https://news.ycombinator.com/item?id=40282650
metadat
13 days ago
Anyone migrated from celery, with / without regrets?
benakh
11 days ago
Many Temporal users used Celery in the past. There was a popular blog post a while back about issues with celery: https://steve.dignam.xyz/2023/05/20/many-problems-with-celer.... Here's a brief heading-by-heading listing of how Temporal addresses those issues: https://community.temporal.io/t/suggestion-for-blog-post-abo....
(disclaimer, I'm the author of the post)
kodablah
11 days ago
The "API isn’t Pythonic" examples are misleading, the first and third are using more verbose forms of:
The verbose forms are for when you want extra functionality like in the second example.It's relatively small compared to the other issues but it sticks out because it's one of only two listed as "you'll have to live with it".
Izkata
11 days ago
(to clarify my ambiguous disclaimer, I am the author of OP's Temporal post, not the Celery one)
kodablah
11 days ago
We migrated from an in-house redis queuing system.
Temporal has its own way of doing things; there's rules about what you can and cant do in workflows, what has to live in activities, etc. Its generally quite easy to adapt existing code work with it. We use typescript.
The worst part for us has been error/anomaly handling. Workflows can sometimes hit a state where the status reads in progress and errors aren't reported anywhere except buried in the event log; which surfaces great in the UI but we still haven't figured out how to programmatically respond to this condition.
A good example is: we use a home-grown version of this [1] to proxy large payloads to S3. However, if those payloads get REALLY large, they can take some time to upload and download; and if that "some time" is longer than 5 seconds, the control plane will believe that the worker has died, it won't reschedule, and the workflow just sits in In Progress. There's always a beautiful error on the temporal dashboard, and we can manually terminate/retry, but the world just seems to die when this happens and we can't do error-level cleanup stuff like alert the user that the thing they were doing didn't finish.
Temporal is also challenging to get support for. Its new, open source, we don't pay for temporal cloud, and there's not a ton of resources or people using it. The documentation is quite bad (if you like 500,000 word pages, codegen'd library sites with no comments, and one example for each feature, you'll like their documentation). Given we run our own temporal cluster, we've also had pretty large challenges in the self-hosting world. We work through them, usually after deep-diving into the temporal server code itself, but there's startlingly little documentation on self-hosting, and even less community support.
Overall, we don't regret adopting it, but if we had a time machine we wouldn't do it again. I feel it makes a series of sacrifices in order to create a system that has extremely high standards for processing, like financial/bank/healthcare level stuff. But, not only are we not building that, but the system has never behaved in a way which makes me think I'd even want to use it if I worked in those industries. Obviously I feel like I'm the one in the wrong here, and I'm sure its just a matter of "we screwed up something somewhere", but that leads back to: bad documentation, no way to get professional support without being on their cloud, and a lack of community support.
[1] https://github.com/DataDog/temporal-large-payload-codec
015a
11 days ago
Would you not like it if you didn’t self host?
If I’m being honest if it is a big issue to self host but it’s value to developers is obvious and apparent why not pay?
no_wizard
11 days ago
Nah we'd probably be fine paying temporal cloud to host the control plane. Their billing is a little weird; I know quite a bit about temporal-the-technology, and the pricing page is literally the first time I've ever seen the word "action" used. I'm familiar with workflows, activities, sinks, codecs, events, but not actions; so when they bill $N/million actions I have no idea what that means, and its surprising to me that that's how they bill it. But I'm sure there's an answer somewhere.
Temporal Cloud is really, really new. Like, it was in some kind of closed beta for a while, with a "contact us" form, as recently as a couple months ago? So, the main reason we don't use it is because it simply wasn't available. It looks like its more widely available now though.
015a
11 days ago
I'd question if you really need to distribute work across machines. It's great this makes distributed systems easier to write but it's much better to reject the premise and avoid writing them in the first place.
siliconc0w
11 days ago
Distributing the work across machines essentially comes for free with the ability to replay workflows from any step. You could run all of your worker processes on a single machine if it had enough capacity, but resuming a workflow on a different machine is transparent to your workloads assuming there's no local state.
dandandan
11 days ago
Is this equal to Azure Durable Functions?
7bit
11 days ago
Not necessarily "equal" but the basic premise is the same, yes, and there is a common lineage. Azure Durable Functions sits on Azure Durable Task Framework which was created by the co-founder of Temporal (https://temporal.io/about).
(disclaimer, I'm the author of the post)
kodablah
11 days ago
Ohh, that's great to hear! I do like ADF, but the Python worker is full of bugs and weird behaviour and tickets stay open for month without progress. I will definitely check that out!
7bit
11 days ago
During an evaluation I found a bug in their library. I went to their Slack, posted about it, and they gave a workaround in 15 minutes, created an issue in 30, and had a bugfix PR ready the next day. Pretty impressed with their team.
Excited to get to use it at some point.
rjbwork
11 days ago
Working on a freelance job 3 years ago, I got sucked down a rabbit-hole for months trying to get Azure Durable Functions to work. Too many bugs, no visibility into its workings and the worker would always grind to a halt. That job did not go well.
Avoid ADF (let alone Azure) until you've got some innovation tokens to spend, or go with another provider.
pm
11 days ago
It got much better. The internals are explained well enough, although there is room for improvement. The bugs can be worked around. It's super annoying, but are the lesser evil when compared to what ADF brings to the table. However, I'm looking into Temporal to see, if that's the better app.
7bit
10 days ago
I can’t understand what layer provides the state orchestration. Like, in celery is redis. What about here?
sscarduzio
11 days ago
The Temporal server stores events and distributes tasks. There is a cloud offering or it can be self-hosted (with support for Cassandra, Postgres, MySQL, and SQLite persistence). This post focuses more on the Temporal Python SDK and not the general platform.
kodablah
11 days ago
Could you or anyone else with experience with Temporal share how hard it is to self-host in practice? Like, is this more like Redis (self-hosting is trivial) or Supabase (nominally self-hostable, but if you try to do it you'll quickly realize it's a pain and the happy path is to use their hosted platform).
kcorbitt
11 days ago
We offer a full guide to help here at https://docs.temporal.io/self-hosted-guide and many users of all sizes self-host Temporal. Having said that, it has challenges as does running any high-available production system. We offer cloud to ease this burden. You still run all your code/workers and you can end-to-end encrypt all data.
kodablah
11 days ago
Is this in the same space as Windmill.dev? Love to hear from anyone with experience with both.
darkteflon
10 days ago
I find Temporal itself to be effectively a clone of Amazon's Simple Workflow (SWF)
fortylove
11 days ago
That's no coincidence, Temporal is founded by the creators of Amazon Simple Workflow. See https://temporal.io/about.
kodablah
11 days ago
Isn't this just threads but with more surprise gotos?
KaiserPro
11 days ago
no, the whole point of temporal is to distribute work across machines, but without worrying too much on the orchestration.
workflows and activities are called remotely, and you can have an autoscaled worker pool handling these calls.
you can retry any unit easily on failure and specify the non retryable errors. What it requires in exchange is full determinism - the same input should produce the same activities in the same order, as a good starting point.
src: I'm a user since over a year ago.
ikari_pl
11 days ago
Threads are durable nor distributed.
anonzzzies
11 days ago