Tornado Without a GIL on PyPy STM

reduce · on Nov 17, 2014

Am I reading this correctly? Looking at the benchmark code, Why not do a comparison versus the default Tornado setup, which is to fork one process per core? So STM Tornado is allowed to use multiple cores in this benchmark, but vanilla Tornado is not allowed to?

        http_server = HTTPServer(Application(),
                                 xheaders=True,
                                 )
        http_server.bind(port)
        http_server.start(0) # Forks one sub-process per core

[1] https://bitbucket.org/kostialopuhin/tornado-stm-bench/src/65...

fijal · on Nov 17, 2014

The problem with multiple processes is the "share nothing" model. It works for some problems, but it blatantly fails for a whole variety of other problems. STM tries to address those problems where "share nothing" does not work, e.g. because there is interesting data to be shared (albeit with few conflicts) or the memory overhead of N processes is just too much.

haberman · on Nov 17, 2014

I recently read this paper about Julia (http://arxiv.org/abs/1411.1607). Now when I hear about PyPy I can't help but think about this quote from that paper:

    New users also want a quick explanation as to why
    Julia is fast, and whether somehow the same “magic
    dust” could also be sprinkled on their traditional
    scientific computing language. [...]  Julia is
    fast because we, the designers, developed it that
    way for us, the users. Performance is fragile,
    like accuracy, one arithmetic error can ruin an
    entire otherwise correct computation. We do not
    believe that a language can be designed for the
    human, and retrofitted for the computer. Rather a
    language must be designed from the start for the
    human and the computer.

To me, Python and Ruby are both perfect examples of languages that were designed for the human, and ever since have seen extensive effort to retrofit them for fast execution by computers.

I respect the work of the PyPy team, particularly given the raving reviews I've seen lately of how RPython is a boon to language designers who can use it to prototype their languages and get a decently-performing VM in not very much time: http://tratt.net/laurie/blog/entries/fast_enough_vms_in_fast...

But I can't help but think that languages like Python and Ruby will start to fall to languages like Swift, Julia, Go, etc. that were designed with performance in mind. I'm not saying this will happen soon, but these languages are showing that you can have your cake and eat it too.

I'm not sure how JavaScript and Lua fit into this analysis. They weren't specifically designed for performance, but have been very successfully optimized. Lua is a very simple language and Mike Pall is a genius, so LuaJIT has been very successful at speeding up Lua. JavaScript is a little more complicated, but has received an immense amount of resources into optimizing it, and has also been quite successful at getting fast.

aidenn0 · on Nov 17, 2014

I will want to read that paper in the entirety later, but I think they are wrong. Many languages are now fast that were designed primarily, or even only for humans (see Lisp for the last one).

Python is the language specification that is probably most impaired by performance on multi-core, due to the fact that threading semantics were so strongly specified. Pypy is now making a compelling argument that even that is not an insurmountable barrier.

If by "retrofitted" they allow backwards-compatible changes to the languages in question (e.g. optional type annotations for dynamic languages) then I would estimate that a language not designed at all with performance in mind would suffer less than a 2x penalty given enough engineering effort.

lucian1900 · on Nov 17, 2014

LuaJIT, PyPy and the top JS runtimes are similarly fast nowadays.

nickbauman · on Nov 17, 2014

All languages other than pure machine code were designed for the humans first.

_delirium · on Nov 17, 2014

I don't think that's quite true. A lot of languages are a mix of designed-for-humans and designed-for-machine. They do aim to be higher-level than machine code, but it's quite common to have design decisions at least partly driven by considerations from the compiler side as well. Not necessarily only designing for efficient execution (though that is one); other design-for-machine considerations can include ease of parsing and ease of compiler implementation.

tiffanyh · on Nov 17, 2014

Agreed.

Where does Lua then fit into your analogy?

justincormack · on Nov 17, 2014

Lua was designed, and has things taken out, and has its semantics simplified (and well documented). And it is small.

Derbasti · on Nov 18, 2014

Python was designed to be a glue language. It does a terrific job at connecting together different pieces of C code. The more I use it, the more I realize that this is Python's calling. I envision a future where I code numerical stuff in Julia, and use Python to connect it to my main application, my GUI, web frontends and file juggling code.

nickbauman · on Nov 17, 2014

I have found personally A* is particularly difficult to scale across cores because it's never a shared-nothing problem. For two reasons. One, each core has to know about the other cores' search space and avoid it (to avoid duplicating effort) so you will contend on some sort of 'visited node' cache. Two, the graph your building itself must be shared, obvs.

Are most of the interesting problems to solve always limited by Amdahl's Law? Will we never see the gains of single-core speed we saw in the last century again?

desdiv · on Nov 17, 2014

One, each core has to know about the other cores' search space and avoid it (to avoid duplicating effort) so you will contend on some sort of 'visited node' cache.

Just make the visited node cache public and immutable.

Two, the graph your building itself must be shared, obvs.

If the graph is immutable then there's zero problem with it being shared.

jerven · on Nov 17, 2014

You can't make the cache immutable because then it will be empty at start and stays that way ;)

The cache has to mutate and be shared as that's the work completed list. As each thread completes a bit of work (visits a node) it needs to communicate it with the other threads.

desdiv · on Nov 17, 2014

In Scala, its:

    var cache = Seq(1,2,3)
    cache :+= 4

The cache is immutable and freely shareable. Any other thread and come in and read it and be guaranteed that its current state is valid.

jerven · on Nov 17, 2014

:+= Returns A copy of this sequence with an element appended.

Meaning the cache is no longer shared as all threads end up having a thread local cache. You can't update shared changing state and have immutability at the same time.

Immutability is great when one can have it but sometimes its not possible. Shared changing state is something to be avoid as much as possible but sometimes we need it.

fijal · on Nov 17, 2014

how do you make an immutable cache that you're populating all the time? (same goes for the graph)

nickbauman · on Nov 17, 2014

If it's public and immutable each core will get only its own cache, which would be pointless. I like your thinking, though; perhaps there can be a way to always pass the last 'visited node' cache around in a timely way.

error54 · on Nov 17, 2014

Note: If anyone was wondering what STM was, it stands for Software Transactional Memory. The read the docs page gives a good overview[1] of what pypy-stm is.

1-http://pypy.readthedocs.org/en/latest/stm.html

fijal · on Nov 17, 2014

While we're here, please donate to PyPy STM, it's purely a crowdfunded effort http://pypy.org/tmdonate2.html

tshepang · on Nov 17, 2014

Please create a PyPy team on Gratipay.

fijal · on Nov 17, 2014

here is mine - https://gratipay.com/fijal/ - not exactly killing it

tshepang · on Nov 17, 2014

Better it be the team account, to avoid having donors try decide "who contributes how much"... let the team decide. Also, please offer it as an alternative to PayPal on the project website.

fijal · on Nov 19, 2014

We can't really be creating accounts everywhere for minor donations. I support your sentiment, but the official PyPy bookkeeping has to be done in a proper way via the Software Freedom Conservancy, so going through all those services for a few $$$ is simply not worth it.

tshepang · on Nov 19, 2014

Why doubt the amount that can be received from Gratipay? What do you need to see to consider using it for PyPy funding? Would a promise of, say $500/week, be enough to make it worth a bother?

exacube · on Nov 18, 2014

This is some great progress!

But I should say that there is still too much overhead in using STM; you will still be able to very easily (and by a large margin) outperform STM-4 by running 4 instances of Tornado with HAProxy or some other lightweight router ontop. A comparison graph for this should have been the benchmark.

zaphar · on Nov 17, 2014

I like that they instrumented the STM code enough that you can debug slowdowns when writing your code. Never understimate the power of well instrumented languages.

pjmlp · on Nov 17, 2014

Maybe one day, PyPy will eventually become the canonical implementation of Python.

Animats · on Nov 17, 2014

Python's Little Tin God wouldn't like it.

Python's feature set is basically what's easy to do in a naive interpreter. Everything is a dictionary. Anything can be changed from anywhere. With "getattr", you can patch one thread from another. It's elegant, and very difficult to speed up. Google tried, with von Rossum on board. Their "Unladen Swallow" JIT compiler project crashed and burned.

The PyPy group has made a fast Python compiler/interpreter/JIT system. It's really hard. The initial funding from the European Union got them started, but wasn't enough. They really try to handle all the hard cases. This requires two interpreters and a JIT. They have to handle "oh no, someone patched object A from thread R, invalidating code that's running in thread S". There's a "backup interpreter" that kicks in for hard cases, and once control is out of the area in trouble, the JIT can recompile it. (This is an old and oversimplified description.)

This transactional memory thing is very clever. It has to separate things at run time that probably should have been separated at compile time, of course. It's impressive that they can get it to work. It's a lot like how a superscalar CPU works, including transaction commit and backup at the retirement unit.

Python gets into this mess because, like C and C++, the language doesn't really know about concurrency. (Threads came late to UNIX, and C predates threads. So C has an excuse for backing into concurrency.) Python has the C model of concurrency; at the user level, it's treated as a library issue. Internally, though, it needs a lot of locking, because there's so much mutable state in the interpreter.

It would be a lot easier if the language were restricted a little. But then It Wouldn't Be Python(tm). The price of this is huge complexity layered on a simple model, and probably years of obscure bugs in PyPy.

beagle3 · on Nov 18, 2014

> Python has the C model of concurrency; at the user level, it's treated as a library issue.

Other than Erlang, which modern language doesn't?

Java has e.g. "synchronized" keyword, but I regard it as merely syntactic sugar over a standard library implementing thread - semantically, it still basically has the C model.

Go has channels and stuff - but it's still library level (in fact, it's basically syntactic sugar over the Plan9/Inferno/Aleph/DontRememberName standard library primitives)

pjmlp · on Nov 17, 2014

> C predates threads

There were other OS with threads and co-routines being explored as design while UNIX was being developed at AT&T.

As for the rest I agree with you.

Personally I don't have any use for Python besides the occasional shell script, but as a user of applications written in Python I would like them to perform fast.

Animats · on Nov 17, 2014

Yes, and UNIVAC 1108 Exec 8 had threads in 1967. But it wasn't on UNIX until the 1980s.

fulafel · on Nov 18, 2014

Local variables arent dict based, and attributes are only sometimes (slots are the other option).

Unladen swallow was attempted at a time when llvm jit was buggy and slow.

fijal · on Nov 17, 2014

PyPy does not really try to do that. CPython is great for a lot of things, small, easy to install on a myriad of platforms and is, after all, Python. PyPy instead tries to push what's possible to do with Python as a language and with dynamic language interpreters. You can do real time image processing using only Python with PyPy, now STM, fast numerics in the future etc. The goal is to expand Python ecosystem, not to "replace" CPython.

joelthelion · on Nov 17, 2014

Having multiple competing implementations is hurting the ecosystem, with duplicated work, compatibility problems, etc.

duaneb · on Nov 17, 2014

I don't think this is applicable to python. Only so many developers can work on cpython without stepping on toes. This led to the decision not to have cpython to be an implementation standard, not an experimental language, while pypy led the effort for expanding with a different team of developers. This is a good decision, because there are still more than enough developers to maintain a stable, full-featured cpython that pypy can conform to (to varying degrees). There aren't any competing standard libraries, for instance, and it's made very clear which libraries work on which implementations.

Maybe if there were competing implementations to ruby, it would still be popular outside the rails community. But it seems as if the vast majority of ruby core developers work on the mainline implementation, or forks thereof.

rguillebert · on Nov 17, 2014

Having just one implementation is hurting the ecosystem by restricting the amount of things you can do with Python

jerf · on Nov 17, 2014

Generally speaking, for a language like Python, having only one implementation is considered a bad sign. Anything serious has multiple implementations.

glynjackson · on Nov 17, 2014

Rubbish! It is called an open standard for a reason. http://www.toptal.com/python/why-are-there-so-many-pythons

twsted · on Nov 17, 2014

Really? See Javascript.

glibgil · on Nov 17, 2014

CPython is hurting the ecosystem by being slow. I'll let you in on a little secret. People used to using C, C++, D even Java and C# laugh at the slowness of languages like Python and Ruby. Blah, blah, blah productivity gains. We laugh. We think, "I could write this in Java and go to production with three servers or I could write it in Python and use eleven."

People don't like to say it to your face, but Python is often derided and considered to be pretty silly. Oh, and its syntax is like Lego. Neat at first until you want something like a multiline lambda.

Igglyboo · on Nov 17, 2014

Computers are fast, Python has good enough performance for most people now a days.

_vmve · on Nov 18, 2014

In most cases I would rather pay for 7 more servers than for the 10,000 extra lines of code your C++ or Java implementation is going to cost me. Hardware is cheap. Engineers are expensive.

ubernostrum · on Nov 17, 2014

C, C++, D, Java and C# are hurting the ecosystem by being slow. I'll let you in on a little secret. People used to coding assembly laugh at the slowness of languages like C and C++. Blah, blah, blah productivity gains. We laugh. We think, "I could write this in assembly and go to production on an Arduino or I could write it in C and have to use expensive servers."

People don't like to say it to your face, but C is often derided and considered to be pretty silly. Oh, and its syntax is like Lego. Neat at first until you want to understand what something is a pointer to.

mikeash · on Nov 17, 2014

The difference is that you can achieve similar performance to assembly by writing in those other languages, while you cannot achieve anything close to similar performance with Python.

It used to be the assumption that there was a direct tradeoff between performance and convenience/productivity of a programming language. I think newer languages are showing that this doesn't have to be true, at least not to nearly the same degree.

ubernostrum · on Nov 17, 2014

Well, I'd argue that the convenience of C/C++ is so low that nobody should be using them for anything other than toy programs, but that's just me.

glibgil · on Nov 17, 2014

You just said, "I know you are but what am I?" Lame.

lelandbatey · on Nov 17, 2014

What's lame was your unwarranted and childish attack on something you happen to not like. Make a terrible argument, get a terrible response.

glibgil · on Nov 17, 2014

It was warranted. CPython needs to be derided and PyPy needs to keep up the good work until Python isn't unreasonably slow. Python should be humiliated into being fast or just go away.

amelius · on Nov 17, 2014

I'm wondering, how does PyPy compare to nodejs?

rdtsc · on Nov 17, 2014

The same way LLVM compares to haproxy ;-)

PyPy is a project to bring some optimization to Python. Basically make Python run faster. It is also effectively another implementation of Python. CPython is the default one (the one you download at python.org). There is also Jython (running Python on the JVM), and PyPy, and a few others.

NodeJS is the marriage of the V8 Javascript interpreter and JIT with an asynchronous IO library (libuv) + a large ecosystem of modules.

You might want to compare nodejs with PyPY+Tornado or with PyPy+eventlet. Read about STM and the idea behind it. STM lets you take advantage of multiple cores. nodejs is single threaded. In practice nodejs might be faster currently just because V8 is very good and depending on workload if most of the stuff it does is just proxying data from one stream to another, it might do pretty well. But if you start doing a large number of concurrent requests where each request has do to some logic, PyPy might come out on top.

In general anywhere with complicated business logic or a large number of steps needed in the backend to handle requests, I wouldn't use nodejs. I never liked the callback/errback paradigm for large concurrent applications. That works for demos and short web tutorials, in practice, I don't like how it looks. I like green threads, lightweight processes better.

fijal · on Nov 17, 2014

"In practice nodejs might be faster currently just because V8 is very good" and I presume you claim PyPy is not so good? Well if so, I would really like to say [citation needed], especially for workloads and not say computer language shootout, which has little to do with performance of doing any actual workload.

rdtsc · on Nov 17, 2014

> and I presume you claim PyPy is not so good?

Of course not, Maciej ;-) PyPy is awesome and thank you and the whole team you have been doing a great job so far. Just looking at performance graphs and speedups gained over the years. It looks very impressive.

Yeah I don't have a citation but from experience, I have noticed where there are a few steps in each request processing (think proxies) solutions based event loops (epoll, kqueue and friends) can outperform those that spawn a thread/process/context. For example haproxy is certainly a very well done fast proxy, it is single threaded and it seems to work for it.

Again sorry for misunderstanding, I was just speculating without any benchmarks or even particular applications in mind.

Fede_V · on Nov 17, 2014

Like apples compare to oranges :)

Cheeky comment aside, PyPy is an alternative interpreter for Python and, more broadly, a meta-frame work to make writing tracing JIT for the language of your choice easy.

It's much faster than regular Python. The only downside (which, admittedly, is very, very big) is that the PyPy developers released their own FFI library for interfacing with C-code (which works brilliantly), but using c-types and the c-python c-api is very kludgy. This means that a lot of very important python libraries which are basically thin wrappers around c/fortran code (SciPy, NumPy, etc) are not really working.

There is some effort around reimplementing NumPy in PyPy, but it's going very slowly.

Edit: PyPy is also a heroic effort by a small team of developers who receive some donations. JS is probably the language that received the most amount of money and attention into making it run fast from Google/Mozilla, etc - in terms of results/money, PyPy is incredible.

TillE · on Nov 17, 2014

> but using c-types and the c-python c-api is very kludgy

It's easy enough to use CPython extension modules via cpyext, but it's really slow.

fiedzia · on Nov 17, 2014

To be a bit more specific, pypy can be much faster, for certain usecases. For others it may not be, or it may be even significanlty slower - writing JIT for python is not an easy task.

glibgil · on Nov 17, 2014

Sigh, I'll give you the only honest answer it seems you will get here. Node.js runs JavaScript faster than anything runs Python, PyPy included. If you want speed and you like JavaScript just as much as Python, then use Node.js. When I say "faster" I mean you will use less servers/cores for your application with Node.js than you will with CPython or PyPy. V8 is just a better and faster VM for its target language.

fijal · on Nov 17, 2014

"Node.js runs JavaScript faster than anything runs Python, PyPy included" that's a [citation needed] right here. Do you have some facts that I don't happen to have? Please share. That said, I don't care about a recursive fibonacci or computer language shootout problems.

glibgil · on Nov 17, 2014

Benchmarks are all you need to know unless a language runtime has IO problems. Fewer instructions and less memory mean fewer servers. Benchmarks are really good at predicting general performance. Languages with slow benchmarks need more servers to run your app. Languages with fast benchmarks need far fewer servers to run your app. This fact is so obvious and testable that I don't know why slow-language people bother throwing up the argument that benchmarks aren't everything.

BuckRogers · on Nov 17, 2014

Repeat after me: languages don't have speed. Maybe to someone who is just now getting into the industry, but it wasn't long ago JS had no speed.

Python also varies in speed over time and implementation. Implementations have speed. There are reasons that people use "slow languages". It's not just developer happiness. You really do have to look at performance holistically.

If you want to speak to microbenchmarks, I can. For example, CPython's std lib JSON processing is done in C. It's very fast. As of Go 1.2, PyPy blew the doors off Go in my tests processing 10,000 JSON records. CPython 2 & 3 also beat Go in my tests as well. What's this mean? Does it mean Go is a slow language? Well, it's one microbenchmark vs another. It really doesn't mean much to your application.

"I don't know why slow-language people bother throwing up the argument that benchmarks aren't everything."

I'd go even further than 'aren't everything'. Benchmarks that aren't your application do not mean anything.

Many are ignorant of how CPython is even built and are shocked when I show them how fast many standard library modules are. It's just not so simple. Don't believe the hype.

rdtsc · on Nov 17, 2014

What benchmarks? Benchmarks are specific to the application area. Computing n-body simulations are not good benchmarks for a concurrent application that servers web requests, connects to databases, and processes credit card data.

BuckRogers · on Nov 17, 2014

But that wouldn't be a true statement. Google's engineers didn't laugh at Node.js for using V8 due to arrogance. V8 is limited to a single CPU thread per instance. It's great for I/O and a typical webpage, but a CPU task stops everything. PyPy with STM does not have this limitation. Once this project is out of beta, V8 wouldn't compare to PyPy/STM. A better comparison would be the JVM or CLR.

pjmlp · on Nov 17, 2014

There are plenty of JVMs to compare to.

illumen · on Nov 17, 2014

Node.js has a pretty good Garbage Collector for a web browser.

bwlandstreet · on Nov 17, 2014

Sometimes I forget the number of great contributions / projects that Bret Taylor keystroked. It's a motivating reminder to build.

fijal · on Nov 17, 2014

I think your comment is moderately off-topic. This is really not about tornado at all, rather than you can take medium-sized program and get STM in PyPy not to crash