I'm investigating some alternatives, the whole lot of them suck in one way or another (true of all software, to be fair). As I write this I'm testing [Kestrel](http://github.com/robey/kestrel) but it's part of a long list.
37
[deleted]May 11, 2010
+15
Question: Are you running RabbitMQ with Beam or HIPE? (I ask because I've been loosely studying the Beam interpreter's memory management model.)
15
ketralnisMay 11, 2010
+14
We're running it with the beam VM, but performance with it hasn't been a problem for us so far, so unless HiPE has a significantly better memory manager or garbage collector it probably wouldn't make a difference
14
old_soundMay 12, 2010
+9
Have you tried the new RabbitMQ Persister? They say that memory problems are gone with it. Also you can ask about your problem in the mailing list, I've seen RabbitMQ replying there the quite often.
9
ketralnisMay 12, 2010
+7
We're using the unconfigured default in the most recent version of rabbitmq, so whichever persister that is
7
rabbitmqMay 12, 2010
+37
David, we offered you help on this some weeks ago. Please do accept our offer of help because your problem is almost certainly easy to fix, and much easier than switching queues. We are very happy to help you if you describe the issue on email, or on here, whatever is easiest.
Best wishes, alexis@rabbitmq.com
37
[deleted]May 12, 2010
+11
I doubt Microsoft has ever showed up and defended MSMQ on a social site to a site admin. Kudos to you and your service. Hopefully they take you up and work through proper tuning and give RabbitMQ another chance. You can't run web products at this level in defaults and expect it to be perfect for a site this large, a lot of the times it takes vendor involvement. Site admins generally don't know *everything*. You at least need to give the vendor a chance to be involved to see if there are better ways to architect things.
Defaults are for average sites, this site is not average.
11
JohnAppsMay 12, 2010
+2
Fascinating to read of someone running a very large, production web site, using default parameters for one of the main components! I have seen these crashes and managed to work around them by changing the memory parameters and introducing memory checking which will tell the clients to back off if things are getting a bit rough (flow control).
Do check with the folks from RabbitMQ - the support I have received has been nothing but superb!
2
cowholio4May 12, 2010
+3
Have you tried using [Beanstalkd](http://kr.github.com/beanstalkd/)? It's not AMQP compliant but I have had great results with it.
RabbitMQ and I did not get along. :D
3
sophaclesMay 12, 2010
+6
If listnook seriously considers this, I will make pybeanstalk not suck as much. Seriously, that is just the type of motivation I need :). (PS listnook people if you do investigate, tell me you are listnook when you ask questions etc).
6
cantonistaMay 12, 2010
+5
In case it's not on the list, also check out [ØMQ](http://www.zeromq.org/)
5
fuzzyman45May 12, 2010
+3
With much of listnook running on Amazon's web service platform, is the Simple Queue Service for Message Queuing a good fit? We haven't had the need to utilize it yet so I can't speak for its usability compared to other MQ options.
3
eminenceMay 12, 2010
+2
would amazon sqs be usable for listnook's needs?
2
dillonaMay 12, 2010
+2
He's stated in the past that it is not fast enough. Sorry I don't have a source.
2
ketralnisMay 12, 2010
+3
It didn't used to be. I've since spoken to some of their engineers and they've undergone a total rewrite since we tested it. It's an option, the problem is that it would screw our open-source contributors without AWS accounts.
3
[deleted]May 12, 2010
+5
It's in his PyCon talk (which can be found at us.pycon.org/2010 somewhere)
5
dillonaMay 12, 2010
+6
Have you looked at [Gearman](http://gearman.org/)?
If I understand the architecture, it might be close to what you are looking for.
6
raldiMay 11, 2010
+91
We were kinda hoping you would tell us.
91
brownmattMay 11, 2010
+9
If I understand Cassandra correctly, it provides durability of data by keeping copies of data on multiple nodes in the ring. A write request isn't successful until at least N nodes acknowledge receiving the write.
So just curious but if there were only 3 in the initial cluster, how many nodes did each piece of data live on - two?
9
ketralnisMay 11, 2010
+7
> If I understand Cassandra correctly, it provides durability of data by keeping copies of data on multiple nodes in the ring
Yes
> A write request isn't successful until at least N nodes acknowledge receiving the write.
Yes, and you specify that *per operation*.
> So just curious but if there were only 3 in the initial cluster, how many nodes did each piece of data live on - two?
Two, yes. We're increasing that to three in the near future.
7
brownmattMay 11, 2010
+2
Cool! Any idea when the code for the Cassandra backend will make it's way to the public git repo?
2
KeyserSosaMay 12, 2010
+3
In fact we're cobbling back together the public branch right now. If not for the last week of down time, it would already be out.
3
iBeenieMay 11, 2010
+12
When I read
>For performance we stuck a memcached cluster in front of it
I thought it said "memecache."
Now I am disappoint.
12
ketralnisMay 11, 2010
+10
What, you want us to be regenerating the "i am disappoint" comments on every new post when they could perfectly well be generated from cache?
10
iBeenieMay 11, 2010
+6
Forgive me. I assumed that there was no memecache, so I would have to manually add memes.
6
easternguyMay 11, 2010
+10
I really appreciate the update. It shows class.
No offense intended, but it really sounds like you have some serious design problems for a site with your traffic levels.
(I've been responsible for provisioning a top-50 internet site, so I'm not just blowing smoke.)
10
ketralnisMay 11, 2010
+9
Can you be more specific?
9
easternguyMay 12, 2010
+8
Sorry, I don't intent to anonymously and blindly criticize people I do respect during some terribly difficult times. I never meant it to come across that way. I'm a big Listnook fan (despite some weird trends/votes on the site lately).
Can I be more specific? About the site I founded, not publicly. Ex-wives, ex-employees, ex-partners, ugh, I'd prefer not to have them digging around my Listnook posts (which are generally for fun, relaxation, and stress relief).
In general, I'll say that one should know all the components of their system, how they perform under all circumstances (including going down/up), and be prepared for those situations. Their easy to investigate/simulate if you have any resources at all. We did it at a site that I'm sure had far fewer resources and backing than Listnook, and at a similar level of traffic.
Often, for a site like Listnook, an existing database or caching system will temporarily alleviate the pain, but not deal with the core problem.
One needs to look into the relationships of your core data, and determine if an existing solution will do the job, and scale. Usually, for a site with the traffic of Listnook, the answer is no. Existing solutions might work fine for now, and the immediate future, but it's a band-aid. There are custom optimizations you can do for inherent constraints in your system, that will blow away any generic database, by an order of magnitude.
At my company, we had a very customized data solution initially. It was pretty decent for the nature of our site. We then flipped over to MySQL, which worked not-too-bad. Then, due to VC and financing pressures, flipped in turn to Oracle. Ugh. It worked. F'ing expensive, but it worked.
Then the .COM/911 crashes hit, so we down-scaled. We turfed the Oracle/big-iron solution (in favor of a stack of PC's running Linux) and went back to our original efficient solution. Without reducing our level of service. In fact, things were better/faster with the data management solution that we custom-designed for the site from the start.
Again, I shouldn't criticize without knowing all the intricate details. But overall, general purpose packages often won't suffice for a super-high-traffic site with specific data relationships.
Cassandra does sound somewhat more specialized than a MySQL or whatever, but obviously wasn't up to the task at hand (or was managed insufficiently for that task).
Google had a specific application need at an unprecedented level of traffic; they developed their own tech to deal with it. Others have done the same. Taking something off the shelf for a top-100 site, generally won't work. You need some custom tech.
I'm sorry I can't be more specific, via PM, maybe.
Hopefully you catch my gist, and don't take me as just a complaining crackpot. :)
I'm sure many of my comments were made in ignorance of certain Listnook specifics. For that, I apologize in advance.
Cheers.
8
jedbergMay 12, 2010
+20
> Their easy to investigate/simulate if you have any resources at all. We did it at a site that I'm sure had far fewer resources and backing than Listnook, and at a similar level of traffic.
> Then, due to VC and financing pressures, flipped in turn to Oracle. Ugh. It worked. F'ing expensive, but it worked.
I can promise you that if you could afford Oracle, you had for more resources than we do. :)
Our entire annual budget is $250K.
The rest of your post makes a lot of sense though.
20
easternguyMay 12, 2010
+9
> I can promise you that if you could afford Oracle, you had for more resources than we do. :)
Well, that was at our peak (financing-wise, especially). We couldn't dream of oracle (nor we we want to) in the earlier days, nor in the post .COM/911 crash (nor would we want to :). If I were a grandparent, I would have rolled over in my grave when Oracle took control of MySQL. (That's why my clients' sites use Postgres now.)
250K is indeed *very* lean for the traffic/prominence/visibility of Listnook. Very lean. So I take back a lot of my criticisms. (No wonder you haven't implement search yet. Ziiiinggggggg!)
Seriously, with those limited resources, I'm impressed you keep it up at all (that's what she said)...
9
jedbergMay 12, 2010
+7
I'm glad you can appreciate that $250K isn't very much -- most people can't. :)
7
easternguyMay 12, 2010
+6
Agreed. "A quarter mil! Wow!", people say.
Bandwidth at Listnook's level ain't c**** at all. I'm out of touch, but I know it'll take a *big* bite out of that.
Add a couple of admin support staff. They aren't going to work for $10K/year. So add that up.
A programmer or two? Not for $30K for anyone half decent that knows coding, html, css, javascript. Maybe at twice that, *if* someone likes Listnook and is taking a salary hit to work for a site they like.
Oh yeah, you'll need office space, power, lights, heat, insurance, benefits, yadda, yadd, yadda.
Would be nice to have some graphic/html/css design talent there, too. Factor that in.
Very lucky to to it at all for $250, IMHO. In fact, it's probably doing it "ghetto style" at that level. (Which isn't always the worst thing.)
EDIT: jedberg replies, $250K just operations, not including salaries. Still impressive. As I said, I'm sure the hosting/bandwidth takes a *huge* bite. It doesn't take much overhead to consume that. My memory is starting to fade, but for our top-50 site, I'm pretty sure we were paying 10's of thousands a month just for our bandwidth. In Cdn $ which were about .60c US at the time. (Almost par these days, though. :)
And forget about something like Oracle. They'd take that $250K solely for themselves.
Does operations include some admin/maintenance/operational-staff salaries? If it doesn't, it should. Although not knowing your overall budgets/finances, it's hard to say.
When you're running your cool web site off your cable modem or basic hosted service, it's easy to significantly underestimate the costs of dealing millions of hits an hour.
If the $250-ish budget is for operations, I'm curious what the budget is for salaries, programmers, developments, enhancements. Operational budgets are great, but without serious ongoing investment in the tech. to handle growth (especially with a popular site like listnook), you're going to be toast. Which would be very sad indeed. You can only throw more servers and caches at a problem so far, until scalability tanks without proper design.
6
jedbergMay 12, 2010
+8
Oh, that $250K is just operations -- that doesn't include salaries.
8
raldiMay 12, 2010
+9
Now imagine your staff was 1/4th of whatever it actually was, and your operations budget 1/8th. Then what would you have done?
9
easternguyMay 12, 2010
+4
Again, never meaning to criticize, just to comment. Hopefully poitively.
I've dealt with the 1/4th staff (more like 1/20th) and 1/8th budget (more like 1/40th). That's why we went back to Linux servers. And it worked, with the right design.
Listnook is prominent enough, owned by wired (or Condé, etc.), that I assume they have a wee bit more resources than we did. But it's likely comparing apples to oranges in any case, so any squabbling is irrelevant.
4
[deleted]May 12, 2010
+11
[deleted]
11
[deleted]May 12, 2010
+7
>You might need to get some Cloud
That sounds like a *great* path forward. This guy is a straight-shooter with upper-management written all over him!
7
[deleted]May 12, 2010
+6
If the cache was preventing you from seeing that the Cassanda cluster was underprovisioned, could you perhaps deliberately induce a certain percentage of cache misses every once in a while to test the waters? It could be done in parallel to the regular processing so as to not slow down user access, but of course that complicates things a great deal.
6
PrototekMay 11, 2010
+2
I have a complaint! Everyday it seems that from 1:15pm PST for about an hour, comments take *FOREVER* to load! I've been attributing it to people on the east coast getting off work but I swear it happens everyday like clockwork. At 1:15pm PST, comments take 20-60 seconds to load! The frontpage loads fine but clicking on the comments to a link take forever!
2
ketralnisMay 11, 2010
+5
Interesting. Most of our traffic is actually *during* work hours, not after them, so it's more likely from other PSTers getting off lunch rather than ESTers getting off work. Do you have the skills to log your ping-times and HTTP-response times to us once per hour for, say, two days or more? That could help us see how consistent it is and compare it with our own load graphs. Is it logged-in, or logged-out that you see the problem?
5
PrototekMay 11, 2010
+2
This happens to me while logged in. Do you have any software recommendations on how to log that? The extent of my network diagnostic experience is using ping and tracert in command line. If you suggest some software, I could probably figure it out and supply you with the logs.
2
aftliMay 11, 2010
+3
> As we write this a mapreduce job is running to recalculate those listings from the canonical store at about 20/sec, and should be done by the end of the week.
Does this have anything to do with my user page missing three years of comments? Will I ever get them back? I assume it's just a matter of linking them back up with my page, since the posts are still there, just not on my user page.
3
GenTiradentesMay 11, 2010
+5
> For now we've upgraded to the latest version of Erlang (rabbitmq is written in Erlang)
Hey, a useful program written in a functional language!
_Ducks_
5
ketralnisMay 11, 2010
+6
kestrel, a competitor to rabbitmq, is in scala
6
ItsAConspiracyMay 12, 2010
+6
Sounds like if Cassandra took items off the internal queue upon timeout, this wouldn't have happened. Any idea if that's a feasible change to Cassandra?
6
sgorfMay 12, 2010
+3
Also, given that the point of using something like Cassandra is to scale, shouldn't it be able to handle bringing in extra emergency nodes more gracefully and incrementally?
3
ItsAConspiracyMay 12, 2010
+3
I think the problem there is that Cassandra's designed for large scale. If you've got 100 nodes, adding a single node is going to have a lot less impact than if you've only got three.
3
ketralnisMay 12, 2010
+3
That sounds like it's accurate, yeah. Hopefully it's a change that can be made. After can put together a more technical version of that I will file a JIRA ticket with them.
3
idreamincodeMay 12, 2010
+4
Thanks guys.
Where can I send beer for your troubles?
4
KeyserSosaMay 12, 2010
+8
Please send all donations to:
Listnook Infinite Sorrows' Fund
520 3rd St
Third Floor
San Francisco, CA 94107
Please make sure to put an extra large "This is not a bomb!" on the oblong, crudely-wrapped, probably ticking package.
We will also accept crudely drawn tokens of appreciation in crayon.
8
Zeische_StabbingtonMay 12, 2010
+5
Is a crudely-drawn spider with seven legs acceptable?
5
seagramsextradryginMay 12, 2010
+7
Is this message from the future?
7
megaman821May 12, 2010
+9
How does memcached/Cassandra 0.5 compare to just Cassandra 0.6? If the row-caching 0.6 performs well are you going to dump memcached?
9
paranoidinfidelMay 11, 2010
+8
Where did all the missing comment history go if you clicky on someone's name? Is that still an outstanding issue or did I miss the fix announcement as well as the fix? I don't care if my history is gone as my comments aren't insightful but I am curious as to the resolution of that item.
8
jedbergMay 11, 2010
+6
You didn't read to the end, did you?
6
paranoidinfidelMay 11, 2010
+5
Thanks for the clarification, I mostly skimmed towards the end but in my brain "broken listings" doesn't have much to do with "missing comment history" ;)
As i clicked on "context" of my post, I am "viewing a single comment's thread." and I can "view the rest of the comments -->" but I am not "viewing listings".
But i understand now and will accept this alternate terminology.
Thanks for all your hard work!
5
[deleted]May 11, 2010
+2
All listnookors need to watch Mr. Huffman's presentation on how listnook works on the back end.
I hope one of the current team can do periodic updated reports.
2
jedbergMay 11, 2010
+5
Here was my talk:
http://us.pycon.org/2010/conference/schedule/event/148/
5
SidtheMagicLobsterMay 12, 2010
+3
If I may offer a bit of advice to the admins here, it sounds like your servers want a human sacrifice.
3
specialk16May 11, 2010
+212
>We've written to Trend Micro explaining that we're actually neither a spammer nor an individual end user, but rather an honest website that's kind of a big deal, and they sent us a form letter explaining how to configure Outlook Express and encouraging us to ask our ISP for further information. We'll try to figure something out as soon as time allows.
hahaha oh wow.
212
bdunderscoreMay 12, 2010
+25
Easiest way to fix this is to get a c**** VPS somewhere that's not blocked and just use it as a smarthost. Spammers can and do exploit EC2, so good luck getting it unblocked.
25
WasterDaveMay 12, 2010
+23
Exactly. The ability to send email from pretty well anywhere has become collateral damage in the war against spammers. You basically *need* to outsource to an SMTP company of some description. F***.
23
strollsMay 12, 2010
+9
IMO the ability to send email from pretty well anywhere has become collateral damage because so many mail admins are lazy.
There are [now ways](http://en.wikipedia.org/wiki/DKIM) you can accept spam from dial-up pools and EC2 and be quite confident they're not spammers, but it's just easier for mail admins to say "oh, you should be using your ISP's mail servers".
9
PoromenosMay 12, 2010
+5
They do use SPF, which should be about as useful (clients can recognise that mail from this machine comes from listnook), but Trend Micro still does that.
5
zemMay 12, 2010
+4
the willingness to accept collateral damage is why i no longer have any respect for blackhole-list-type spam fighters
4
PoromenosMay 11, 2010
+82
You laugh, but correctly configuring their Outlook clients solves the problem!
If only they had tried it...
82
DimeShakeMay 12, 2010
+5
This is what's called a Policy Black List, other VPS hosts are affected by it as well. Rackspace Cloud servers are on Spamhaus' PBL by default, however their interface for removing explicit hosts is quite clean and quick. I'm unsure how Trend Micro's PBL removal process goes.
5
uhhhclemMay 12, 2010
+8
My favorite sentence in the whole post.
8
redderritterMay 11, 2010
+18
yeah trend micro can kiss my failass
18
[deleted]May 11, 2010
+63
Remember ketralnis, every time you break the site raldi's [going to post another digit of your phone number](http://www.listnook.com/r/announcements/comments/c0snf/new_feature_inboxes_show_you_your_new_mail_rather/c0pilkf?context=3).
Thanks as always for letting the community know what happened, it's that shit which keeps us here.
63
P3GMay 11, 2010
+10
I stay for the burnt pizza smell
10
Armitage1May 12, 2010
+1
You should switch from Python to Cocao. I don't know what that is but I hear its super awesome!
Seriously though, it sounds like you have App Developers also acting as Sys Admins. Is that the case? Have you considered splitting those roles?
1
timberspineMay 12, 2010
+2
what does an "internal message bus" do?
from the rabbitmq site and your description, i gathered that it acts as messaging system for enterprise apps ... i'm just curious as to what listnook apps use the internal message bus ... ?
2
ckwalshMay 11, 2010
+34
I run a site that gets a trillion hits per day and have never had these sorts of problems. Just move everything to Windows Server, re-write it in Visual Basic, and empty the tubes before you plug it back in and you should be fine. You guys are idiots for not doing so already!
On a serious note, thank you very much for the explanation, and we appreciate the hard work you put in for us.
34
[deleted]May 12, 2010
+11
[deleted]
11
[deleted]May 12, 2010
+6
No he didn't - his Access database is still loading his comment....
6
kisielkMay 12, 2010
+22
Jeff Atwood, is that you?
22
[deleted]May 11, 2010
+162
If all companies were as open about their mistakes as listnook was, everyone could benefit by learning from each other's mistakes.
162
jedbergMay 11, 2010
+132
That's really the main reason we are so open about it. We are also very open about our costs, in large part because if everyone were, the prices would come down. High traffic hosting is very overpriced in large part because of the lack of pricing information available.
132
[deleted]May 12, 2010
-6
[deleted]
-6
jedbergMay 12, 2010
+30
Really? Every [talk that I give](http://us.pycon.org/2010/conference/schedule/event/148/) starts with that info, but sure, I'll repeat it for you.
> how many servers are used to operate listnook?
Here are the numbers for today:
28 c1.xlarge
26 m1.large
17 m1.xlarge
You can see the descriptions of the instance types [here](http://aws.amazon.com/ec2/instance-types/).
> roughly how much bandwidth is used per month?
In April:
3.9TB Inbound from Akamai
15.4TB Out to Akamai
24.6TB between datacenters.
I don't have quick access to our Akamai bandwidth, but it is many TBs
> roughly how much ($) is spent on bandwidth?
In April we spent $2,358.26 on bandwidth.
30
[deleted]May 12, 2010
+12
[deleted]
12
jedbergMay 12, 2010
+4
If only the other guys would share their info... :/ Then I would have a basis for comparison.
4
[deleted]May 12, 2010
+5
I use about 250Gb/mo for [Simutrans](http://simutrans.com/), but I'm not the "other guys" you mean. :-)
5
jedbergMay 12, 2010
+3
How much do you pay for that bandwidth?
3
[deleted]May 12, 2010
+4
I have two dedicated servers with iWeb.ca; included with first server is 2Tb; 1.5Tb with second; I got a deal for both servers for $130/mo (with cPanel)... Additional bandwidth is 25 cents per GB if I exceed - and unlike some companies, all the listed resources are mine to actually use if I do... heh.
4
[deleted]May 12, 2010
+3
>In April we spent $2,358.26 on bandwidth.
Do you have someone who fills out a corporate checkbook every month with a pretty listnook-logo check to Amazon? That number is absurdly precise, and now I'm imagining you going through the same ritual that I do with my monthly bills, just on a larger scale.
3
jedbergMay 12, 2010
+3
Of course! The accounting department. The number was precise because Amazon rolls up all data transfer into a single line item on the bill, and I just copied it from there. :)
3
ketralnisMay 12, 2010
+10
> Let's see how transparent you guys can be
This sounds more like you're trolling than that you actually care, but we've publicised that information
10
[deleted]May 12, 2010
+6
It might or might not have been trolling - I didn't realize y'all shared this info readily until jedberg's reply, and I could see their wording go either way... But I do think it's awesome you do. :)
6
NeebatMay 12, 2010
+55
I wonder if this sort of thing could happen in other industries? Imagine if it were hard to get price quotes for something *really* important, like a life-saving medical procedure... prices could sky rocket.
55
jedbergMay 12, 2010
+65
Seriously.
Not to diverge too much here, but there is a procedure that has almost perfect price transparency -- LASIK. It isn't covered by any insurance, and tons of people do it every day.
Amazingly, it is quite affordable.
65
NeebatMay 12, 2010
+36
> Not to diverge too much here
You realize this is listnook, right? All we do here is diverge.
36
ScriptoriusMay 12, 2010
+55
Poor guy must be new here.
55
jedbergMay 12, 2010
+22
Speaking of diverging, did you guys watch the Felicia Day interview?
22
SeriousWormMay 12, 2010
+13
Just seen it.
I now have permanent blind spots in my eyes in the shape of Felicia Day.
13
NeebatMay 12, 2010
+7
Someday, you'll regret that. When you're walking down the street, she's walking down the street, you can't see her and you walk right into her.
Wait, that sounds pretty nice actually. She'd probably apologize. Because she's so damned nice!
7
archivatorMay 12, 2010
+11
Precious, precious blind spots!
11
[deleted]May 12, 2010
+3
Yeah... let's just say that Listnook is very ADHD-friendly (I know, cuz I is one)... hehehe
3
Jonathan_the_NerdMay 12, 2010
+7
> almost perfect price transparency ... **isn't covered by any insurance** ... **Amazingly, it is quite affordable.**
Interesting. Third-party payment destroys price transparency, which causes prices to go up, which causes demand for more third-party payment, ad infinitum. So, remove most third-party payment, and you restore price transparency and price competition.
How's *that* for diverging from the topic?
7
[deleted]May 12, 2010
+15
It's even more affordable if you [do it yourself](http://www.lasikathome.com/).
15
jedbergMay 12, 2010
+28
Please dear god tell me that is fake.
28
[deleted]May 12, 2010
+4
Actually trying to order generates this:
>ERROR
>[SQL SERVER] Error Code = 3949 SQL SERVER: SQL1
>LINE 41
>TABLE NOT FOUND
heh.
4
raldiMay 12, 2010
+17
Also for the karma.
17
pobodyMay 11, 2010
+1
So, um, I hate to ask this since last time it apparently caused the whole site to go down, but are we still (eventually) getting our missing comments back?
1
KeyserSosaMay 11, 2010
+179
Pre-emptive snarky counter-argument: if your inclination is to begin your comment with "Why didn't you just..." please step away from your keyboard, and assume you are operating in a universe where we are not idiots.
179
PoromenosMay 11, 2010
+19
Quick few questions (hopefully one of you guys will make a blog post later on, this is valuable information):
* How did Cassandra/memcached/mongodb/whatever else you tried fare as caching layers?
* Why did you go with Cassandra, if memcached is so damn fast?
* You said you tried mongodb as a cache, how did that go (speed/reliability/etc)?
Thanks!
19
ketralnisMay 11, 2010
+25
> How did Cassandra/memcached/mongodb/whatever else you tried fare as caching layers?
Cassandra/memcached has faired *extremely* well until we dumped the memcached. After adding more nodes (an extremely transparent process) it faired well even on the empty memcached.
> Why did you go with Cassandra, if memcached is so damn fast?
memcached isn't persistent, we need something that survives a reboot. Putting memcached in front is just a performance hack (and it worked so well for that that we didn't detect the read load on Cassandra)
> You said you tried mongodb as a cache, how did that go (speed/reliability/etc)?
I tried several other databases, yes. Mongo specifically suffered more than one bout of catastrophic data loss in my brief testing with it, but performed reasonably well for basic sequential reads/writes, and significantly worse for random reads/writes (especially writes)
25
PoromenosMay 11, 2010
+7
Ah, thanks for that. Did you try memcachedb as a persistent store?
Also, your experience with mongodb is consistent with mine. It looks fantastic in theory and if you try it with few data, but if you start putting more data in it becomes slower and prone to bugs... It also won't help if, due to the confusing versioning, you run 1.1 or 1.3 (an odd minor release denotes a testing version)...
7
ketralnisMay 12, 2010
+9
> Ah, thanks for that. Did you try memcachedb as a persistent store?
Yes [1](http://blog.listnook.com/2010/01/what-day.html) [2](http://blog.listnook.com/2010/03/and-fun-weekend-was-had-by-all.html) [3](http://blog.listnook.com/2010/03/she-who-entangles-men.html)
9
matjamMay 12, 2010
+3
We're using TimesTen here as an in-memory database (it can also operate as a cache in front of a full Oracle database) and it works very well for us. But I guess you guys are not considering commercial stuff.
I've heard good things about MonetDB but have no personal experience with it.
If you guys have the budget, take a good long look at TimesTen.
3
ketralnisMay 12, 2010
+5
> I guess you guys are not considering commercial stuff
That's not necessarily true, but because we're open source and want to continue to accept outside contributions we need things that our open source contributors can run, and they typically don't want to pay for it. Also, our budget isn't huge and just paying for the servers to run said commercial software already occupies our entire budget. More still, most of our application is in Python, so we'd need Python bindings for said software, and those don't tend to exist for non-free software
5
matjamMay 12, 2010
+2
TimesTen has two ODBC interfaces, one that connects to timesten over a network, and is a little slow, and the other that uses the shm stuff to go "really really fast" that requires the client to run on the same machine as the TimesTen instance. Getting Python to work with it should be pretty straightforward; use one of the python ODBC implementations like pyODBC, linked against the appropriate timesten ODBC library, and Bob's your uncle. Actually, he's my uncle. You might have an uncle called Bob, I don't know. Stop looking at me like that.
Anyway. Because it's ODBC, someone could substitute any database that provides an ODBC driver, which is I think most of them, if they can't get TimesTen. I've not tested the python interface; everything I do is in C, so I just use the native TimesTen ODBC driver directly.
We use cluster pairs of about 16 instances to collect usage data for a large ISP here in Australia. This usage data comes in as Radius, XML usage records, etc, and we process them and write them to the local TimesTen instances. We then have an XLA application that watches the commit log for each instance and writes session data to a central site for reporting and billing purposes.
Check it out at least. Its free to try, I believe. Something like $15k per instance if you run it in a production environment. Talk to an Oracle rep. I don't think it's a 15k per year; I think thats a one off. The cost comes in if you need it supported in the long run :)
BTW, I know you guys are fairly positive about your switch to EC2, and the flexibility it's given you, the ability to scale on demand. I could never consider going that way myself for a high profile site like listnook; I think if you are running systems that are critical to your business, you need to be in control of your own destiny. You never, ever want to have to say "yeah, performance is broken right now because ... I don't know. Something, someone, is using too much CPU. Or IO. Or something. I'll need to call Amazon.". At least when you're hosting your own stuff, if it breaks, its your own fault. There is nothing worse than sitting at your desk, staring at a dying system, thinking "if only I could see the big picture, I'd be able to fix this now.".
2
robotsongsMay 12, 2010
>Mongo suffered more than one bout of catastrophic data loss...
Is that why I lost over two years of saved articles a couple weeks ago? I'm still trying to figure out if those are ever coming back.
0
ketralnisMay 12, 2010
+7
We've never run Mongo in production
7
robotsongsMay 12, 2010
+2
So would there be any explanation for everything just disappearing? I've read as much of the post as I can but the musician in me starts hearing white noise whenever I try deciphering what precached keys on Cassandra's node sssssssssssssshhhhhhhhhhh.
2
libcryptoMay 12, 2010
+2
Stand back everyone. I speak musician.
So.
You know how in *Sweet Emotion*, Tyler sings, "you can't catch me 'cause the rabbit done died"? Well, it's the same thing here. The rabbit(mq) done died, so you can't catch Tyler Durden, er, Data. Pretty simple, eh?
2
JulianMorrisonMay 12, 2010
+3
Mongo's characteristics are 100% obvious when you grok it's really a wrapper around mmap.
Specifically:
- It can trash data, because it's overwriting it in mapped RAM and letting the OS do the writeback.
- It's fast sequentially, because it just slurps whole pages into RAM.
- It's slow randomly, because it *has to* slurp whole pages into RAM. Even if you only needed a couple of bytes.
3
davidreiss666May 11, 2010
+7
>and assume you are operating in a universe where we are not idiots.
Yes, we know, you Listnook admins are lovable idiots. :-)
7
IcommentonthingsMay 12, 2010
-7
This is ridiculous, some of us actually work on much higher availability and complex systems than this... and the fact that you and others want to believe you are masters that have nothing to learn from anyone else makes you an idiot.
I love the IT ego... such a fragile little creature.
-7
ketralnisMay 12, 2010
+8
> want to believe you are masters that have nothing to learn from anyone else
I don't think anyone believes this. His point was just "before you assume that we haven't thought of something, assume that we've thought about the problem for at least thirty seconds". It doesn't imply that we think we know everything, just that no amount of "why aren't you using my favourite toy?" or "why can't you just reboot the server?" is a help
8
IcommentonthingsMay 12, 2010
-6
I get it, but it is not worded that way... in fact there is a general lack of interest in engaging those of us who use and love Listnook daily who could be a big help. I understand you probably get a lot of simplistic "you should totally use Joomla on a Mac mini!1!" junk, but there are I'm sure more than a few of us who have run some serious HA and large data centers. 4,000+ server data centers, supercomputers, complex core routing and networking, etc.
It's the attitude that keeps folks like myself from even offering some (real) free advice.
-6
ketralnisMay 12, 2010
+6
> there is a general lack of interest in engaging those of us who use and love Listnook daily who could be a big help
So you haven't been reading this entire thread at all, then?
> there are I'm sure more than a few of us who have run some serious HA and large data centers
And we'd love to hear from them. We don't pretend to know everything, we just want to keep out the "you should use my favourite toy instead of your favourite toy" because it pollutes the conversation for those offering actual advice.
6
glengyronMay 12, 2010
+3
Not a 'why didn't you just', but it's still a 'why did you', and that's why did you use memcached & Cassandra, rather than just Cassandra?
I know in hindsight it sucked, but was the decision made based on performance or preserving other parts of the system that presumably worked with memcached currently?
**Edit**: They wanted row-cache support.
3
KeyserSosaMay 12, 2010
+3
You've already self-answered, but to summarize: we started using cassandra when it was still at 0.5 and didn't have row-cache support.
The intention of the comment was to avoid the responses which usually boil down to "why didn't you just use *my favorite solution X which is roughly equivalent*" or "why didn't you just *do the thing that you too would have realized in 30 seconds that you probably actually did try*". The last thing I want to do here is stifle the actual conversation.
3
kwhMay 11, 2010
+87
Why didn't you just reroute power from cargo bay four transporter to the auxiliary replicator system, and remodulate the shields to initiate an inverse tachyon pulse through the main deflector dish?
87
KeyserSosaMay 11, 2010
+84
The phase inverter wouldn't take the additional magnetic flux. We even tried reversing the polarity and re-calibrating the phase compensators.
84
pobodyMay 11, 2010
+61
Reversing the polarity? Are you *trying* to cause a subspace rupture?!
61
Mutiny32May 12, 2010
+15
We all know subspace weapons were banned in 2293 under the Khitomer Accords.
15
atheist_creationistMay 12, 2010
+40
Why didnt u just use wordpress for the sight LOL. My websiet has lots of comments and users and has never gone down. Listnook admins are noobs rofl.
40
[deleted]May 11, 2010
+229
Why didn't you just put that in the blog post? :P
229
thebellmaster1xMay 11, 2010
+12
Assuming that people read links before commenting? You must be new here.
12
3SecndsOfUrLifeWastdMay 11, 2010
+48
Whoring for karma, duh
48
FausterMay 12, 2010
+10
For greater TPS impact, why don't you just leverage transformative web 2.0 synergistic contingency planning to replace rabbitmq vis-a-vis paradigm-shifting enduser empowerment strategies?
10
mattyvilleMay 12, 2010
+8
You could do that, but it would just be easier to install Mac OS X on the whole thing. Or hell, why don't you just run Listnook from your iPads? Gosh.
8
IvyMikeMay 11, 2010
+101
Why didn't you just call the Best Buy Geek Squad?
101
poeirMay 11, 2010
+60
Please do this one day and post the result.
60
tedivmMay 11, 2010
+37
If they recorded the call it would totally be worth the downtime.
37
mattyvilleMay 12, 2010
+6
"Results" seem like an overly generous term for what they would accomplish. But yes, please do this for next April 1st.
6
raldiMay 11, 2010
+16
Why didn't you dust the furniture?
16
anarchmanMay 11, 2010
+6
>We've also discovered that a lot of the verification emails we've been sending out haven't been going through.
It's pretty common knowledge that mail from the EC2 cloud is marked as spam across the board. Some companies (e.g., Acquia) handle this by having an external e-mail server for each instance. They use exim.
6
jeffbarrMay 11, 2010
+6
You can set up reverse DNS for your EC2 instances to solve this problem. More info at http://aws.typepad.com/aws/2010/03/reverse-dns-for-ec2s-elastic-ip-addresses.html .
6
ketralnisMay 11, 2010
+16
We've done this, and it works for some ISPs, it doesn't solve 100% of the problem.
16
covatiMay 12, 2010
+4
I ran into this problem with my servers. We went with rackspace hosted email. It's very easy to setup (you just configure the mail tool for your language to the login parameters for the smtp server) and off you go.
If you want a more robust solution, then you can setup your SMTP server to queue the messages, but send them out through another server. Which I guess is similar to a mail relay, but without all the nasty 'being a spammer' parts of it.
This is a great article describing how to deal with this in EC2: http://pauldowman.com/2008/02/17/smtp-mail-from-ec2-web-server-setup/
Good luck and thanks for all the listnooks.
4
AzuredMay 11, 2010
+272
TLDR: Raldi cheated on some b**** called Cassandra and she threw a fit in the server room.
272
shauncMay 11, 2010
+110
And maybe wound up pregnant. :(
110
chrxsMay 12, 2010
+5
This
>Sometimes a hedge fund or biotech company will come through and fire up 1000 instances in a single EC2 zone to do some number crunching. (..) Since the bulk of our application is in a zone that had this capacity problem on Thursday, we had trouble getting the instance sizes we needed alongside them.
makes it sound like EC2 is a better fit to a model where you need highly variable amounts of processing power with a relative low priority, while your model is one where you need relatively constant, plannable amounts with a high priority. Is there anything to that thought?
5
GunnerMcGrathMay 11, 2010
+155
> (Still with me?)
Nope. I'm gonna go look at more pictures of cats.
155
onthesubMay 11, 2010
+61
They are so cute :)
61
[deleted]May 12, 2010
+56
Ha. The captions make it look like they can speak English but have bad grammar.
56
clausyMay 12, 2010
+18
they'll fit right in here then.
18
YourDadMay 11, 2010
+19
Last week? Oh yeah, I remember. I pushed and pushed the lever but no brain snacks came out.
19
LuckyDragonNo5May 12, 2010
+14
I wish I knew what any of that meant. So, the files are IN the computer?
14
zemmekkisMay 12, 2010
+1
I realize that you guys moved to Cassandra because you obviously felt it would solve your problems.
Do you feel that this problem was due because of the maturity (or lack thereof) of Cassandra?
1
VindexusMay 11, 2010
+24
>We invite all listnook artists to contribute your own images of the alien in
strange, exotic locales.
My attempt: http://i.imgur.com/EKj38.png
24
salvadorwiiMay 11, 2010
+40
Here's my attempt
http://imgur.com/KFr2m.png
40
jevonMay 11, 2010
+5
You should get rid of the block shadow and replace it with something more blurred, to go with the rest of the scene. Otherwise it looks awesome :D
5
soyabstemioMay 12, 2010
+9
And mine [http://imgur.com/TjM7m.jpg](http://imgur.com/TjM7m.jpg)
9
unusualbobMay 12, 2010
+4
Here's mine: [http://imgur.com/N7MTF.png](http://imgur.com/N7MTF.png)
Context: Video game party from Halloween '08
4
raldiMay 11, 2010
+11
Anyone know the owners of those characters? With their permission, we'd totally use that.
11
[deleted]May 12, 2010
+14
i think that's the simpsons.
14
raldiMay 12, 2010
+10
I mean *personally.*
10
[deleted]May 12, 2010
+4
So, do I submit my entry here?
http://i.imgur.com/jDjTg.jpg
4
MMxRicoMay 11, 2010
+4
Thank you so much for the explanation, but even with the picture I still don't know what happed. All I know is that listnook is back, thanks to some kick-ass people.
4
tehguywithahatMay 11, 2010
+50
Well I'm sure this will be the last time listnook goes down.
50
dakboyMay 12, 2010
+6
The going down happens over on /r/gonewild
6
Firefoxx336May 12, 2010
So, layman's terms translation?
0
meltedlaundryMay 11, 2010
+35
soon to be: digg's May 2010 "State of the Servers" report (or: Why Digg was down on Wednesday)
35
MidnightTofuRunMay 11, 2010
+21
Don't you mean Thursday?
21
jevonMay 11, 2010
+3
Isn't that the Trend Micro "how to configure Outlook Express" e-mail?
3
russellvtMay 12, 2010
+3
I have to say, that's one of the most awesome pseudo root cause analysis documents I think I have ever read... I compliment you on your transparency, here (tends to be how all great admins learn, though), as well as your seeming sense of humor about the whole thing (if you don't laugh, you'll cry?).
Let us know if you need help configuring your Outlook Express (buhahahahaha... freakin' Trend, how I loathe thee!)
Edit: BTW... thank you guys for the often completely thankless job you perform! You rock!
3
omeganonMay 12, 2010
+2
With regard to "email verification irony", it's well known among mail server admins that you should not send or accept mail from EC2 addresses. I'm really surprised that it's taken this long for you guys to know that. Since Amazon doesn't provide dedicated IP space for you, there's no really accurate way to distinguish mail from Listnook as different from mail from Joe Spammer who's also using EC2. Admins could go the analysis/heuristics route to analyze the content of the message but that's fraught with uncertainty and expensive. It's much easier to treat them as exactly what they are, dynamic IP space, and block all content from them. From our standpoint they are little different than any other major ISP's dynamic customer IP space which we block already.
My suggestion to you would be to configure one or more relay servers in IP space you own, warm them up by gradually increasing the number of messages you send through them and relay all your EC2 originated mail through them.
2
Shiggityx2May 12, 2010
+6
Guys, I got this. Just gonna head down to Tashi's station to pick up some power convertors.
6
A-punkMay 11, 2010
+7
So what I'm gathering from the cartoon is that you all dropped a tab of acid and had an epic snail war in the server room where you also planted a massive oak tree to provide listnook with good 'chi' from the east.
And they say computer programming is hard.
7
[deleted]May 12, 2010
-1
Listnook was down on Wednesday? I didn't even notice.
-1
d-cupMay 11, 2010
+26
Who else skipped to the comic before reading?
26
[deleted]May 11, 2010
+24
[deleted]
24
ketralnisMay 11, 2010
+35
Maybe I should have added a love interest
35
d-cupMay 11, 2010
+17
Isn't Cassandra playing that part? Or is she just a crazy ex-girlfriend?
17
[deleted]May 11, 2010
+78
Black magic. Got it.
78
krispykrackersMay 11, 2010
+19
♫ ♪ Listnook's got a black magic woman,
Got me so blind I can't see
That Cassandra's a black magic woman
And she's tryin to make a devil out of me. ♪ ♫
19
AzuredMay 11, 2010
+11
Don't let your ROW-READ-STAGE's PENDING operations grow unbounded, baby
Don't let your ROW-READ-STAGE's PENDING operations grow unbounded, baby
11
adpowersMay 11, 2010
+2
Queues should not be allowed to grow without bound. (not my blog, BTW)
http://gamertroll.blogspot.com/2010/01/maxim-5-no-queue-should-be-without.html
2
ketralnisMay 11, 2010
+10
So if listnook is slow during a load-spike, we should just start throwing away votes? That sounds shitty, maybe your queue system should deal with queues that briefly grow without bound.
10
Arizona_BayMay 12, 2010
+4
This deserves more upvotes (not that ketralnis needs them).
The admins are more concerned with data loss over performance, while trying their best to be equally concerned with both.
4
ketralnisMay 12, 2010
+4
> not that ketralnis needs them
They only pay me in karma
4
[deleted]May 12, 2010
+3
[deleted]
3
adpowersMay 12, 2010
+2
Exactly.
Unless I read the blog incorrectly, it sounded like Cassandra was letting its ROW-READ-STAGE's PENDING operations queue grow unbounded. This resulted in all incoming requests having no chance of being fulfilled since they would timeout before reaching the head of the queue. Instead it should immediately reject requests it knew it had no chance of fulfilling. This way some fraction of requests would continue to be processed and your application would quickly be able to handle the error.
I wasn't suggesting that rabbitmq drop its messages, which as I understand do offline processing that keeps all denormalized data consistent. It is reasonable to expect that a queue service would log messages to disk sufficiently long that the site operators could step in and fix everything. However, this queue would still be bounded by the size of your disk. I would expect the queue to start rejecting new messages before it completely fills up the disk, in order to prevent corruption and other nasties.
2
JulianMorrisonMay 12, 2010
+5
Sounds like Cassandra needs some fixing, because the queue full of dead sockets and the mistaken "load" estimation are pure bugs.
5
tiajuanatMay 12, 2010
+5
So the 404 template alien looks like he's staring at goatse...
5
lesighMay 11, 2010
+5
Obligatory: *This is unacceptable! I want my money back.*
5
resephMay 11, 2010
+17
http://www.listnook.com/r/refund
17
onthesubMay 11, 2010
+6
*there doesn't seem to be anything here*
6
DDay629May 11, 2010
-2
27 minutes and still no user submitted 404's? I am disappoint.
Also, I am lazy.
EDIT: Awesome, let's compile them in this thread!
-2
infiniteMay 15, 2010
+2
The fact that you have to put in contingency for someone firing up a quant calculation and taking resources makes ec2 a no deal for me. Coding something that (a) detects such a situation then (b) handle it is very complicated and prone to failure. Since my stuff simply cannot go down, ever, ever, ec2 just won't fit so I'll continue using my own dedicated servers. It's not as dynamic, but it's way more predictable. If you can't trust the underling system, you're in for a world of pain.
2
Armitage1May 12, 2010
+2
FYI, Comcast and some others block our website emails as well. The only outgoing email we have is transaction confirmations. Contacting the company yielded similar results to yours. Other widespread issues from customers with large petitions from paying customer customers are ignored. So I don't expect anything different with a request from a few companies. They clearly don't give a shit.
2
dmanwwMay 12, 2010
+3
I skipped to the cartoon and found that it was blocked at work. If only there was a meme to describe my feelings about this.
198 Comments