[insert witty title]

Can tame almost any shrew

Entries Comments



Category: Statistics


Fairtrade?

22 December, 2007 (21:48) | Discussion, Statistics | By: cmb

Imagine for one second that you own a coffee shop (Not a Dutch coffee shop, just a shop selling cups of coffee). Assuming that after all costs have been tallied up it costs you exactly one pound to produce a cup of coffee, how much should you sell that cup for?

No matter what price you pick the result will always be suboptimal. Although a bleary eyed commuter may be willing to pay a high price, say three pounds a cup, for a morning caffeine fix on his daily stumble to work, you’ll miss out on the custom of the minimum wage employee, who may only be wiling to pay 1.10. Similarly if you price the coffee too cheaply, you’ll do a roaring trade, but all the people who would be willing to pay more end up keeping money in their pockets

In an ideal world you would find out how much an individual customer would be willing to pay and charge them exactly that much. This approach would maximise your profits, whilst keeping a maximal number of people happy. Unfortunately without a magical telepathic supercomputer at every checkout this will never happen. Coffee shops, therefore use a different way of allowing customers to self-select how much they pay, and the mechanism is choice. Take for example, Fairtrade coffee:

Cafedirect pay farmers an extra 40-55 pence per pound of coffee (enough to almost double their wages), however a typical cup requires only a quarter of an ounce of coffee beans, meaning that the additional cost to the coffee company is less than a penny per cup. For a time coffee houses were selling Fairtrade coffee at a premium of ten pence, and using the fairtrade label as a way of allowing customers who are willing to pay a bit more for a cup of coffee to do so.

Indeed none of the choices on the menu at a coffee shop have production costs that differ by more than a few pence (indeed, the actual raw ingredients are pretty cheap, it’s running a supply chain and nationwide stores that dominates the price), whereas the menu price will change by a factor of 2-3 between the cheapest and most expensive options. Choice provides a way for consumers to pay as much as they like. A spendthrift can get a black coffee for a pound or two, whereas a tourist looking to treat themselves can pay four pounds for a double mochalattechino. The cost to the actual coffee house isn’t too different.

This behaviour is pretty ubiquitous, for a time Amazon would track customer buying habits and alter book pricing accordingly (hey! he bought six harry potter books, lets stick a couple of quid on the price of the last one), but after customer complaints stopped this practice. Supermarkets sell about a billion different types of onions, from “cheep and cheerful value onions” to “deluxe, organic onions”, and on that note I have two last questions:

How much extra does it cost for a supermarket to sell organic food? How does this difference compare to the markup they charge?

Why do you never see ‘organic strawberries’ on display next to ‘normal strawberries’?

(This post inspired (i.e. stolen from) The Undercover Economist)

Of Murder Rates and Methods

10 September, 2007 (00:58) | Statistics | By: cmb

Recently I was reading a Guardian article about how murder rates are changing in Britain. If you listen to the daily news or read tabloid newspapers it becomes very easy to believe that “Britain is burning and we’re getting swept under a tidal wave of gun crime.”. The reality is (as usual) a bit more nuanced, and would benefit from an indepth analysis. Fortunately I stumbled over a paper (Shaw, Tunstall & Dorling (pdf)) in which this analysis is carried out in some detail. In many ways the results are unsurprising, but it still makes for interesting reading so I’ll summarise a few points here.

The most obvious question to ask is “Have murder rates actually increased over time?”. The answer is a resounding yes. Here we see the total murder rate as a function of year (and that the authors of a scientific paper don’t know how to label axes, grrr)

murder0.png

The bar chart and left hand scale represent the total number of murders, and the blue line and right hand scale represent the number of murders per million people. There has been a steady increase in the murder rate for as long as the numbers have been analysed. However, it is useful to break these numbers down into smaller populations.

To look at the effect of social class on murder rate we need to define a measure of poverty and in this case we use the breadline index, a measure of poverty defined on a region by region basis using six variables: unemployment, lack of owner occupied accommodation and lack of car ownership, as well as three ‘at risk’ variables: limiting long term illness, lone parent households, and low social class.

Using this we can see how murder rates changed as a function of social class:

murder1.png

OK so murder rates have stayed approximately constant or even decreased amongst the well off, but have absolutely exploded in the more deprived regions of the country, which gives us our first handy hint:

Ways to avoid being murdered #1: Do not be poor

The authors of the study also look at the ages of murder victims (label your axes, people!):

murder3.png

The barchart to the left represents males, the one on the right represents females. The vertical axis is ‘Age’ and the horizonal one is ‘Number of murders’. This leads us immediately to our second hint:

Ways to avoid being murdered #2: Do not be male or of working age

Finally the authors break down murders by method:

murder2.png

Amongst the poor knife crime is overwhelmingly popular, whereas the rich (for some reason) are partial to a good old strangling or shooting.

Ways to avoid being murdered #3: Wear a stabproof jacket if you’re poor, and perhaps a steel gorget if you’re rich

In summary: There has been a substantial increase in the British murder rate over the past few decades (although compared to other countries it is still very low). This increase has been almost completely confined to poor areas and young men, and was on the rise long before all this hysteria about a ‘gun epidemic’ started up in the press. The poor have always been more likely to be victims of crime than others in society. This implies that the increase in crime is most likely to be a reflection of an increase in the number of poor people in the community, addressing that is the best way to address crime.

I’ll give the final words to the authors of the study:

murderconclusion.png

Book Review: The Long Tail

9 August, 2007 (00:42) | General, Statistics, Text | By: cmb

Walking through waterstones the other day I saw a copy of The Long Tail, an economics book I have long wanted to read. The long tail is all about the power of power laws and the shifting patterns of supply and demand in society. The book was inspired by an article in Wired magazine, which has been mentioned multiple times on this blog and can be read in full here

So what about power laws? Well, they’re ubiquitous. The sales figures for books? Power law. Number of rentals of DVDs? Power law? Names given to children? Power law. Box office takings? Power law. TV ratings? Power law. Citations per scientist? Power law. Length of encyclopedia entries? Yep. Colours people choose for their kettle? You guessed it. The popularity of blogs? Power law. Youtube video watches? Power. Law.

The central theme of the book is summed up really well by this graph:

ff_170_tail5_f.gif

Here we see how the popularity of titles is distributed. There are relatively few hit products that sell in an incredibly large volume, whereas there are millions of less popular products. This is a pretty typical power law shape.

“Bricks and Mortar” shops can only stock a few items, given the limitations of shelf space. Retailers naturally choose to sell the most popular items and as such can only carry items from the “head” of the curve, i.e. blockbuster movies and hit singles. However, there are many hundreds of thousands of less popular products that shops cannot afford to stock and together their sales add up to rival those of the traditional hits. Online retailers aren’t so constrained by having to maintain a network of shops and so can afford to offer a much larger range, their catalogues stretch further down the long tail, offering an amount of choice that would have been unimaginable only a couple of decades ago. The extreme case of this phenomenon is something such as iTunes, where the goods are purely digital. The cost of adding a product to the iTunes library is negligible (upload it to a server) and as the popular ‘hits’ and more common ‘misses’ are on an equal economic footing. iTunes can quite happily sell them all. It is interesting to note that every single song on iTunes has now sold at least once showing there is demand, however small, for pretty much anything you can imagine. To further underline this point: 57% of Amazon’s revenue now comes from products far enough down the curve that they are not even sold in brick and mortar shops. This is an absolutely incredible shift in the way we consume things. Chris Anderson describes it as follows:

Hit-driven economics is a creation of an age without enough room to carry everything for everybody. Not enough shelf space for all the CDs, DVDs, and games produced. Not enough screens to show all the available movies. Not enough channels to broadcast all the TV programs, not enough radio waves to play all the music created, and not enough hours in the day to squeeze everything out through either of those sets of slots.

This is the world of scarcity. Now, with online distribution and retail, we are entering a world of abundance. And the differences are profound.

So what does this mean for us? A massive amount of choice is useless on its own, how do we know what we want to see? There is a lot of crap down amongst the unpopular items of music and literature. This is where the second half of the theory of the long tail comes in: filters. Think Amazon’s “Other people who bought this item…”, think specialized blogs, think Google’s pagerank algorithm think specialised search and recommendation engines, think user reviews. The collective wisdom of the people (along with some clever coding) allows us to navigate the sea of niches, tracking down obscure things that we like in a way that is completely impossible in a brick and mortar shop.

This is already beginning to have an effect on society. As more and more avenues for consumption open up, more and more people get interested in (well… they already were interested, but it’s now possible to pursue) niche pursuits. Be it ambient dub music, the history of lace making or ceramic crocodiles, it is now possible to track down communities, blogs, product recommendations, and out of print books incredibly easily. Monoculture is fragmenting as people begin to follow their own interests. Just look at the numbers for television: In the 1970’s over 70% of American households watched the most populat TV show, ‘I Love Lucy’, on a Sunday night. Today the TV shows with the highest ratings (CSI) attract only 15% of the population. The total amount of TV consumption certainly hasn’t decreased since the 70’s, so the explanation we’re left with is that when offered choice people actually do have different tastes and those steep power laws are suddenly beginning to flatten off…

“TV is not vulgar and prurient and dumb because the people who compose the audience are vulgar and dumb. Television is the way it is simply because people tend to be extremely similar in their vulgar and prurient and dumb interests and wildly different in their refined and aesthetic and noble interests.
–David Foster Wallace

It’s a fantastic book and I would heartily recommend that everybody reads it. The author additionally keeps a long tail blog here, which itself makes for a fantastic read.

InterTrends

20 May, 2007 (20:11) | Statistics | By: cmb

Recently I have been playing with Google trends (yes I know it has been mentioned previously on this site). Google Trends is a tool that lets you chart how often people searched for different terms over time. I love it! Some of the trends that come out of the tool are really amusing.

Sometimes yearly trends are apparent:

We can chart the relative growth of different companies:

See that they have mothers day on a different day in America:

Watch people search for the current year:

And see them dread getting a hangover on new year’s day.

The Fabulous Graph Machine

29 January, 2007 (20:16) | Statistics | By: cmb

I guess most readers of this blog have figured out by now that I am a big fan of graphs and when I found this website it was like being invited into graph heaven. IT allows you to plot a variety of graphs using statistics about many of the countries in the world.

It’s really easy to use, just select what you want on the x and y axes and then what metric you want to use to scale the point size and the machine plots a pretty graph. Just be sure to remember the difference between correlation and causation since here we appear to have proved that the internet makes you infertile


(point size proportional to life expectancy)

and here that carbon dioxide saves lives:


(point size proportional to population)

The really fun things happen when you notice that you can animate these plots through time (the little play button at the bottom right) and watch e.g. the income vs. life expectancy evolve for all of the African nations.

This is really good fun and I’d heartily recommend that everybody has a go. If you make anything cool let me know.

WikiCounting #2: Battle Royale

26 January, 2007 (00:35) | Statistics | By: cmb

In the last episode of Wikicounting I investigated how Wikipeia treated history, by calculating the size (in characters) of the Wikipedia entries for each individual year. The results were very interesting. In the comments of that post PEB mentioned that it would be fun to see how the size of the Wikipedia article for a specific year is correlated with the number of links that Google turns up.

Here is what you find if you plot the number of links turned up by Google against the number of characters on the wikipedia page:

I used google search queries of the form “year XXXX” (with the quotes), which should ensure that most of the results we return are relevant to the given year. There is a really good correlation (the odd points at the lower end of each spectrum are mainly years around the 1500s).

It’s interesting to note that there are only two points with over ten million google links, they are 2000 and 2005 (the graph stops at 2005). In the year 2000 every single blog, news site and poorly designed geocities webpage was ranting about the y2k bug and so that point got boosted massively above the others around it. Just look at the numbers

(year, number of google links)
1998 1,520,000
1999 1,550,000
2000 15,600,000 !!!!
2001 2,760,000
2002 4,480,000
2003 5,890,000
2004 8,100,000
2005 15,400,000

2006 37,500,000
2007 23,200,000 (and we’re still in January!)

It looks like post-2000 number of links is almost doubling every single year, some may say that Google is growing at a geometric rate!

next time: We learn which deity is preferred by the people of the internet!

Also give me more ideas please

—–

WikiCounting #1: The Inevitable March of Time

24 January, 2007 (22:09) | Statistics | By: cmb

After introducing my wiki wordcount script in this post I solicited for ideas of things to count. One great suggestion from Abbie was to do ‘years’.

Here it is.

Here we have plotted year against ‘number of characters in the wikipedia entry’. There are a couple of lines marked on the graph (I gave them fancy names because that’s just the way I roll).

The dark, horizontal line is the ‘Base Historical Interest’ (BHI). Beyond a certain point (around 1850) ‘normal people’ stop being interested in what happened on a month by month basis in a certain year and so the articles are written by and for a vast minority (historians? those with family connections to a certain event?). The BHI level represents the amount of work that Wikipedians put into an average, uninteresting year. Most of the letters are actually in the form of the sidebar and headers, which my script does not yet automatically remove.

The light, diagonal line is the ‘Linear Interest Growth’ (LIG) trend. As we get closer and closer to the present day people become more involved with the subject and there is a corresponding increase in the amount of words that get written.

I find it really interesting that when something historically significant happens, the trend lifts itself from the LIG and we see an increase in wikipedia interest. I have labelled a few LIG-excess events here:

The mystery peak of 1903 is due to somebody having messed up the formatting on the 1903 wikipedia page so all the tags get counted in the wordcounter. Naturally after wikipedia was introduced (2001) the number of entries on the year page exploded as people began to add things in real-time.

late edit: I just ran the script all the way back to 1500. The results are here. With this extra data I’m beginning to think that an exponential may make for a better fit.

The point is moot really, as just like trying to observe an uncertain-electron, the act of using wikipedia as a measure of history has been affected by the existence of wikipedia itself, and any and all trends will have been smashed by 2001

late edit 2: Give me more ideas for things to count

The Wild World of Wiki Wordcounting

22 January, 2007 (21:51) | Statistics | By: cmb

It is a sad reality of life that my usual routine of ‘doing shit on the internet’ has been disrupted by ‘doing work’ and as such the update rate on the blog has dropped.

I thought that today I would share something I have been doing in my spare time for the past couple of days: Wordcounting Wikipedia. I put together a little script that takes a list of words, finds the corresponding wikipedia page for each word and counts the size of it. Here is an example (click for full size), it contains the size of the wikipedia entry for every country with a name starting in the range A-K (clicky for massive version).

Cool, Eh? I have had far too much fun just seeing who the big winners and losers are in the world of internet. I’ll finish off the graph soon, but I’m feeling lazy right now. Here are a few facts I’ve learnt about the world (A through K inclusive)

  • Argentina has a freakishly long wiki entry, I have no idea why
  • It took me three attempts to spell Kyrgyzstan correctly
  • There are no countries that start with the letter X, an only one that starts with Q,O and Y
  • The peoples of Brunei do not enjoy writing on wikipedia

What about digging deeper into the data, using new and powerful statistics to let us view the world in a whole new way. The statistic I’m going to introduce now is probably as powerful as the legendary MEPP. It is called the WDQ or Wiki Drama Quotient. Every page on wikipedia has an associated ‘Talk’ page, where beardy internet people can argue about what gets put in the article itself. I would contend that the ratio of the length of a discussion page to its article gives a measure of how much drama a particular term carries with it. For example lets take a look at the age old battle between Tea and Coffee:

Coffee, 29027
Talk:Coffee, 64620
Tea, 43991
Talk:Tea, 35143

Although Tea has a longer article, its WDQ is much lower than that of coffee (0.79 vs. 2.23) suggesting that in the internet world tea is more popular than coffee, but it has a calmer history and there is not much in the way of disagreement about it (although on the talk pages there are a few sharp exchanges about the merits of different shapes of teabag).

What about figuring out what the best colour is?

Blue. Obviously. Anybody that disagrees is fooling themselves.

The main point of this blog post is that I need ideas for things to investigate: places? football teams? branches of the physical sciences? Using this stunning technique we can, once and for all, settle a lot of arguments

p.s. also could somebody write a guest post for me, I’ll buy you a beer. You can even be anonymous if you wish.

Battle of the Giants

16 January, 2007 (17:46) | Internet, Statistics | By: cmb

(this post in collabaration with NPR)

Over the past few days a couple of us have been trying to figure out the economy of the internet. Or more specifically: Of ‘internet gambling’ and ‘internet pornography’, which industry is larger?

Firstly what does the ‘largeness’ of an industry mean? Its turnover? profits? number of users? It’ll likely be quite hard to get reliable numbers of pornography users (although the number of online gamblers has been estimated at 12 million as of 2003) so I’ll restrict this analysis to talking about the gross revenues for each business.

Gambling: In 2001 the gaming board for Great Britain estimated that “Total online gambling – £21 billion”. This figure is split into:

Online Horseracing - £2 billion
Online Sports betting - £9 billion
Online casinos - £5 billion
Online Lotteries - £5 billion

These figures represent the turnover (that is: total bets placed) and gross profits (i.e. bets placed - winnings) are much smaller, of order 2.2 to 3 billion pounds circa 2000. In the period 2000-2006 onling gambling experienced an absolutely explosive growth and I would not be surprised if the figures today were a factor of 5 or more higher than these (backed up by the graph on this page, showing a revenue growth from 2bl in 2001 to just shy of 15bl in 2007). I’ll therefore estimate that total Online Gambling Revenue is around £10+ billion.

This seems pretty much in line with future predictions from e.g. $48bl (£24bl) by 2010 (Merrill Lynch)

Pornography: I have really struggled to find figures on the size of this business, the BBC recently reported it to be ‘worth £29bn’ and I don’t know whether this is revenue, net worth, or turnover so need to discount this number.

NBC wrote recently that the industry is a “$2.5bn business”. Other sites put the revenue at $1bn in 1998, a 2001 report from UK group Analysys, forecast that broadband erotica would be worth US$3 billion by 2003. In 1998 Datamonitor forecast that by 2000 adult content would generate US$1.7 billion revenue in 2000. Finally it has been estimated that the global “online adult market” was worth around US$2 billion a year. (last three numbers from here)

These numbers are all a bit out of date and I wouldn’t be surprised if with the advent of widespread broadband coverage this market grew in the same way as the gambling market

Summary: From a completely untrained and haphazard look at the numbers I think that around the start of the millennium the online gambling and online pornography businesses were both valued at around $2bn, although these is a lot of scatter in the numbers from ifferent sources. Today the size of the online gambling business is slightly greater than that of the online pornography business.

Obviously I’m no accountant, so this conclusion should be taken with a shovel full of salt. Additionally I have not factored in how the US government’s recent ban on online gambling will affect the gambling numbers

Miscellaneous Mindblowing Facts:

  • Total global gambling turnover (offline and online) - £638 billion.
  • About 200 pornographic films are shot in the US every week
  • Adult entertainment model Jasmine Mai told the BBC: “The adult industry
    is bigger than every professional sport combined. It’s part of life - it’s
    mainstream now.
  • Annual PPARC budget: 450-500mil

—–

Saddam Poll

30 December, 2006 (18:44) | News, Statistics | By: cmb

Courtesy of everybody’s favourite news source, Fox news, here is a poll about the execution of Saddam Hussein:

I don’t know whether to laugh or cry. Think I’ll probably do a bit of both.

Money Can’t Buy Happiness…

22 November, 2006 (23:37) | Statistics | By: cmb

As regular readers of this blog may know, I like graphs! Especially when they let us look at the world around us in a slightly different way. Here is one I found today:


Source: Modernization and Postmodernization (Princeton, 1997).

This is a plot of the percentage of the population who are either satisfied or happy with their lives against the GNP of that country. It is pretty clear that money buys happiness (well, unless we want to argue correlation and causation again! but I think this one is probably sound).

I think what we are seeing here is that in poor countries, a tiny boost in GNP brings great benefits to quality of life (and hence happiness), as previously unaffordable conveniences (electricity, running water, a good sewage system) become affordable and life becomes much more pleasant. After some point, people have all the things needed to make them comfortable, and further increases in GNP go towards luxuries (cars, televisions, etc.), which increase happiness by a small amount.*

If I were rich I’d surely be too busy firing money out of my solid gold cannon, eating diamonds for breakfast and killing and eating endangered animals to be sad. Also I could probably buy some really good friends :)

*late edit: On further viewing I think my initial assessment may be wrong, most of the countries on the tail to the left of the graph are the ex-soviet republics, maybe the collapse of the USSR and subsequent fragmentation of these countries has something to do with it. Most of the african nations seems pretty happy. Maybe the correlation between GDP and happiness is not that strong after all. That said, I still want to be a millionaire, just to find out!

Mapping Madness!

11 July, 2006 (14:20) | Statistics | By: cmb

A while ago I stumbled over some deformable maps that everybody seemed to find pretty interesting. Here are a few more showing data from the 2004 US presidential election. Firstly coloured by state:

Oh my! That is far too much red. Things change a little bit if we show the individual counties in each state:

And it gets really interesting if we scale each county in proportion to its population:

Finally, we can make the map a much more pleasing blue colour by shading each county by the proportion of the vote that went each way:


I’d love it if somebody could find the page where these maps came from. I lost the URL ages ago and luckily had these ones sat on my harddrive.

—–

Warped Worldview

11 May, 2006 (00:40) | Statistics | By: cmb

This is a map:


Maps are pretty awesome, and are very useful for things like navigating in unfamiliar territory. What about different ways of visualising the map? Well a researcher at the University of Michigan managed to create an algorithm that can deform countries based on any number*. For example here is a map of the earth with each country weighted by its population:

But the really interesting stuff happens when we pick other numbers to examine. For example, the number of patents issued from a country:

You see that massive deep purple blob to the right of the map? That’s Japan.

What about looking at the number of children (10-14) in the workforce:


Why hello thar Africa and Asia. We really appreciate all the cheap goods you’re sending our way.

I would encourage everybody to take a look at some of the maps on the worldmapper.org site. It is sometimes really enlightning to look at the world in a different way.

late edit: Turns out I was a little unclear when describing what the maps show. In each one the total area of land on the earth is unchanged, but each country is given a size proportional to the quantity being measured

All images from SASI Group, Univ. Sheffield/M Newman, Univ. Michigan

*From the New Scientist article his method involves the physics of heat transfer, molecular mixing and a mathematical tool called the fast Fourier transform… sounds nice.

Quantifying Life

9 May, 2006 (23:43) | Statistics | By: cmb

I have recently been reading Freakonomics , which aside from the horrendous cover makes for a very interesting insight into various and sundry facets of life. One thing in particular that caught my eye was a table used by the state of Connecticut to compensate workers for work-related injuries:


I guess there are two things that really struck me about this table. Firstly for the table to exist there must be people out there whose job consists of trying to quantify such horrific events as losing an eye or an arm. It must take a special sort of detachment to be able to appraise many hundreds of cases of people being incredibly hurt, look through every case history and at the end of it all pluck out one number that decides how bad–in the eyes of the law–their injury actually is.

Secondly becoming disabled is a threat that hangs over every single one of us (I’d like to suggest that we don’t call able bodied people ‘able bodied’ but rather ‘not yet disabled’) and an utterly life destroying accident is only seconds around the corner. And, you know what…

…if the worst happens to you the proper procedures, regulations and compensations for your tragedy and loss have already been set to paper by some faceless economist in a city miles away.

Sobering, no?