Archive

Posts Tagged ‘statistics’

How to : What to do when your model fails?

dataminingSometimes (well most of the time) using your favorite data mining methods and the more obvious attributes are not good enough. What to do then? An usual idea is to use every other models your software provides and/or add every attributes you could think of whatever their relation to your problem. In this post, I will try to elaborate a kind of “how to” for this case.

Step 1 : What is my model?

If your model is a neural network, it’s quite hard to get any insight of how it works by looking at the weights or neural functions. How could you improve something you don’t understand?

Read more…

Categories: How To Tags: ,

Data mining tools

Weka
When it comes to data mining the tool you use is very important. It seems that peoples use many software (see How many software packages is too much?). I’m currently using three tools : Weka, R and Microsoft Excel. When I have to, I’m also programming my own tools. Here is why I need all of them.
Read more…

Categories: Tools Tags: ,

What is the value of your work?

It’s a damn good question which should be tightly correlated to your salary in an utopic world. In other words, how do you justify your existence?

Supposed you are doing a new product version (a software, a website, a car, whatever). It can be used internally to your enterprise or sell to client. In the later case, you guess you can look at the sales to see if your work was valuable. You can think the difference in benefits (positive at minimum) is a direct consequence to your work. Considering your salary and other costs, you can compute your ROI. If you don’t sale it, people which will use your new products could be instrumented, i.e. you can see if they improved theirs sales or benefits. Not so easy.
Read more…

Categories: Thoughts Tags: , ,

First statistics about twitter users

twitter1Recently I have discussed about how to get some data from Twitter. At this time, I have downloaded 6859 profiles. Here I will give some information about them. Of course, it’s only a very small subset of the whole twitter community.

First the location field. I list here the 20 most given locations :

+-------------------+-------+---------+
| location          | count | proba(%)|
+-------------------+-------+---------+
|                   |  1787 | 26.0534 |
| london            |   327 |  4.7675 |
| los angeles       |   159 |  2.3181 |
| los angeles ca    |   113 |  1.6475 |
| uk                |    67 |  0.9768 |
| new york          |    65 |  0.9477 |
| london uk         |    55 |  0.8019 |
| usa               |    53 |  0.7727 |
| washington dc     |    47 |  0.6852 |
| new york ny       |    44 |  0.6415 |
| california        |    44 |  0.6415 |
| san francisco ca  |    40 |  0.5832 |
| canada            |    31 |  0.4520 |
| everywhere        |    31 |  0.4520 |
| nyc               |    31 |  0.4520 |
| san francisco     |    30 |  0.4374 |
| chicago           |    28 |  0.4082 |
| la                |    27 |  0.3936 |
| new york city     |    26 |  0.3791 |
| manchester        |    23 |  0.3353 |
+-------------------+-------+---------+

A quarter of the users doesn’t use the location field. The same real location could have many different location field values like Los Angeles which takes values like los angeles, los angeles ca, la, … Using such synonyms, I found that 6.25% of the declared locations are Los Angeles, 9.56% from London and 4.69% from New York. These results are a little too much, there is location called london which are not London in UK for instance, but they are relatively few. It would be interesting to try to extract an OLAP dimension from such data, at least (country, state, city).

Next, I want to see how my twitter subset is unrepresentative from the whole twitter database. I know that using my procedure the probability of a profile to be selected is linear with the number of followers he has. If there is no trouble with Twitter, the number of following link is equals to the number of followers link as it’s a bijective link. If a follows b, the b is followed by a.

In my subset, the average number of followers is around 12,000 and the average number of following is 1,500. On average, each user has 8 times more followers that following. Very far from the real population.Thus my subset could hardly be used to make inferences about the whole population.

The whole correlation between these two attributes is 0.34. Less than I would expect but I suspect this correlation highly depends on the type of user (and currently we doesn’t know the type of each user).