Archive

Posts Tagged ‘data mining’

Data mining tools

Weka
When it comes to data mining the tool you use is very important. It seems that peoples use many software (see How many software packages is too much?). I’m currently using three tools : Weka, R and Microsoft Excel. When I have to, I’m also programming my own tools. Here is why I need all of them.
Read more…

Categories: Tools Tags: ,

A Twitter users segmentation

twitter1

Now it’s time to create some clusters from our twitter data. In this post, we focus only on biographical tags and we use the old kmeans algorithms in order to find significant clusters. At least we hope so.

Read more…

When is a token a tag?

twitter1After some first statistics about the twitter dataset, I try to get further. In this post, i’ve discussed how to extract token from Twitter, more precisely from the biography and the last tweets of users. The problem is that the bio attribute is composed of 11702 tokens. No way all these tokens are interesting (are itv1, rmer, 112/xm givin you some insight?). But how to remove all the uninteresting tokens while keeping the goods one. As always, there is a tradoff between keeping too much noise and removing some gold nuggets. In my view, an interesting token is a tag (as a tag is usually giving you some knowledge about an item). The problem is for each user, you have a set of token. What you want is a set of tags, i.e. token with interest.

I found two ways to create the keeped tags from all the tokens. The first, the whitelist one, check each token against the whitelist. If it’s not in, remove the token. If it’s in, this token is a tag.  The second, the blacklist one, checks also each token against the blacklist, but keeps only token which are not in the blacklist. These two methods have drawbacks. The whitelist is more likely to remove interesting tokens which were not spotted. twitter should be a tag but if you’ve made the whitelist two years ago, you won’t have whitelisted it (or you’re a visionary). Thus, you could miss some emerging trends. The blacklist has the opposite drawback, you keep to much. Considering my goals, I choose to use the whitelist way. Now it’s the time to construct the list of tokens which are really tags.

The easier solution is to check each token in the dataset. While probably the best solution, I found it boring and not in the data mining way. Thus I use some tricks , which you can call a priori knowledge or choices :

  • Trick 1: A tag is in general a english noun. English because handling multiple language will be a pain. I have no idea why it should be a noun, but all the tags I think about are nouns. Thus, we can use the wordnet which is a english lexical database. We just have to dump all noun to our whitelist. This trick takes out 7159 tokens (61%).
  • Trick 2: A tag should be used many times. Could we extract a pattern of a tag used only once? I know that none of the algorithm I can use on this dataset could do that. Thus, it is useless to keep them. Of course, this decision could not be made if the dataset is growing (this tag could be more used in the future). With a threshold of 10 occurrences, this discard 10911 token (93%)
  • Trick 3: A tag as more than 3 characters. As for the first trick, all interesting tag in my mind have more than 3 letters. The most frequent tokens are ‘i‘, ‘my‘, ‘you‘ (and  ‘love‘, but the pattern  ‘i love you‘ has only one occurrence in a profile with bio attribute “give me a love, and i will love you more :3, funny). This trick dismiss 1193 token (10%).
  • Maybe you have other tricks?

Thus, using all these 3 tricks we end with only 430 tags. Easier to manage and read. Here are the more used tags :

+-----------+----------------------+
| token     | count(distinct user) |
+-----------+----------------------+
| twitter   |                  243 |
| love      |                  238 |
| news      |                  208 |
| life      |                  170 |
| music     |                  167 |
| world     |                  143 |
| social    |                  138 |
| more      |                  132 |
| writer    |                  130 |
| people    |                  119 |
| like      |                  111 |
| time      |                  108 |
| have      |                  104 |
| business  |                  102 |
| marketing |                   98 |
| blogger   |                   88 |
| work      |                   85 |
| internet  |                   81 |
| lover     |                   81 |
| real      |                   79 |
+-----------+----------------------+

As you can see, there is still many tokens which are likely to be meaningless (‘more‘, ‘have‘). Nevertheless it’s easier to see thing. Uninterresting tags are also unlikely to be considered as a pattern by our algorithms. At least we hope so :-)

Ok, that’s was for the biography. Now the content attribute. It’s more challenging having 126,566 tokens ! Using these 3 tricks reduce the number to 3207. Could all these tags give an insight?

PS : Just to remember the SQL query

select twi.token, count(*) from twitterBioToken twi join WordNetTokens dico on dico.token = twi.token where type = ‘noun’ and length(twi.token) > 3 group by twi.token having count(*) >= 10

Book review : Competing on analytics

Competing on analyticsCompeting on Analytics : A new Science of Winning is an interesting book even if it doesn’t explain anything about how to do real analytics. It’s just not the point. It’s objective is to give you some insight on why and how you need to move your business in a more analytical way.

The book is full of real analytics examples. For instance, baseball teams use analytics to manage their team composition. You don’t need to have the best players, you need to hire good player with a cheap salary in order to maximize your profit.

The book gives you ideas on how you can use analytics in the different fields of a company : financial, manufacturing, R&D, Human resources, CRM and suppliers. I know how to do analytics and data mining, but how to apply it to business processes is more tricky. At the end, nobody cares about the validity of yours models if they don’t earn any dollar. This book is about that, showing why and how you could make more money using analytics.

Book review : Collective Intelligence in Action

Collective Intelligence in ActionLast book I read was Collective Intelligence in Action from Satman Alag (ed. Manning). It covers data mining from a web 2.0 related view.  Data is generated by users in many form (ratings, tags, blogs, web pages,  …). Such data are not well defined. An user can create a new tag like gloupy without giving you the meaning. There is also some text mining issues. How to understand the meaning of a sentences?

The book is divided in three parts. First (half of the book) describe data and more especially how to get them (web crawling, blog trackers). The second part is about exploiting the data, i.e. data mining (clustering and prediction). There is also a chapter on converting text into tokens. The last part is on examples of applications. Making an intelligent search engine or a recommendation engine (with an interesting discussion on Amazon, Google News and Netflix solutions).

Being based on Java code, it relies upon some libraries like Nutch for web crawling,  Lucene for text handling and Weka for the data mining. I think there is too much java code in the book. Indeed, it’s boring an you skip easily some pages. For instance, the book use kmeans with self made code, Weka code and JDM (an data mining java api) code. It seems quite useless to see three times the same thing.

Nevertheless, I have found this book very interesting and a very good introduction to web mining, an area where I have little knowledge of.