A Twitter users segmentation
![]()
Now it’s time to create some clusters from our twitter data. In this post, we focus only on biographical tags and we use the old kmeans algorithms in order to find significant clusters. At least we hope so.
![]()
Now it’s time to create some clusters from our twitter data. In this post, we focus only on biographical tags and we use the old kmeans algorithms in order to find significant clusters. At least we hope so.
After some first statistics about the twitter dataset, I try to get further. In this post, i’ve discussed how to extract token from Twitter, more precisely from the biography and the last tweets of users. The problem is that the bio attribute is composed of 11702 tokens. No way all these tokens are interesting (are itv1, rmer, 112/xm givin you some insight?). But how to remove all the uninteresting tokens while keeping the goods one. As always, there is a tradoff between keeping too much noise and removing some gold nuggets. In my view, an interesting token is a tag (as a tag is usually giving you some knowledge about an item). The problem is for each user, you have a set of token. What you want is a set of tags, i.e. token with interest.
I found two ways to create the keeped tags from all the tokens. The first, the whitelist one, check each token against the whitelist. If it’s not in, remove the token. If it’s in, this token is a tag. The second, the blacklist one, checks also each token against the blacklist, but keeps only token which are not in the blacklist. These two methods have drawbacks. The whitelist is more likely to remove interesting tokens which were not spotted. twitter should be a tag but if you’ve made the whitelist two years ago, you won’t have whitelisted it (or you’re a visionary). Thus, you could miss some emerging trends. The blacklist has the opposite drawback, you keep to much. Considering my goals, I choose to use the whitelist way. Now it’s the time to construct the list of tokens which are really tags.
The easier solution is to check each token in the dataset. While probably the best solution, I found it boring and not in the data mining way. Thus I use some tricks , which you can call a priori knowledge or choices :
Thus, using all these 3 tricks we end with only 430 tags. Easier to manage and read. Here are the more used tags :
+-----------+----------------------+ | token | count(distinct user) | +-----------+----------------------+ | twitter | 243 | | love | 238 | | news | 208 | | life | 170 | | music | 167 | | world | 143 | | social | 138 | | more | 132 | | writer | 130 | | people | 119 | | like | 111 | | time | 108 | | have | 104 | | business | 102 | | marketing | 98 | | blogger | 88 | | work | 85 | | internet | 81 | | lover | 81 | | real | 79 | +-----------+----------------------+
As you can see, there is still many tokens which are likely to be meaningless (‘more‘, ‘have‘). Nevertheless it’s easier to see thing. Uninterresting tags are also unlikely to be considered as a pattern by our algorithms. At least we hope so
Ok, that’s was for the biography. Now the content attribute. It’s more challenging having 126,566 tokens ! Using these 3 tricks reduce the number to 3207. Could all these tags give an insight?
PS : Just to remember the SQL query
select twi.token, count(*) from twitterBioToken twi join WordNetTokens dico on dico.token = twi.token where type = ‘noun’ and length(twi.token) > 3 group by twi.token having count(*) >= 10
Recently I have discussed about how to get some data from Twitter. At this time, I have downloaded 6859 profiles. Here I will give some information about them. Of course, it’s only a very small subset of the whole twitter community.
First the location field. I list here the 20 most given locations :
+-------------------+-------+---------+ | location | count | proba(%)| +-------------------+-------+---------+ | | 1787 | 26.0534 | | london | 327 | 4.7675 | | los angeles | 159 | 2.3181 | | los angeles ca | 113 | 1.6475 | | uk | 67 | 0.9768 | | new york | 65 | 0.9477 | | london uk | 55 | 0.8019 | | usa | 53 | 0.7727 | | washington dc | 47 | 0.6852 | | new york ny | 44 | 0.6415 | | california | 44 | 0.6415 | | san francisco ca | 40 | 0.5832 | | canada | 31 | 0.4520 | | everywhere | 31 | 0.4520 | | nyc | 31 | 0.4520 | | san francisco | 30 | 0.4374 | | chicago | 28 | 0.4082 | | la | 27 | 0.3936 | | new york city | 26 | 0.3791 | | manchester | 23 | 0.3353 | +-------------------+-------+---------+
A quarter of the users doesn’t use the location field. The same real location could have many different location field values like Los Angeles which takes values like los angeles, los angeles ca, la, … Using such synonyms, I found that 6.25% of the declared locations are Los Angeles, 9.56% from London and 4.69% from New York. These results are a little too much, there is location called london which are not London in UK for instance, but they are relatively few. It would be interesting to try to extract an OLAP dimension from such data, at least (country, state, city).
Next, I want to see how my twitter subset is unrepresentative from the whole twitter database. I know that using my procedure the probability of a profile to be selected is linear with the number of followers he has. If there is no trouble with Twitter, the number of following link is equals to the number of followers link as it’s a bijective link. If a follows b, the b is followed by a.
In my subset, the average number of followers is around 12,000 and the average number of following is 1,500. On average, each user has 8 times more followers that following. Very far from the real population.Thus my subset could hardly be used to make inferences about the whole population.
The whole correlation between these two attributes is 0.34. Less than I would expect but I suspect this correlation highly depends on the type of user (and currently we doesn’t know the type of each user).
![]()
Twitter is a famous social website. It works like a blog but limits the message length (160 characters). Thus, it is also called micro blogging and should be get more frequent update about every thought you could have. Could we do something of such atrophied data?
I’m only at the begining of this project. I have settle a basic crawl infrastructure in order to extract a dataset from twitter and mine in it.
The taken data have five attributes : user name, location, followers count, following count, biography (a small who am i field) and the concatenation of theirs last messages. Below is a exemple of a profile, a public person named Richard Bacon. In this example, you could figure how complex these information are. The location is quite unclear (GPS coordinates). The biography is quite small (but really clear on this example). And the content is … confusing.
id: 1351
name: richardpbacon
location: iphone 51.511682 0.224661
nbFollowing: 72
nbFollowers: 360574
bio: minor celebrity bbc radio fivelive presenter
content: yep she tweeted sunday her tweet alone theyd have
run monday news 10 asking susan boyle backlash she overrated
sounds like someone team listened 5live way work sounds like
someone news 10 team (...)
Actually, the content field displayed above was already treated. I’ve use Lucene in order to tokenize and clean the text part. Bellow is the text before and after applying Lucene in order to get tokens instead of free form text.
before : News at 10 asking, is there a Susan Boyle backlash / is she overrated? Sounds like someone on the team listened to 5live on the way to work. after : news 10 asking susan boyle backlash she overrated sounds like someone team listened 5live way work
As you can see, there is still a lot of meaningless tokens like 5live.
I have done a quick (not so much data, not a god algorithm, not so much cleaning) segmentation on only the biography tokens. Nevertheless, trying with 25 clusters, things start to emerge. For instance, a cluster has a high relative frequency of tokens like university, engineering, computer, student, science, studying, school. This is a students cluster (3% of my dataset). There is also a cluster for official public people (twitter, page, official, feed), some geeks clusters (one for geek users of mac or linux, one for open source software developers, another for web developers), a companies twitter account cluster (tokens like company, services, production, advertising, leading) and a photographs one (photography, make-up, light, photo, traveler).
More work has to be done, but the first insight are encouraging.