Archive

Archive for June, 2009

Using MySQL as a Data Warehouse

Mysql LogoI often use a database not only to store data but also to do some treatment before mining and some analysis. I use MySQL as back-end. I have also considered other free software alternatives, but I feel uncomfortable with Firebird and PostgreSQL lacks of analytical functions like WITH ROLLUP.

MySQl is an atypical DBMS in the sense that it has many different storage engines for the data. Here is a short description of each.

MyISAM

MyISAM is the historical MySQL storage engine. It doesn’t support transaction, neither foreign keys. His big problem is his lack of data cache, he never store any row in memory in case you access it again. This defect is partially compensated by the operating system cache.

InnoDB

InnoDB is the transactional engine of MySQL. He store rows in the primary index thus search with the primary index should be quite fast. He has a data cache.

Memory

Memory, store data only in memory. All content (exception of the structure) is lost with a shutdown of the database. thus you have to populated these table at each MySQL launch. He’s very fast but he’s limited by the amount of memory you allocate.

Archive

Archive is the solution for read only (infrequently) data. There is an on the fly compression to reduce storage cost. Thus the I/O operation are reduced but the CPU is more used (for decompression). There is currently no support for index, each request imply a linear read of the table.

I’ve modeled a star schema with two dimension of 100k rows and a fact table of 10M rows. The footprint of the fact table is dependant on the storage engine used :

Engine Footprint
innodb 416Mo
myisam 204Mo (+184Mo of index)
archive 24Mo

Archive does a great work at reducing the footprint.  InnoDB and MyIsam takes around the same space.

Now, let’s look the tme taken for a query which extract some statistics about the dataset.

SELECT d.yeard, c.country, c.city, sum(qty) FROM fact f  JOIN client c  ON f.clientId = c.id
     JOIN dated d ON f.dateId = d.id GROUP BY yeard, country, city

/* Result. */
+-------+-----------+-----------+----------+
| yeard | country   | city      | sum(qty) |
+-------+-----------+-----------+----------+
|  1981 | Allemagne | Berlin    |  3161889 |
|  1981 | Allemagne | Bohn      |  3468275 |
|  1981 | Allemagne | Frankfort |  3467403 |
|  1981 | France    | Lille     |  2244612 |
|  1981 | France    | Lyon      |  2243573 |
|  1981 | France    | Paris     |  2652841 |
|  1981 | UK        | Liverpool |  3979256 |
|  1981 | UK        | Londres   |  4283599 |
+-------+-----------+-----------+----------+

We use the following command while running the query in order to disable OS cache effect.

watch "echo 3 > /proc/sys/vm/drop_caches"

Results depending on the used storage engine are  :

Engine for fact Engine for client Engine for date Query time
myisam myisam myisam 64s
innodb innodb innodb 33s
archive innodb innodb 28s
innodb memory memory 28s
myisam innodb innodb 23s
archive memory memory 21s
myisam memory memory 19s

As you can see, results greatly depends on the type of engine you use. The better configuration I see is to use MyIsam for fact table. It’s very fast in sequential read. For dimension tables, innodb is a better choice. Dimension table are used in a look-up way and could fit in the allocated memory, thus the data cache of InnoDB does a great work.

Categories: Tools Tags: , ,

Mining Twitter data

twitter1

Twitter is a famous social website. It works like a blog but limits the message length (160 characters). Thus, it is also called micro blogging and should be get more frequent update about every thought you could have. Could we do something of such atrophied data?

I’m only at the begining of this project. I have settle a basic crawl infrastructure in order to extract a dataset from twitter and mine in it.

The taken data have five attributes : user name, location, followers count, following count, biography (a small who am i field) and the concatenation of theirs last messages. Below is a exemple of a profile, a public person named Richard Bacon. In this example, you could figure how complex these information are. The location is quite unclear (GPS coordinates). The biography is quite small (but really clear on this example). And the content is … confusing.

         id: 1351
       name: richardpbacon
   location: iphone 51.511682 0.224661
nbFollowing: 72
nbFollowers: 360574
        bio: minor celebrity bbc radio fivelive presenter
    content: yep she tweeted sunday her tweet alone theyd have
run monday news 10 asking susan boyle backlash she overrated
sounds like someone team listened 5live way work sounds like
someone news 10 team (...)

Actually, the content field displayed above was already treated. I’ve use Lucene in order to tokenize and clean the text part. Bellow is the text before and after applying Lucene in order to get tokens instead of free form text.

before :
News at 10 asking, is there a Susan Boyle backlash / is she
overrated? Sounds like someone on the team listened to 5live
on the way to work.

after :
news 10 asking susan boyle backlash she overrated sounds like
someone team listened 5live way work

As you can see, there is still a lot of meaningless tokens like 5live.

I have done a quick (not so much data, not a god algorithm, not so much cleaning) segmentation on only the biography tokens. Nevertheless, trying with 25 clusters, things start to emerge. For instance, a cluster has a high relative frequency of tokens like university, engineering, computer, student, science, studying, school. This is a students cluster (3% of my dataset). There is also a cluster for official public people (twitter, page, official, feed), some geeks clusters (one for geek users of mac or linux, one for open source software developers, another for web developers), a companies twitter account cluster (tokens like company, services, production, advertising, leading) and a photographs one (photography, make-up, light, photo, traveler).

More work has to be done, but the first insight are encouraging.

Book review : Collective Intelligence in Action

Collective Intelligence in ActionLast book I read was Collective Intelligence in Action from Satman Alag (ed. Manning). It covers data mining from a web 2.0 related view.  Data is generated by users in many form (ratings, tags, blogs, web pages,  …). Such data are not well defined. An user can create a new tag like gloupy without giving you the meaning. There is also some text mining issues. How to understand the meaning of a sentences?

The book is divided in three parts. First (half of the book) describe data and more especially how to get them (web crawling, blog trackers). The second part is about exploiting the data, i.e. data mining (clustering and prediction). There is also a chapter on converting text into tokens. The last part is on examples of applications. Making an intelligent search engine or a recommendation engine (with an interesting discussion on Amazon, Google News and Netflix solutions).

Being based on Java code, it relies upon some libraries like Nutch for web crawling,  Lucene for text handling and Weka for the data mining. I think there is too much java code in the book. Indeed, it’s boring an you skip easily some pages. For instance, the book use kmeans with self made code, Weka code and JDM (an data mining java api) code. It seems quite useless to see three times the same thing.

Nevertheless, I have found this book very interesting and a very good introduction to web mining, an area where I have little knowledge of.