Lately, I was thinking on the difference between machine learning and simulation (for prediction). Machine learning use historical inputs and outputs to find subsequent outputs. Simulation, on the other side, asses you get the knowledge, i.e. the underlying model so you don’t need historical data to learn it. Sometimes you can use both methods to know something, sometimes only one method is available. After thinking about it, I find than the distinction between them is thinner that I thought.
Read more…
A new data mining contest is available here. The functional domain is medical, more precisely there is two tasks. First, we need to prediction if a given patient will be transferred to another hospital. The second task is to predict if the patient will die (the medical domain definitively lacks of fun). For each task, we give a score from the most probable patient to the least. The dataset contains many challenges. In this post, I propose my personals ideas to handle these challenges.
Read more…
Predicting the number of sales representatives on a particular time on a particular store is harder than expected. If you instrument the whole process, you could know the activity of your representatives (number of customers, average time of a transaction, activity rate, …). We could then predict the number of required representatives. We know the cost of having set too much of them but what is the cost of having to few representatives? How to value a missed opportunity, a customer unsatisfaction of the quality of service, the behaviour of a too much stressed employee?
Read more…

Programming Collective Intelligence is a great book. It covers most of the existing data mining algorithms and presents many applications for them. It covers clustering (k-means, hierarchical), supervised classification (k-nearest neighbours, Naïve Bayes, decision trees, SVM), data analysis (non negative matrix factorization), optimisation (hill climbing, simulated annealing and genetic algorithms) and end with genetic programming. Along the way, it present application like spam detection, pricing, recommendation, … If you want to start in data mining this is a very good way. 0
Read more…
Sometimes (well most of the time) using your favorite data mining methods and the more obvious attributes are not good enough. What to do then? An usual idea is to use every other models your software provides and/or add every attributes you could think of whatever their relation to your problem. In this post, I will try to elaborate a kind of “how to” for this case.
Step 1 : What is my model?
If your model is a neural network, it’s quite hard to get any insight of how it works by looking at the weights or neural functions. How could you improve something you don’t understand?
Read more…