<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Data into results</title>
	<atom:link href="http://www.dataintoresults.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dataintoresults.com</link>
	<description>A data miner diary</description>
	<lastBuildDate>Thu, 29 Dec 2011 19:23:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Big data and mobile BI : New hype but same old issue</title>
		<link>http://www.dataintoresults.com/2011/12/new-hype/</link>
		<comments>http://www.dataintoresults.com/2011/12/new-hype/#comments</comments>
		<pubDate>Thu, 29 Dec 2011 19:22:46 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[data warehouse]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=326</guid>
		<description><![CDATA[For the end of 2011, many around the blogosphere are forecasting what will be on hype next year. I often read that big data and mobile BI are on the hype. In this article, I will discuss about those technologies. Don&#8217;t believe the ads, those technologies aren&#8217;t game changer at all. Big data First there [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/08/the-cost-of-reducing-cost/' rel='bookmark' title='The cost of reducing costs'>The cost of reducing costs</a></li>
<li><a href='http://www.dataintoresults.com/2011/10/marketingcalculator/' rel='bookmark' title='Book review : Marketing calculator'>Book review : Marketing calculator</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-is-the-value-of-your-work/' rel='bookmark' title='What is the value of your work?'>What is the value of your work?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dataintoresults.com/wp-content/uploads/2009/07/datamining.jpg"><img class="alignright size-thumbnail wp-image-143" title="datamining" src="http://www.dataintoresults.com/wp-content/uploads/2009/07/datamining-150x150.jpg" alt="" width="122" height="122" /></a>For the end of 2011, many around the blogosphere are forecasting what will be on hype next year. I often read that big data and mobile BI are on the hype. In this article, I will discuss about those technologies. Don&#8217;t believe the ads, those technologies aren&#8217;t game changer at all.</p>
<h3><span id="more-326"></span></h3>
<h3>Big data</h3>
<p>First there is a lot of hype on big data. The definition is still unclear but two main component are :</p>
<ul>
<li>to big to handle it with current methods, i.e. regular DBMS</li>
<li>mostly unstructured</li>
</ul>
<p>For the first issue, which is size, let me point out that eBay have <a href="http://www.dbms2.com/2010/10/06/ebay-followup-greenplum-out-teradata-10-petabytes-hadoop-has-some-value-and-more/">two data warehouse with many petabytes </a>running Teradata. Obviously, Teradata is far from cutting edge new stuff. I didn&#8217;t heard of a Hadoop clusters above the single digit petabyte range and the number of instances over the petabyte range seems equals for Hadoop and Teradata (between 15-20).  Therefore, at least we can say that old technology can handle big data. To be fair, we have to say that it&#8217;s all about a trade off decision. Hadhoop is good for scale, i.e. it cost a small capital at the begining and can grow accordingly with the business. That come at the cost of effectiveness. Computational effectiveness but human effectiveness as well.</p>
<p>For the unstructured issue, I would say that it&#8217;s just not an issue. The first time I saw unstructured data was in an Oracle instance using XML fields (Oracle 9 from a decade ago can handle XML). What was interresting is the reason of that choice of adding an unstructured XML field to a old relational table. The business guys were having ideas at a velocity the development team couldn&#8217;t handle. One can say it&#8217;s agility at the cost of usability. To me it&#8217;s just increasing the <a href="http://en.wikipedia.org/wiki/Technical_debt">technical debt</a>. Structuring data is boring because we have to think on how such data should be structured to benefit the business. Letting it unstructured is pushing the boring part for later, that&#8217;s not a good idea.</p>
<div>
<h3>Mobile BI</h3>
<p>That&#8217;s hot, or at least predicted to be hot for <a href="http://smartdatacollective.com/timoelliott/43752/sap-businessobjects-mobile-bi-directions">SAP Business Object</a>. Obviously giving the ability to answer business questions within a meeting using a smartphone is amazing. But think a little bit about how likely such scenario is. In my experience there is only three case :</p>
<ul>
<li>the data is a KPI : if it&#8217;s a KPI and people don&#8217;t know have an idea of it, mobile BI can&#8217;t help. The BI stack just fail. If the C-level suite doesn&#8217;t know the performance of the business and main drivers of it, what is the purpose of BI?</li>
<li>the data is not a KPI but covered by the usual BI stack : but nobody know that such data is in the daily report they get. I already saw some people getting a lot of daily automated reports with some having more that 50 pages. That&#8217;s hundred indicators. Nobody can know them all, some are even not correctly computed (at least not the way you think it should be).</li>
<li>they want insight and not a piece of data : such question like &#8220;What is the behavior of such kind of customers when we launch such kind of marketing operation?&#8221; Such question will never be handled by any mobile BI, any self served BI at all. It just need some work. Maybe the conversion rate increase but the lifetime decrease</li>
</ul>
<p><span class="Apple-style-span" style="font-size: 15px;"><strong>The old issue</strong></span></p>
</div>
<div>To me, the root issue of both those hyped technologies is the same. BI is still mainly a geek and IT topic. Answering a business question or making things the old way is just &#8230; boring. Who cares what the trend of acquisition is when we can build a new sexy hadoop cluster (with hbase, zookepper, hdfs and many funky tools), when <a href="http://en.wikipedia.org/wiki/Support_vector_machine">SVM</a> increase accuracy of 0.1%? So the idea is simple, let&#8217;s give some fancy tools to the business so they don&#8217;t come again and let&#8217;s make something awesome. Yahoo is very strong in Hadoop, but it&#8217;s hard to see any business result.</div>
<div>Technology is just a tool. Having petabytes of data and a huge Hadoop cluster is easy, it&#8217;s just time and money. Making reports or apps on smartphone is easy, again just time and money. Extracting useful insights from data, that&#8217;s hard. Most of those insight can be extracted using a regular database and a simple excel sheet.</div>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/08/the-cost-of-reducing-cost/' rel='bookmark' title='The cost of reducing costs'>The cost of reducing costs</a></li>
<li><a href='http://www.dataintoresults.com/2011/10/marketingcalculator/' rel='bookmark' title='Book review : Marketing calculator'>Book review : Marketing calculator</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-is-the-value-of-your-work/' rel='bookmark' title='What is the value of your work?'>What is the value of your work?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2011/12/new-hype/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Book review : Marketing calculator</title>
		<link>http://www.dataintoresults.com/2011/10/marketingcalculator/</link>
		<comments>http://www.dataintoresults.com/2011/10/marketingcalculator/#comments</comments>
		<pubDate>Sun, 02 Oct 2011 15:14:07 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Book review]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[evaluation]]></category>
		<category><![CDATA[marketing]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=282</guid>
		<description><![CDATA[Measuring and managing return on marketing investment, that&#8217;s the promise of the book from Guy R. Powell. A famous quote in marketing is : Half of my advertising is wasted; I just don&#8217;t know which half (John Wannamacher). Indeed, how many marketing initiative are rigorously evaluated? No doubt price cuts increase volumes, but how often [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/06/book-review-actionable-web-analytics/' rel='bookmark' title='Book review : Actionable Web Analytics'>Book review : Actionable Web Analytics</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/book-review/' rel='bookmark' title='Book review : Competing on analytics'>Book review : Competing on analytics</a></li>
<li><a href='http://www.dataintoresults.com/2009/08/book-review-programming-collective-intelligence/' rel='bookmark' title='Book review : Programming Collective Intelligence'>Book review : Programming Collective Intelligence</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-283" title="Marketing calculator" src="http://www.dataintoresults.com/wp-content/uploads/2011/10/marketing-calculator.jpg" alt="" height="120" /></p>
<p>Measuring and managing return on marketing investment, that&#8217;s the promise of the book from Guy R. Powell. A famous quote in marketing is : <em>Half of my advertising is wasted; I just don&#8217;t know which half</em> (John Wannamacher). Indeed, how many marketing initiative are rigorously evaluated? No doubt price cuts increase volumes, but how often does it to improve the top line, not even talking about the bottom line. This book is about putting some analytics in marketing and knowing the truth.</p>
<p><span id="more-282"></span></p>
<h3>The five step of marketing effectiveness</h3>
<p>The book start with a presentation of their marketing effectiveness framework which is a very good one indeed. It&#8217;s based around 4P (Price, Product, Place, Promotion), 3C (Competition, Consumers, Channel) and  1E (Exogenous factor like holidays, weather, regulation, &#8230;) which are  all stuff which can impact your KPI whether you have control on it (like the P elements) or not. After that presentation, the book goes through 5 levels of marketive effectiveness.</p>
<h4>Activity tracker</h4>
<p>The first step seems easy, you just have to track your marketing activities and how much was spent for each one. The book propose a test, you should be able to provide in half an hour a spreadsheet of all activities in the last 3 years and all expected activities in the next year. Many companies fail the test.</p>
<h4>Campaign measurer</h4>
<p>This is what we can expect from a campaign analysis. Each sale which was driven by the marketing campaign is considered as additional revenue thank to the campaign. This is usually a last touch attribution method (if one week, I use a coupon to buy my weekly pack of beer it will be attributed to the campaign even if I don&#8217;t care about the campaign). Then, you  have revenue generated by the campaign on one side, the cost on the other and you can easily compute some ROI.</p>
<h4>Mix modeler</h4>
<p>As marketing campaign are more and more using multiple channels, analyzing one is more and more tricky. The mix modeler approach uses the 4P3C1E as input and output the sales volumes. Learning is usually done by regression analysis. Now you are able to know what channel is the most relevant, what is expected if a competitor launch a marketing campaign. You can also optimize expenditure for each channel to achieve greater ROI.</p>
<h4>Consumer analyser</h4>
<p>While the mix modeler approach is very campaign focused, the consumer analyser approach focus on the customer by modelizing it as an agent. I think this is very valuable when you already have some customer segments which behave differently. With such systems you can generate any What-If scenario and see what would happen. It&#8217;s a bit like the <a href="http://en.wikipedia.org/wiki/Psychohistory_(fictional)">psychohistory</a> science from Asimov which could predict the future of mankind by mathematic modelization. Definitively the holy grail for any business analyst.</p>
<h4>Brand optimizer</h4>
<p>This chapter focus on managing a portfolio of brand to maximize shareholder value.</p>
<p><span class="Apple-style-span" style="font-size: 15px; font-weight: bold;">My analysis</span></p>
<p>This book is about marketing meeting analytics and will give you a lot of ideas. Thanks to the 14 case studies you can already find out where to start. Sadly, the book stop here. Both mix modeler and consumer analyzer are really valuable approaches but the journey is long to obtain them even if it seems easy in the book. Case studies are not really detailed so it&#8217;s hard to know which shortcuts they take. Nevertheless, the underlying idea of the book is very strong. Considering the amount of money spent in marketing and that much of it is wasted or misallocated, putting at least a bit of analytics in the process could only help.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/06/book-review-actionable-web-analytics/' rel='bookmark' title='Book review : Actionable Web Analytics'>Book review : Actionable Web Analytics</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/book-review/' rel='bookmark' title='Book review : Competing on analytics'>Book review : Competing on analytics</a></li>
<li><a href='http://www.dataintoresults.com/2009/08/book-review-programming-collective-intelligence/' rel='bookmark' title='Book review : Programming Collective Intelligence'>Book review : Programming Collective Intelligence</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2011/10/marketingcalculator/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Manipulation Part 2 : ETL</title>
		<link>http://www.dataintoresults.com/2011/09/data-manipulation-part-2-etl/</link>
		<comments>http://www.dataintoresults.com/2011/09/data-manipulation-part-2-etl/#comments</comments>
		<pubDate>Thu, 01 Sep 2011 22:01:40 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Tools]]></category>
		<category><![CDATA[data manipulation]]></category>
		<category><![CDATA[data warehouse]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=239</guid>
		<description><![CDATA[My last post discuss about SQL queries. Nevertheless, sometimes data came from differents databases. In such cases, it is no longer possible to use SQL. ETL tools, which stands for Extract Transform Load are designed to easilly allow data transformations. I have currently used three tools : Talend, SAP Business Objects Data Integrator and Kettle. [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2010/05/data-manipulation-sql/' rel='bookmark' title='Data Manipulation Part 1 : SQL'>Data Manipulation Part 1 : SQL</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/using-mysql-as-a-data-warehouse/' rel='bookmark' title='Using MySQL as a Data Warehouse'>Using MySQL as a Data Warehouse</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-thumbnail wp-image-143" title="datamining" src="http://www.dataintoresults.com/wp-content/uploads/2009/07/datamining-150x150.jpg" alt="datamining" width="120" height="120" />My <a href="http://www.dataintoresults.com/2010/05/data-manipulation-sql/" target="_blank">last post</a> discuss about SQL queries. Nevertheless, sometimes data came from differents databases. In such cases, it is no longer possible to use SQL. ETL tools, which stands for Extract Transform Load are designed to easilly allow data transformations. I have currently used three tools : Talend, SAP Business Objects Data Integrator and Kettle. I will review them and explain one or two tips I&#8217;ve learned using ETL tools.</p>
<p><span id="more-239"></span></p>
<p><strong>Generalities</strong></p>
<p>The aim of an ETL are numerous. Kimball define <a href="http://intelligent-enterprise.informationweek.com/showArticle.jhtml?articleID=202405400">34 subsystems</a>. To be more general I primarily focus on</p>
<ul>
<li>gain in speed of development : ETL generally use some visual interface with components in order to fasten the process;</li>
<li>ease the maintenance/evolution : plain SQL request and programming tend to be quite complex very quickly;</li>
<li>functionalities : if one of you data base contains XML fields or you want to parse web pages you need to hope that your ETL allows it. If not &#8230; very dirty things will happens.</li>
</ul>
<div>Let&#8217;s see three ETL software I used.</div>
<p><strong>SAP Business Object Data Integrator</strong></p>
<p>It&#8217;s the only one with not open source I&#8217;ve used (version XI r2). An ETL is composed of Work flow and Data flow. Work flows are the structure as they can contains work flows, data flows and orchestration elements (e.g. if then else block). Data flow only work on data manipulation objects : table, join, switch, &#8230; Data Integrator is not designed to let you write complex SQL (it&#8217;s possible but not recommended), a table element read a table and that&#8217;s all. What is great is that Data Integrator will merge table and join elements to create the complex SQL query if everything can be done in one database (let say in a TEL way), . You quickly end up with many elements in your data flows but no big chunk of SQL. Data Integrator is environment aware, you can define a data source and at run time decide which database it represent (production, dev, &#8230;). For the downsides, don&#8217;t expect copy paste to work, neither expect to do tricky things.</p>
<p><a href="http://www.dataintoresults.com/wp-content/uploads/2011/09/Data_Integrator_Client.jpg"><img class="aligncenter size-medium wp-image-248" title="Data_Integrator_Client" src="http://www.dataintoresults.com/wp-content/uploads/2011/09/Data_Integrator_Client-300x206.jpg" alt="" width="300" height="206" /></a></p>
<p><strong>Talend</strong></p>
<p><a href="http://www.google.fr/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CCwQFjAA&amp;url=http%3A%2F%2Fwww.talend.com%2F&amp;ei=y_BfToL8CoK1hAey2ISWBA&amp;usg=AFQjCNEKe9ecvan4bCqZ2DXSQxAmmItP7Q&amp;sig2=zd5XeNVSWleYGLLnBdm3oQ" target="_blank">Talend </a>is an open source product. While Data Integrator has a few very generic components to build, Talend has hundred of components which cover everything you could want. For instance, you have one table input component for each database (even many table input component for one database). No separation between work flow and data flow, only job (but you can have sub jobs). On the downside, it&#8217;s amazingly slow to use. I&#8217;m also not a big fan that it generate a Java program (it could be very useful in some case but not for a serious process). A lot of bugs in it too (that&#8217;s the cost of containing so much).</p>
<p><a href="http://www.dataintoresults.com/wp-content/uploads/2011/09/Talend.png"><img class="aligncenter size-medium wp-image-250" title="Talend" src="http://www.dataintoresults.com/wp-content/uploads/2011/09/Talend-300x187.png" alt="" width="300" height="187" /></a></p>
<p>&nbsp;</p>
<p><strong>Kettle</strong></p>
<p><a href="http://www.google.fr/url?sa=t&amp;source=web&amp;cd=1&amp;ved=0CCYQFjAA&amp;url=http%3A%2F%2Fkettle.pentaho.com%2F&amp;ei=N_ZfTo-NNIuJhQe0yMX4Aw&amp;usg=AFQjCNFK8ibT-serm7HA_1Lb22vHojItFQ&amp;sig2=lIOTObRa_UTpsgMqqcOxyA" target="_blank">Kettle</a> is also an open source product integrated in the Pentaho suite. To describe it, let&#8217;s say that if Talend is shiny, marketing enhanced, Kettle is &#8230; functional and unpractical. There is again jobs (work flow) and transformations (data flow), while the first one is executed sequentially, the second one is multi-threaded, expect fun to understand.  There is many program like Spoon (the designer which could execute both jobs and transformations), Kitchen (execute jobs), Pan (execute transformations). Nevertheless, it&#8217;s currently my favorite. When you understand the underlying logic everything is easier and behave as expected. Not as complete as Talend but enough for 99% of what I want and you can always use some Java code for the rest.</p>
<p><a href="http://www.dataintoresults.com/wp-content/uploads/2011/09/kettle.png"><img class="aligncenter size-medium wp-image-253" title="kettle" src="http://www.dataintoresults.com/wp-content/uploads/2011/09/kettle-300x188.png" alt="" width="300" height="188" /></a></p>
<p><strong>How I used ETL in the everyday work</strong></p>
<p>Every times I should extract a big chunk of data I use and ETL (usually a simple input query and an output in a CSV or Excel file). It just takes almost no time and you keep a file with all the process in case you need it later. And of course I use ETL to load the data warehouse. Here is some tips :</p>
<ul>
<li>Split stuff : Divide your ETL by subjects (accounting, CRM, &#8230;), the higher level vision should be simple.</li>
<li>Keep it ETL : I always dump all data needed in the staging area, make most transformations in SQL (as everything is in the same database, so one table input and one table output to the data warehouse but no &#8220;create table as&#8221; statement).</li>
<li>Make it environment friendly : you should have at least a dev environment and a production environment. Therefore you can test your jobs in the dev environment before putting it in production, the job will not change just the runtime configuration.</li>
</ul>
<div>When doing intensive data transformation, ETL tools are great. Even if you only do analysis, it could make your work more clean and ease your life. Do you never made a lot of manual transformation and just before the end your computer crash? With an ETL, all your steps are still there. Someone asking an update of your work? With an ETL, you just have to run the job again and to get results. Clean and easy.</div>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2010/05/data-manipulation-sql/' rel='bookmark' title='Data Manipulation Part 1 : SQL'>Data Manipulation Part 1 : SQL</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/using-mysql-as-a-data-warehouse/' rel='bookmark' title='Using MySQL as a Data Warehouse'>Using MySQL as a Data Warehouse</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2011/09/data-manipulation-part-2-etl/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data Manipulation Part 1 : SQL</title>
		<link>http://www.dataintoresults.com/2010/05/data-manipulation-sql/</link>
		<comments>http://www.dataintoresults.com/2010/05/data-manipulation-sql/#comments</comments>
		<pubDate>Fri, 07 May 2010 12:08:47 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Tools]]></category>
		<category><![CDATA[data manipulation]]></category>
		<category><![CDATA[data warehouse]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=224</guid>
		<description><![CDATA[Data manipulation is a big part of a data mining process. Some authors claims it could take 80% of a data mining project. I could only agree. If data comes from the data warehouse it could be a lot faster. If you have to dig (and understand) operational systems or  adding some externals data the [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/06/using-mysql-as-a-data-warehouse/' rel='bookmark' title='Using MySQL as a Data Warehouse'>Using MySQL as a Data Warehouse</a></li>
<li><a href='http://www.dataintoresults.com/2011/09/data-manipulation-part-2-etl/' rel='bookmark' title='Data Manipulation Part 2 : ETL'>Data Manipulation Part 2 : ETL</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-thumbnail wp-image-143" title="datamining" src="http://www.dataintoresults.com/wp-content/uploads/2009/07/datamining-150x150.jpg" alt="datamining" width="120" height="120" />Data manipulation is a big part of a data mining process. Some authors claims it could take 80% of a data mining project. I could only agree. If data comes from the data warehouse it could be a lot faster. If you have to dig (and understand) operational systems or  adding some externals data the works takes even more time. Therefore it is of greatest importance to be efficient in data manipulation. Currently I use two way to do this task : big SQL queries or ETL depending on the situation.</p>
<p><span id="more-224"></span><strong>Big SQL Queries</strong></p>
<p>Usually to extract data from a database, you create an SQL query and export the result set to a CSV (comma separated values) or Excel file. Depending on your knowledge of Excel and SQL, you do some post-treatment in Excel or not. I think it is more convenient to do everything in the query in order to be able to reproduct the dataset at any time. I found some SQL tips very useful for big analysis.</p>
<p><strong>Common Table Expressions (CTE) </strong></p>
<p>Using subqueries (and sub sub sub &#8230; sub queries) makes a query just impossible to understand (at least for me). The WITH keyword makes things simple in a more functional or procedural fashion (without any performance hit, we still use ensemble theory).</p>
<pre>with customers as (
/* Do some cleanup in clients */
    select CLIE_CLIENT_ID as id,
        lower(CLIE_NAME) as name,
        decode(CLIE_TYPE, 'I', 'Individual',
            'P', 'Professional', 'Unknown') as category,
        CLIE_DISTRICT as district,
        case when CLIE_DT_BEGIN &gt; sysdate - 365 then 'New'
            else 'Old' end as age
    from CLIENT
    where CLIE_DISABLED = 'F'
),
location as (
/* Find some geographical attributes */
    select DIST_DISTRICT_ID as district,
        DIST_POPULATION as district_pop,
        COUN_POPULATION as country_pop
    from DISTRICT join COUNTRY on DIST_COUNTRY_ID = COUN_COUNTRY_ID
)
/* Merge all */
select id, name, category, age, district_pop, country_pop
from customers natural join location;</pre>
<p>It is easy to understand when thing are break in little separated parts which is what the WITH does.</p>
<p><strong>Analytical aggregations</strong></p>
<p>Analytical aggregations are included in SQL since SQL2003 (i.e. support may vary depending on your DBMS).  It allows you to extends each line of a table with aggregated values. Please find a good description <a href="http://www.orafaq.com/node/55">here</a>. For time series you can add the moving average at each point (which is sometime more predictive than the point value). For each customer transaction, you can add the sum of the amount for all transaction of this customer.</p>
<p><strong>Analytical join</strong></p>
<p>Analytical queries aren&#8217;t the same as transactionnal ones. Usually we need to join many rows together. When joining two tables, RDBMS generally read each rows of one table, let say A,  (or the subset using the <em>where</em> part of the query) and lookup to the second table, let say B, using an index. It requires a lot of random reads and is an iterative operation which is something we want to avoid in a database.  The complexity is a.log(b) with a the number of row which need to be read in A, and b the total number of rows in B. Oracle could do MERGE join and HASH join, I will briefly explain how they work.</p>
<p>MERGE join takes the subset of A (with the WHERE clause) and the subset of B. It order them by the join key, then make the join. If the join is an equijoin (using just &#8216;=&#8217;) it take only one read of both subset to make the join. The sort operation is what is costly here. So you need to restrict the subset of A and B to the minimum or allow the use of an ordered index.</p>
<p>HASH join is like the LOOKUP but it create first a hashmap with the B subset, thus access is very fast (complexity of 1). The construction of the hasmap is b&#8217;.log(b&#8217;) (not exactly sure) where b&#8217; is the number of rows in the subset of B. If b&#8217; is a lot smaller then b then it&#8217;s very efficient.</p>
<p>You could read here for <a href="http://psoug.org/reference/hints.html">queries</a> optimization in Oracle. I don&#8217;t discuss here for bitmap indexes which are of great help but requires DBA action. The analyse table operation is of course very very useful too, it allows Oracle to find the best join strategy for you.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/06/using-mysql-as-a-data-warehouse/' rel='bookmark' title='Using MySQL as a Data Warehouse'>Using MySQL as a Data Warehouse</a></li>
<li><a href='http://www.dataintoresults.com/2011/09/data-manipulation-part-2-etl/' rel='bookmark' title='Data Manipulation Part 2 : ETL'>Data Manipulation Part 2 : ETL</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2010/05/data-manipulation-sql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>About evaluation</title>
		<link>http://www.dataintoresults.com/2009/12/about-evaluation/</link>
		<comments>http://www.dataintoresults.com/2009/12/about-evaluation/#comments</comments>
		<pubDate>Sun, 06 Dec 2009 17:44:47 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[data mining]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=206</guid>
		<description><![CDATA[When deploying a model, one very important thing is to monitor the results. Does it work like you&#8217;ve expected? I&#8217;m not talking about pre production tests but following the life of your model. I use two kind of reports to do that : preventive reports and corrective reports.  As you expect the first one is [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/' rel='bookmark' title='Machine learning vs simulation'>Machine learning vs simulation</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/' rel='bookmark' title='How to : What to do when your model fails?'>How to : What to do when your model fails?</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-thumbnail wp-image-220" title="evaluation" src="http://www.dataintoresults.com/wp-content/uploads/2009/12/evaluation-150x150.jpg" alt="evaluation" width="126" height="126" />When deploying a model, one very important thing is to monitor the results. Does it work like you&#8217;ve expected? I&#8217;m not talking about pre production tests but following the life of your model. I use two kind of reports to do that : preventive reports and corrective reports.  As you expect the first one is created just after the prediction and the second is created after the consequence of the prediction is known.</p>
<p><span id="more-206"></span></p>
<p><strong>Preventive reporting</strong></p>
<p>My last model makes five weeks ahead prediction. The result could be changed within one week. After that, it&#8217;s just too late.  As these predictions could have dramatic impacts, it is good for me to be sure it wouldn&#8217;t cause any mess (learning is done each time on an updated dataset). As there was thousands of predictions each time it would be impossible for me to check everything (beside the fact that it&#8217;s tricky sometimes).</p>
<p>I predict not only 5 weeks ahead but also 6, 7 and 8 weeks ahead. Thus I can watch the evolution of the prediction. If results are greatly changing, it&#8217;s good to go deeper. It&#8217;s also interesting to see if the results are quite the same 10 weeks ahead or 5 weeks ahead. If they are always the same, you could take opportunity to publish results 5 weeks sooner which could unlock a business opportunity.</p>
<p>I also consider an comparison with past data as my prediction are comparable years after years (but of course not exactly the same). This can avoid big mistakes on a particular prediction.</p>
<p><strong>Corrective reporting</strong></p>
<p>Corrective occurs when an evaluation about the prediction can be made. On the top of such report I have one or more dial chart which gives an aggregated indicator, i.e. is the error  acceptable or not. If it seems fine, there is nothing more to do. If not, I have many statistics to find where the error is. Using only aggregated error indicators like root mean squared error don&#8217;t give any hint on why the error is so big. As I&#8217;m making hundred numeric predictions each time, I compute the mean deviance from the real values. Usually this value should be around 0%, deviations errors compensate themselves. Nevertheless sometimes I see 5%, i.e. on average each prediction is 5% bigger than the value which should be predicted. This reveal an obvious problem (learning set too old or with mistakes, a change in the process, &#8230;). You could also compute RMSE on different subspace of the evaluation set, but it seems quite complicated to obtain useful insights.</p>
<p>As conclusion, I found that prediction is not the end of the data miner job if you want some quality. How do you follow your models?</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/' rel='bookmark' title='Machine learning vs simulation'>Machine learning vs simulation</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/' rel='bookmark' title='How to : What to do when your model fails?'>How to : What to do when your model fails?</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2009/12/about-evaluation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Machine learning vs simulation</title>
		<link>http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/</link>
		<comments>http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/#comments</comments>
		<pubDate>Thu, 29 Oct 2009 12:29:53 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[data mining]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=193</guid>
		<description><![CDATA[Lately, I was thinking on the difference between machine learning and simulation (for prediction).  Machine learning use historical inputs and outputs to find subsequent outputs.  Simulation, on the other side, asses you get the knowledge, i.e. the underlying model so you don&#8217;t need historical data to learn it.  Sometimes you can use both methods to [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/12/about-evaluation/' rel='bookmark' title='About evaluation'>About evaluation</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/' rel='bookmark' title='How to : What to do when your model fails?'>How to : What to do when your model fails?</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-is-the-value-of-your-work/' rel='bookmark' title='What is the value of your work?'>What is the value of your work?</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-medium wp-image-143" title="datamining" src="http://www.dataintoresults.com/wp-content/uploads/2009/07/datamining-278x300.jpg" alt="datamining" width="117" height="126" />Lately, I was thinking on the difference between machine learning and simulation (for prediction).  Machine learning use historical inputs and outputs to find subsequent outputs.  Simulation, on the other side, asses you get the knowledge, i.e. the underlying model so you don&#8217;t need historical data to learn it.  Sometimes you can use both methods to know something, sometimes only one method is available. After thinking about it, I find than the distinction between them is thinner that I thought.</p>
<p><span id="more-193"></span>I think there is a continuum and it depends on the amount of domain knowledge you add versus the amount of learning. Pure machine learning doesn&#8217;t take anything else than data. In theory everything could be learnt from it. On the other side, pure simulation considers that everything is known and that no noise is present.</p>
<p>Of course, in real life, it&#8217;s a bit more complex than that. You have to inject some domain knowledge in your models to<em> help</em> them : post-treatment rules, adding some nodes ans leafs in a decision tree.  For simulation, often you can&#8217;t modelise everything or there is always a parameters you need to estimate on historical data (it&#8217;s more statistics than machine learning, but you still need data).</p>
<p>Machine learning never gives perfect results because of the learning approximation. Nevertheless,  the same apply to simulation and it&#8217;s sometimes worse. If you give wrong input data to a simulator it will produce crap. Machine learning doesn&#8217;t suffer a lot if the error is consistant (by example all your numericals inputs divided by 2). Morever, machine learning can change it&#8217;s model faster than using simulation or to much domain knowledge, underlying process can change.  I&#8217;m sure you already made a model with some domain knowledge and where removing the knowledge allows better reults. As always there is a tradeoff to make.</p>
<p>These two tools can be both useful and, to some extends, merged together.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/12/about-evaluation/' rel='bookmark' title='About evaluation'>About evaluation</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/' rel='bookmark' title='How to : What to do when your model fails?'>How to : What to do when your model fails?</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/what-is-the-value-of-your-work/' rel='bookmark' title='What is the value of your work?'>What is the value of your work?</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>INFORMS Data Mining Contest Part 1</title>
		<link>http://www.dataintoresults.com/2009/08/informs-data-mining-contest-1/</link>
		<comments>http://www.dataintoresults.com/2009/08/informs-data-mining-contest-1/#comments</comments>
		<pubDate>Sun, 16 Aug 2009 19:03:06 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Applications]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[relational mining]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=175</guid>
		<description><![CDATA[A new data mining contest is available here.  The functional domain is medical, more precisely there is two tasks. First, we need to prediction if a given patient will be transferred to another hospital. The second task is to predict if the patient will die (the medical domain definitively lacks of fun). For each task, [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/mining-twitter-data/' rel='bookmark' title='Mining Twitter data'>Mining Twitter data</a></li>
<li><a href='http://www.dataintoresults.com/2010/05/data-manipulation-sql/' rel='bookmark' title='Data Manipulation Part 1 : SQL'>Data Manipulation Part 1 : SQL</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-medium wp-image-174" title="trophe" src="http://www.dataintoresults.com/wp-content/uploads/2009/08/trophe-185x300.jpg" alt="trophe" width="62" height="101" />A new data mining contest is available <a title="Informs dta mining contest" href="http://www.informsdmcontest2009.org/" target="_blank">here</a>.  The functional domain is medical, more precisely there is two tasks. First, we need to prediction if a given patient will be transferred to another hospital. The second task is to predict if the patient will die (the medical domain definitively lacks of fun). For each task, we give a score from the most probable patient to the least. The dataset contains many challenges. In this post, I propose my personals ideas to handle these challenges.</p>
<p><span id="more-175"></span></p>
<p><strong>Sequences</strong></p>
<p>Each patient is represented by a sequence (previous visits and the current one). For a given patient we have many lines in file. The sequence is not length fixed so we can&#8217;t just put everything on one line with concatenation.</p>
<p><strong>Ensemble attributes</strong></p>
<p>There is also some ensemble attributes (an attribute which the value is an ensemble). In the data file it is represented by <em>Other-Dx-Code-1</em>, <em>Other-Dx-Code-2</em>, &#8230; with <em>Other-Dx-Code-9</em> often missing. There is also <em>Principal-Dx-Code</em> and <em>Admit-Dx-Code</em> which I see part of the ensemble.</p>
<p><strong>Hierarchical attributes</strong></p>
<p>Some attributes are hierarchy. For instance, Hospital-ID and Region-ID are two levels of a geographical hierarchy. I don&#8217;t know how hierarchy can be used in data mining (well in a clever way than standard attributes). I could be interesting for generalization purposes and reducing overfitting.</p>
<p><strong>It&#8217;s relational</strong></p>
<p>These three problems have in common their relational nature. I think it&#8217;s madness to use it directly as a single table, I think that we need to better formalize the problem  first. Then we could construct a single table using to feature of relational data mining, <a href="http://www.cs.iastate.edu/~honavar/Papers/ilpfinal.pdf" target="_blank">selection graphs</a> and <a href="http://pages.stern.nyu.edu/~fprovost/Papers/claudia-kdd03-final.pdf" target="_blank">aggregation</a> (either manually or automatically). Notice that the last link is a paper from the contest organiser Claudia Perlich thus I think I couldn&#8217;t be so wrong. I don&#8217;t know if it&#8217;s the better way, but if I do something it will be clearly in this direction.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/mining-twitter-data/' rel='bookmark' title='Mining Twitter data'>Mining Twitter data</a></li>
<li><a href='http://www.dataintoresults.com/2010/05/data-manipulation-sql/' rel='bookmark' title='Data Manipulation Part 1 : SQL'>Data Manipulation Part 1 : SQL</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2009/08/informs-data-mining-contest-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The cost of reducing costs</title>
		<link>http://www.dataintoresults.com/2009/08/the-cost-of-reducing-cost/</link>
		<comments>http://www.dataintoresults.com/2009/08/the-cost-of-reducing-cost/#comments</comments>
		<pubDate>Fri, 07 Aug 2009 20:53:59 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Thoughts]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[marketing]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=166</guid>
		<description><![CDATA[Predicting the number of sales representatives on a particular time on a particular store is harder than expected. If you instrument the whole process, you could know the activity of your representatives (number of customers, average time of a transaction, activity rate, &#8230;). We could then predict the number of required representatives. We know the [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/07/what-is-the-value-of-your-work/' rel='bookmark' title='What is the value of your work?'>What is the value of your work?</a></li>
<li><a href='http://www.dataintoresults.com/2011/12/new-hype/' rel='bookmark' title='Big data and mobile BI : New hype but same old issue'>Big data and mobile BI : New hype but same old issue</a></li>
<li><a href='http://www.dataintoresults.com/2011/10/marketingcalculator/' rel='bookmark' title='Book review : Marketing calculator'>Book review : Marketing calculator</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-167" title="Cost killing" src="http://www.dataintoresults.com/wp-content/uploads/2009/08/cost_killer.jpg" alt="Cost killing" width="99" height="126" />Predicting the number of sales representatives on a particular time on a particular store is harder than expected. If you instrument the whole process, you could know the activity of your representatives (number of customers, average time of a transaction, activity rate, &#8230;). We could then predict the number of required representatives. We know the cost of having set too much of them but what is the cost of having to few representatives? How to value a missed opportunity, a customer unsatisfaction of the quality of service, the behaviour of a too much stressed employee?</p>
<p><span id="more-166"></span></p>
<p>This problem can be extended to other areas where human presence is important (call center, postal services, &#8230;) or when dealing with supplies. Having too much item of a product on the shelve is costly, but having too few is costly too as missed opportunities, but hardly measurable. If you have this value, the whole problem is just an optimisation one : find the quality of service which maximize earning while minimize costs.</p>
<p>I have not so much clues right now (happy to hear yours if you have one).  I read that a Telco in New Zealand use 0.001 as the probability for a customer to switch to competitor if can&#8217;t have an operator on a call. However finding this magic number is not easy (I don&#8217;t know how they achieve it).</p>
<p>An idea is to get data on what happened when the quality of service was disturbed. But you can&#8217;t collect so much data. If you can, your business have a big issue. Moreover, such data will be quite complex to handle. For instance, a customer come in your shop and need to wait 15 minutes before having a sales representative. If this customer never shows again, is it because of the quality of service last time he came? Or because of an external factor you can&#8217;t catch (like a new ad campaign by your competitor)? Small dataset, huge complexity and external factors, only a thin chance to get something statistically relevant.</p>
<p>Feel free to comment if you have a hint on this problem.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/07/what-is-the-value-of-your-work/' rel='bookmark' title='What is the value of your work?'>What is the value of your work?</a></li>
<li><a href='http://www.dataintoresults.com/2011/12/new-hype/' rel='bookmark' title='Big data and mobile BI : New hype but same old issue'>Big data and mobile BI : New hype but same old issue</a></li>
<li><a href='http://www.dataintoresults.com/2011/10/marketingcalculator/' rel='bookmark' title='Book review : Marketing calculator'>Book review : Marketing calculator</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2009/08/the-cost-of-reducing-cost/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Book review : Programming Collective Intelligence</title>
		<link>http://www.dataintoresults.com/2009/08/book-review-programming-collective-intelligence/</link>
		<comments>http://www.dataintoresults.com/2009/08/book-review-programming-collective-intelligence/#comments</comments>
		<pubDate>Sat, 01 Aug 2009 14:34:32 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[Book review]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[web mining]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=151</guid>
		<description><![CDATA[Programming Collective Intelligence is a great book. It covers most of the existing data mining algorithms and presents many applications for them.  It covers clustering (k-means, hierarchical), supervised classification (k-nearest neighbours, Naïve Bayes, decision trees, SVM), data analysis (non negative matrix factorization), optimisation (hill climbing, simulated annealing and genetic algorithms) and end with genetic programming. [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/06/book-review-collective-intelligence-in-action/' rel='bookmark' title='Book review :  Collective Intelligence in Action'>Book review :  Collective Intelligence in Action</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/book-review/' rel='bookmark' title='Book review : Competing on analytics'>Book review : Competing on analytics</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/book-review-actionable-web-analytics/' rel='bookmark' title='Book review : Actionable Web Analytics'>Book review : Actionable Web Analytics</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="size-full wp-image-152 alignright" title="pcolint" src="http://www.dataintoresults.com/wp-content/uploads/2009/07/pcolint.jpg" alt="Programming Collective Intelligence" width="122" height="159" /></p>
<p><a href="http://oreilly.com/catalog/9780596529321/" target="_blank">Programming Collective Intelligence</a> is a great book. It covers most of the existing data mining algorithms and presents many applications for them.  It covers clustering (k-means, hierarchical), supervised classification (k-nearest neighbours, Naïve Bayes, decision trees, SVM), data analysis (non negative matrix factorization), optimisation (hill climbing, simulated annealing and genetic algorithms) and end with genetic programming. Along the way, it present application like spam detection, pricing, recommendation, &#8230; If you want to start in data mining this is a very good way. 0</p>
<p><span id="more-151"></span>Example are given in Python, a language I never used. Nevertheless, it is quite easy to follow. Python has a very concise syntax which avoid to have hundred lines of code in the book. Many third party library are used, especially to connect to third party services (facebook, ebay, &#8230;) to produce the datasets.</p>
<p>In comparison to <a title="Collective Intelligence in Action" href="http://www.dataintoresults.com/2009/06/book-review-collective-intelligence-in-action/" target="_blank">Collective Intelligence in Action</a>, this book is more focused on data mining, there is for instance no discussion on how scale to big datasets.  Nevertheless, it contains a lot more information, so I would recomment this book instead of the other.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/06/book-review-collective-intelligence-in-action/' rel='bookmark' title='Book review :  Collective Intelligence in Action'>Book review :  Collective Intelligence in Action</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/book-review/' rel='bookmark' title='Book review : Competing on analytics'>Book review : Competing on analytics</a></li>
<li><a href='http://www.dataintoresults.com/2009/06/book-review-actionable-web-analytics/' rel='bookmark' title='Book review : Actionable Web Analytics'>Book review : Actionable Web Analytics</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2009/08/book-review-programming-collective-intelligence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to : What to do when your model fails?</title>
		<link>http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/</link>
		<comments>http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/#comments</comments>
		<pubDate>Sat, 25 Jul 2009 17:28:06 +0000</pubDate>
		<dc:creator>Sébastien Derivaux</dc:creator>
				<category><![CDATA[How To]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.dataintoresults.com/?p=142</guid>
		<description><![CDATA[Sometimes (well most of the time) using your favorite data mining methods and the more obvious attributes are not good enough. What to do then? An usual idea is to use every other models your software provides and/or add every attributes you could think of whatever their relation to your problem. In this post, I [...]


Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/' rel='bookmark' title='Machine learning vs simulation'>Machine learning vs simulation</a></li>
<li><a href='http://www.dataintoresults.com/2009/12/about-evaluation/' rel='bookmark' title='About evaluation'>About evaluation</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-143" title="datamining" src="http://www.dataintoresults.com/wp-content/uploads/2009/07/datamining.jpg" alt="datamining" width="123" height="132" />Sometimes (well most of the time) using your favorite data mining methods and the more obvious attributes are not good enough. What to do then? An usual idea is to use every other models your software provides and/or add every attributes you could think of whatever their relation to your problem. In this post, I will try to elaborate a kind of &#8220;how to&#8221; for this case.</p>
<p><strong>Step 1 : What is my model?</strong></p>
<p>If your model is a neural network, it&#8217;s quite hard to get any insight of how it works by looking at the weights or neural functions. How could you improve something you don&#8217;t understand?</p>
<p><span id="more-142"></span></p>
<p><strong>Step 1.1  : Is there an human version please?</strong></p>
<p>There is two models which are easy to understand : decision trees (for classification) and linear regression (for regression). It cost nothing to use them. If their give results close to the initial model, they could be a good estimation and catch most of the inner model logic.</p>
<p>If results of such a simple model are too far from your model, you could consider using ensembles of models learned on different subsets of the learning set.</p>
<p><strong>Step 1.2  : Does it use the attributes I provide?</strong></p>
<p>When you got this simpler models, you get insight on what attributes are used and how. In a linear regression you could look a p-values to know whose attributes have an impact. Be aware that p-value could be misleading when having colinear attributes.</p>
<p><strong>Step 2 : Where and why is my model failing?</strong></p>
<p>Maybe your model works fine half of the time, but really fails on some cases. Try to find where the model makes bigger mistakes.</p>
<p>Using the simpler version in step 1, you should be able to process manually the model on these failing cases and see why it doesn&#8217;t work.</p>
<p><strong>Step 3 : What can I do?<br />
</strong></p>
<p>At the time of writing I figure two main action that could be used to improve the model : adding more attributes and segmenting the problem. There is another which doesn&#8217;t directly improve the model, but improve the results, I call it cheating (in a machine learning perspective).</p>
<p>A new idea of attributes can arise when finding where the model fails. But in general, you can&#8217;t get this attribute. For instance, if you need the oil price three month ahead it would be quite challenging to find. And finally, if you are able to predict it, you will get rich enough to forget the initial problem and drink Mojito on a beach all day long.</p>
<p>Segmenting the problem is the idea of making a separate model for subset of your population which behave differently. It&#8217;s like making a decision tree to choose which model to use, if the underlying models are decision trees, it would be generally useless to do it manually.</p>
<p>Cheating can be used when you know something about your problem but your model can&#8217;t express it. Maybe it&#8217;s a multiplicative factor. Of course a linear regression can&#8217;t do the trick. Thus, you could use a filter approach (pre-processing and post-processing) to use it.</p>
<p><strong>Step 4 : What if it does not work?</strong></p>
<p>When you are here, i see only one solution, use genetic programming, give it access to all the data you have, every possible mathematical, logical or whatever functions and wait. You could look at  <a title="GenIQ" href="http://www.dmstat1.com/DMtechnique.html">GenIQ</a> pages for a description. 99.9% of the time I think it will not work. But if you have an unused computer you can run it and come back some days, weeks, months after. In the same time you could do something else. It&#8217;s computer at work.</p>
<p><strong>Another idea?</strong></p>
<p>This only present the process I&#8217;m actually using. If you have another idea, please feel free to leave a comment.</p>


<p>Related posts:<ol><li><a href='http://www.dataintoresults.com/2009/10/machine-learning-vs-simulation/' rel='bookmark' title='Machine learning vs simulation'>Machine learning vs simulation</a></li>
<li><a href='http://www.dataintoresults.com/2009/12/about-evaluation/' rel='bookmark' title='About evaluation'>About evaluation</a></li>
<li><a href='http://www.dataintoresults.com/2009/07/data-mining-tools/' rel='bookmark' title='Data mining tools'>Data mining tools</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.dataintoresults.com/2009/07/what-to-do-when-model-fails/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

