Archive for the ‘Modeling’ Category



The Seven Secrets of Successful Data Scientists

Tuesday, September 7th, 2010 by baqmarblog Mail to a friend
by Michael E. Driscoll

At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.

What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.”

1. Choose The Right-Sized Tool

Or, as I like to say, you don’t need a chainsaw to cut butter.

If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I’ve just endorsed that dull knife called Excel).

In fact, Excel’s and Emacs’ program-by-example keyboard macros can be fantastic tool for quick and dirty data clean-up.

Alternatively, if you’ve got 600 million lines of data and you need something simple, piping together a several Unix tools (cut, uniq, sort) with a dash of Perl one-liner foo may get you there.

But don’t confuse this kind of data exploration, where the goal is to size up the data, with building proper data plumbing, where you want robustness and maintainability. Perl and bash scripts are nice for the former, but can be a nightmare for building data pipelines.

When you’re data gets very large, so big it can’t fit reasonably on your laptop (in 2010, that’s north of a terabyte), then you’re in Hadoop, parallelized database , or overpriced Big Iron territory.

So, when it comes to choosing tools: scale them up as you need, and focus on getting results first.

2. Compress Everything

We live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth.

As I was writing this, I was downloading an uncompressed CSV file via a web API. Uncompressed, it was 257MB, ZIP-compressed: 9MB.

Compression gives you a 6-8x bump out of the gate. When moving or crunching data of a certain heft, compress everything, always: it will save you time and money.

That said, because compression can render data difficult to introspect, I don’t recommend compressing TBs of data into a single tarball, but rather splitting it up, as I discuss next.

3. Split Up Your Data

“Monolithic” is a bad word in software development.

It’s also, in my experience, a bad word when it comes to data.

The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too. Respect the grain of your data, because eventually you’ll need to use it to shard your database or distribute it across your file system.

Even more, it’s this splitting up of data that enables the parallel execution in Hadoop and commercial data platforms (such as Greenplum, Aster, and Netezza).

Splitting is part of a larger design pattern succinctly identified in a paper by Hadley Wickham as:     split, apply, combine .

This is, in my mind, a more lucid formulation of “map, reduce” to include key selection (”split”) as a distinct step before any map/apply.

4. Sample Your Data

Let’s say hypothetically you’ve got 200 GBs of data from your portmanteau of a start-up, FaceLink. Someone wants to know if more people visit on Mondays or Fridays, what do you do?

Before you wonder “if only I had 64 GB of RAM on my MacBook Pro”, or fire up a Hadoop streaming job, try this: look at a 10k sample of data.

It’s easy to visually inspect, or pull into R and plot.

Sampling allows you to quickly iterate your approach, and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty.

That said, sampling can bite you if you’re not careful: when data is skewed, which it always is, it can be hard to estimate joint-distributions – comparing the means of California vs Alaska, for example, if your sample is dominated by Californians (an issue that statistics, that sexy skill, can address).

5. Smart Borrows, But Genius Uses Open Source

Before you create something new out of whole cloth, pause and consider that someone else may have already seen it, solved it, and open-sourced it.

A Google Code Search may find turn up a regular expression for that obscure data format.

The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks.

6. Keep Your Head in the Cloud

This past week, an engineer friend was just thinking about buying a dream desktop: a high RAM, multi-core box to run machine learning code over TBs of data.

I told him it was a terrible idea.

Why? Because the data he wants to work on isn’t local, it’s on an Amazon EC2 cluster. It’d take hours to download those TBs over a cable connection.

If you want to compute locally, pull down a sample. But if your data is in the cloud, that’s where your tools and code should be.

7. Don’t Be Clever

I once heard Brewster Kahle discuss managing the Internet Archive’s many-petabyte data platform: “everytime one of our engineers comes to me with a new, ingenious and clever idea for managing our data, I have a response: ‘You’re fired.’”

Hyperbole aside, his point is well-taken: cleverness doesn’t scale.

When dealing with big data, embrace standards and use commonly available tools. Most of all, keep it simple, because simplicity scales.

I know of a firm that, several years ago, decided to fork one part of Hadoop because they had a more clever approach. Today, they are several versions behind the latest release, and devoting time & energy to back-porting changes.

Cleverness rarely pays off. Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.

IBM neemt Unica over

Friday, August 13th, 2010 by Geert Verstraeten Mail to a friend

Na SPSS (voor de predictive analytics) heeft IBM nu ook Unica overgenomen. Unica is zeer bekend voor campaign management oplossingen – een domein dat uiterst complementair is met predictive analytics uiteraard. Wat meer info, vers van de pers:

Marketingsoftwareleverancier Unica is overgenomen door softwaregigant IBM voor 480 miljoen dollar. De software van Unica moet IBM’s klanten verder helpen bij het analyseren en voorspellen van klantbehoeften om gerichtere campagnes te voeren. De gigant boort daarmee zijn budget voor overnames en investeringen (twintig miljard dollar) steeds verder aan.
IBM neemt marketingsoftwareleverancier Unica overDe overname van Unica door IBM volgt na een serie overnames die de softwaregigant eerder dit jaar deed. Zo werd Coremetrics en Sterling Commerce overgenomen. Vorig jaar nam IBM nog analyticsspeler SPSS over voor 1,2 miljard dollar. De overname van Unica moet nog goed gekeurd worden door de aandeelhouders van het bedrijf, maar naar verwachting zal de deal in het laatste kwartaal van 2010 gesloten worden.

IBM wil steeds meer tegemoet komen aan de wens van marketeers om gerichte en persoonlijke marketingcampagnes te sturen. Dit willen ze bereiken door meer analyse tools te bieden, waarvoor ze SPSS en Coremetrics hebben overgenomen, maar ook door marketingactiviteiten steeds beter te automatiseren. Daar komt de overname van Unica om de hoek kijken.

Unica was een heldere keuze voor IBM, aldus Craig Hayman, general manager bij IBM Industry Solutions, door de kracht in het automatiseren van een brede set van marketingactiviteiten en de reputatie in het leveren van klantsucces in marketing voor organisaties over de hele wereld. Voor Unica betekent de overname dat ze meer klanten wereldwijd kunnen bereiken.

Unica’s vijfhonderd medewerkers worden opgenomen in IBM’s software solutions group.

Auteur: Redactie ITcommercie

Bron: ITcommercie

World Programming System: An Alternative to SAS

Wednesday, August 4th, 2010 by Sandro Saitta Mail to a friend

In an earlier post, I was mentioning two ways to reduce the SAS licence costs. The first one, Carolina, consists of translating the SAS code into Java code. However, it seems not very easy to do and the solution is not known (and thus there is no real support for it). Another solution is to interpret your SAS code using the World Programming System (WPS).

WPS is a SAS code interpreter. The main advantage of WPS is that the licence cost is much cheaper. Also, you don’t need to change your code too much (see below). This is, to my knowledge, the easiest way to make a SAS program run without using SAS. However, since WPS has some issues reading/writing .sas7bdat format, it is advised to use their own format (it is easy to transform from .sas7bdat to WPS format). WPS has its own editor which is even better than Enterprise Guide 4.2, at least for coding/debugging. Similarly to SAS, WPS support team is very helpful and professional.

However, running a SAS program within WPS is not straightforward. Here is a list of issues I had when executing my SAS code with WPS:

  • Can’t update a data set in the SAS format (need to use the WPS format)
  • Unknown fdelete function
  • Unable to use ODS PDF option
  • noquotelenmax is not supported
  • Cannot use SYSTASK
  • Format is8601dt can’t be read

For information, I’m using SAS 9.2 and I tried WPS 2.4. Finally, keep in mind that WPS can interpret SAS code, but it is not (yet) able to read Enterprise Miner generated code, for example.

For more information: World Programming System (WPS).

Ethnicity and Geography of Facebook Users

Monday, July 26th, 2010 by Matthew Hurst Mail to a friend

ePluribus: Ethnicity on Social Networks, by the Facebook data science team includes some interesting estimates of the geographic distributions of Facebook users. My guess is that these are not dissimilar to those found in the census data, confirming the team’s suggestion that the ethnic distribution of Facebook users is approaching that of the internet population in general.

FacebookEthnicity1

FacebookEthnicity2

Challenges: Together we build the future of our industry!

Wednesday, July 14th, 2010 by baqmarblog Mail to a friend

Challenge 5: SEARCH FOR TALENT
How could we attract bright and talented graduates to come and work in our industry? How could we make our industry more COOL to work in?

Challenge 6: NEW MEDIA IN MARKET RESEARCH
How will involving customer feedback on an ongoing basis change the way we do our job? (research communities, social media monitoring, … = constant feedback loops)