Subscribe to our Newsletter

Featured Posts (202)

Sort by

Guest blog post by Max Wegner

What’s the first thing you think of when you hear the phrase “artificial intelligence”? Perhaps it’s the HAL 9000 from 2001: A Space Odyssey, or maybe it’s chess Grandmaster Garry Kasparov losing to IBM’s Deep Blue supercomputer. While those are indeed examples of artificial intelligence, examples of AI in the real world of today are a bit more mundane and a whole lot less sinister.

In fact, many of us use AI, in one form or another, in our everyday lives. The personal assistant on your smartphone that helps you locate information, the facial recognition software on Facebook photos, and even the gesture control on your favourite video game are all examples of practical AI applications. Rather than being a part of a dystopian world view in which the machines take over, current AI makes our lives a whole lot more convenient by carrying out simple tasks for us.

What’s more, there’s a lot of money flowing into a lot of companies working on AI developments. This means that in the near future, we could see even more practical uses for AI, from smart robots to smart drones and more.

To give you a better understanding of the current state of AI, our friends at have put together this helpful Artificial Intelligence infographic. It will give you the full rundown, from categories to geography to finances. Check it out, and you’ll see why AI is so essential to our everyday lives, and why the future of AI looks so bright.

Read more…

Originally posted on Data Science Central

This article was posted by Bethany Cartwright. Bethany is the blog team's Data Visualization Intern. She spends most of her time creating infographics and other visuals for blog posts.

Whether you’re writing a blog post, putting together a presentation, or working on a full-length report, using data in your content marketing strategy is a must. Using data helps enhance your arguments by make your writing more compelling. It gives your readers context. And it helps provide support for your claims.

That being said, if you’re not a data scientist yourself, it can be difficult to know where to look for data and how to best present that data once you’ve got it. To help, below you'll find the tools and list of resources you need to source credible data and create some stunning visualizations. 

Resources for Uncovering Credible Data

When looking for data, it’s important to find numbers that not only look good, but are also credible and reliable.

The following resources will point you in the direction of some credible sources to get you started, but don’t forget to fact-check everything you come across. Always ask yourself: Is this data original, reliable, current, and comprehensive?

Tools for Creating Data Visualizations

Now that you know where to find credible data, it’s time to start thinking about how you’re going to display that data in a way that works for your audience.

At its core, data visualization is the process of turning basic facts and figures into a digestible image --  whether it’s a chart, graph, timeline, map, infographic, or other type of visual. 

While understanding the theory behind data visualization is one thing, you also need the tools and resources to make digital data visualization possible. Below we’ve collected 10 powerful tools for you to browse, bookmark, or download to make designing data visuals even easier for your business.

To check all this information, click hereFor more articles about data visualization, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Mike Waldron

Originally posted on Data Science Central

This blog was originally published on the AYLIEN Text Analysis blog

We wanted to gather and analyze news content in order to look for similarities and differences in the way two journalists write headlines for their respective news articles and blog posts. The two reporters we selected operate in, and write about, two very different industries/topics and have two very different writing styles:

  • Finance: Akin Oyedele of Business Insider, who covers market updates.
  • Celebrity: Carly Ledbetter of the Huffington Post, who mainly writes about celebrities.

Note: For a more technical, in-depth and interactive representation of this project, check out the Jupyter notebook we created. This includes sample code and more in depth descriptions of our approach.

The Approach

  1. Collect news headlines from both of our journalists
  2. Create parse trees from collected headlines (we explain parse trees below!)
  3. Extract information from each parse tree that is indicative of the overall headline structure
  4. Define a simple sequence similarity metric to quantitatively compare any pair of headlines
  5. Apply the same metric to all headlines collected for each author to find similarity
  6. Use K-Means and tSNE to produce a visual map of all the headlines so we can clearly see the differences between our two journalists

Creating Parse Trees

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar. Here’s an example;

For example with a simple sentence like “The cat sat on the mat”, a parse tree might look like this;

Thankfully parsing our extracted headlines isn’t too difficult. We used the Pattern Library for Python to parse the headlines and generate our parse trees.


In total we gathered about 700 article headlines for both journalists using the AYLIEN News API which we then analyzed using Python. If you’d like to give it a go yourself, you can grab the Pickled data files directly from the GitHub repository (link), or by using the data collection notebook we prepared for this project.

First we loaded all the headlines for Akin Oyedele, then we created parse trees for all 700 of them, and finally we stored them together with some basic information about the headline in the same Python object.

Then using a sequence similarity metric, we compared all of these headlines two by two, to build a similarity matrix.


To visualize headline similarities for Akin we generated a 2D scatter plot with the hope of grouping similarly structured headlines close together in a graph in groups of sorts.

To achieve this, we first reduced the dimensionality of our similarity matrix using tSNE and applied K-Means clustering to find groups of similar headlines. We also used some a nice viz library which we’ve outlined below;

  • tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2
  • K-Means to identify 5 clusters of similar headlines and add some color
  • Plotted the actual chart using Bokeh

The chart above shows a number of dense groups of headlines, as well as some sparse ones. Each dot on the graph represents a headline, as you can see when you hover over one in the interactive version. Similar titles are as you can see grouped together quite cleanly. Some of the more stand-outish groups are;

  • The circular group left of center typically consists of short, snappy stock update headlines such as “Viacom is crashing”
  • The large circular group on the top right are mostly announcement-style headlines such as “Here come the…. “ formats.
  • The small green circular group towards the bottom left are similar and use the same phrases we see headlines such as “Industrial production falls more than expected” or “ADP private payrolls rise more than expected”.

Comparing the two authors

By repeating the process for our second journalist, Carly Ledbetter, we were then able to compare both authors and see how many common patterns exist between the two in terms of how they write their headlines.

We observed that roughly 50% (347/700) of the headlines had a similar structure.

Here we can see the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors. The yellow dots represent our Celebrity focused author and the blue our finance guy.

  • The bottom right cluster is almost exclusive to the first author, as it covers the short financial/stock report headlines such as “Here comes CPI”, but it also covers some of the headlines from the first author such as “There’s Another Leonardo DiCaprio Doppelgänger”. Same could be said about the top middle cluster.
  • The top right cluster mostly contains single-verb headlines about celebrities doing things, such as “Kylie Jenner Graces Coachella With Her Peachy Presence” or “Kate Hudson Celebrated Her Birthday With A Few Shirtless Men” but it also includes market report headlines from the first author such as “Oil rig count plunges for 7th straight week”.

Conclusion and future work

In this project we’ve shown how you can retrieve and analyze news headlines, evaluate their structure and similarity, and visualize the results on an interactive map.

While we were quite happy with the results and found it quite interesting there were some areas that we thought could be improved. Some of the weaknesses of our approach, and ways to improve them are:

 - Using entire parse trees instead of just the chunk types

 - Using a tree or graph similarity metric instead of a sequence similarity one (ideally a linguistic-aware one too)

 - Better pre-processing to identify and normalize Named Entities, etc.

Next up..

In our next post, we’re going to study the correlations between various headline structures and some external metrics like number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns. We can hazard a guess already that the short, snappy Celebrity style headlines would probably get the most shares and reach on social media, but there’s only one way to find out.

If you’d like to access the data used or want to see the sample code we used head over to our Jupyter notebook.

Read more…

Guest blog post by Jeff Pettiross

For almost as long as we have been writing, we’ve been putting meaning into maps, charts, and graphs. Some 1,300 years ago, Chinese astronomers recorded the position of the stars and the shapes of the constellations. The Dunhuang star maps are the oldest preserved atlas of the sky:

More than 500 years ago, the residents of the Marshall Islands learned to navigate the surrounding waters by canoe in the daytime—without the aid of stars. These master rowers learned to recognize the feel of the currents reflecting off the nearby islands. They visualized their insights on maps made of sticks, rocks, and shells.

In the 1800s, Florence Nightingale used charts to explain to government officials how treatable diseases were killing more soldiers in the Crimean War than battle wounds. She knew that pictures would tell a more powerful story than numbers alone:

Why Visualized Data Is So Powerful

Since long before spreadsheets and graphing software, we have communicated data through pictures. But we’ve only begun, in the last half-century, to understand why visualizations are such effective tools for seeing and understanding data.

It starts with the part of your brain called the visual cortex. Located near the bony lump at the back of your skull, it processes input from your eyes. Thanks to the visual cortex, our sense of sight provides information much faster than the other senses. We actually begin to process what we see before we think about it.

This is sound from an evolutionary perspective. The early human who had to stop and think, “Hmm, is that a jaguar sprinting toward me?” probably didn’t survive to pass on their genes. There is a biological imperative for our sense of sight to override cognition—in this case, for us to pay sharp attention to movement in our peripheral vision.

Today, our sight is more likely to save us on a busy street than on the savannah. Moving cars and blinking lights activate the same peripheral attention, helping us navigate a complicated visual environment. We see other cues on the street, too. Bright orange traffic cones mark hazards. Signs call out places, directions, and warnings. Vertical stripes on the street indicate lanes while horizontal lines indicate stop lines.

We have designed a rich, visual system that drivers can comprehend quickly, thanks to perceptual psychology. Our visual cortex is attuned to color hues (like safety orange), position (signs placed above road), and line orientation (lanes versus stop lines). Research has identified other visual features. Size, clustering, and shape also help us perceive our environment almost immediately.

What This Means for Us Today

Fortunately, our offices and homes tend to be safer than the savannah or the highway. Regardless, our lightning-quick sense of vision jumps into action even when we read email, tweets, or websites. And that right there is why data visualization communicates so powerfully and immediately: It takes advantage of these visual features, too.

A line graph immediately reveals upward or downward changes, thanks to the orientation of each segment. The axes of the graph use position to communicate values in relationship to each other. If there are multiple, colored lines, the color hue lets us rapidly tell the lines apart, no matter how many times they cross. Bar charts, maps with symbols, area graphs—these all use the visual superhighway in our brains to communicate meaning.

The early pioneers of data visualization were led by their intuition to use visual features like position, clustering, and hue. The longevity of those works is a testament to their power.

We now have software to help us visualize data and to turn tables of facts and figures into meaningful insights. That means anyone, even non-experts, can explore data in a way that wasn’t possible even 20 years ago. We can, all of us, analyze the world’s growing volume of data, spot trends and outliers, and make data-driven decisions.

Today, we don’t just have charts and graphs; we have the science behind them. We have started to unlock the principles of perception and cognition so we can apply them in new ways and in various combinations. A scatter plot can leverage position, hue, and size to visualize data. Its data points can interactively filter related charts, allowing the user to shift perspectives in their analysis by simply clicking on a point. Animating transitions as users pivot from one idea to the next brings previously hidden differences to the foreground. We’re building on the intuition of the pioneers and the conclusions of science to make analysis faster, easier, and more intuitive.

When humanity unlocked the science behind fire and magnets, we learned to harness chemistry and physics in new ways. And we revolutionized the world with steam engines and electrical generators.

Humanity is now at the dawn of a new revolution, and intuitive tools are putting the beautiful science of data visualization into the hands of millions of users.

I’m excited to see where you take all of us next.

Note: This post first appeared in VentureBeat.

Read more…

Investigating Airport Connectedness

Guest blog post by SupStat

Contributed by the neuroscientist Sricharan Maddineni. He holds huge passion and talents in data science. Thus he took NYC Data Science Academy 12 weeks boot camp program  between Jan 11th to Apr 1st, 2016. The post was based on his second project, which posted on February 16th (due at 4th week of the program). He acquired the publicly transportation data and consult from social media. Consuming the data through his mind, he visualized the economic and business insights.

Why Are Airports Important?

(Photo by

Aviation infrastructure has been a bedrock of the United States economy and culture for many decades, and it was the first instrument through which we connected with the world. Before the invention of flight, humans were inexorably confined by the immenseness of Earth's oceans.

All the disdain and unpleasantries we endure on flights are quickly forgotten once we safely land at our destinations and realize we have just been transported to a new place on our vast planet. Every time I have flown and landed in a new country or city, I am overwhelmed with feelings of how beautiful our world is and how much I wish I could visit every corner of our planet. My love of aviation has led me to investigate the connectedness of United States airports and the passenger-disparity between the developed and developing countries.

The App

The interactive map can be used as a tool to investigate the connectedness of the US airports. Users can choose from a list of airports including LAX, JFK, IAD and more to visualize the connections out of that airport. The 'Airport Connections' table shows us the combinations of connections by Airline Carrier. For example, we can see that American Airlines (AA) had 8058 flights out of LAX to JFK (2009 dataset). The 'Carriers' table shows us the total flights out of LAX by American Airlines (76,670).

If we select Hartsfield-Jackson Atlanta International, we see that it is the most connected airport in the United States. *Please note that I am not plotting all possible connections, just major airport connections and only within the United States (the map would be filled solid if I plotted all connections!). The size of the airport bubble is calculated by the number of connections. Therefore, all large bubbles are international airports, and smaller bubbles are regional/domestic airports.

I also plotted Voronoi tesselations between the airports using one nearest neighbor to show the area differences between airports in the Eastcoast/Westcoast/Midwest. The largest polygons are found in the Midwest because airports are far apart in all directions. These airports are generally more connected as well since they are connecting the east and west coast (see Denver International or Salt Lake City International). Clicking on a Voronoi polygon brings up the nearest airport within that area.

Why is it important for countries to improve their airport infrastructure?

Looking at the Motion/Bubble Chart, we observe that developing countries travel horizontally whereas developed countries travel vertically. This indicates that developed countries populations have remained steady, but they have seen a rise in passenger travelers. On the flip side, developing countries have seen their populations boom, but the number of air travelers has remained stagnant.

Most importantly, countries moving upward show noticeable gains in GDP whereas countries moving horizontally show minimal gains over the last four decades (GDP is represented by the size of the bubble). We can also notice that airline passenger counts plunge during recessions for first world countries but remain comparatively steady for developing countries (1980, 2000, 2009). We can interpret this to mean that developing countries are not as connected to the rest of the world since their economies are unaffected by global economic crises.

Passenger Counts during weekends and Holidays

The calendar heatmap shows us the Daily flight count in the United States. We can recognize that airlines operate significantly fewer flights on Saturdays and National Holidays such as July 4th and Thanksgiving. The days leading up to and after National Holidays show an increase in flights as expected. Looking carefully, you can also notice there are fewer flights on Tuesdays and Wednesdays, and there are more flights during the summer season.

If you select a day on the calendar, a table shows us the top 20 Airline carrier flight counts on that day. Southwest, American Airlines, SkyWest, and Delta seem to operate the most airlines in the United States.


The Data

1. Interactive Map

I utilized comprehensive datasets provided by the United States Department of Transportation and Open Data by Socrata that allowed me to map airport connections in the United States. The first airport dataset included airport locations (city/state) and their latitude and longitude degrees, and the second dataset included the airport connections (LAX - JFK, LAX-SFO, ...). First, I used these datasets to calculate the size of the airport based on how many connections each had.

2. Motion Chart

The second analysis was done using the airline passenger, population, and GDP numbers for the world's countries over the last 45 years. Most of the work here was in transforming the three datasets provided by the World Bank from wide to long. See the code below.

3. Calendar Chart

Lastly, I used the Transtats database to obtain the daily flight counts by Airline Carrier for the years 2004-2007. Some transformation was done to create two separate data frames - flight counts per day and flight counts per carrier. While trying to calculate flight counts by day, I tried this code:

f2007_2 <- f2007 %>% group_by(UniqueCarrier, month) %>% summarise(sum = n()) 

I knew there as an error by looking at the resulting heatmap, but I didn't realize this was showing me a cumulative sum by month rather than the daily flight count, so I hit twitter to see if I could get help diagnosing my problem. I tweeted Jeff Weis who appeared as the Aviation Analyst on CNN during the Malaysian Airlines MH370 disappearance, and he caught my mistake! After he had pushed me in the right direction, I corrected my code to: 

group_by(UniqueCarrier, date) %>% summarise(count = n()) 

The Code

Creating Voronoi Polygons

Connection Lines

The second step was creating the line connections between the airports. To do this, I used the polylines function in Leaflet to add connecting lines between airports filtered by user input. input$Input1 catches the user selected airport and subsets the dataset by all origin airports that equal the selected airport. The gcIntermediate function makes those lines curved.

Calendar json capture

The calendar chart required two parameters, the whichdatevar reads the date column, and numvar which plots the value for each day on the calendar. Then I utilized a gvis.listener.jscode method to capture the user selected date and filter the dataset for the table.

To experience Sricharan Maddineni the interactive Shiny App

Read more…

Guest blog post by Irina Papuc

Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization.

The supply of able ML designers has yet to catch up to this demand. A major reason for this is that ML is just plain tricky. This tutorial introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic.

What is Machine Learning?

So what exactly is “machine learning” anyway? ML is actually a lot of things. The field is quite vast and is expanding rapidly, being continually partitioned and sub-partitioned ad nauseam into different sub-specialties and types of machine learning.

There are some basic common threads, however, and the overarching theme is best summed up by this oft-quoted statement made by Arthur Samuel way back in 1959: “[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.”

And more recently, in 1997, Tom Mitchell gave a “well-posed” definition that has proven more useful to engineering types: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon University

So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P).

The highly complex nature of many real-world problems, though, often means that inventing specialized algorithms that will solve them perfectly every time is impractical, if not impossible. Examples of machine learning problems include, “Is this cancer?”, “What is the market value of this house?”, “Which of these people are good friends with each other?”, “Will this rocket engine explode on take off?”, “Will this person like this movie?”, “Who is this?”, “What did you say?”, and “How do you fly this thing?”. All of these problems are excellent targets for an ML project, and in fact ML has been applied to each of them with great success.

ML solves problems that cannot be solved by numerical means alone.

Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:

  • Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.

  • Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships therein.

We will primarily focus on supervised learning here, but the end of the article includes a brief discussion of unsupervised learning with some links for those who are interested in pursuing the topic further.

Supervised Machine Learning

In the majority of supervised learning applications, the ultimate goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”). “Learning” consists of using sophisticated mathematical algorithms to optimize this function so that, given input data x about a certain domain (say, square footage of a house), it will accurately predict some interesting value h(x) (say, market price for said house).

In practice, x almost always represents multiple data points. So, for example, a housing price predictor might take not only square-footage (x1) but also number of bedrooms (x2), number of bathrooms (x3), number of floors (x4), year built (x5), zip code (x6), and so forth. Determining which inputs to use is an important part of ML design. However, for the sake of explanation, it is easiest to assume a single input value is used.

So let’s say our simple predictor has this form:

where and are constants. Our goal is to find the perfect values of and to make our predictor work as well as possible.

Optimizing the predictor h(x) is done using training examples. For each training example, we have an input value x_train, for which a corresponding output, y, is known in advance. For each example, we find the difference between the known, correct value y, and our predicted value h(x_train). With enough training examples, these differences give us a useful way to measure the “wrongness” of h(x). We can then tweak h(x) by tweaking the values of and to make it “less wrong”. This process is repeated over and over until the system has converged on the best values for and . In this way, the predictor becomes trained, and is ready to do some real-world predicting.

A Simple Machine Learning Example

We stick to simple problems in this post for the sake of illustration, but the reason ML exists is because, in the real world, the problems are much more complex. On this flat screen we can draw you a picture of, at most, a three-dimensional data set, but ML problems commonly deal with data with millions of dimensions, and very complex predictor functions. ML solves problems that cannot be solved by numerical means alone.

With that in mind, let’s look at a simple example. Say we have the following training data, wherein company employees have rated their satisfaction on a scale of 1 to 100:

First, notice that the data is a little noisy. That is, while we can see that there is a pattern to it (i.e. employee satisfaction tends to go up as salary goes up), it does not all fit neatly on a straight line. This will always be the case with real-world data (and we absolutely want to train our machine using real-world data!). So then how can we train a machine to perfectly predict an employee’s level of satisfaction? The answer, of course, is that we can’t. The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

It is somewhat reminiscent of the famous statement by British mathematician and professor of statistics George E. P. Box that “all models are wrong, but some are useful”.

The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

ML builds heavily on statistics. For example, when we train our machine to learn, we have to give it a statistically significant random sample as training data. If the training set is not random, we run the risk of the machine learning patterns that aren’t actually there. And if the training set is too small (see law of large numbers), we won’t learn enough and may even reach inaccurate conclusions. For example, attempting to predict company-wide satisfaction patterns based on data from upper management alone would likely be error-prone.

With this understanding, let’s give our machine the data we’ve been given above and have it learn it. First we have to initialize our predictor h(x) with some reasonable values of and . Now our predictor looks like this when placed over our training set:

If we ask this predictor for the satisfaction of an employee making $60k, it would predict a rating of 27:

It’s obvious that this was a terrible guess and that this machine doesn’t know very much.

So now, let’s give this predictor all the salaries from our training set, and take the differences between the resulting predicted satisfaction ratings and the actual satisfaction ratings of the corresponding employees. If we perform a little mathematical wizardry (which I will describe shortly), we can calculate, with very high certainty, that values of 13.12 for and 0.61 for are going to give us a better predictor.

And if we repeat this process, say 1500 times, our predictor will end up looking like this:

At this point, if we repeat the process, we will find that and won’t change by any appreciable amount anymore and thus we see that the system has converged. If we haven’t made any mistakes, this means we’ve found the optimal predictor. Accordingly, if we now ask the machine again for the satisfaction rating of the employee who makes $60k, it will predict a rating of roughly 60.

Now we’re getting somewhere.

A Note on Complexity

The above example is technically a simple problem of univariate linear regression, which in reality can be solved by deriving a simple normal equation and skipping this “tuning” process altogether. However, consider a predictor that looks like this:

This function takes input in four dimensions and has a variety of polynomial terms. Deriving a normal equation for this function is a significant challenge. Many modern machine learning problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients. Predicting how an organism’s genome will be expressed, or what the climate will be like in fifty years, are examples of such complex problems.

Many modern ML problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients.

Fortunately, the iterative approach taken by ML systems is much more resilient in the face of such complexity. Instead of using brute force, a machine learning system “feels its way” to the answer. For big problems, this works much better. While this doesn’t mean that ML can solve all arbitrarily complex problems (it can’t), it does make for an incredibly flexible and powerful tool.

Gradient Descent - Minimizing “Wrongness”

Let’s take a closer look at how this iterative process works. In the above example, how do we make sure and are getting better with each step, and not worse? The answer lies in our “measurement of wrongness” alluded to previously, along with a little calculus.

The wrongness measure is known as the cost function (a.k.a., loss function), . The input represents all of the coefficients we are using in our predictor. So in our case, is really the pair and . gives us a mathematical measurement of how wrong our predictor is when it uses the given values of and .

The choice of the cost function is another important piece of an ML program. In different contexts, being “wrong” can mean very different things. In our employee satisfaction example, the well-established standard is the linear least squares function:

With least squares, the penalty for a bad guess goes up quadratically with the difference between the guess and the correct answer, so it acts as a very “strict” measurement of wrongness. The cost function computes an average penalty over all of the training examples.

So now we see that our goal is to find and for our predictor h(x) such that our cost function is as small as possible. We call on the power of calculus to accomplish this.

Consider the following plot of a cost function for some particular ML problem:

Here we can see the cost associated with different values of and . We can see the graph has a slight bowl to its shape. The bottom of the bowl represents the lowest cost our predictor can give us based on the given training data. The goal is to “roll down the hill”, and find and corresponding to this point.

This is where calculus comes in to this machine learning tutorial. For the sake of keeping this explanation manageable, I won’t write out the equations here, but essentially what we do is take the gradient of , which is the pair of derivatives of (one over and one over ). The gradient will be different for every different value of and , and tells us what the “slope of the hill is” and, in particular, “which way is down”, for these particular s. For example, when we plug our current values of into the gradient, it may tell us that adding a little to and subtracting a little from will take us in the direction of the cost function-valley floor. Therefore, we add a little to , and subtract a little from , and voilà! We have completed one round of our learning algorithm. Our updated predictor, h(x) = + x, will return better predictions than before. Our machine is now a little bit smarter.

This process of alternating between calculating the current gradient, and updating the s from the results, is known as gradient descent.

That covers the basic theory underlying the majority of supervised Machine Learning systems. But the basic concepts can be applied in a variety of different ways, depending on the problem at hand.

Classification Problems

Under supervised ML, two major subcategories are:

  • Regression machine learning systems: Systems where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”.

  • Classification machine learning systems: Systems where we seek a yes-or-no prediction, such as “Is this tumer cancerous?”, “Does this cookie meet our quality standards?”, and so on.

As it turns out, the underlying theory is more or less the same. The major differences are the design of the predictor h(x) and the design of the cost function .

Our examples so far have focused on regression problems, so let’s now also take a look at a classification example.

Here are the results of a cookie quality testing study, where the training examples have all been labeled as either “good cookie” (y = 1) in blue or “bad cookie” (y = 0) in red.

In classification, a regression predictor is not very useful. What we usually want is a predictor that makes a guess somewhere between 0 and 1. In a cookie quality classifier, a prediction of 1 would represent a very confident guess that the cookie is perfect and utterly mouthwatering. A prediction of 0 represents high confidence that the cookie is an embarrassment to the cookie industry. Values falling within this range represent less confidence, so we might design our system such that prediction of 0.6 means “Man, that’s a tough call, but I’m gonna go with yes, you can sell that cookie,” while a value exactly in the middle, at 0.5, might represent complete uncertainty. This isn’t always how confidence is distributed in a classifier but it’s a very common design and works for purposes of our illustration.

It turns out there’s a nice function that captures this behavior well. It’s called the sigmoid function, g(z), and it looks something like this:

z is some representation of our inputs and coefficients, such as:

so that our predictor becomes:

Notice that the sigmoid function transforms our output into the range between 0 and 1.

The logic behind the design of the cost function is also different in classification. Again we ask “what does it mean for a guess to be wrong?” and this time a very good rule of thumb is that if the correct guess was 0 and we guessed 1, then we were completely and utterly wrong, and vice-versa. Since you can’t be more wrong than absolutely wrong, the penalty in this case is enormous. Alternatively if the correct guess was 0 and we guessed 0, our cost function should not add any cost for each time this happens. If the guess was right, but we weren’t completely confident (e.g. y = 1, but h(x) = 0.8), this should come with a small cost, and if our guess was wrong but we weren’t completely confident (e.g. y = 1 but h(x) = 0.3), this should come with some significant cost, but not as much as if we were completely wrong.

This behavior is captured by the log function, such that:

Again, the cost function gives us the average cost over all of our training examples.

So here we’ve described how the predictor h(x) and the cost function differ between regression and classification, but gradient descent still works fine.

A classification predictor can be visualized by drawing the boundary line; i.e., the barrier where the prediction changes from a “yes” (a prediction greater than 0.5) to a “no” (a prediction less than 0.5). With a well-designed system, our cookie data can generate a classification boundary that looks like this:

Now that’s a machine that knows a thing or two about cookies!

An Introduction to Neural Networks

No discussion of ML would be complete without at least mentioning neural networks. Not only do neural nets offer an extremely powerful tool to solve very tough problems, but they also offer fascinating hints at the workings of our own brains, and intriguing possibilities for one day creating truly intelligent machines.

Neural networks are well suited to machine learning problems where the number of inputs is gigantic. The computational cost of handling such a problem is just too overwhelming for the types of systems we’ve discussed above. As it turns out, however, neural networks can be effectively tuned using techniques that are strikingly similar to gradient descent in principle.

A thorough discussion of neural networks is beyond the scope of this tutorial, but I recommend checking out our previous post on the subject.

Unsupervised Machine Learning

Unsupervised learning typically is tasked with finding relationships within data. There are no training examples used in this process. Instead, the system is given a set data and tasked with finding patterns and correlations therein. A good example is identifying close-knit groups of friends in social network data.

The algorithms used to do this are very different from those used for supervised learning, and the topic merits its own post. However, for something to chew on in the meantime, take a look at clustering algorithms such as k-means, and also look into dimensionality reduction systems such as principle component analysis. Our prior post on big data discusses a number of these topics in more detail as well.


We’ve covered much of the basic theory underlying the field of Machine Learning here, but of course, we have only barely scratched the surface.

Keep in mind that to really apply the theories contained in this introduction to real life machine learning examples, a much deeper understanding of the topics discussed herein is necessary. There are many subtleties and pitfalls in ML, and many ways to be lead astray by what appears to be a perfectly well-tuned thinking machine. Almost every part of the basic theory can be played with and altered endlessly, and the results are often fascinating. Many grow into whole new fields of study that are better suited to particular problems.

Clearly, Machine Learning is an incredibly powerful tool. In the coming years, it promises to help solve some of our most pressing problems, as well as open up whole new worlds of opportunity. The demand for ML engineers is only going to continue to grow, offering incredible chances to be a part of something big. I hope you will consider getting in on the action!

This article was originally published in Toptal.

Read more…

Guest blog post by Chris Atwood

Recently, I rediscovered a TED Talk by David McCandless, a data journalist, called “The beauty of data visualization.” It’s a great reminder of how charts (though scary to many) can help you tell an actionable story about a topic in a way that bullet points alone usually cannot. If you have not seen the talk, I recommend you take a look for some inspiration about visualizing big ideas.


In any social media report you make for the brass, there are several types of data charts to help summarize the performance of your social media channels; the most common ones are bar charts, pie/donut charts and line graphs. They are tried and true but often overused, and are not always the best way to visualize the data to then inform and justify your strategic decisions. Below are some less common charts to help you tell the story about your social media strategy’s ROI.


For our examples here, we’ll primarily be examining a brand’s Facebook page for different types of analyses on its owned post performance.


Scatter plots

Figure 1: Total engagement vs total reach, colored by post type (Facebook Insights)

What they are: Scatter plots measure two variables against each other to help users determine where a correlation or relationship between those variables might be.


Why they’re useful:  One of the most powerful aspects of a scatter plot is its ability to show nonlinear relationships between variables. They also help users get a sense of the big picture. In the example above, we’re looking for any observable correlations between total engagement (Y axis) and total reach (X axis) that can guide this Facebook page’s strategy. The individual dots are colored by the post type — status update (green), photo (blue) or video (red).


This scatter plot shows that engagement and reach have a direct relationship for photo posts because it makes a fairly clear, straight line from the bottom left to the upper  right. For other types of posts, the relationships are less clear, although it can be noted that video posts have extremely high reach even though engagement is typically low.


Box plots

Figure 2: Total reach benchmark by post type (Facebook Insights)


What they are: Box plots show the statistical distribution of different categories in your data, and let you compare them against one another and establish a benchmark for a certain variable. They are not commonly used because they’re not always pretty, and sometimes can be a bit confusing to read without the right context.


Why they’re useful: Box plots are excellent ways to display key performance indicators. Each category (with more than one post) will show a series of lines and rectangles; the box and whisker show what’s called the interquartile range (IQR). When you look at all the posts, you can split the values up into groups called quartiles or percentiles based on the distribution of the values. You can use the median or the value of the second quartile as a benchmark for “average” performance.


In this example, we’re once again looking at different post types on a brand’s Facebook page, and seeing what the total reach is like for each. For videos (red), you can see that the lower boundary for the reach is higher than the majority of photo posts, and that it doesn’t have any outliers. Photos, however, tell a different story. The first quartile is very short, while the fourth quartile is much longer. Since most of the posts fall above the second quartile, you know that many of these posts are performing above average. The dots above the whisker indicate outliers — i.e., these posts do not fall within the normal distribution. You should take a closer look at outliers to see what you can learn based on what they have in common (seasonality/timing, imagery, topic, audience targeting, or word choices).

Heat maps

Figure 3: Average total engagement per day by post type (Facebook Insights)


What they are: Heat maps are a great way to determine factors like which posts have the highest number of engagement or impressions, on average, on a given day. Heat maps take two categories of data and compare a single quantitative variable (average total reach, average total engagement, etc.).


Why they’re useful: The difference in the shade in colors shows how values in each column are different from each other. If the shades are all light, there is not a large difference in the values from category to category, versus if there are light colors and darker colors in a column, the values are very different from each other (more interesting!).


You could run a similar analysis to see what times  of day your posts get the highest engagement or reach, and find the answer to the classic question, “When should I post for the highest results?” You can also track competitors this way, to see how their content performs throughout the day or on particular days of the week. You can time your own posts around when you think shared audiences may be paying less attention to competitors, or make a splash during times with the best performance.


In the above example, you can see that three post types from a brand’s Facebook page have been categorized by their average total engagement on a given day of the week. Based on the chart, photos do not differentiate much from day to day. Looking closer at the data from the previous box plot, we know that photo posts are the most common post, and make up a large amount of the data set; we can conclude that the user must be used to seeing those posts so they perform about the same day to day. We also see that video posts either perform far above or far below average, and that it appears the best day to post videos for this brand is typically on Thursdays.

Tree maps

Figure 4: Average total engagement by content pillar and post type (Facebook Insights)


What they are: Tree maps use qualitative information, usually represented as a diagram that grows from a single trunk and ends with many leaves. Tree maps typically have three main components that help you tell what’s going on — the size of each rectangle, the relative color and what the hierarchy is.


Why they’re useful: Tree maps are a fantastic way to get a high-level look at your social data and figure out where you want to dig in for further analysis. In this example, we’re able to compare the average total engagement between different post types, broken out by content pillar.

For our brand’s Facebook page, we have trellised the data by post type (figure 4); in other words, we created a visualization that comprises three smaller visualizations, so we can see how the post type impacts the average total engagement for each content pillar. It answers the question, “Do my videos in category X perform differently than my photos in the same category?” You can also see that the rectangles vary in size from content pillar to content pillar; they  are sized by the number of posts in each subset. Finally, they are colored by the average total engagement for that content pillar’s subset of the post type. The darker the color, the higher the engagement.


We immediately learn that posts in the status trellis aren’t performing anywhere near the other post types (it only has one post), and that photos have the greatest number of content pillars or the greatest variety in topic. You can see from the visualization that you want to spend more of your energy digging into why posts in the Timely, Education and Event categories perform well in both photos and videos. .  


TL;DR: Better Presentations are made with Better Charts

In your next analysis, you shouldn’t disregard the tried and true bar charts, pie graphs and line charts. However, these four different visualizations may offer a more succinct way to summarize your data and help you explain the performance of your campaigns. They’ll also make your reports and wrapups look distinctive when they’re used correctly. Although there are other chart types that are also useful for making better analyses and presentations, the ones discussed here are fairly simple to put together and nearly all of them can be put together in Microsoft Excel or visualization/analysis software such as TIBCO's Spotfire. 

Read more…

Guest blog post by Divya Parmar

To once again demonstrate the power of MySQL (download), MySQL Workbench (download), and Tableau Desktop (free trial version can be downloaded here), I wanted to walk through another data analysis example. This time, I found a Medicare dataset publicly available on and imported it using the Import Wizard as seen below.



Let’s take a look at the data: it has hospital location information, measure name (payment for heart attack patient, pneumonia patient, etc), and payment information.


I decided to look at the difference in lower and higher payment estimates for heart attack patients for each state to get a sense of variance in treatment cost. I created a query and saved it as a view.


One of the convenient features of Tableau Desktop is the ability to connect directly to MySQL, so I used that connection to load my view directly into Tableau.


 I wanted to see how the difference between lower and higher payment estimate varies by state. Using Tableau’s maps and geographic recognition of the state column, I used a few drag-and-drop moves and a color fill to complete the visualization.

You can copy the image itself to use elsewhere, choosing to add labels and legends if necessary. Enjoy. 

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here. He can also be found on LinkedIn and Twitter

Read more…

Guest blog post by Ujjwal Karn

I created an R package for exploratory data analysis. You can read about it and install it here.  

The package contains several tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate and multivariate investigation which is the first step of any predictive modeling pipeline. This package can be used to get a good sense of any dataset before jumping on to building predictive models.

The package is constantly under development and more functionalities will be added soon. Pull requests to add more functions are welcome!

The functions currently included in the package are mentioned below:

  • numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
  • charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
  • Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
  • removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
  • bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata

More functions to be added soon. Any feedback on improving this is welcome!

Read more…

Originally posted on Data Science Central

Contributed byBelinda Kanpetch, she is current Architecture graduate student in Columbia University. With the strong urban design sense, she is fascinated in Urban installation art and urge to acquire any elements to ameliorate urban space. In order to gather all the information systematically to apply into her work, she took NYC Data Science Academy 12 week full-time Data Science Bootcamp program April 11th to July 1st 2016. The post was based on her first class project(due at 2nd week of the program).

Why Street Trees?

The New York City street tree can sometimes be taken for granted or go unnoticed. Located along paths of travel they stand steady and patient; quietly going about their business of filtering out pollutants in our air, bringing us oxygen, providing shade during the warmer months, blocking winds during cold seasons, and relieving our sewer systems during heavy rainfall. All of this while beautifying our streets and neighborhoods. Some recent studies have found a link between presence of streets and lower stress levels in urban citizens.

So what makes a street tree different from any other tree? Mainly its location. A street tree is defined as any tree that lives within the public right of way; not in a park or on private property. Although they reside in the public right of way (or within the jurisdiction of The Department of Transportation) they are the property of and cared for by the NYC Department of Parks and Recreation.

With the intent to understand the data and explore what the data was telling me I started with some very basic questions:

  • How many street trees are there in Manhattan?
  • How many different species are there?
  • What is the general condition of the street trees?
  • What is the distribution of species by community district?
  • Is there a connection between median income of a community district to the number of street trees?

The Dataset

The dataset used for this exploratory visualization was downloaded from the NYC Open Data Portal and was collected as part of TreeCount!2015, a street tree census maintained by the NYC Department of Park and Recreation. The first census count was 1995 and has been conducted every 10 years by trained volunteers.

Some challenges with this dataset involved missing values in the form of unidentifiable species types. There were 2285 observations with unclassifiable species type, 487 observations that had unclassifiable community districts, geographic information (longitude and latitude) were character strings that had to be split into different variables, and species codes were given by 4 letter characters without any reference to genus, species, or cultivar and I had to find another dataset to decipher that code.

Visualizing the data

A quick summary of the dataset revealed a total of 51,660 trees total in Manhattan with 91 identifiable species with one ‘species’ as missing values.

A bar plot of all 92 species gave an interesting snapshot of the range in total number of trees per species. It was quite obvious that there was one species that has a dominant presence. In order to get better understanding of their counts and what were common species, I broke them down by quartiles and plotted them.

Plotting the first quartile (< 3.75)revealed that there were several species in which there was only one tree that existed in Manhattan!

The distribution within the 4th quartile (181.75 << total >> 11529) was informative in that it helped to visualize the dominance of two specific species, the Honeylocust and Ornamental Pear that make up 23% and 15% of all the trees in Manhattan respectively. Coming in close were Ginko trees with 9.47% and London Plane with 7.8%. This quartile also contained the missing species group ‘0’.

A palette of the top 4 species in Manhattan.

Looking at trees by Community District

I wanted to look at community districts as opposed to zip codes because in my opinion community districts are more representative of community cohesiveness and character. So I plot the distribution by community district and tree condition.

Plotting the species distribution by community board using facet grid helped visualize other species that were not showing up dominant in the previous graphs. It would be interesting to look further into what those species are and why they are more dominant within some community districts and not others.

Attempts at mapping

The ultimate goal was to map each individual tree location on a map of Manhattan with the community districts outlined or shaded in. I attempted to plot them on a map using leaflet, bringing in shape files and converting to a data frame, and ggplot but neither yielded anything useful. The only visualization I was able to get was using qplot which took over 2 hours to render.

Read more…

The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.

Visualizing Bagged Trees as Approximating Borders, Part 1

Visualizing Bagged Trees, Part 2

Conclusion: Other tree aggregation methods differ in how they grow trees and they may compute weighted average. But in the end we can visualize the result of a algorithm as borders between classified sets in a shape of connected perpendicular segments, as in this 2-dimensional case. As for higher dimensions these became multidimensional rectangular pieces of hyperplanes which are perpendicular to each other.

Read more…

Contributed by Bin Lin. He took  NYC Data Science Academy 12 week full-time Data Science Bootcamp program between Jan 11th to Apr 1st, 2016. The post was based on his second class project(due at 4th week of the program).


The consumption pattern is an important driver of the development pattern of the industrialized world. The consumption price changes reflect the economic performance and income of households in a country. In this project, the focus is on the food price changes. The goals of the project were:

  • Utilize Shinny for interactive visualization (the Shinny app is hosted at
  • Explore food price changes over time from the year 1974 - 2015.
  • Compare food price changes to All-Items price changes (All-items include all consumer goods and services, including food).
  • Compare Consumer Food Price Changes vs. Producer Price Changes (producer price changes are the average change in prices paid to domestic producers for their output).




Consumer Food Price Changes Dataset:

  • Data dimension: 42 rows x 21 columns
  • Missing data: There are 2 missing values in the column of "Eggs"

Producer Price Changes Dataset:

  • Data dimension: 42 rows x 17 columns
  • Missing data: There are 25 missing values in the column of "Processed.fruits.vegetables".

Consumer Food Categories:

  • Data dimension: 20 rows x 2 columns

Data Analysis and Visualization:

Food Consumption Categories:

Food consumption is broken out into 20 categories. Among all of them, the categories with high share based on consumer expenditures are (see Figure 1 and Figure 2):

  • Food.away.from.Home (eat out): 40.9%
  • Other.foods: 10.5 (note this is a sum of rest of the uncategorized food)
  • Cereals.and.bakery.products: 8.0$
  • Nonalcoholic.beverages: 6.7%
  • Dairy.products: 6.3
  • Beef.and.veal: 4.1
  • Fresh.fruits: 4.0

The high share of nonalcoholic beverages/soft drinks (6.7%) seems concerning as high consumption of soft drinks might pose the health risk.

Figure 1: Pie Chart on Food Categories Share of Consumer Expenditures

Figure 2: Bar Chart on Food Categories Share of Consumer Expenditures 

Food Price Changes over Time:

The Consumer Price Index (CPI) is a measure that examines average change over time in the prices paid by consumers for goods and services. It is calculated by taking price changes for each item in the predetermined basket of goods and averaging them; the goods are weighted according to their importance. Changes in CPI are used to assess price changes associated with the cost of living.

In the Shinny app, I created a line chart that showed price changes for different food categories, which were selected from a drop-down list. See Figure 3 for screenshot of the Different Food Categories Price Changes over Time

As I was looking at the food price changes, I noticed that there was the dramatic increase during the late 70s.  After reviewed history of the 1970s, a lot happened during that time of period, including the "Great Inflation".

Figure 3: Screenshot of Different Food Categories Price Changes over Time

Yearly Food Category Price Changes:

To view the food price changes for each category in a year, I created the bar chart in the Shinny app. Users can select a year from the slider; the chart will show food price changes of each category for that year. I actually created two bar charts side-by-side in case users want to compare the food price changes between any of the two years.

A quick look at the year 2015, the price of "Egg" had the biggest increase; price of "Pork" dropped the most. In fact, many food categories dropped their price. Compared 2015, the year 2014 had fewer categories with dropped price; the price of "Beef and Veal" had the biggest increase.

Figure 4: Screenshot of Food Category Price Changes by Year

Food Price Changes vs All-Items Price Changes

The Consumer Price Index (CPI) for food is a component of the all-items CPI. That led me to the comparing of those two. From the line chart, I observed:

  • Food price changes mostly aligns with all-item price changes.
  • Food price inflation has outpaced the economy-wide inflation in recent years.

Figure 5: Price Changes in All-Items vs Price Changes in Food

Food Price Changes vs Producer Price Changes

Based on United State Department of Agriculture (USDA),  changes in farm-level and wholesale-level PPIs are of particular interest in forecasting food CPIs. Therefore, I created a chart to show the Over All Food Price Changes vs Producer Price Changes. Uses can choose one or more Producer food categories.

From the chart, that food price changes mostly aligns with the producer price changes. However,  farm level milk, farm level cattle, farm level wheat seem fluctuate since year 2000 and they didn't affect the over all food price change that much. Though the impact on the over all food price was small, I doubt they might have impacted individual food categories. I would like to add a new drop-down list to allow users to select food categories from the consumer food categories.

Figure 6: Food Price Changes vs Producer Price Changes

Correlation Tile Map:

To see the relationship among the different categories in terms of price changes, I created a correlation tile map.


Food price has been increasing, in different amount of percentage. Since 1990, food price changes keep under small percentage. The degree of food price inflation varies depending on the type of foods

Looking ahead to 2016, ERS predicts food-at-home (supermarket) prices to rise 2.0 to 3.0 percent - a rate of inflation that remains in line with the 20-year historical average of 2.5 percent. For future works, I would love to try to fit a time-series model to predict the price changes for the coming five years.

Again, this project was done in Shiny and most of the information in this blog post were from the Shiny,

Originally posted on Data Science Central

Read more…

There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands.

The data for this project come from this source:

In this project, researchers used data from accelerometers on the belt, forearm, arm, and dumbbell of few participants. They were asked to perform barbell lifts correctly, marked as "A", and incorrectly with four typical mistakes, marked as "B", "C", "D" and "E". The goal of the project is to predict the manner in which they did the exercise.

There are 52 numeric variables and one classification variable, the outcome. We can plot density graphs for first 6 features, which are in effect smoothed out histograms.

We can see that data behaviors are complicated. Some of features are bimodal and even multimodal. These properties could be caused by participants' different sizes or training levels or something else, but we do not have enough information to check it out. Nevertheless it is clear that our variables do not obey normal distribution. Therefore we are better with algorithms which do not assume normality, like trees and random forests.  We can visualize the algorithms work in the following way: as finding vertical lines which divide areas under curves such that areas to the right and to to left of the line are significantly different for different outcomes.
There are a number of ways to distinguish functions analytically on an interval in Functional Analysis. It looks like the most suitable is to consider areas between curves. Clearly, we should scale it with respect to the size of the curves. For every feature I will consider all pairs of density curves to find out if they are sufficiently different.  Here is my final criterion:
If there is a pair for a feature which satisfies it then the feature is chosen for a prediction. As result I got 21 features for a random forests algorithm.  The last one yielded accuracy 99% for the model itself and on a validation set. I checked how many variables we need for the same accuracy with PCA preprocessing, and it was 36. Mind you that the variables will be scaled and rotated, and that we still use the same original 52 features to construct them. Thus more efforts are needed to construct a prediction and to explain it. While with the above method it is easier, since areas under curves represent numbers of observations.

Originally posted on Data Science Central
Read more…

Analysis of Fuel Economy Data

Paul Grech

October 5, 2015

Contributed by Paul Greeh. Paul took NYC Data Science Academy 12 week full time Data Science Bootcamp  program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


Analyse fuel economy ratings in the automotive industry.

Compare vehicle efficiency of American automotive manufacturer, Cadillac with the automotive industry as a whole.

Sept 2014 - “We cannot deny the fact that we are leaving behind our traditional customer base,” de Nysschen said. “It will take several years before a sufficiently large part of the audience who until now have been concentrating on the German brands will find us in their consideration set.” Cadillac’s President - Johan de Nysschen

Compare vehicle efficiency of American automotive manufacturer, Cadillac, with self declared competition, the German luxury market.

What further comparisons will display insight into EPA ratings?

Analysis Overview

  1. Automotive Industry
  2. Cadillac vs Automotive Industry
  3. Cadillac vs German Luxury Market
  4. Cadillac vs German Luxury Market by Vehicle Class

Importing the Data

Import data and filter rows needed for analysis. Then remove all zero’s included in city and highway MPG data as this will skew results. - Replace this information with NA as to not perform calculations on data not present.

# Import Data and convert to Dplyr data frame
FuelData <- read.csv("", stringsAsFactors = FALSE)
FuelData <- tbl_df(FuelData)

# Create data frame including information necessary for analysis
FuelDataV1 <- select(FuelData,
mfrCode, year, make, model,
engId, eng_dscr, cylinders, displ, sCharger, tCharger,
trans_dscr, trany, drive,
startStop, phevBlended,
city08, comb08, highway08,

# Replace Zero values in MPG data with NA
FuelDataV1$city08U[FuelDataV1$city08 == 0] <- NA
FuelDataV1$comb08U[FuelDataV1$comb08 == 0] <- NA
FuelDataV1$highway08U[FuelDataV1$highway08 == 0] <- NA

1: Automotive Industry

Visualize city and highway EPA ratings of the entire automotive industry.


How have EPA ratings for city and highway improved across the automotive industry as a whole?

Note: No need to include combined as combined is simply a percentage based calculation defaulting to 60/40 but can be adjusted on the website.

IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")

Comp.Ind <- rbind(IndCityMPG, IndHwyMPG)

ggplot(data = Comp.Ind, aes(x = year, y = MPG, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
ggtitle("Industry\n(city & highway MPG)")



Data visualization shows relatively poor EPA ratings throughout the 1980's, 1990's and early to mid 2000's with the first drastic improvement in these ratings occurring around 2008. One significant event around this time period was the recession hitting America. Consumers having less disposable income along with increased oil prices likely fueled competition to develop fuel efficient powertrains across the automotive industry as a whole.

2: Cadillac vs Automotive Industry

Visualize Cadillac's city and highway EPA ratings with that of the automotive industry.


How does Cadillac perform when compared to the automotive industry as a whole?
IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")
CadCityMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "City")
CadHwyMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "Highway")

Comp.Ind.Cad <- rbind(IndCityMPG, IndHwyMPG, CadCityMPG, CadHwyMPG)

ggplot(data = Comp.Ind.Cad, aes(x = year, y = MPG, color = Label, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
scale_color_manual(name = "Cadillac / Industry", values = c("blue","#666666")) +
ggtitle("Cadillac vs Industry\n(city & highway MPG)")



Cadillac was chosen as a brand of interest because they are currently redefining their brand as a whole. It is important to analyze past performance to have a complete understanding of how Cadillac has been viewed for several decades.

In 2002, Cadillac dropped to its lowest performance. Why did this occur? Because the entire fleet was made up of the same 4.6L V8 mated to a 4-speed automatic transmission, or as some would say... slush-box. The image that Cadillac had of this time was of a retirement vehicle to be shipped to its owners new retirement home in Florida with a soft ride, smooth powerful delivery and no performance. With the latest generation of Cadillac's being performance oriented beginning with the LS2 sourced CTS-V and now containing the ATS-V, CTS-V along with several other V-Sport models, a rebranding is crucial in order to appeal to a new market of buyers.

Also interesting to note is that although there is an increased amount of performance models being produced, fuel efficiency is not lacking. The gap noted above has decreased although there has been an increase in performance models being developed, a concept not often found to align.

3: Cadillac vs German Luxury Market

Cadillac has recently targeted the German luxury market consisting of the following manufacturers:
  • Audi
  • BMW
  • Mercedes-Benz


How does Cadillac perform when compared with the German Luxury Market?
# Calculate Cadillac average Highway / City MPG past 2000
CadCityMPG <- filter(CadCityMPG, year > 2000)
CadHwyMPG <- filter(CadHwyMPG, year > 2000)

# Calculate Audi average Highway / City MPG
AudCityMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "City")
AudHwyMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "Highway")

# Calculate BMW average Highway / City MPG
BMWCityMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "City")
BMWHwyMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "Highway")

# Calculate Mercedes-Benz average Highway / City MPG
MbzCityMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "City")
MbzHwyMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "Highway")

# Concatenate all Highway/City MPG data for:
# v.s. German Competitors
CompGerCadCity <- rbind(CadCityMPG, AudCityMPG, BMWCityMPG, MbzCityMPG)
CompGerCadHwy <- rbind(CadHwyMPG, AudHwyMPG, BMWHwyMPG, MbzHwyMPG)

ggplot(data = CompGerCadCity, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle("CITY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")

ggplot(data = CompGerCadHwy, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle(label = "HIGHWAY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")



“Mr. Ellinghaus, a German who came to Cadillac in January from pen maker Montblanc International after more than a decade at BMW, said he has spent the past 11 months doing”foundational work" to craft an overarching brand theme for Cadillac’s marketing, which he says relied too heavily on product-centric, me-too comparisons.

“In engineering terms, it makes a lot of sense to benchmark the cars against BMW,” Mr. Ellinghaus said. But he added: “From a communication point of view, you must not follow this rule.”

Despite comments made by Mr. Ellinghaus, the end goal is for consumers to be comparing Cadillac with Audi, BMW and Mercedes-Benz. The fact that this is already happening is a huge success for the company which only ten years ago, would never be mentioned in the same sentence as the German Luxury market.

Data visualization shows that Cadillac is equally rated as its German competitors and at the same time, has not had any significant dips unlike all other manufacturers. The continued increase in performance combined with rebranding signify that Cadillac is on a path to success.

4: Cadillac vs German Luxury Market by Vehicle Class

Every manufacturer has its strengths and weaknesses. It is important to assess and recognize these attributes to best determine where an increase in R&D spending is needed and where to maintain a competitive advantage for the consumer by vehicle class.


In what vehicle class is Cadillac excelling or falling behind?
# Filter only Cadillac and german luxury market
German <- filter(FuelDataV1, make %in% c("Cadillac", "Audi", "BMW", "Mercedes-Benz"))
# Group vehicle classes into more generic classes
German$ <- ifelse(grepl("Compact", German$VClass, = T), "Compact",
ifelse(grepl("Wagons", German$VClass), "Wagons",
ifelse(grepl("Utility", German$VClass), "SUV",
ifelse(grepl("Special", German$VClass), "SpecUV", German$VClass))))

# Focus on vehicle model years past 2000
German <- filter(German, year > 2000)
# Vans, Passenger Type are only specific to one company and are not needed for this analysis
German <- filter(German, != "Vans, Passenger Type")

IndClass <- filter(German, make %in% c("Audi", "BMW", "Mercedes-Benz")) %>%
group_by(, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))
CadClass <- filter(German, make %in% c("Cadillac")) %>%
group_by(, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))

##### Join tables #####
CadIndClass <- left_join(IndClass, CadClass, by = c("year", ""))
CadIndClass$DifCity <- (CadIndClass$AvgCity.y - CadIndClass$AvgCity.x)
CadIndClass$DifHwy <- (CadIndClass$AvgHwy.y - CadIndClass$AvgHwy.x)

ggplot(CadIndClass, aes(x = year, ymax = DifCity, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifCity), color = 'blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap( +
ggtitle("Cadillac vs Germany Luxury Market\n(city mpg by class)") +
xlab("Year") +
ylab("MPG Difference")

ggplot(CadIndClass, aes(x = year, ymax = DifHwy, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifHwy), color='blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap( +
ggtitle("Cadillac vs German Luxury Market\n(highway mpg by class)") +
xlab("Year") +
ylab("MPG Difference")



The above data visualization displays the delta between Cadillac and the average (Audi, BMW, Mercedes-Benz) fuel economy ratings. Positive can then be considered above the average competition and negative, below the average competition.

There is a lack of performance across all vehicle classes. Reasoning may be because the same power trains are being used across multiple chassis.


Conclusion & Continued Analysis

  1. There is a clear improvement in EPA ratings as federal emission standards drive innovation for increased fleet fuel economy. It is important for automotive manufacturers to continue innovation and push for increased efficiency.
  2. Further analysis on the following areas provides greater researcher opportunity:
    • Drivetrain v.s. MPG
    • Sales data
    • Consumer reaction to new marketing strategies
    • Consumer demand for product or badge

Originally posted on Data Science Central

Read more…

Guest blog post by Vincent Granville

There's been many variations of this theme - defining big data with 3Vs (or more, including velocity, variety, volume, veracity, value), as well as other representations such as the data science alphabet.

Here's an interesting Venn diagram that tries to define statistical computing (a sub-field of data science) with 7 sets and 9 intersections:

It was published in a scholarly paper entitled Computing in the Statistics Curricula (PDF document). Enjoy!

Read more…

Learning R in Seven Simple Steps

Originally posted on Data Science Central

Guest blog post by Martijn Theuwissen, co-founder at DataCamp. Other R resources can be found here, and R Source code for various problems can be found here. A data science cheat sheet can be found here, to get you started with many aspects of data science, including R.

Learning R can be tricky, especially if you have no programming experience or are more familiar working with point-and-click statistical software versus a real programming language. This learning path is mainly for novice R users that are just getting started but it will also cover some of the latest changes in the language that might appeal to more advanced R users.

Creating this learning path was a continuous trade-off between being pragmatic and exhaustive. There are many excellent (free) resources on R out there, and unfortunately not all could be covered here. The material presented here is a mix of relevant documentation, online courses, books, and more that we believe is best to get you up to speed with R as fast as possible.

Data Video produced with R: click here and also here for source code and to watch the video. More here.

Here is an outline:

  • Step 0: Why you should learn R
  • Step 1: The Set-Up
  • Step 2: Understanding the R Syntax
  • Step 3: The core of R -> packages
  • Step 4: Help?!
  • Step 5: The Data Analysis Workflow
    • 5.1 Importing Data
    • 5.2 Data Manipulation
    • 5.3 Data Visualization
    • 5.4 The stats part
    • 5.5 Reporting your results
  • Step 6: Become an R wizard and discovering exciting new stuff

Step 0: Why you should learn R

R is rapidly becoming the lingua franca of Data Science. Having its origins in academics, you will spot it today in an increasing number of business settings as well where it is a contestant to commercial software incumbents such as SAS, STATA and SPSS. Each year, R gains in popularity and in 2015 IEEE listed R in the top ten languages of 2015.

This implies that the demand for individuals with R knowledge is growing, and consequently learning R is definitely a smart investment career wise (according to this survey R even is the highest paying skill). This growth is unlikely to plateau in the next years with large players such as Oracle &Microsoft stepping up by including R in its offerings.

Nevertheless, money should not be the only driver when deciding to learn a new technology or programming language. Luckily, R has a lot more to offer than a solid paycheck. By engaging yourself with R, you will become familiar with a highly diverse and interesting community. Namely, R is being used for a diverse set of task such as finance, genomic analysis, real estate, paid advertising, and much more. All these fields are actively contributing to the development of R. You will encounter a diverse set of examples and applications on a daily basis, keeping things interesting and giving you the ability to apply your knowledge on a diverse range of problems.

Have fun!

Step 1: The Set-Up

Before you can actually start working in R, you need to download a copy of it on your local computer. R is continuously evolving and different versions have been released since R was born in 1993 with (funny) names such as World-Famous Astronaut and Wooden Christmas-Tree. Installing R is pretty straightforward and there are binaries available for Linux, Mac and Windows from the Comprehensive R Archive Network (CRAN).

Once R is installed, you should consider installing one of R’s integrated development environment as well (although you could also work with the basic R console if you prefer). Two fairly established IDE’s are RStudio and Architect. In case you prefer a graphical user interface, you should check out R-commander.

Step 2: Understanding the R Syntax

Learning the syntax of a programming language like R is very similar to the way you would learn a natural language like French or Spanish: by practice & by doing. One of the best ways to learn R by doing is through the following (online) tutorials:

Next to these online tutorials there are also some very good introductory books and written tutorials to get you started:

Step 3: The core of R -> packages

Every R package is simply a bundle of code that serves a specific purpose and is designed to be reusable by other developers. In addition to the primary codebase, packages often include data, documentation, and tests. As an R user, you can simply download a particular package (some are even pre-installed) and start using its functionalities. Everyone can develop R packages, and everyone can share their R packages with others.

The above is an extremely powerful concept and one of the key reasons R is so successful as a language and as a community. Namely, you don’t need to do all the hard core programming yourself or understand every complex detail of a particular algorithm or visualization. You can simple use the out-of-the box functions that come with the relevant package as an interface to such functionalities.  As such it is useful to have an understanding of R’s package ecosystem.

Many R packages are available from the Comprehensive R Archive Network, and you can install them using the install.packages function. What is great about CRAN is that it associates packages with a particular task via Task Views. Alternatively, you can find R packages on bioconductorgithub and bitbucket.

Looking for a particular package and corresponding documentation? Try Rdocumentation, where you can easily search packages from CRAN, github and bioconductor.

Step 4: Help?!

You will quickly find out that for every R question you solve, five new ones will pop-up. Luckily, there are many ways to get help:

  • Within R you can make use of its built-in help system. For example the command  `?plot` will provide you with the documentation on the plot function.
  • R puts a big emphasis on documentation. The previously mentionedRdocumentation is a great website to look at the different documentation of different packages and functions.
  • Stack Overflow is a great resource for seeking answers on common R questions or to ask questions yourself.
  • There are numerous blogs & posts on the web covering R such asKDnuggets and R-bloggers.

Step 5: The Data Analysis Workflow

Once you have an understanding of R’s syntax, the package ecosystem, and how to get help, it’s time to focus on how R can be useful for the most common tasks in the data analysis workflow

5.1 Importing Data

Before you can start performing analysis, you first need to get your data into R. The good thing is that you can import into R all sorts of data formats, the hard part this is that different types often need a different approach:

If you want to learn more on how to import data into R check an online Importing Data into R tutorial or  this post on data importing.

5.2 Data Manipulation

Performing data manipulation with R is a broad topic as you can see in for example this Data Wrangling with R video by RStudio or the book Data Manipulation with R. This is a list of packages in R that you should master when performing data manipulations:

  • The tidyr package for tidying your data.
  • The stringr package for string manipulation.
  • When working with data frame like objects it is best to make yourself familiar with the dplyr package (try this course). However. in case of heavy data wrangling tasks, it makes more sense to check out the blazingly fast data.table package (see this syntax cheatsheet for help).
  • When working with times and dates install the lubridate package which makes it a bit easier to work with these.
  • Packages like zooxts and quantmod offer great support for time series analysis in R.

5.3 Data Visualization

One of the main reasons R is  the favorite tool of data analysts and scientists is because of its data visualization capabilities. Tons of beautiful plots are created with R as shown by all the posts on FlowingData, such as this famous facebook visualization.

Credit card fraud scheme featuring time, location, and loss per event, using R: click here for source

If you want to get started with visualizations in R, take some time to study theggplot2 package. One of the (if not the) most famous packages in R for creating graphs and plots. ggplot2 is makes intensive use of the grammar of graphics, and as a result is very intuitive in usage (you’re continuously building part of your graphs so it’s a bit like playing with lego).  There are tons of resources to get your started such as this interactive coding tutorial, a cheatsheet and  an upcoming book by Hadley Wickham.

Besides ggplot2 there are multiple other packages that allow you to create highly engaging graphics and that have good learning resources to get you up to speed. Some of our favourites are:

If you want to see more packages for visualizations see the CRAN task view. In case you run into issues plotting your data this post might help as well.

Next to the “traditional” graphs, R is able to handle and visualize spatial data as well. You can easily visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with a package such as ggmap. Another great package is choroplethr developed by Ari Lamstein of Trulia or the tmap package. Take this tutorial onIntroduction to visualising spatial data in R if you want to learn more.

5.4 The stats part

In case you are new to statistics, there are some very solid sources that explain the basic concepts while making use of R:

Note that these resources are aimed at beginners. If you want to go more advanced you can look at the multiple resources there are for machine learning with R. Books such as Mastering Machine Learning with R andMachine Learning with R explain the different concepts very well, and online resources like the Kaggle Machine Learning course help you practice the different concepts. Furthermore there are some very interesting blogs to kickstart your ML knowledge like Machine Learning Mastery or this post.

5.5 Reporting your results

One of the best way to share your models, visualizations, etc is through dynamic documents. R Markdown (based on knitr and pandoc) is a great tool for reporting your data analysis in a reproducible manner though html, word, pdf, ioslides, etc.  This 4 hour tutorial on Reporting with R Markdownexplains the basics of R markdown. Once you are creating your own markdown documents, make sure this cheat sheet is on your desk.

Step 6: Become an R wizard and discovering exciting new stuff

R is a fast-evolving language. It’s adoption in academics and business is skyrocketing, and consequently the rate of new features and tools within R is rapidly increasing. These are some of the new technologies and packages that excite us the most:

Once you have some experience with R, a great way to level up your R skillset is the free book Advanced R by Hadley Wickham. In addition, you can start practicing your R skills by competing with fellow Data Science Enthusiasts on Kaggle, an online platform for data-mining and predictive modelling competitions. Here you have the opportunity to work on fun cases such as this titanic data set.

To end, you are now probably ready to start contributing to R yourself by writing your own packages. Enjoy!

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

New book on data mining and statistics

New book:

Numeric Computation and Statistical Data Analysis on the Java Platform (by S.Chekanov)

710 pages. Springer International Publishing AG. 2016. ISBN 978-3-319-28531-3.

Book S.V.Chekanov 2016

About this book: Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language.

Originally posted on Data Science Central

Read more…

Dealing with Outliers is like searching a needle in a haystack

This is a guest repost by Jacob Joseph.

An Outlier is an observation or point that is distant from other observations/points. But, how would you quantify the distance of an observation from other observations to qualify it as an outlier. Outliers are also referred to as observations whose probability to occur is low. But, again, what constitutes low??

There are parametric methods and non-parametric methods that are employed to identify outliers. Parametric methods involve assumption of some underlying distribution such as normal distribution whereas there is no such requirement with non-parametric approach. Additionally, you could do a univariate analysis by studying a single variable at a time or multivariate analysis where you would study more than one variable at the same time to identify outliers.

The question arises which approach and which analysis is the right answer??? Unfortunately, there is no single right answer. It depends for what is the end purpose for identifying such outliers. You may want to analyze the variable in isolation or maybe use it among a set of variables to build a predictive model.

Let’s try to identify outliers visually.

Assume we have the data for Revenue and Operating System for Mobile devices for an app. Below is the subset of the data:

How can we identify outliers in the Revenue?

We shall try to detect outliers using parametric as well as non-parametric approach.

Parametric Approach

Comparison of Actual, Lognormal and Normal Density Plot

The x-axis, in the above plot, represents the Revenues and the y-axis, probability density of the observed Revenue value. The density curve for the actual data is shaded in ‘pink’, the normal distribution is shaded in 'green' and  log normal distribution is shaded in 'blue'. The probability density for the actual distribution is calculated from the observed data, whereas for both normal and log-normal distribution is computed based on the observed mean and standard deviation of the Revenues.

Outliers could be identified  by calculating the probability of the occurrence of an observation or calculating how far the observation is from the mean. For example, observations greater/lesser than 3 times the standard deviation from the mean, in case of normal distribution, could be classified as outliers.

In the above case, if we assume a normal distribution, there could be many outlier candidates especially for observations having revenue beyond 60,000. The log-normal plot does a better job than normal distribution, but it is due to the fact that the underlying actual distribution has characteristics of a log-normal distribution. This could not be a general case since determining the distribution or parameters of the underlying distribution is extremely difficult before hand or apriori. One could infer the parameters of the data by fitting a curve to the data, but a change in the underlying parameters like mean and/or standard deviation due to new incoming data will change the location and shape of the curve as observed in the plots below:

Comparison of Density Plot for change in mean and standard deviation for Normal DistributionComparison of Density Plot for change in mean and standard deviation for LogNormal Distribution

The above plots show the shift in location or the spread of the density curve based on an assumed change in mean or standard deviation of the underlying distribution. It is evident that a shift in the parameters of a distribution is likely to influence the identification of outliers.

Non-Parametric Approach

Let’s look at a simple non-parametric approach like a box plot to identify the outliers.

Non Parametric approach to detect outlier with box plots (univariate approach)

In the box plot shown above, we can identify 7 observations, which could be classified as potential outliers, marked in green. These observations are beyond the whiskers. 

In the data, we have also been provided information on the OS. Would we identify the same outliers, if we plot the Revenue based on OS??

Non Parametric approach to detect outlier with box plots (bivariate approach)

In the above box plot, we are doing a bivariate analysis, taking 2 variables at a time which is a special case of multivariate analysis. It seems that there are 3 outlier candidates for iOS whereas there are none for Android. This was due to the difference in distribution of Revenues for Android and iOS users. So, just analyzing Revenue variable on its own i.e univariate analysis, we were able to identify 7 outlier candidates which dropped to 3 candidates when a bivariate analysis was performed.

Both Parametric as well as Non-Parametric approach could be used to identify outliers based on the characteristics of the underlying distribution. If the mean accurately represents the center of the distribution and the data set is large enough, parametric approach could be used whereas if the median represents the center of the distribution, non-parametric approach to identify outliers is suitable.

Dealing with outliers in a multivariate scenario becomes all the more tedious. Clustering, a popular data mining technique and a non-parametric method could be used to identify outliers in such a case.

Originally posted on Data Science Central

Read more…

Guest blog post by ahmet taspinar

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in the test set (a dataset of which the entries have not been labelled yet) with the model which was constructed from a training set. You could think of classifying crime in the field of Pre-Policing, classifying patients in the Health sector, classifying houses in the Real-Estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This is the field of science with the goal to makes machines (computers) understand (written) human language. You could think of Text Categorization, Sentiment Analysis, Spam detection and Topic Categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines.  We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly.

This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sounds like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

1. Regression Analysis

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Lets say we have a dataset containing n datapoints; X = ( x^{(1)}, x^{(2)}, .., x^{(n)} ). For each of these (input) datapoints there is a corresponding (output) y^{(i)}-value. Here the x-datapoints are called the independent variables and y the dependent variable; the value of y^{(i)} depends on the value of x^{(i)}, while the value of x^{(i)} may be freely chosen without any restriction imposed on it by any other variable.
The goal of Regression analysis is to find a function f(X) which can best describe the correlation between X and Y. In the field of Machine Learning, this function is called the hypothesis function and is denoted as h_{\theta}(x).




If we can find such a function, we can say we have successfully build a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the datapoints. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, lets say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset Y which contains the final grade of n students. Dataset X contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable x^{(i)} therefore indicates how many hours student i has studied. The first thing we would do is visualize this data:


regression_left2 regression_right2

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is not correlation between Y and X at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.


This function could for example be:

h_{\theta}(X) = \theta_0+ \theta_1 \cdot x


h_{\theta}(X) = \theta_0 + \theta_1 \cdot x^2

where \theta are the dependent parameters of our model.


1.1. Multivariate Regression

Evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strong enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by X = ( (x_1^{(1)}, x_2^{(1)}), (x_1^{(2)}, x_2^{(2)}), .., (x_1^{(n)}, x_2^{(n)}) ). In this dataset  x_1^{(i)} indicates how many hours student i has studied and x_2^{(i)} indicates how many hours he has slept.

See the rest of the blog here, including Linear vs Non-linear, Gradient Descent, Logistic Regression, and Text Classification and Sentiment Analysis.

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds